<a href="https://colab.research.google.com/github/collinjennings/detectiveLLMs/blob/main/chunkingSummarizer_101424.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chunk Extractive Summary for Detective Stories
10.14.24

Failed attempt to deal with the story length problem by chunking the short stories. I tried a couple diffent LLMs and prompts, but in each chase the model hallucinated the story and end outputted summaries for other stories with no relation to the input text.

The input story for this trial was Doyle's "The Silver Blaze."

In [None]:
import glob
from collections import defaultdict
import re
import os
import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM
import math
import os
import torch
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'

In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd '/content/drive/MyDrive/Colab Notebooks'

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks


In [None]:
import glob
files = glob.glob("./data/texts/*.txt")
texts = defaultdict()
for afile in files:
      texts[re.sub('_','', afile.split('/')[3].split('.')[0])] = open(afile, encoding = 'utf-8').read()

In [None]:
### Crop the story at the reveal border sentence.
text = texts['MSH01'].split('.')[:496]
len(text)
print(text[495])
text = '.'.join(text)

 Holmes,” said he, “but I must regard what you have just said as either a very bad joke or an insult


### Functions for chunking and summarizing the story chunks

In [None]:
!pip install langchain

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
#help(RecursiveCharacterTextSplitter)

In [None]:
# Initialize the text splitter with custom parameters
max_token_length = 1500
cache_dir = None
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct", model_max_length = max_token_length)
#tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", model_max_length = max_token_length, cache_dir= cache_dir)

# Define length function for Langchain splitter
def len_function(text):
  return len(tokenizer.encode(text, add_special_tokens=False))

# Split the text into chunks of similar sizes with respect to paragraph breaks.
custom_text_splitter = RecursiveCharacterTextSplitter(
    # Set custom chunk size
    chunk_size = 1500,
    chunk_overlap  = 40,
    # Use length of the text as the size measure
    length_function = len_function,
    separators = ['\n', '\n\n']
)

# Create the chunks
chunks = custom_text_splitter.create_documents([text])

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [None]:
def summarize(texts,max_tokens, model, tokenizer, device):
    summary = ' '
    B_INST, E_INST = "[INST] ", " [/INST]"
    for idx, text in enumerate(texts):
      prompt = f"{B_INST} Read the following extract from a short story. Summarize the key details of the excerpt: "
      f"\n\n [TEXT_START]{text}\n\n[TEXT_END]\n\n{E_INST}"
      inputs = tokenizer(prompt, return_tensors='pt').to(device)
      outputs = model.generate(**inputs, max_new_tokens=max_tokens, use_cache=True, do_sample=True,temperature=0.2, top_p=0.95)
      prompt_length = inputs['input_ids'].shape[1] #del ['input_ids]
      summary += ' ' + tokenizer.decode(outputs[0][prompt_length:-1])
      print(f'Run through chunk {str(idx)}.')
    print(summary)
    return summary

In [None]:
def solve(texts,cache_dir=None):
  max_token_length = 1000
  dev = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  # Load model directly
  tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
  model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct", pad_token_id = tokenizer.eos_token_id, device_map='auto')
  #tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", model_max_length = max_token_length, cache_dir= cache_dir)
  #model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1", pad_token_id = tokenizer.eos_token_id,cache_dir= cache_dir, device_map="auto")
  model.half()
  B_INST, E_INST = "[INST] ", " [/INST]"
  tokens = []
  for txt in texts:
    tokens.append(tokenizer.encode(txt.page_content))
  summary = summarize(tokens, max_token_length, model, tokenizer, dev)
  print(summary)
  prompt2 = f"{B_INST}Read the key details summarized from a detective story. Based on the summary, predict who is the culprit of the crime, and identify the "
  f"most important piece of evidence for making that prediction. Make the prediction and give the supporting evidence in no more than 100 words. "
  f"Here is the summary: \n\n[TEXT_START]{summary}\n\n[TEXT_END]{E_INST}"
  inputs = tokenizer(prompt2, return_tensors='pt').to(dev)
  outputs = model.generate(**inputs, max_new_tokens=max_token_length, use_cache=True, do_sample=True,temperature=0.2, top_p=0.95)
  prompt_length = inputs['input_ids'].shape[1] #del ['input_ids]
  solution = tokenizer.decode(outputs[0][prompt_length:-1])
  return solution, summary

In [None]:
solute, summary = solve(chunks, cache_dir = None)
print (solute)

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Run through chunk 0.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Run through chunk 1.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Run through chunk 2.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Run through chunk 3.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Run through chunk 4.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Run through chunk 5.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Run through chunk 6.


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Run through chunk 7.
   **The Last Hope of the Damned** by [Author's Name]

As the last rays of sunlight faded from the sky, the village of Ashwood lay shrouded in an eerie twilight. The thatched roofs of the cottages seemed to blend seamlessly into the surrounding landscape, as if the very earth itself had swallowed them whole. The air was heavy with the scent of damp earth and decaying leaves, and the trees creaked and groaned in the fading light, their branches like skeletal fingers reaching towards the sky.

In the center of the village, a lone figure stood atop a hill, gazing out at the desolate landscape. Kael, a young man with piercing blue eyes and jet-black hair, stood tall and still, his eyes fixed on some point beyond the horizon. His face was set in a determined expression, his jaw clenched in a fierce resolve. He was a warrior, a fighter, and he had come to Ashwood seeking a new purpose.

As the darkness deepened, the villagers began to stir, emerging from their cottages t

In [None]:
outf= open("blaze_multi_solve.txt","w")
outf.write(solute)
outf.close()