## **Chat with Your Data**
#### Steps:
- Process your documents (chunking, embedding, vector-store)
- Q & A using our RAG
  
<br/>

### **Process your documents (chunking, embedding, vector-store):**

In [92]:
import os
import glob
import fitz
from tqdm import tqdm
import re

# set the path to your data directory
DATA_DIR = 'data/'

In [93]:
from spacy.lang.en import English 

# Add a sentencizer pipeline
nlp = English()
nlp.add_pipe("sentencizer")

def get_sentences(txt):
    sentences = list(nlp(txt).sents)
    sentences = [str(sentence) for sentence in sentences]
    return sentences

def read_files(data_dir):
    # loop over your files
    extracted_data = []
    for file in glob.glob(os.path.join(data_dir, "*.pdf")):
        # open the doc
        document = fitz.open(file)
        # process
        # print("file path: " , file)
        for page_num, page in tqdm(enumerate(document)):
            # get the raw text of each page
            txt = page.get_text()
            # do some cleaning
            cleaned_text = txt.replace("\n", " ").strip()
            
            # print(cleaned_text)
            # print("\n\n ++++++++++++++++++++++++++++++++++ \n\n")
            sentences = get_sentences(cleaned_text)
            entry = {"file_path": file,
                     "page_number": page_num,
                     "page_char_count": len(cleaned_text),
                     "page_word_count": len(cleaned_text.split(" ")),
                     "page_sentence_count": len(sentences),
                     "page_token_count": len(cleaned_text) / 4,
                     "text": cleaned_text,
                     "sentences": sentences}
            extracted_data.append(entry)
    return extracted_data

In [94]:
extracted_data = read_files(DATA_DIR)

15it [00:00, 94.26it/s] 
11it [00:00, 231.34it/s]
19it [00:00, 148.26it/s]
34it [00:00, 147.38it/s]


In [95]:
import random 
random.sample(extracted_data, k=1)

[{'file_path': 'data\\attention is all you need.pdf',
  'page_number': 12,
  'page_char_count': 812,
  'page_word_count': 127,
  'page_sentence_count': 8,
  'page_token_count': 203.0,
  'text': 'Attention Visualizations Input-Input Layer5 It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or voting process more difficult . <EOS> <pad> <pad> <pad> <pad> <pad> <pad> It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or voting process more difficult . <EOS> <pad> <pad> <pad> <pad> <pad> <pad> Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads. Best viewed in co

In [96]:
import pandas as pd
df = pd.DataFrame(extracted_data)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,79.0,79.0,79.0,79.0,79.0
mean,11.29,3299.86,507.87,28.14,824.97
std,8.75,1014.94,157.67,15.02,253.74
min,0.0,812.0,127.0,8.0,203.0
25%,4.5,2615.5,428.5,18.0,653.88
50%,9.0,3473.0,498.0,24.0,868.25
75%,16.0,4007.0,629.0,35.0,1001.75
max,33.0,5391.0,849.0,64.0,1347.75


#### **Chunking** 
We need to break the text into chunks then to embed these chunks and save them in the vectore-store. 

In [97]:
# Define split size to turn groups of sentences into chunks
CHUNK_SIZE_IN_SENTENCES = 10 

def chunking(list_of_sentences, chunk_size):
    # We group sentences based on the chunk size (estimated in sentences)
    sentence_chunks = [list_of_sentences[i:i + chunk_size] for i in range(0, len(list_of_sentences), chunk_size)]
    return sentence_chunks

for entry in tqdm(extracted_data):
    entry["sentence_chunks"] = chunking(entry["sentences"], CHUNK_SIZE_IN_SENTENCES)
    entry["num_chunks"] = len(entry["sentence_chunks"])


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:00<?, ?it/s]


In [98]:
random.sample(extracted_data, k=1)[0]["sentence_chunks"][1]

['For longer output sequences, |Y | can become large, requiring many forward passes.',
 'For more efﬁcient decoding, we can make a further approximation that pθ(y|x, zi) ≈ 0 where y was not generated during beam search from x, zi.',
 'This avoids the need to run additional forward passes once the candidate set Y has been generated.',
 'We refer to this decoding procedure as “Fast Decoding.”',
 '3 Experiments We experiment with RAG in a wide range of knowledge-intensive tasks.',
 'For all experiments, we use a single Wikipedia dump for our non-parametric knowledge source.',
 'Following Lee et al. [',
 '31] and Karpukhin et al. [',
 '26], we use the December 2018 dump.',
 'Each Wikipedia article is split into disjoint 100-word chunks, to make a total of 21M documents.']

In [99]:
df = pd.DataFrame(extracted_data)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,num_chunks
count,79.0,79.0,79.0,79.0,79.0,79.0
mean,11.29,3299.86,507.87,28.14,824.97,3.25
std,8.75,1014.94,157.67,15.02,253.74,1.55
min,0.0,812.0,127.0,8.0,203.0,1.0
25%,4.5,2615.5,428.5,18.0,653.88,2.0
50%,9.0,3473.0,498.0,24.0,868.25,3.0
75%,16.0,4007.0,629.0,35.0,1001.75,4.0
max,33.0,5391.0,849.0,64.0,1347.75,7.0


From the stats, we can see that the average num of chunks per page is 3, and the average token count is 807. we can conclude that each chunk has 807/3 ~ 269 tokens. meaning we need to choose an embedding model that has a context length >= 269. for example **all-mpnet-base-v2** model (it has a capacity of 384 tokens)

Before going directly to creating the embedding locally, I need to filter some very short chunks, which could have not important info ex.(footer, links, etc)

In [100]:
# create chunks dict to keep only chunks info
def convert_to_chunck_dict(text_dict):  
    extracted_chunks = []
    for item in tqdm(text_dict):
        for sentence_chunk in item["sentence_chunks"]:
            chunk_dict = {}
            chunk_dict["file_path"] = item["file_path"]
            chunk_dict["page_number"] = item["page_number"]
            # Join the sentences together into a paragraph-like structure
            joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
            joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) 
            chunk_dict["sentence_chunk"] = joined_sentence_chunk
            # Get stats about the chunk
            chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
            chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
            chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
            extracted_chunks.append(chunk_dict)
    return extracted_chunks


extracted_chunks = convert_to_chunck_dict(extracted_data)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:00<00:00, 14323.07it/s]


In [101]:
random.sample(extracted_chunks, k=1)

[{'file_path': 'data\\RAG.pdf',
  'page_number': 12,
  'sentence_chunk': 'for Computational Linguistics, pages 6086–6096, Florence, Italy, July 2019. Association for Computational Linguistics.doi: 10.18653/v1/P19-1612. URL https://www.aclweb.org/ anthology/P19-1612. [32] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019. URL https://arxiv.org/abs/1910.13461. [33] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models.',
  'chunk_char_count': 668,
  'chunk_word_count': 71,
  'chunk_token_count': 167.0}]

In [102]:
# Get stats about our chunks
df = pd.DataFrame(extracted_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,257.0,257.0,257.0,257.0
mean,11.07,1012.51,154.96,253.13
std,7.7,595.98,99.55,148.99
min,0.0,1.0,1.0,0.25
25%,6.0,604.0,78.0,151.0
50%,10.0,852.0,133.0,213.0
75%,14.0,1402.0,226.0,350.5
max,33.0,2798.0,469.0,699.5


In [103]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 28.0 | Text: In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013.12
Chunk token count: 5.5 | Text: During inference, we 9
Chunk token count: 18.75 | Text: doi: 10.1162/tacl_a_00030. URL https://www.aclweb.org/anthology/Q18-1031.11
Chunk token count: 27.75 | Text: Another exciting research direction is to have the model predict future text as well as just the next action.34
Chunk token count: 12.0 | Text: URL https://www.aclweb.org/anthology/P17-1020.10


In [106]:
extracted_chunks_filtered = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")

In [114]:
random.sample(extracted_chunks_filtered, k=1)

[{'file_path': 'data\\CNN.pdf',
  'page_number': 6,
  'sentence_chunk': 'Despite our best efforts so far we will still ﬁnd that our models are still enor- mous if we use an image input of any real dimensionality. However, methods have been developed as to greatly curtail the overall number of parameters within the convolutional layer. Parameter sharing works on the assumption that if one region feature is useful to compute at a set spatial region, then it is likely to be useful in another region. If we constrain each individual activation map within the output volume to the same weights and bias, then we will see a massive reduction in the number of parameters being produced by the convolutional layer. As a result of this as the backpropagation stage occurs, each neuron in the out- put will represent the overall gradient of which can be totalled across the depth - thus only updating a single set of weights, as opposed to every single one.',
  'chunk_char_count': 879,
  'chunk_word_coun

#### **Embed chunks**