## **Chat with Your Data**
#### Steps:
- Process your documents (chunking, embedding, vector-store)
- Q & A using our RAG
  
<br/>

### **Process your documents (chunking, embedding, vector-store):**

In [1]:
import os
import glob
import fitz
from tqdm import tqdm
import re

# set the path to your data directory
DATA_DIR = 'data/'

In [2]:
from spacy.lang.en import English 

# Add a sentencizer pipeline
nlp = English()
nlp.add_pipe("sentencizer")

def get_sentences(txt):
    sentences = list(nlp(txt).sents)
    sentences = [str(sentence) for sentence in sentences]
    return sentences

def read_files(data_dir):
    # loop over your files
    extracted_data = []
    for file in glob.glob(os.path.join(data_dir, "*.pdf")):
        # open the doc
        document = fitz.open(file)
        # process
        # print("file path: " , file)
        for page_num, page in tqdm(enumerate(document)):
            # get the raw text of each page
            txt = page.get_text()
            # do some cleaning
            cleaned_text = txt.replace("\n", " ").strip()
            
            # print(cleaned_text)
            # print("\n\n ++++++++++++++++++++++++++++++++++ \n\n")
            sentences = get_sentences(cleaned_text)
            entry = {"file_path": file,
                     "page_number": page_num,
                     "page_char_count": len(cleaned_text),
                     "page_word_count": len(cleaned_text.split(" ")),
                     "page_sentence_count": len(sentences),
                     "page_token_count": len(cleaned_text) / 4,
                     "text": cleaned_text,
                     "sentences": sentences}
            extracted_data.append(entry)
    return extracted_data

In [3]:
extracted_data = read_files(DATA_DIR)

15it [00:00, 84.15it/s] 
11it [00:00, 226.97it/s]
19it [00:00, 136.28it/s]
34it [00:00, 132.00it/s]


In [4]:
import random 
random.sample(extracted_data, k=1)

[{'file_path': 'data\\video pretraining VPT.pdf',
  'page_number': 8,
  'page_char_count': 3716,
  'page_word_count': 588,
  'page_sentence_count': 24,
  'page_token_count': 929.0,
  'text': 'Trained on Contractor Data Trained on IDM Labeled Web Data Figure 8: (Left) Zero-shot rollout performance of foundation models trained on varying amounts of data. Models to the left of the dashed black line (points ≤1k hours) were trained on contractor data (ground-truth labels), and models to the right were trained on IDM pseudo-labeled subsets of web_clean. Due to compute limitations, this analysis was performed with smaller (71 million parameter) models except for the final point, which is the 0.5 billion parameter VPT foundation model. (Right) The corresponding performance of each model after BC fine-tuning each model to the contractor_house dataset. contractor data, and those trained on 5k hours and above are trained on subsets of web_clean, which does not contain any IDM contractor data. Sca

In [5]:
import pandas as pd
df = pd.DataFrame(extracted_data)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,79.0,79.0,79.0,79.0,79.0
mean,11.29,3299.86,507.87,28.14,824.97
std,8.75,1014.94,157.67,15.02,253.74
min,0.0,812.0,127.0,8.0,203.0
25%,4.5,2615.5,428.5,18.0,653.88
50%,9.0,3473.0,498.0,24.0,868.25
75%,16.0,4007.0,629.0,35.0,1001.75
max,33.0,5391.0,849.0,64.0,1347.75


#### **Chunking** 
We need to break the text into chunks then to embed these chunks and save them in the vectore-store. 

In [6]:
# Define split size to turn groups of sentences into chunks
CHUNK_SIZE_IN_SENTENCES = 5 

def chunking(list_of_sentences, chunk_size):
    # We group sentences based on the chunk size (estimated in sentences)
    sentence_chunks = [list_of_sentences[i:i + chunk_size] for i in range(0, len(list_of_sentences), chunk_size)]
    return sentence_chunks

for entry in tqdm(extracted_data):
    entry["sentence_chunks"] = chunking(entry["sentences"], CHUNK_SIZE_IN_SENTENCES)
    entry["num_chunks"] = len(entry["sentence_chunks"])


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:00<00:00, 78930.45it/s]


In [7]:
random.sample(extracted_data, k=1)[0]["sentence_chunks"][1]

['Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.',
 'On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.',
 'We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.',
 '∗Equal contribution.',
 'Listing order is random.']

In [8]:
df = pd.DataFrame(extracted_data)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,num_chunks
count,79.0,79.0,79.0,79.0,79.0,79.0
mean,11.29,3299.86,507.87,28.14,824.97,6.0
std,8.75,1014.94,157.67,15.02,253.74,3.04
min,0.0,812.0,127.0,8.0,203.0,2.0
25%,4.5,2615.5,428.5,18.0,653.88,4.0
50%,9.0,3473.0,498.0,24.0,868.25,5.0
75%,16.0,4007.0,629.0,35.0,1001.75,7.0
max,33.0,5391.0,849.0,64.0,1347.75,13.0


From the stats, we can see that the average num of chunks per page is 3, and the average token count is 807. we can conclude that each chunk has 807/3 ~ 269 tokens. meaning we need to choose an embedding model that has a context length >= 269. for example **all-mpnet-base-v2** model (it has a capacity of 384 tokens)

Before going directly to creating the embedding locally, I need to filter some very short chunks, which could have not important info ex.(footer, links, etc)

In [9]:
# create chunks dict to keep only chunks info
def convert_to_chunck_dict(text_dict):  
    extracted_chunks = []
    for item in tqdm(text_dict):
        for sentence_chunk in item["sentence_chunks"]:
            chunk_dict = {}
            chunk_dict["file_path"] = item["file_path"]
            chunk_dict["page_number"] = item["page_number"]
            # Join the sentences together into a paragraph-like structure
            joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
            joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) 
            chunk_dict["sentence_chunk"] = joined_sentence_chunk
            # Get stats about the chunk
            chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
            chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
            chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
            extracted_chunks.append(chunk_dict)
    return extracted_chunks


extracted_chunks = convert_to_chunck_dict(extracted_data)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:00<00:00, 13172.33it/s]


In [10]:
random.sample(extracted_chunks, k=1)

[{'file_path': 'data\\RAG.pdf',
  'page_number': 5,
  'sentence_chunk': 'BART completes the generation "The Sun Also Rises" is a novel by this author of "The Sun Also Rises" indicating the title "The Sun Also Rises" is stored in BART’s parameters. Similarly, BART will complete the partial decoding "The Sun Also Rises" is a novel by this author of "A with "The Sun Also Rises" is a novel by this author of "A Farewell to Arms". This example shows how parametric and non-parametric memories work together—the non-parametric component helps to guide the generation, drawing out speciﬁc knowledge stored in the parametric memory.4.4 Fact Veriﬁcation Table 2 shows our results on FEVER. For 3-way classiﬁcation, RAG scores are within 4.3% of state-of-the-art models, which are complex pipeline systems with domain-speciﬁc architectures and substantial engineering, trained using intermediate retrieval supervision, which RAG does not require.',
  'chunk_char_count': 866,
  'chunk_word_count': 133,
  'c

In [11]:
# Get stats about our chunks
df = pd.DataFrame(extracted_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,474.0,474.0,474.0,474.0
mean,11.13,548.66,84.16,137.17
std,7.61,321.96,53.99,80.49
min,0.0,1.0,1.0,0.25
25%,6.0,314.0,40.0,78.5
50%,10.0,477.5,75.0,119.38
75%,14.0,763.0,120.0,190.75
max,33.0,1905.0,296.0,476.25


In [12]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 6.75 | Text: We only collected labels 17
Chunk token count: 1.75 | Text: 1960.13
Chunk token count: 10.0 | Text: arXiv preprint arXiv:1508.04025, 2015.11
Chunk token count: 22.5 | Text: Later in the project, as we needed more data and as some contractors asked to terminate 19
Chunk token count: 25.25 | Text: Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf.14


In [13]:
extracted_chunks_filtered = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")

In [14]:
random.sample(extracted_chunks_filtered, k=1)

[{'file_path': 'data\\RAG.pdf',
  'page_number': 2,
  'sentence_chunk': '2.1 Models RAG-Sequence Model The RAG-Sequence model uses the same retrieved document to generate the complete sequence. Technically, it treats the retrieved document as a single latent variable that is marginalized to get the seq2seq probability p(y|x) via a top-K approximation. Concretely, the top K documents are retrieved using the retriever, and the generator produces the output sequence probability for each document, which are then marginalized, pRAG-Sequence(y|x) ≈ X z∈top-k(p(·|x)) pη(z|x)pθ(y|x, z) = X z∈top-k(p(·|x)) pη(z|x) N Y i pθ(yi|x, z, y1:i−1) RAG-Token Model In the RAG-Token model we can draw a different latent document for each target token and marginalize accordingly. This allows the generator to choose content from several documents when producing an answer. Concretely, the top K documents are retrieved using the retriever, and then the generator produces a distribution for the next output toke

#### **Embed chunks**
 import a text embedding model **all-mpnet-base-v2** which outputs vectors of size **768**. With a context length of **384** tokens. 

In [15]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cuda")
test = "This is a test text!!"
embedding = embedding_model.encode(test)
print("embedding shape: ", embedding.shape)
print("embedding values: ", embedding)



embedding shape:  (768,)
embedding values:  [-2.98330616e-02 -9.65827703e-02  9.01226245e-04  7.41711119e-03
 -4.27282527e-02  2.93111559e-02 -8.72458238e-03  8.85402597e-03
  3.43042314e-02 -9.74845886e-03  6.65283054e-02 -1.87870618e-02
  3.86397354e-02 -2.12698132e-02  1.95645131e-02 -4.51359116e-02
  4.06676419e-02 -3.41284163e-02 -2.89035011e-02  1.64026618e-02
 -5.03091551e-02  2.55158283e-02 -1.66003518e-02 -6.22245744e-02
 -2.18643956e-02  8.85844044e-03 -4.52556312e-02 -3.97380888e-02
  5.33500360e-03 -1.09222541e-02  2.95482390e-02 -1.83249377e-02
  1.36156548e-02 -4.51841801e-02  1.41395890e-06  9.03561246e-03
 -2.42098961e-02 -1.42207881e-02 -1.03771966e-03  1.08437771e-02
  4.88634109e-02  2.92940773e-02  1.86034199e-02  4.02783975e-02
 -2.57998984e-02  6.53411495e-03  4.77776714e-02  2.44616047e-02
 -2.95096375e-02  7.03382939e-02  1.46438566e-03 -7.21641025e-03
  2.66191293e-03 -4.00283672e-02  6.15323633e-02  5.02035860e-03
  1.76253971e-02  3.06243580e-02 -2.64169946e-

In [16]:
# here just a speed test for creating the embeddings
# my local CPU VS my local GPU (Nividia RTX 3060)
import time

def create_embeddings(chunks, embedding_model):
    # Embed each chunk one by one
    for item in tqdm(chunks):
        item["embedding"] = embedding_model.encode(item["sentence_chunk"])

# cpu
embedding_model.to("cpu")
t1 = time.time()
create_embeddings(extracted_chunks_filtered, embedding_model)
t2 = time.time()
print(f"CPU time: {round(t2 - t1, 2)} s" )

# gpu
embedding_model.to("cuda")
t1 = time.time()
create_embeddings(extracted_chunks_filtered, embedding_model)
t2 = time.time()
print(f"GPU time: {round(t2 - t1, 2)} s" )

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [00:43<00:00, 10.40it/s]


CPU time: 43.26 s


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 450/450 [00:07<00:00, 58.17it/s]

GPU time: 7.74 s





Embed all texts in batches: 

In [23]:
text_chunks = [item["sentence_chunk"] for item in extracted_chunks_filtered]
embedding_model.to("cpu")
t1 = time.time()
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32,
                                               convert_to_tensor=True) 
t2 = time.time()
print(f"CPU time batched: {round(t2 - t1, 2)} s" )

CPU time batched: 57.83 s


In [24]:
text_chunks = [item["sentence_chunk"] for item in extracted_chunks_filtered]
embedding_model.to("cuda")
t1 = time.time()
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32,
                                               convert_to_tensor=True) 
t2 = time.time()
print(f"GPU time batched: {round(t2 - t1, 2)} s" )

GPU time batched: 3.4 s


In [18]:
# Save embeddings to file
import csv
embeddings_df = pd.DataFrame(extracted_chunks_filtered)
embeddings_df_save_path = "embeddings.csv"
embeddings_df.to_csv(embeddings_df_save_path, index=False, escapechar="\\")

In [19]:
# Import saved file and view
embeddings_loaded_df = pd.read_csv(embeddings_df_save_path)
embeddings_loaded_df.head()

Unnamed: 0,file_path,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,data\\attention is all you need.pdf,0,"Provided proper attribution is provided, Googl...",1165,154,291.25,[ 2.07077805e-02 2.70413030e-02 -1.68691296e-...
1,data\\attention is all you need.pdf,0,Our model achieves 28.4 BLEU on the WMT 2014 E...,620,95,155.0,[ 1.10328430e-03 5.08999750e-02 3.29319574e-...
2,data\\attention is all you need.pdf,0,Jakob proposed replacing RNNs with self-attent...,658,90,164.5,[ 1.54208224e-02 1.14086864e-03 -6.22348813e-...
3,data\\attention is all you need.pdf,0,Lukasz and Aidan spent countless long days des...,392,49,98.0,[ 2.14229785e-02 5.34767583e-02 -1.31562511e-...
4,data\\attention is all you need.pdf,1,"1 Introduction Recurrent neural networks, long...",1115,158,278.75,[-2.53893668e-03 4.21451516e-02 -5.52566350e-...
