# Preprocessors testing

As can be seen in the notebook "create_preprocessors.ipynb" we create three (since as mentioned proposal chunking won't be considered) different preprocessing strategies to split our documents into chunks:

* Sentences splitting
* Semantic chunking
* Sequential semantic chunking

On the other hand, we contemplate three retrieval scenarios, implemented directly in code (the expected volume of data for this projects is not big enough for a vector DB to be neccesary and the implementation is usefull for concepts understanding):

* Word matching using TFid vectorizer (as seen in the course in the minsearch implementation)
* Hybrid serach implementing the embeddings with sentence transformers.
* Hybrid search with RRF.

This notebooks compares the capability of each scenario to create a good retrieval strategy. Though verifying the end to end behavior by evaluating the RAG with each metodology would be more recommendable, given time restrictions we will only explore the performance of the retrieval part by cheking the hit rate and MMR using a ground_truth data base

## Libraries



In [1]:
import os
import sys
import time 
import json
import random
import pandas as pd
import hashlib

from tqdm import tqdm
from dotenv import load_dotenv

project_path = os.path.dirname(os.getcwd())
sys.path.append(project_path)

from src.preprocess import  extract_text_from_pdf, get_sentences, get_semantic_chunks, get_sequential_semantic_chunks
from src.rag import RAG

GOOGLE_API_KEY = os.environ['GOOGLE_API_KEY']

  from tqdm.autonotebook import tqdm, trange


## Ground truth data

for the ground truth data we will choose at random 5 documents, chunks them and then generate questions for each chunk. This means a ground truth set is to be generated for each chunking strategy


In [2]:
doc_categories = os.listdir(os.path.join(project_path, 'docs'))
papers = []
index = 0
for category in doc_categories:
    for document in os.listdir(os.path.join(project_path, 'docs', category)):
        index += 1
        papers.append({
            "index": index,
            "category":category,
            "paper":document
        })

In [3]:
papers[:3]

[{'index': 1,
  'category': 'deeplearning',
  'paper': 'an_overview_of_gradient_descent_optimization_algorithms.pdf'},
 {'index': 2,
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf'},
 {'index': 3,
  'category': 'deeplearning',
  'paper': 'dense_x_retieval_what_retrieval_granularity_shoud_we_use.pdf'}]

In [4]:
random.seed(123)
sample = random.sample(papers, 5)

In [5]:
sample

[{'index': 2,
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf'},
 {'index': 9,
  'category': 'deeplearning',
  'paper': 'the_matrix_calculus_you_need_for_deeplearning.pdf'},
 {'index': 3,
  'category': 'deeplearning',
  'paper': 'dense_x_retieval_what_retrieval_granularity_shoud_we_use.pdf'},
 {'index': 14,
  'category': 'time_series',
  'paper': 'another_lookat_measures_of_forecast_accuracy.pdf'},
 {'index': 4,
  'category': 'deeplearning',
  'paper': 'knowledge_card_filling_llms_knowledge_gaps_with_plug_in_specialied_language_models.pdf'}]

## Getting the chunked data

### Sentence splitting

In [6]:
# Creating chunks
chunks = []
for doc in tqdm(sample):
    index = doc.get('index')
    category = doc.get('category')
    paper = doc.get('paper')
    
    doc_id = hashlib.sha256(
        f'{category}-{paper}-{index}'.encode('utf-8')
    ).hexdigest()
    
    pdf_path = os.path.join(
        project_path, 'docs', category, paper
    )
    
    text = extract_text_from_pdf(pdf_path)
    
    doc_chunks = get_sentences(text, doc_id)
    
    for doc_chunk in doc_chunks:                
        chunks.append({
            'id': f'{doc_id}-{doc_chunk['chunk']}',
            'category':category,
            'paper': paper,
            'text': doc_chunk['text']
        })
 

100%|██████████| 5/5 [00:05<00:00,  1.01s/it]


In [7]:
df_sentence_splitting = pd.DataFrame(chunks)

In [8]:
df_sentence_splitting.head()

Unnamed: 0,id,category,paper,text
0,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Attention Is All You Need\nAshish Vaswani\nGo...
1,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,The best\nperforming models also connect the e...
2,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,"We propose a new simple network architecture, ..."
3,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Experiments on two machine translation tasks s...
4,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Our model achieves 28.4 BLEU on the WMT 2014 E...


In [33]:
df_sentence_splitting.to_csv(
    os.path.join(project_path, 'data', 'testing', 'sentence_splitting.csv')
)

## Semantic chunking

In [34]:
# Creating chunks
chunks = []
for doc in tqdm(sample):
    index = doc.get('index')
    category = doc.get('category')
    paper = doc.get('paper')
    
    doc_id = hashlib.sha256(
        f'{category}-{paper}-{index}'.encode('utf-8')
    ).hexdigest()
    
    pdf_path = os.path.join(
        project_path, 'docs', category, paper
    )
    
    text = extract_text_from_pdf(pdf_path)
    
    doc_chunks = get_semantic_chunks(text, doc_id)
    
    for doc_chunk in doc_chunks:                
        chunks.append({
            'id': f'{doc_id}-{doc_chunk['chunk']}',
            'category':category,
            'paper': paper,
            'text': doc_chunk['text']
        })

df_semantic_chunking = pd.DataFrame(chunks)
df_semantic_chunking.head()

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

100%|██████████| 5/5 [07:42<00:00, 92.42s/it]


Unnamed: 0,id,category,paper,text
0,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,We used a beam size of 21and= 0:3\nfor both W...
1,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,The Transformer allows for signiﬁcantly more p...
2,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,"[4]Jianpeng Cheng, Li Dong, and Mirella Lapata."
3,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Attention Is All You Need\nAshish Vaswani\nGo...
4,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,At each step the model is auto-regressive\n[10...


In [36]:
df_semantic_chunking.to_csv(
    os.path.join(project_path, 'data', 'testing', 'semantic_chunking.csv')
)

## Sequential semantic chunking

In [6]:
# Creating chunks
chunks = []
for doc in tqdm(sample):
    index = doc.get('index')
    category = doc.get('category')
    paper = doc.get('paper')
    
    doc_id = hashlib.sha256(
        f'{category}-{paper}-{index}'.encode('utf-8')
    ).hexdigest()
    
    pdf_path = os.path.join(
        project_path, 'docs', category, paper
    )
    
    text = extract_text_from_pdf(pdf_path)
    
    doc_chunks = get_sequential_semantic_chunks(text, doc_id)
    
    for doc_chunk in doc_chunks:                
        chunks.append({
            'id': f'{doc_id}-{doc_chunk['chunk']}',
            'category':category,
            'paper': paper,
            'text': doc_chunk['text']
        })

df_sequential_semantic_chunking = pd.DataFrame(chunks)
df_sequential_semantic_chunking.head()

100%|██████████| 5/5 [2:11:10<00:00, 1574.14s/it]


Unnamed: 0,id,category,paper,text
0,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Attention Is All You Need\nAshish Vaswani\nGo...
1,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,This\nconsists of two linear transformations w...
2,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,We\ntrained the base models for a total of 100...
3,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,We\nused beam search with a beam size of 4and ...
4,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,While single-head\nattention is 0.9 BLEU worse...


In [8]:
df_sequential_semantic_chunking.to_csv(
    os.path.join(project_path, 'data', 'testing', 'sequential_semantic_chunking.csv')
)