# Chunking and search algorithms test

preprocessing strategies to split our documents into chunks:

* Sentences splitting
* Semantic chunking
* Sequential semantic chunking

On the other hand, we contemplate three retrieval scenarios, implemented directly in code (the expected volume of data for this projects is not big enough for a vector DB to be neccesary and the implementation is usefull for concepts understanding):

* Word matching using TFid vectorizer (as seen in the course in the minsearch implementation)
* Hybrid serach implementing the embeddings with sentence transformers.
* Hybrid search with RRF.

This notebooks compares the capability of each scenario to create a good retrieval strategy. Though verifying the end to end behavior by evaluating the final RAG response with each metodology would be more recommendable in the optimization process, given time restrictions we will only explore the performance and use the MMR to optimize the hyperparameters using a ground_truth data base generated for each chunkig strategy. Given the best parameters for the best performing search method for each chunking alternative we will compare the chunking strategies in a future notebook.

## Libraries

In [1]:
import os
import sys
import json
import pandas as pd

from dotenv import load_dotenv

project_path = os.path.dirname(os.getcwd())
sys.path.append(project_path)

from src.rag import RAG
from src.evaluation import evaluate

load_dotenv()

GOOGLE_API_KEY = os.environ['GOOGLE_API_KEY']

  from .autonotebook import tqdm as notebook_tqdm


Now we are ready to check the performance of our search arlgorithms using the ground truth dataset. For this we will evaluate both our minsearch (word matching) and hybridsearch (word matching and semantic search) with each set of chunks and ground truth data. Notice that since rrf is a parameter of our hybridserach approach we are evaluating it withing the optimization process to get the best parameters.

## Sentence splitting

### Data indexing



In [3]:
df_sentence_splitting = pd.read_csv(
    os.path.join(project_path, 'data', 'testing', 'sentence_splitting.csv')
)
print(f"The data set contains: {df_sentence_splitting.shape[0]} sentences")
sentece_splitting_docs = df_sentence_splitting.to_dict(orient="records")
sentece_splitting_docs[:3]

The data set contains: 2929 sentences


[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-1',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'Attention Is All You Need\nAshish Vaswani\x03\nGoogle Brain\navaswani@google.comNoam Shazeer\x03\nGoogle Brain\nnoam@google.comNiki Parmar\x03\nGoogle Research\nnikip@google.comJakob Uszkoreit\x03\nGoogle Research\nusz@google.com\nLlion Jones\x03\nGoogle Research\nllion@google.comAidan N. Gomez\x03y\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser\x03\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin\x03z\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder.'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-2',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'The best\nperforming models also connect the encoder and deco

In [5]:
# Initialice the RAG
ss_rag = RAG(api_key=GOOGLE_API_KEY)

# Define search fields
text_fields = [
    'category',
    'paper',
    'text'
]
keyword_fields = ['id']

# Ingest the documents
ss_rag.minsearch_index(
    docs=sentece_splitting_docs,
    text_fields=text_fields,
    keyword_fields=keyword_fields
)
ss_rag.hybserach_index(
    docs=sentece_splitting_docs,
    text_fields=text_fields,
    keyword_fields=keyword_fields
)



Let's check that both algorithms are working:

In [9]:
query = "'What specific architectural or training techniques enabled the model to achieve such a significant BLEU score improvement compared to previous single models?"

In [10]:
ss_rag.minsearch(query, num_results=2)

[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-157',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41:0,\noutperforming all of the previously published single models, at less than 1=4the training cost of the\nprevious state-of-the-art model.'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-6',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'On the WMT 2014 English-to-French translation task,\nour model establishes a new single-model state-of-the-art BLEU score of 41.8 after\ntraining for 3.5 days on eight GPUs, a small fraction of the training costs of the\nbest models from the literature.'}]

In [11]:
ss_rag.hybsearch(query, num_results=2)

[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-6',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'On the WMT 2014 English-to-French translation task,\nour model establishes a new single-model state-of-the-art BLEU score of 41.8 after\ntraining for 3.5 days on eight GPUs, a small fraction of the training costs of the\nbest models from the literature.'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-157',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41:0,\noutperforming all of the previously published single models, at less than 1=4the training cost of the\nprevious state-of-the-art model.'}]

### Ground truth data

In [12]:
df_ss_gt = pd.read_csv(
    os.path.join(project_path, 'data', 'testing', 'ground-truth-sentence-splitting.csv')
)
ss_gt_docs = df_ss_gt.to_dict(orient="records")
ss_gt_docs[:3]

[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-157',
  'question': 'What specific architectural or training techniques enabled the model to achieve such a significant BLEU score improvement compared to previous single models?'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-157',
  'question': 'How does the reduced training cost of this model compare to the cost of ensemble models, and what are the trade-offs involved in choosing one over the other?'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-157',
  'question': "What are the limitations of the model's performance on the WMT 2014 English-to-French translation task, and how might these be addressed in future research?"}]

We can start by evaluating the arlgorithms with default parameters:

In [13]:
evaluate(ss_gt_docs, lambda q: ss_rag.minsearch(q['question']))

100%|██████████| 770/770 [00:02<00:00, 372.41it/s]


{'hit_rate': 0.43636363636363634, 'mmr': 0.34350803957946807}

In [14]:
evaluate(ss_gt_docs, lambda q: ss_rag.hybsearch(q['question']))

100%|██████████| 770/770 [01:39<00:00,  7.70it/s]


{'hit_rate': 0.4688311688311688, 'mmr': 0.34527674706246125}

### Parameters optimization

Now we will use the minserach_fit and hybserach_fit methods implemented within the RAG class that use a simple optimization process (by doing a random search within the parameters space) to find the best parameters for each search algorithm. The best parameters for the algorithm with the best performance will be saved for the comparison of the RAG's performance with each chunkig stragey.

In [15]:
param_ranges = {
    'category': (0.0, 3.0),
    'paper': (0.0, 3.0),
    'text': (0.0, 3.0)
}
ss_rag.minserach_fit(ground_truth=ss_gt_docs, param_ranges=param_ranges)

100%|██████████| 770/770 [00:02<00:00, 339.70it/s]
100%|██████████| 770/770 [00:02<00:00, 346.70it/s]
100%|██████████| 770/770 [00:02<00:00, 340.72it/s]
100%|██████████| 770/770 [00:02<00:00, 326.32it/s]
100%|██████████| 770/770 [00:02<00:00, 337.49it/s]
100%|██████████| 770/770 [00:02<00:00, 352.53it/s]
100%|██████████| 770/770 [00:02<00:00, 347.49it/s]
100%|██████████| 770/770 [00:02<00:00, 357.86it/s]
100%|██████████| 770/770 [00:02<00:00, 352.17it/s]
100%|██████████| 770/770 [00:02<00:00, 353.28it/s]
100%|██████████| 770/770 [00:02<00:00, 359.30it/s]
100%|██████████| 770/770 [00:02<00:00, 366.18it/s]
100%|██████████| 770/770 [00:02<00:00, 370.29it/s]
100%|██████████| 770/770 [00:02<00:00, 369.46it/s]
100%|██████████| 770/770 [00:02<00:00, 370.16it/s]
100%|██████████| 770/770 [00:02<00:00, 373.28it/s]
100%|██████████| 770/770 [00:02<00:00, 363.22it/s]
100%|██████████| 770/770 [00:02<00:00, 371.47it/s]
100%|██████████| 770/770 [00:02<00:00, 378.09it/s]
100%|██████████| 770/770 [00:02

Model fitted with provided ground truth data.
Best parameters are:
{'category': 1.3363782475591928, 'paper': 1.723769131509511, 'text': 1.4854316318119176}
Best score was:
0.34350803957946807





In [18]:
param_ranges = {
    'category': (0.0, 3.0),
    'paper': (0.0, 3.0),
    'text': (0.0, 3.0),
    'alpha': (0.0, 1.0),
    'rrf': (False, True),
    'k': (30, 100)
}
ss_rag.hybsearch_fit(ground_truth=ss_gt_docs, param_ranges=param_ranges)

100%|██████████| 770/770 [01:42<00:00,  7.55it/s]
100%|██████████| 770/770 [01:40<00:00,  7.68it/s]]
100%|██████████| 770/770 [01:39<00:00,  7.75it/s]]
100%|██████████| 770/770 [01:39<00:00,  7.74it/s]]
100%|██████████| 770/770 [01:36<00:00,  7.94it/s] 
100%|██████████| 770/770 [01:35<00:00,  8.06it/s]
100%|██████████| 770/770 [01:35<00:00,  8.05it/s]
100%|██████████| 770/770 [01:35<00:00,  8.07it/s]
100%|██████████| 770/770 [01:36<00:00,  7.94it/s]
100%|██████████| 770/770 [01:35<00:00,  8.06it/s]
100%|██████████| 770/770 [01:35<00:00,  8.03it/s]]
100%|██████████| 770/770 [01:35<00:00,  8.06it/s]]
100%|██████████| 770/770 [01:35<00:00,  8.03it/s]]
100%|██████████| 770/770 [01:37<00:00,  7.86it/s]]
100%|██████████| 770/770 [01:39<00:00,  7.73it/s]]
100%|██████████| 770/770 [01:37<00:00,  7.91it/s]]
100%|██████████| 770/770 [01:41<00:00,  7.58it/s]]
100%|██████████| 770/770 [01:43<00:00,  7.41it/s]]
100%|██████████| 770/770 [01:42<00:00,  7.51it/s]t]
100%|██████████| 770/770 [01:49<00:0

Model fitted with provided ground truth data.
Best parameters are:
{'category': 2.929045533967244, 'paper': 2.948397953982448, 'text': 2.73834686069514, 'alpha': 0.1003540150371196, 'rrf': 0, 'k': 86}
Best score was:
0.3673907441764584





In [19]:
# Best minsearch parameters
ss_rag.boost_dict

{'category': 1.3363782475591928,
 'paper': 1.723769131509511,
 'text': 1.4854316318119176}

In [20]:
# Evaluation after optimization
evaluate(ss_gt_docs, lambda q: ss_rag.minsearch(q['question']))

100%|██████████| 770/770 [00:02<00:00, 346.36it/s]


{'hit_rate': 0.43636363636363634, 'mmr': 0.34350803957946807}

In [21]:
# Best hybsearch parameters
ss_h_best_params = {
    "boost_dict":ss_rag.h_boost_dict,
    "alpha":ss_rag.alpha,
    "rrf":ss_rag.rrf,
    "k":ss_rag.k
}
ss_h_best_params

{'boost_dict': {'category': 2.929045533967244,
  'paper': 2.948397953982448,
  'text': 2.73834686069514},
 'alpha': 0.1003540150371196,
 'rrf': 0,
 'k': 86}

In [31]:
# Evaluation after optimization
evaluate(ss_gt_docs, lambda q: ss_rag.hybsearch(q['question']))

100%|██████████| 770/770 [01:38<00:00,  7.80it/s]


{'hit_rate': 0.4662337662337662, 'mmr': 0.3673907441764584}

In [32]:
evaluate(ss_gt_docs, lambda q: ss_rag.hybsearch(q['question'], num_results=20))

100%|██████████| 770/770 [01:37<00:00,  7.91it/s]


{'hit_rate': 0.5025974025974026, 'mmr': 0.36991665767142207}

### Saving best parameters:

The algorithm with best performance was the hybrid search and next we save it's parameters

In [29]:
ss_params_file = os.path.join(project_path, 'src', 'parameters', 'ss_best_params.json')
with open(ss_params_file, "w") as file:
    json.dump(ss_h_best_params, file)

## Semantic chunking


### Data indexing

In [2]:
df_semantic_chunking = pd.read_csv(
    os.path.join(project_path, 'data', 'testing', 'semantic_chunking.csv')
)
print(f"The data set contains: {df_semantic_chunking.shape[0]} sentences")
sc_docs = df_semantic_chunking.to_dict(orient="records")
sc_docs[:3]

The data set contains: 269 sentences


[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-1',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'We used a beam size of 21and\x0b= 0:3\nfor both WSJ only and the semi-supervised setting.'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-2',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'The Transformer allows for signiﬁcantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.\nFor translation tasks, the Transformer can be trained signiﬁcantly faster than architectures based\non recurrent or convolutional layers.'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-3',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': '[4]Jianpeng Cheng, Li Dong, and Mirella Lapata.'}]

In [3]:
# Initialice the RAG
sc_rag = RAG(api_key=GOOGLE_API_KEY)

# Define search fields
text_fields = [
    'category',
    'paper',
    'text'
]
keyword_fields = ['id']

# Ingest the documents
sc_rag.minsearch_index(
    docs=sc_docs,
    text_fields=text_fields,
    keyword_fields=keyword_fields
)
sc_rag.hybserach_index(
    docs=sc_docs,
    text_fields=text_fields,
    keyword_fields=keyword_fields
)



### Ground truth data

In [4]:
df_sc_gt = pd.read_csv(
    os.path.join(project_path, 'data', 'testing', 'ground-truth-semantic-chunking.csv')
)
sc_gt_docs = df_sc_gt.to_dict(orient="records")
sc_gt_docs[:3]

[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-31',
  'question': "What are the key differences between the research presented in the papers cited with the 'CoRR' prefix and those published by Curran Associates, Inc.?"},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-31',
  'question': "How do the papers cited with 'abs/1409.0473' and 'abs/1703.03906' contribute to the development of the attention mechanism in deep learning?"},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-31',
  'question': "What are the potential implications of the research presented in 'abs/1406.1078' and 'abs/1412.3555' for the field of natural language processing?"}]

In [5]:
evaluate(sc_gt_docs, lambda q: sc_rag.minsearch(q['question']))

100%|██████████| 770/770 [00:01<00:00, 420.01it/s]


{'hit_rate': 0.7077922077922078, 'mmr': 0.5527313955885382}

In [6]:
evaluate(sc_gt_docs, lambda q: sc_rag.hybsearch(q['question']))

100%|██████████| 770/770 [01:24<00:00,  9.11it/s]


{'hit_rate': 0.7376623376623377, 'mmr': 0.5253035456606884}

### Parameters optimization

In [7]:
param_ranges = {
    'category': (0.0, 3.0),
    'paper': (0.0, 3.0),
    'text': (0.0, 3.0)
}
sc_rag.minserach_fit(ground_truth=sc_gt_docs, param_ranges=param_ranges)

100%|██████████| 770/770 [00:02<00:00, 337.27it/s]
100%|██████████| 770/770 [00:02<00:00, 355.03it/s]
100%|██████████| 770/770 [00:02<00:00, 332.98it/s]
100%|██████████| 770/770 [00:02<00:00, 324.46it/s]
100%|██████████| 770/770 [00:02<00:00, 343.27it/s]
100%|██████████| 770/770 [00:02<00:00, 264.73it/s]
100%|██████████| 770/770 [00:03<00:00, 237.72it/s]
100%|██████████| 770/770 [00:03<00:00, 231.69it/s]
100%|██████████| 770/770 [00:02<00:00, 274.75it/s]
100%|██████████| 770/770 [00:02<00:00, 311.82it/s]
100%|██████████| 770/770 [00:02<00:00, 297.83it/s]
100%|██████████| 770/770 [00:02<00:00, 322.06it/s]
100%|██████████| 770/770 [00:02<00:00, 326.08it/s]
100%|██████████| 770/770 [00:02<00:00, 352.67it/s]
100%|██████████| 770/770 [00:02<00:00, 360.10it/s]
100%|██████████| 770/770 [00:02<00:00, 364.58it/s]
100%|██████████| 770/770 [00:02<00:00, 362.82it/s]
100%|██████████| 770/770 [00:02<00:00, 299.19it/s]
100%|██████████| 770/770 [00:03<00:00, 242.79it/s]
100%|██████████| 770/770 [00:03

Model fitted with provided ground truth data.
Best parameters are:
{'category': 0.9353338909809019, 'paper': 2.471768789824002, 'text': 2.815544505155288}
Best score was:
0.5527313955885382





In [8]:
param_ranges = {
    'category': (0.0, 3.0),
    'paper': (0.0, 3.0),
    'text': (0.0, 3.0),
    'alpha': (0.0, 1.0),
    'rrf': (False, True),
    'k': (30, 100)
}
sc_rag.hybsearch_fit(ground_truth=sc_gt_docs, param_ranges=param_ranges)

100%|██████████| 770/770 [01:31<00:00,  8.44it/s]
100%|██████████| 770/770 [01:29<00:00,  8.58it/s]
100%|██████████| 770/770 [01:26<00:00,  8.87it/s]
100%|██████████| 770/770 [01:25<00:00,  9.02it/s]
100%|██████████| 770/770 [01:24<00:00,  9.09it/s]
100%|██████████| 770/770 [01:25<00:00,  9.05it/s]
100%|██████████| 770/770 [01:25<00:00,  9.02it/s]
100%|██████████| 770/770 [01:25<00:00,  8.97it/s]
100%|██████████| 770/770 [01:25<00:00,  9.03it/s]
100%|██████████| 770/770 [01:25<00:00,  9.02it/s]
100%|██████████| 770/770 [01:25<00:00,  9.02it/s]]
100%|██████████| 770/770 [01:25<00:00,  8.99it/s]]
100%|██████████| 770/770 [01:26<00:00,  8.89it/s]]
100%|██████████| 770/770 [01:25<00:00,  9.01it/s]]
100%|██████████| 770/770 [01:25<00:00,  8.99it/s]]
100%|██████████| 770/770 [01:25<00:00,  9.02it/s]]
100%|██████████| 770/770 [01:26<00:00,  8.94it/s]]
100%|██████████| 770/770 [01:25<00:00,  9.01it/s]]
100%|██████████| 770/770 [01:25<00:00,  8.99it/s]]
100%|██████████| 770/770 [01:26<00:00,  8

Model fitted with provided ground truth data.
Best parameters are:
{'category': 2.4542322611256355, 'paper': 1.4225317885901219, 'text': 1.9479982311584607, 'alpha': 0.1780985829329771, 'rrf': 0, 'k': 67}
Best score was:
0.5758508554937124





### Saving best parameters

In [9]:
# Best minsearch parameters
sc_rag.boost_dict

{'category': 0.9353338909809019,
 'paper': 2.471768789824002,
 'text': 2.815544505155288}

In [10]:
# Evaluation after optimization
evaluate(sc_gt_docs, lambda q: sc_rag.minsearch(q['question']))

100%|██████████| 770/770 [00:01<00:00, 388.41it/s]


{'hit_rate': 0.7077922077922078, 'mmr': 0.5527313955885382}

In [11]:
# Best hybsearch parameters
sc_h_best_params = {
    "boost_dict":sc_rag.h_boost_dict,
    "alpha":sc_rag.alpha,
    "rrf":sc_rag.rrf,
    "k":sc_rag.k
}
sc_h_best_params

{'boost_dict': {'category': 2.4542322611256355,
  'paper': 1.4225317885901219,
  'text': 1.9479982311584607},
 'alpha': 0.1780985829329771,
 'rrf': 0,
 'k': 67}

In [1]:
0.17*3

0.51

In [12]:
# Evaluation after optimization
evaluate(sc_gt_docs, lambda q: sc_rag.hybsearch(q['question']))

100%|██████████| 770/770 [00:01<00:00, 395.20it/s]


{'hit_rate': 0.7077922077922078, 'mmr': 0.5527313955885382}

In [13]:
sc_params_file = os.path.join(project_path, 'src', 'parameters', 'sc_best_params.json')
with open(sc_params_file, "w") as file:
    json.dump(sc_h_best_params, file)

## Sequential semantic chunking

### Data idexing

In [14]:
df_sequential_semantic_chunking = pd.read_csv(
    os.path.join(project_path, 'data', 'testing', 'sequential_semantic_chunking.csv')
)
print(f"The data set contains: {df_sequential_semantic_chunking.shape[0]} sentences")
ssc_docs = df_sequential_semantic_chunking.to_dict(orient="records")
ssc_docs[:3]

The data set contains: 154 sentences


[{'Unnamed: 0': 0,
  'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-1',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'Attention Is All You Need\nAshish Vaswani\x03\nGoogle Brain\navaswani@google.comNoam Shazeer\x03\nGoogle Brain\nnoam@google.comNiki Parmar\x03\nGoogle Research\nnikip@google.comJakob Uszkoreit\x03\nGoogle Research\nusz@google.com\nLlion Jones\x03\nGoogle Research\nllion@google.comAidan N. Gomez\x03y\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser\x03\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin\x03z\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder.\nThe best\nperforming models also connect the encoder and decoder through an attention\nmechanism.\nWe propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispen

In [15]:
# Initialice the RAG
ssc_rag = RAG(api_key=GOOGLE_API_KEY)

# Define search fields
text_fields = [
    'category',
    'paper',
    'text'
]
keyword_fields = ['id']

# Ingest the documents
ssc_rag.minsearch_index(
    docs=ssc_docs,
    text_fields=text_fields,
    keyword_fields=keyword_fields
)
ssc_rag.hybserach_index(
    docs=ssc_docs,
    text_fields=text_fields,
    keyword_fields=keyword_fields
)



### Ground truth data

In [16]:
df_ssc_gt = pd.read_csv(
    os.path.join(project_path, 'data', 'testing', 'ground-truth-sequential-semantic-chunking.csv')
)
ssc_gt_docs = df_ssc_gt.to_dict(orient="records")
ssc_gt_docs[:3]

[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-1',
  'question': 'What specific challenges or limitations of recurrent neural networks in sequence transduction tasks are addressed by the Transformer architecture, and how does the reliance on attention mechanisms overcome these limitations?'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-1',
  'question': 'The paper mentions that the Transformer model achieves superior performance in machine translation tasks. What specific aspects of the Transformer architecture contribute to this improved quality, and how do they differ from traditional encoder-decoder models?'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-1',
  'question': 'The Transformer architecture is described as being more parallelizable than recurrent models. Explain how this parallelization is achieved and what implications it has for training efficiency and scalability.'}]

In [17]:
evaluate(ssc_gt_docs, lambda q: ssc_rag.minsearch(q['question']))

100%|██████████| 770/770 [00:01<00:00, 399.03it/s]


{'hit_rate': 0.6506493506493507, 'mmr': 0.4196438878581736}

In [18]:
evaluate(ssc_gt_docs, lambda q: ssc_rag.hybsearch(q['question']))

100%|██████████| 770/770 [01:28<00:00,  8.69it/s]


{'hit_rate': 0.7298701298701299, 'mmr': 0.47098433312718985}

### Parameters optimization

In [19]:
param_ranges = {
    'category': (0.0, 3.0),
    'paper': (0.0, 3.0),
    'text': (0.0, 3.0)
}
ssc_rag.minserach_fit(ground_truth=ssc_gt_docs, param_ranges=param_ranges)

100%|██████████| 770/770 [00:01<00:00, 385.92it/s]
100%|██████████| 770/770 [00:01<00:00, 398.00it/s]
100%|██████████| 770/770 [00:02<00:00, 377.98it/s]
100%|██████████| 770/770 [00:02<00:00, 383.66it/s]
100%|██████████| 770/770 [00:01<00:00, 387.12it/s]
100%|██████████| 770/770 [00:01<00:00, 405.95it/s]
100%|██████████| 770/770 [00:01<00:00, 396.40it/s]
100%|██████████| 770/770 [00:01<00:00, 391.29it/s]
100%|██████████| 770/770 [00:01<00:00, 396.09it/s]
100%|██████████| 770/770 [00:01<00:00, 387.14it/s]
100%|██████████| 770/770 [00:01<00:00, 398.73it/s]
100%|██████████| 770/770 [00:01<00:00, 406.25it/s]
100%|██████████| 770/770 [00:01<00:00, 388.67it/s]
100%|██████████| 770/770 [00:01<00:00, 403.78it/s]
100%|██████████| 770/770 [00:01<00:00, 404.05it/s]
100%|██████████| 770/770 [00:01<00:00, 395.41it/s]
100%|██████████| 770/770 [00:01<00:00, 392.82it/s]
100%|██████████| 770/770 [00:01<00:00, 394.06it/s]
100%|██████████| 770/770 [00:01<00:00, 402.47it/s]
100%|██████████| 770/770 [00:01

Model fitted with provided ground truth data.
Best parameters are:
{'category': 0.08962309625236509, 'paper': 1.5419571656285616, 'text': 1.0558143115463618}
Best score was:
0.4196438878581736





In [20]:
param_ranges = {
    'category': (0.0, 3.0),
    'paper': (0.0, 3.0),
    'text': (0.0, 3.0),
    'alpha': (0.0, 1.0),
    'rrf': (False, True),
    'k': (30, 100)
}
ssc_rag.hybsearch_fit(ground_truth=ssc_gt_docs, param_ranges=param_ranges)

100%|██████████| 770/770 [01:30<00:00,  8.52it/s]
100%|██████████| 770/770 [01:30<00:00,  8.55it/s]
100%|██████████| 770/770 [01:31<00:00,  8.41it/s]
100%|██████████| 770/770 [01:29<00:00,  8.56it/s]
100%|██████████| 770/770 [01:30<00:00,  8.51it/s]
100%|██████████| 770/770 [01:30<00:00,  8.53it/s]
100%|██████████| 770/770 [01:29<00:00,  8.58it/s]
100%|██████████| 770/770 [01:29<00:00,  8.57it/s]
100%|██████████| 770/770 [01:30<00:00,  8.54it/s]
100%|██████████| 770/770 [01:29<00:00,  8.58it/s]
100%|██████████| 770/770 [01:29<00:00,  8.56it/s]]
100%|██████████| 770/770 [01:30<00:00,  8.55it/s]]
100%|██████████| 770/770 [01:37<00:00,  7.86it/s]]
100%|██████████| 770/770 [01:37<00:00,  7.90it/s]]
100%|██████████| 770/770 [01:29<00:00,  8.62it/s]]
100%|██████████| 770/770 [01:28<00:00,  8.72it/s]]
100%|██████████| 770/770 [01:28<00:00,  8.72it/s]]
100%|██████████| 770/770 [01:27<00:00,  8.82it/s]]
100%|██████████| 770/770 [01:27<00:00,  8.80it/s]]
100%|██████████| 770/770 [01:27<00:00,  8

Model fitted with provided ground truth data.
Best parameters are:
{'category': 2.903764967645502, 'paper': 0.6323336051521776, 'text': 2.1479318697981338, 'alpha': 0.3737468123409873, 'rrf': 0, 'k': 68}
Best score was:
0.5116094619666043





### Saving best parameters

In [21]:
# Best minsearch parameters
ssc_rag.boost_dict

{'category': 0.08962309625236509,
 'paper': 1.5419571656285616,
 'text': 1.0558143115463618}

In [22]:
# Evaluation after optimization
evaluate(ssc_gt_docs, lambda q: ssc_rag.minsearch(q['question']))

100%|██████████| 770/770 [00:01<00:00, 388.49it/s]


{'hit_rate': 0.6506493506493507, 'mmr': 0.4196438878581736}

In [23]:
# Best hybsearch parameters
ssc_h_best_params = {
    "boost_dict":ssc_rag.h_boost_dict,
    "alpha":ssc_rag.alpha,
    "rrf":ssc_rag.rrf,
    "k":ssc_rag.k
}
ssc_h_best_params

{'boost_dict': {'category': 2.903764967645502,
  'paper': 0.6323336051521776,
  'text': 2.1479318697981338},
 'alpha': 0.3737468123409873,
 'rrf': 0,
 'k': 68}

In [24]:
# Evaluation after optimization
evaluate(ssc_gt_docs, lambda q: ssc_rag.hybsearch(q['question']))

100%|██████████| 770/770 [01:35<00:00,  8.10it/s]


{'hit_rate': 0.7571428571428571, 'mmr': 0.5116094619666043}

In [26]:
ssc_params_file = os.path.join(project_path, 'src', 'parameters', 'ssc_best_params.json')
with open(ssc_params_file, "w") as file:
    json.dump(ssc_h_best_params, file, indent=4)