# Ground truth data generation

As can be seen in the notebook "create_preprocessors.ipynb" we create three (since as mentioned proposal chunking won't be considered) different preprocessing strategies to split our documents into chunks:

* Sentences splitting
* Semantic chunking
* Sequential semantic chunking

This notebooks creates the ground truth for each scenario so that they can be evaluated.

## Libraries



In [1]:
import os
import sys
import time 
import json
import random
import pandas as pd
import hashlib

from tqdm import tqdm
from dotenv import load_dotenv

project_path = os.path.dirname(os.getcwd())
sys.path.append(project_path)

from src.preprocess import  extract_text_from_pdf, get_sentences, get_semantic_chunks, get_sequential_semantic_chunks
from src.rag import RAG

load_dotenv()

GOOGLE_API_KEY = os.environ['GOOGLE_API_KEY']

  from tqdm.autonotebook import tqdm, trange


## Ground truth data

for the ground truth data we will choose at random 5 documents, chunks them and then generate questions for each chunk. This means a ground truth set is to be generated for each chunking strategy


In [6]:
doc_categories = os.listdir(os.path.join(project_path, 'docs'))
papers = []
index = 0
for category in doc_categories:
    for document in os.listdir(os.path.join(project_path, 'docs', category)):
        index += 1
        papers.append({
            "index": index,
            "category":category,
            "paper":document
        })

In [7]:
papers[:3]

[{'index': 1,
  'category': 'deeplearning',
  'paper': 'an_overview_of_gradient_descent_optimization_algorithms.pdf'},
 {'index': 2,
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf'},
 {'index': 3,
  'category': 'deeplearning',
  'paper': 'dense_x_retieval_what_retrieval_granularity_shoud_we_use.pdf'}]

In [8]:
random.seed(123)
sample = random.sample(papers, 5)

In [9]:
sample

[{'index': 2,
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf'},
 {'index': 9,
  'category': 'deeplearning',
  'paper': 'the_matrix_calculus_you_need_for_deeplearning.pdf'},
 {'index': 3,
  'category': 'deeplearning',
  'paper': 'dense_x_retieval_what_retrieval_granularity_shoud_we_use.pdf'},
 {'index': 14,
  'category': 'time_series',
  'paper': 'another_lookat_measures_of_forecast_accuracy.pdf'},
 {'index': 4,
  'category': 'deeplearning',
  'paper': 'knowledge_card_filling_llms_knowledge_gaps_with_plug_in_specialied_language_models.pdf'}]

## Getting the chunked data

### Sentence splitting

In [10]:
# Creating chunks
chunks = []
for doc in tqdm(sample):
    index = doc.get('index')
    category = doc.get('category')
    paper = doc.get('paper')
    
    doc_id = hashlib.sha256(
        f'{category}-{paper}-{index}'.encode('utf-8')
    ).hexdigest()
    
    pdf_path = os.path.join(
        project_path, 'docs', category, paper
    )
    
    text = extract_text_from_pdf(pdf_path)
    
    doc_chunks = get_sentences(text, doc_id)
    
    for doc_chunk in doc_chunks:                
        chunks.append({
            'id': f'{doc_id}-{doc_chunk['chunk']}',
            'category':category,
            'paper': paper,
            'text': doc_chunk['text']
        })
 

100%|██████████| 5/5 [00:04<00:00,  1.08it/s]


In [11]:
df_sentence_splitting = pd.DataFrame(chunks)

In [12]:
df_sentence_splitting.head()

Unnamed: 0,id,category,paper,text
0,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Attention Is All You Need\nAshish Vaswani\nGo...
1,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,The best\nperforming models also connect the e...
2,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,"We propose a new simple network architecture, ..."
3,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Experiments on two machine translation tasks s...
4,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Our model achieves 28.4 BLEU on the WMT 2014 E...


In [13]:
df_sentence_splitting.to_csv(
    os.path.join(project_path, 'data', 'testing', 'sentence_splitting.csv'),
    index=False
)

## Semantic chunking

In [15]:
# Creating chunks
chunks = []
for doc in tqdm(sample):
    index = doc.get('index')
    category = doc.get('category')
    paper = doc.get('paper')
    
    doc_id = hashlib.sha256(
        f'{category}-{paper}-{index}'.encode('utf-8')
    ).hexdigest()
    
    pdf_path = os.path.join(
        project_path, 'docs', category, paper
    )
    
    text = extract_text_from_pdf(pdf_path)
    
    doc_chunks = get_semantic_chunks(text, doc_id)
    
    for doc_chunk in doc_chunks:                
        chunks.append({
            'id': f'{doc_id}-{doc_chunk['chunk']}',
            'category':category,
            'paper': paper,
            'text': doc_chunk['text']
        })

df_semantic_chunking = pd.DataFrame(chunks)
df_semantic_chunking.head()

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

100%|██████████| 5/5 [07:12<00:00, 86.58s/it]


Unnamed: 0,id,category,paper,text
0,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,We used a beam size of 21and= 0:3\nfor both W...
1,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,The Transformer allows for signiﬁcantly more p...
2,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,"[4]Jianpeng Cheng, Li Dong, and Mirella Lapata."
3,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Attention Is All You Need\nAshish Vaswani\nGo...
4,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,At each step the model is auto-regressive\n[10...


In [16]:
df_semantic_chunking.to_csv(
    os.path.join(project_path, 'data', 'testing', 'semantic_chunking.csv'),
    index=False
)

## Sequential semantic chunking

In [6]:
# Creating chunks
chunks = []
for doc in tqdm(sample):
    index = doc.get('index')
    category = doc.get('category')
    paper = doc.get('paper')
    
    doc_id = hashlib.sha256(
        f'{category}-{paper}-{index}'.encode('utf-8')
    ).hexdigest()
    
    pdf_path = os.path.join(
        project_path, 'docs', category, paper
    )
    
    text = extract_text_from_pdf(pdf_path)
    
    doc_chunks = get_sequential_semantic_chunks(text, doc_id)
    
    for doc_chunk in doc_chunks:                
        chunks.append({
            'id': f'{doc_id}-{doc_chunk['chunk']}',
            'category':category,
            'paper': paper,
            'text': doc_chunk['text']
        })

df_sequential_semantic_chunking = pd.DataFrame(chunks)
df_sequential_semantic_chunking.head()

100%|██████████| 5/5 [2:11:10<00:00, 1574.14s/it]


Unnamed: 0,id,category,paper,text
0,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Attention Is All You Need\nAshish Vaswani\nGo...
1,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,This\nconsists of two linear transformations w...
2,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,We\ntrained the base models for a total of 100...
3,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,We\nused beam search with a beam size of 4and ...
4,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,While single-head\nattention is 0.9 BLEU worse...


In [8]:
df_sequential_semantic_chunking.to_csv(
    os.path.join(project_path, 'data', 'testing', 'sequential_semantic_chunking.csv')
)

## Ground truth datasets generation using an LLM

In [136]:
prompt_template = """
You emulate a user of our academic assistant application, designed to aid college students, master students, and researchers.

Formulate 5 insightful questions this user might ask based on a provided text chunk from an academic paper.
Make the questions specific to this chunk and its context within the paper.
Ensure the questions probe deeper into the subject matter, prompting critical thinking and analysis.
Be complete and not too short. 
Use as fewer words as possible from the record. 

The record:

category: {category}
paper: {paper}
text: {text}

Provide the output in parsable JSON without using code blocks.
The response must not include any json reference (like ```json).
The code must be parsable as a python dictionary.
Ensure that all special characters are correctly escaped using backslashes.
The response must have the followinf format, without code references at the beginning:

{{"questions": ["question1", "question2", ..., "question5"]}}
""".strip()

In [18]:
rag = RAG(api_key=GOOGLE_API_KEY)

We will use the number of samples generated from the smallest dataset. In this case, the sequential semantic chinking

In [23]:
SMALLEST_DATASET_SIZE=154

import random

def subsample_dictionaries(list_of_dicts, sample_size):
  """
  Randomly subsamples a specified number of dictionaries from a list.

  Args:
      list_of_dicts: The list of dictionaries to sample from.
      sample_size: The desired number of dictionaries in the subsample.

  Returns:
      A new list containing the randomly selected dictionaries.
  """

  if sample_size > len(list_of_dicts):
    raise ValueError("Sample size cannot exceed the size of the list.")

  return random.sample(list_of_dicts, sample_size)

### Ground truth for sentence splitting

In [14]:
df_sentence_splitting = pd.read_csv(
    os.path.join(project_path, 'data', 'testing', 'sentence_splitting.csv')
)
print(f"The data set contains: {df_sentence_splitting.shape[0]} sentences")
df_sentence_splitting.head()


The data set contains: 2929 sentences


Unnamed: 0,id,category,paper,text
0,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Attention Is All You Need\nAshish Vaswani\nGo...
1,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,The best\nperforming models also connect the e...
2,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,"We propose a new simple network architecture, ..."
3,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Experiments on two machine translation tasks s...
4,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Our model achieves 28.4 BLEU on the WMT 2014 E...


In [17]:
sentece_splitting_docs = df_sentence_splitting.to_dict(orient="records")
sentece_splitting_docs[:3]

[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-1',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'Attention Is All You Need\nAshish Vaswani\x03\nGoogle Brain\navaswani@google.comNoam Shazeer\x03\nGoogle Brain\nnoam@google.comNiki Parmar\x03\nGoogle Research\nnikip@google.comJakob Uszkoreit\x03\nGoogle Research\nusz@google.com\nLlion Jones\x03\nGoogle Research\nllion@google.comAidan N. Gomez\x03y\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser\x03\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin\x03z\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder.'},
 {'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-2',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'The best\nperforming models also connect the encoder and deco

We will use a similar ammont of data for the three cases, therefore we will subsample 154 docs

In [24]:
ss_sampled_docs = subsample_dictionaries(sentece_splitting_docs, SMALLEST_DATASET_SIZE)
print(f"Sampled number of senteces: {len(ss_sampled_docs)}")
ss_sampled_docs[:2]

Sampled number of senteces: 154


[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-157',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41:0,\noutperforming all of the previously published single models, at less than 1=4the training cost of the\nprevious state-of-the-art model.'},
 {'id': '95c543b8fe00c9b1adbed90ddbc9e22abacf34b8ecc804f58850ed67a54b4cc7-250',
  'category': 'deeplearning',
  'paper': 'dense_x_retieval_what_retrieval_granularity_shoud_we_use.pdf',
  'text': 'Sub-sentence en-\ncoder: Contrastive learning of propositional semantic\nrepresentations.'}]

In [32]:
prompt = prompt_template.format(**ss_sampled_docs[0])

questions = rag.llm(prompt)

json.loads(questions[0])

{'questions': ['What specific architectural or training techniques enabled the model to achieve such a significant BLEU score improvement compared to previous single models?',
  'How does the reduced training cost of this model compare to the cost of ensemble models, and what are the trade-offs involved in choosing one over the other?',
  "What are the limitations of the model's performance on the WMT 2014 English-to-French translation task, and how might these be addressed in future research?",
  "How does the model's performance on this specific task translate to its potential for other language pairs or different NLP tasks?",
  "Considering the model's efficiency and performance, what are the potential implications for the development and deployment of large-scale language models in real-world applications?"]}

In [145]:
def generate_questions(doc:dict) -> dict:    
    """Function that uses an LLM to generate questions about
    a record from our documents.

    Returns:
        dict: Dictionary of questions
    """
    prompt = prompt_template.format(**doc)
    response = rag.llm(prompt)
    return json.loads(response[0])

In [34]:
generate_questions(ss_sampled_docs[0])

{'questions': ['What specific architectural or training techniques enabled the model to achieve such a significant BLEU score improvement compared to previous single models?',
  'How does the reduced training cost of this model compare to the cost of ensemble models, and what are the trade-offs involved in choosing one over the other?',
  "What are the limitations of the model's performance on the WMT 2014 English-to-French translation task, and how might these be addressed in future research?",
  "How does the model's performance on this specific task translate to its potential for other language pairs or different NLP tasks?",
  "Considering the model's efficiency and performance, what are the potential implications for the development and deployment of large-scale language models in real-world applications?"]}

In [35]:
ss_results = {}

In [62]:
for doc in tqdm(ss_sampled_docs):
    doc_id = doc['id']
    if doc_id in ss_results:
        continue
    questions = generate_questions(doc)
    ss_results[doc_id] = questions['questions']
    time.sleep(5) # Introducing sys sleep to avoid over using the resource

100%|██████████| 154/154 [14:20<00:00,  5.59s/it]


In [63]:
ss_final_results = []

for doc_id, questions in ss_results.items():
    for q in questions:
        ss_final_results.append((doc_id, q))

In [64]:
ss_final_results[0]

('e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-157',
 'What specific architectural or training techniques enabled the model to achieve such a significant BLEU score improvement compared to previous single models?')

In [66]:
df_ss_results = pd.DataFrame(ss_final_results, columns=['id', 'question'])

In [67]:
df_ss_results.to_csv(
    os.path.join(project_path, 'data', 'testing', 'ground-truth-sentence-splitting.csv'),
    index=False
)

### Ground truth for semantic chunking

In [68]:
df_semantic_chunking = pd.read_csv(
    os.path.join(project_path, 'data', 'testing', 'semantic_chunking.csv')
)
print(f"The data set contains: {df_semantic_chunking.shape[0]} sentences")
df_semantic_chunking.head()

The data set contains: 269 sentences


Unnamed: 0,id,category,paper,text
0,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,We used a beam size of 21and= 0:3\nfor both W...
1,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,The Transformer allows for signiﬁcantly more p...
2,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,"[4]Jianpeng Cheng, Li Dong, and Mirella Lapata."
3,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Attention Is All You Need\nAshish Vaswani\nGo...
4,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,At each step the model is auto-regressive\n[10...


In [69]:
semantic_chunking_docs = df_semantic_chunking.to_dict(orient="records")
sc_sampled_docs = subsample_dictionaries(semantic_chunking_docs, SMALLEST_DATASET_SIZE)
print(f"Sampled number of senteces: {len(sc_sampled_docs)}")
sc_sampled_docs[:2]

Sampled number of senteces: 154


[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-31',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'CoRR , abs/1409.0473, 2014.\nCoRR , abs/1703.03906, 2017.\nCoRR , abs/1406.1078, 2014.\nCoRR , abs/1412.3555, 2014.\nCurran Associates,\nInc., 2015.\nCoRR , abs/1512.00567, 2015.\nCoRR , abs/1606.04199, 2016.'},
 {'id': 'baf07af8cb279975638f4875627751cd1ad7639ee446444a5496c3be708f5a12-45',
  'category': 'deeplearning',
  'paper': 'knowledge_card_filling_llms_knowledge_gaps_with_plug_in_specialied_language_models.pdf',
  'text': 'In Proceedings of the\n2021 Conference of the North American Chapter of the Association for Computational Linguistics:\nHuman Language Technologies , pp.\nIn Proceedings of the 2022 Conference of\nthe North American Chapter of the Association for Computational Linguistics: Human Language\nTechnologies , pp.\nIn Proceedings of the 2021\nConference of the North American Chapter of the Association for C

In [70]:
sc_results = {}

In [83]:
for doc in tqdm(sc_sampled_docs):
    doc_id = doc['id']
    if doc_id in sc_results:
        continue
    questions = generate_questions(doc)
    sc_results[doc_id] = questions['questions']
    time.sleep(3)

100%|██████████| 154/154 [04:47<00:00,  1.87s/it]


In [84]:
sc_final_results = []

for doc_id, questions in sc_results.items():
    for q in questions:
        sc_final_results.append((doc_id, q))

In [85]:
df_sc_results = pd.DataFrame(sc_final_results, columns=['id', 'question'])

In [86]:
df_sc_results.to_csv(
    os.path.join(project_path, 'data', 'testing', 'ground-truth-semantic-chunking.csv'),
    index=False
)

### Groud truth for sequential semantic chunkng

In [21]:
df_sequential_semantic_chunking = pd.read_csv(
    os.path.join(project_path, 'data', 'testing', 'sequential_semantic_chunking.csv')
)
print(f"The data set contains: {df_sequential_semantic_chunking.shape[0]} sentences")
df_sequential_semantic_chunking.head()

The data set contains: 154 sentences


Unnamed: 0.1,Unnamed: 0,id,category,paper,text
0,0,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,Attention Is All You Need\nAshish Vaswani\nGo...
1,1,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,This\nconsists of two linear transformations w...
2,2,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,We\ntrained the base models for a total of 100...
3,3,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,We\nused beam search with a beam size of 4and ...
4,4,e1ccff07e5c99304d9674e3bb8b21a9f3ad63a70834970...,deeplearning,attention_is_all_you_need.pdf,While single-head\nattention is 0.9 BLEU worse...


I forgot not to use the index when writing thefile so I am droping it next since running this chunking is quite slow

In [90]:
df_sequential_semantic_chunking.drop('Unnamed: 0', axis=1, inplace=True)

In [91]:
ssc_docs = df_sequential_semantic_chunking.to_dict(orient="records")
ssc_docs[:2]

[{'id': 'e1ccff07e5c99304d9674e3bb8b21a9f3ad63a708349704476b45c169163a8b4-1',
  'category': 'deeplearning',
  'paper': 'attention_is_all_you_need.pdf',
  'text': 'Attention Is All You Need\nAshish Vaswani\x03\nGoogle Brain\navaswani@google.comNoam Shazeer\x03\nGoogle Brain\nnoam@google.comNiki Parmar\x03\nGoogle Research\nnikip@google.comJakob Uszkoreit\x03\nGoogle Research\nusz@google.com\nLlion Jones\x03\nGoogle Research\nllion@google.comAidan N. Gomez\x03y\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser\x03\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin\x03z\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder.\nThe best\nperforming models also connect the encoder and decoder through an attention\nmechanism.\nWe propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrenc

In [92]:
ssc_results = {}

In [144]:
for doc in tqdm(ssc_docs):
    doc_id = doc['id']
    if doc_id in ssc_results:
        continue
    questions = generate_questions(doc)
    ssc_results[doc_id] = questions['questions']
    time.sleep(3)

100%|██████████| 154/154 [03:05<00:00,  1.20s/it]


In [146]:
ssc_final_results = []

for doc_id, questions in ssc_results.items():
    for q in questions:
        ssc_final_results.append((doc_id, q))
        
df_ssc_results = pd.DataFrame(ssc_final_results, columns=['id', 'question'])
df_ssc_results.to_csv(
    os.path.join(
        project_path, 'data', 'testing', 'ground-truth-sequential-semantic-chunking.csv'
    ),
    index=False
)