## Imports and Data

Disclaimer: The code in this document is for explanatory purpose. Some of it may not run. Two other notebooks and a python file have been provided which have been tested and should be used for experimentation.

In [None]:
!pip install -q accelerate==0.20.3
!pip install -q transformers==4.30.0
!pip install -q sentence-transformers==2.2.2
# !pip install -q torch==2.0.1

[0m

In [None]:
# standard
import os
import numpy as np
import pandas as pd

# DL
import torch
import transformers
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import sentence_transformers

# ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize

In [None]:
# Reading the data from the pdf files
# this uses the file present in the raw_data folders on the drive

data = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        with open(os.path.join(dirname, filename), 'r') as f:
            res = ' '.join(f.readlines()).split('.')
            data.extend(res)
data = np.array(data)
data

## Open Source Models Initialisation

Firstly, I experiment with open source models that can answer questions.

I experiment with various question-answering models. However the amount of data that we have is too much for a model to input at once.
I need a way to effectively extract relevant parts of the source document and and allow the model to use that as it's knowledge base.


Question-Answering DL models take a question and context and use the given context to generate answers. 
I use the following model, which also returns a score for how confident the model is of the answer given the context along with the answers.

This will be our extraction strategy.

In [None]:
# https://huggingface.co/deepset/roberta-base-squad2
model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'What is 42?',
    'context': '42 is the answer to the ultimate question of life, the universe, and everything'
}

# sample output
res = nlp(QA_input)
res
'''
{'score': 0.44963333010673523,
 'start': 6,
 'end': 79,
 'answer': 'the answer to the ultimate question of life, the universe, and everything'}
'''

## Retreiver Experiments

Using the above model on all sentences (which is what our documents consist of) will be very expensive in compute. Further, given unknown/new words (like GARDASIL 9) which may not be in the initial training of the model, the results from the model may not be reliable

To solve this, I use Term Frequency - Inverse Document Frequency to extract documents from that are similar to the question.

I find the tf-idf for all terms in the initial data, and then use this for the terms in the question to extract the most similar documents. For similarity I use the cosine_similarity.

I get the k most similar documents

In [None]:
questions = [
    "When did the GARDASIL 9 recommendations change?",
    "What were the past 3 recommendation changes for GARDASIL 9?",
    "Is GARDASIL 9 recommended for Adults?",
    "Does the ACIP recommend one dose GARDASIL 9?"
]

In [None]:
# Retrieving relvant documents using tf-idf

corpus = data
vectorizer = TfidfVectorizer()

query = questions[1]

# top k docs
k = 50

query_emb = vectorizer.fit_transform(corpus)
doc_emb = vectorizer.transform([query])
Z = cosine_similarity(doc_emb, query_emb)[0]
top_ind = np.argsort(Z)[::-1][:k]
top_docs = corpus[top_ind]

Now that we have the code, we use the earlier model on these top few documents to get the most relevant few.
I set (closest matches) n as 5.

In [None]:
n = 5
results = []
for doc in top_docs:
    QA_input = QA_input = {
        'question': query,
        'context': doc
    }
    res = nlp(QA_input)
    results.append((res['score'], res['answer'], doc))

# get top n scores
final = sorted(results)[::-1][:n]
final

In [None]:
'''
Some results from the above two methods are given below. We observe that for all 4 questions, it produces good results, the retrieved documents are relevant and contain the answeer.
'''

'''
When did the GARDASIL 9 recommendations change?

[(0.08958052843809128,
  'December 10, 2014',
  ' December 10, 2014 Approval letter—\n GARDASIL 9'),
 (0.00470396876335144,
  'February 2015',
  ' 11During its February 2015 meeting, the Advisory Committee \n on Immunization Practices (ACIP) recommended 9-valent \n human papillomavirus (HPV) vaccine (9vHPV) (Gardasil 9, \n Merck and Co'),
 (0.0006312825134955347,
  'new data',
  '\n Why are the recommendations being modified now?\n The updated recommendations contain new data on the \n epidemiology of typhoid fever and vaccine effectiveness and safety'),
 (0.0006220066570676863,
  'month 7',
  ' The main analyses \n were restricted to participants who received all 3 doses, had no evidence of current or past infection with the relevant vaccine HPV type through 1 month after the third dose (month 7), and did not deviate from protocol'),
 (0.000401303666876629,
  'month 96',
  ' Long-term extension study of Gardasil in adolescents; results \n through month 96 [Presentation]')]
  
  

What were the past 3 recommendation changes for GARDASIL 9?

[(0.41020330786705017,
  'low level of evidence) among males',
  ' Evidence supporting 9vHPV use was evaluated using \n the Grading of Recommendations, Assessment, Development, \n and Evaluation (GRADE) framework ( 5) and determined to \n be type 2 (moderate level of evidence) among females and 3 (low level of evidence) among males; the recommendation was categorized as a Category A recommendation (for all persons \n in an age- or risk-factor–based group) (6)'),
 (0.3149503171443939,
  'low level of evidence) among males',
  ' The evidence supporting 9vHPV vaccination was evaluated using the Grading of \n Recommendations, Assessment, Development, and Evaluation \n (GRADE) framework and determined to be type 2 (moderate level of evidence) among females and 3 (low level of evidence) among males; the recommendation was designated as a \n Category A recommendation (recommendation for all persons \n in an age- or risk-factor–based group)'),
 (0.0032194838859140873,
  'month 7',
  ' The main analyses \n were restricted to participants who received all 3 doses, had no evidence of current or past infection with the relevant vaccine HPV type through 1 month after the third dose (month 7), and did not deviate from protocol'),
 (0.0014037188375368714,
  '9vHPV, 4vHPV or 2vHPV',
  '\n What are the new recommendations?\n 9vHPV, 4vHPV or 2vHPV can be used for routine vaccination of \n females aged 11 or 12 years and females through age 26 years who have not been vaccinated previously or who have not \n completed the 3-dose series'),
 (0.001384895178489387,
  'noninferior',
  ' The GMTs were noninferior for all nine HPV vaccine types in the co-administered group (all p<0')]
  
  
Is GARDASIL 9 recommended for Adults?

[(0.056292034685611725,
  'Approval letter',
  ' December 10, 2014 Approval letter—\n GARDASIL 9'),
 (0.042542651295661926,
  '9 through 26 years',
  '  These \n recommendations for children and adults aged 9 through 26 years and for adults aged >26 years apply to all persons, \n † F or persons initiating vaccination before their 15th birthday, the recommended \n immunization schedule is 2 doses of HPV vaccine (0, 6–12 month schedule)'),
 (0.03578556701540947,
  'through the recommended age',
  ' Therefore, vaccination \n is recommended through the recommended age for females regardless of whether they have an abnormal Pap test result, and for females or males regardless of known HPV infection, HPV-associated precancer lesions, or anogenital warts'),
 (0.012184708379209042,
  '9vHPV',
  ' Vaccination of males is \n recommended with 4vHPV (as long as this formulation is \n available) or 9vHPV'),
 (0.010534054599702358,
  '9vHPV',
  '\n † Vaccination of females \n is recommended with 2vHPV, 4vHPV (as long as this for-mulation is available), or 9vHPV')]


Does the ACIP recommend one dose GARDASIL 9?

[(0.040387365967035294,
  'ACIP did not recommend \n catch-up vaccination',
  ' ACIP did not recommend \n catch-up vaccination for all adults aged 27 through 45 years, \n but recognized that some persons who are not adequately vaccinated might be at risk for new HPV infection and might benefit from vaccination in this age range; therefore, ACIP recommended shared clinical decision-making regarding potential HPV vaccination for these persons'),
 (0.026912059634923935,
  '9-valent',
  ' 11During its February 2015 meeting, the Advisory Committee \n on Immunization Practices (ACIP) recommended 9-valent \n human papillomavirus (HPV) vaccine (9vHPV) (Gardasil 9, \n Merck and Co'),
 (0.02402137592434883,
  'guidance',
  ' FDA licensure of quadrivalent human papillomavirus vaccine (HPV4, Gardasil) for use in males and guidance from the Advisory Committee on Immunization Practices (ACIP)'),
 (0.01927875354886055,
  '9-valent',
  ' Characteristics of the three human papillomavirus (HPV) vaccines licensed for use in the United States \n Characteristic Bivalent (2vHPV)* Quadrivalent (4vHPV)† 9-valent (9vHPV)§\n Brand name Cervarix Gardasil Gardasil 9\n VLPs 16, 18 6, 11, 16, 18 6, 11, 16, 18, 31, 33, 45, 52, 58\n Manufacturer GlaxoSmithKline Merck and Co'),
 (0.012921948917210102,
  'Approval letter',
  ' December 10, 2014 Approval letter—\n GARDASIL 9')]


'''

pass

# Faster Retreival

Since we only care about embedding similarities we can also use sentence transformers to find embeddings.

This is a significantly faster than the sequence to sequence model used above in two ways:

- The embeddings for the entire corpus can be computed at initialisation time, inference time only requires computing the embedding of the query, making it a much faster retreival method, especially during inference.
- Since we are not generating a response in this step, and only retreiving, we do not need a model that is producing a generated answer, we only care about the embeddings. 

I experiment with 4 models, 2 of which are trained on question answering. Note: These are the best few models on huggingface in SentenceSimilaeity.

In [None]:


# https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1
# finetuned on question answering
model1 = sentence_transformers.SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')

# https://huggingface.co/sentence-transformers/all-mpnet-base-v2
model2 = sentence_transformers.SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
model3 = sentence_transformers.SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1
model4 = sentence_transformers.SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')


# The following function to get outputs for query and as well the docs returns the most similar documents
def top_embeds(model, documents, question, n):
    query_emb = model.encode([question])
    doc_emb = model.encode(documents)
    scores  = cosine_similarity(query_emb, doc_emb)[0]
    doc_score_pairs = list(zip(top_docs, scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)[:n]
    return doc_score_pairs
    pass


In [None]:
query = questions[0]
a = top_embeds(model1, top_docs, query, 5)
b = top_embeds(model2, top_docs, query, 5)
c = top_embeds(model3, top_docs, query, 5)
d = top_embeds(model4, top_docs, query, 5)

print("Query: " + query)
print("Model 1")
print(a)
print("Model 2")
print(b)
print("Model 3")
print(c)
print("Model 4")
print(d)

The results were unsatisfactory

# Generation

We now move on to generation.

We need a model that will take the best documents we have retreived and use them to produce a good verbal output.

We experiment with a plethora of generators, whose outputs produced are given as comments in the cells.

One thing to note here is that we focus on text to text models. This is because question-answering models are trained to find a start and end token in the context that contains the correct answer. Some of our questions cannot be answered that way and require a definitive answer such as yes or no, or a rephrase of what's given in the context. So for this, we use text to text models which work better to produce general results 

In [None]:
from transformers import AutoTokenizer, BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

ARTICLE_TO_SUMMARIZE = (
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
)
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors="pt")

# Generate Summary
summary_ids = model.generate(inputs["input_ids"], num_beams=2, min_length=0, max_length=20)
tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

'PG&E scheduled the blackouts in response to forecasts for high winds amid dry conditions'

In [None]:
generator = transformers.pipeline("text2text-generation", model="facebook/bart-large-cnn")
prompt = "Answer this question: Is GARDASIL 9 recommended for Adults?\n Given this is Context: December 10, 2014 Approval letter— GARDASIL 9, These recommendations for children and adults aged 9 through 26 years and for adults aged >26 years apply to all persons, For persons initiating vaccination before their 15th birthday, the recommended \n immunization schedule is 2 doses of HPV vaccine (0, 6–12 month schedule), Therefore, vaccination \n is recommended through the recommended age for females regardless of whether they have an abnormal Pap test result, and for females or males regardless of known HPV infection, HPV-associated precancer lesions, or anogenital warts"
generator(prompt)

In [None]:
generator = transformers.pipeline("text2text-generation", model="bigscience/bloom-560m")
prompt = "Answer this question: Is GARDASIL 9 recommended for Adults?\n Given this is true: December 10, 2014 Approval letter— GARDASIL 9, These recommendations for children and adults aged 9 through 26 years and for adults aged >26 years apply to all persons, For persons initiating vaccination before their 15th birthday, the recommended \n immunization schedule is 2 doses of HPV vaccine (0, 6–12 month schedule), Therefore, vaccination \n is recommended through the recommended age for females regardless of whether they have an abnormal Pap test result, and for females or males regardless of known HPV infection, HPV-associated precancer lesions, or anogenital warts"
generator(prompt)

In [None]:
generator = transformers.pipeline("text2text-generation", model="google/flan-t5-base")
prompt = "Answer this question: Is GARDASIL 9 recommended for Adults? Be elaborate\n. The following is true: December 10, 2014 Approval letter— GARDASIL 9, These recommendations for children and adults aged 9 through 26 years and for adults aged >26 years apply to all persons, For persons initiating vaccination before their 15th birthday, the recommended \n immunization schedule is 2 doses of HPV vaccine (0, 6–12 month schedule), Therefore, vaccination \n is recommended through the recommended age for females regardless of whether they have an abnormal Pap test result, and for females or males regardless of known HPV infection, HPV-associated precancer lesions, or anogenital warts"
generator(prompt)

In [None]:
generator = transformers.pipeline("text2text-generation", model="google/flan-t5-large")
prompt = "Answer this question: Is GARDASIL 9 recommended for Adults?\n. The following is true: December 10, 2014 Approval letter— GARDASIL 9, These recommendations for children and adults aged 9 through 26 years and for adults aged >26 years apply to all persons, For persons initiating vaccination before their 15th birthday, the recommended \n immunization schedule is 2 doses of HPV vaccine (0, 6–12 month schedule), Therefore, vaccination \n is recommended through the recommended age for females regardless of whether they have an abnormal Pap test result, and for females or males regardless of known HPV infection, HPV-associated precancer lesions, or anogenital warts"
generator(prompt)

In [None]:
prompt = "Answer this question: Is GARDASIL 9 recommended for Adults? Be elaborate. \n The following is true: December 10, 2014 Approval letter— GARDASIL 9, These recommendations for children and adults aged 9 through 26 years and for adults aged >26 years apply to all persons, For persons initiating vaccination before their 15th birthday, the recommended \n immunization schedule is 2 doses of HPV vaccine (0, 6–12 month schedule), Therefore, vaccination \n is recommended through the recommended age for females regardless of whether they have an abnormal Pap test result, and for females or males regardless of known HPV infection, HPV-associated precancer lesions, or anogenital warts"
generator(prompt)

In [None]:
T0generator = transformers.pipeline("text2text-generation", model="bigscience/T0_3B", torch_dtype=torch.bfloat16)

In [None]:
prompt = "Answer this question: Is GARDASIL 9 recommended for Adults? \n The following is true: December 10, 2014 Approval letter— GARDASIL 9, These recommendations for children and adults aged 9 through 26 years and for adults aged >26 years apply to all persons, For persons initiating vaccination before their 15th birthday, the recommended \n immunization schedule is 2 doses of HPV vaccine (0, 6–12 month schedule), Therefore, vaccination \n is recommended through the recommended age for females regardless of whether they have an abnormal Pap test result, and for females or males regardless of known HPV infection, HPV-associated precancer lesions, or anogenital warts"
T0generator(prompt)

# Restructing our Data

On noticing inconsistencies and errors in the earilier data, the data was scraped from the webpages using the HTML. The new data consists of section headers, as well as the respective contents.


We are interested in the headings and the bodies. This is what we retrieve.
The metadata can give us important information about dates. We can include retrieval for this as well.

In [3]:

file_name = '/kaggle/input/qna-data/section_wise_data.txt'
with open(file_name, 'r') as f:
    data = f.read()
    pass

# webpages
level1 = data.split("<<<<>>>>")[:-1]
tree_data = []
for paper in level1:
    parts = paper.split("<<<>>>")
    tree_data.append({"meta": parts[0], "content":parts[1]})
    pass

for paper in tree_data:
    parts = paper["content"].split("<<>>")
    paper["content"] = []
    for part in parts:
        innersplit = part.split("<>")
        paper["content"].append({"heading": innersplit[0], "body":innersplit[1]})
        pass
    pass

'''
Schema
[
    {
        "meta": metadata about the paper
        "content":[
            {
                "heading": section heading like introduction
                "body": body of the section 
            }
            {
                "heading": ...
                "body": ... 
            }
            ...
        ]
    },
    {
    
    }
    ...
]


'''
pass

# Result on New data

We experiment on the new data we have.

In [25]:
questions = [
    "When did the GARDASIL 9 recommendations change?",
    "What were the past 3 recommendation changes for GARDASIL 9?",
    "Is GARDASIL 9 recommended for Adults?",
    "Does the ACIP recommend one dose GARDASIL 9?"
]

corpus = [content["body"] for paper in tree_data for content in paper["content"]]
headings = [content["heading"] for paper in tree_data for content in paper["content"]]

corpus = np.array(corpus)
headings = np.array(headings)

# Retrieving relvant documents using tf-idf


vectorizer = TfidfVectorizer()

query = questions[2]

# top k docs
k = 50

query_emb = vectorizer.fit_transform(corpus)
doc_emb = vectorizer.transform([query])
Z = cosine_similarity(doc_emb, query_emb)[0]
top_ind = np.argsort(Z)[::-1][:k]
top_docs = corpus[top_ind]
top_headers = headings[top_ind]


model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

n = 5
results = []
for i in range(len(top_docs)):
    QA_input = QA_input = {
        'question': query,
        'context': top_docs[i]
    }
    res1 = nlp(QA_input)
    QA_input = QA_input = {
        'question': query,
        'context': headings[i]
    }
    res2 = nlp(QA_input)
    results.append((res1['score']+res2['score'], top_docs[i], headings[i]))

# get top n scores
final = sorted(results)[::-1][:n]
final


# Experiment Failed -----

[(0.12598911951255332,
  '\nVaccine efficacy and safety. Data were considered from 11 clinical trials of 9vHPV, 4vHPV, and/or 2vHPV in adults aged 27 through 45 years, along with supplemental bridging immunogenicity data. In per-protocol analyses from three trials, 4vHPV and 2vHPV demonstrated significant efficacy against a combined endpoint of persistent vaccine-type HPV infections, anogenital warts, and cervical intraepithelial neoplasia (CIN) grade 1 (low-grade lesions) or worse. In nine trials, seroconversion rates to vaccine-type HPV after 3 doses of any HPV vaccine were 93.6%–100% at 7 months after the first dose. Overall evidence on benefits was GRADE evidence level 2, for moderate-quality evidence. In nine trials, few serious adverse events and no vaccine-related deaths were reported. Overall evidence on harms was also GRADE evidence level 2, for moderate-quality evidence. In the efficacy trial that was the basis for 9vHPV licensure for adults through age 45 years, per-protocol

The results are still not satisfactory, the retrieved documents are not very good. We abandon this experiment.

# Sentences + Generator

In [3]:
file_name = '/kaggle/input/qna-data/section_wise_data.txt'
with open(file_name, 'r') as f:
    data = f.read()
    pass

# webpages
level1 = data.split("<<<<>>>>")[:-1]
tree_data = []
for paper in level1:
    parts = paper.split("<<<>>>")
    tree_data.append({"meta": parts[0], "content":parts[1]})
    pass

for paper in tree_data:
    parts = paper["content"].split("<<>>")
    paper["content"] = []
    for part in parts:
        innersplit = part.split("<>")
        paper["content"].append({"heading": innersplit[0], "body":innersplit[1]})
        pass
    pass

corpus = [content["body"] for paper in tree_data for content in paper["content"]]
corpus = ' '.join(corpus)
corpus = sent_tokenize(corpus)
corpus = np.array(corpus)

In [52]:
class QnAModel():
    def __init__(self, model_corpus, retreiver_model = None, ST_retreiver: str = None, gen_model = None):
        self.corpus = model_corpus
        
        # Retreiver model
        if retreiver_model == None:
            retreiver_model = "deepset/roberta-base-squad2"
        self.retreiver_model = pipeline('question-answering', model=retreiver_model, tokenizer = retreiver_model)
        
        
        # Sentence tarnsformer model (unused)
        self.ST_retreiver = sentence_transformers.SentenceTransformer(ST_retreiver)
        
        #Generator Model
        if gen_model == None:
            gen_model = pickle.load(open('/kaggle/input/models/flan-t5-large-finetuned-finetuning_final_data-10_epochs.h5', 'rb'))
            self.gen_model = transformers.pipeline("text2text-generation", model = gen_model, tokenizer = 'google/flan-t5-large')
        else:
            self.gen_model = transformers.pipeline("text2text-generation", model = gen_model, tokenizer = gen_model)
        
        pass
    
    def tf_idf_retreival(self, query, k):
        # Retrieving relvant documents using tf-idf
        vectorizer = TfidfVectorizer()

        query_emb = vectorizer.fit_transform(self.corpus)
        doc_emb = vectorizer.transform([query])
        Z = cosine_similarity(doc_emb, query_emb)[0]
        top_ind = np.argsort(Z)[::-1][:k]
        top_docs = self.corpus[top_ind]
        return top_docs
        
    def DL_retreiver(self, query, top_documents, n):
        # Retrieving top documents using an end to end question answering model
        results = []
        for doc in top_documents:
            QA_input = QA_input = {
                'question': query,
                'context': doc
            }
            res = self.retreiver_model(QA_input)
            results.append((doc, res['score']))

        # get top n scores
        final = sorted(results, key=lambda x: x[1])[::-1][:n]
        return final
        
    def SentenceTransform_retreiver(self, query, top_documents, n):
        # retreiving top documents using a sentence transformer using vector embedding similarities 
        if self.ST_retreiver == None:
            raise Exception("Sentence Transformer not provided.")
        
        query_emb = self.ST_retreiver.encode([question])
        doc_emb = self.ST_retreiver.encode(top_documents)
        scores  = cosine_similarity(query_emb, doc_emb)[0]
        doc_score_pairs = list(zip(top_docs, scores))
        doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)[:n]
        return doc_score_pairs
    
    def generate_result(self, query, top_documents, max_length):
        # Generating results using a text to text generator model
        
        # Our top douments are the context for the model
        context = ' '.join(top_documents)
        
        # We prompt the task to the model
        prompt = f'Answer this question elaborately: {query} \n Given this is true: {context}'
        return self.gen_model(prompt, max_length=max_length)
    
    def answer_question(self, query, k = 50, n = 5, max_length = 75, use_sent_transformer = False):
        # tf idf
        top_documents = self.tf_idf_retreival(query, k)
        # retreive
        if use_sent_transformer:
            top_documents = self.SentenceTransform_retreiver(query, top_documents, n)
        else:
            top_documents = self.DL_retreiver(query, top_documents, n)
            #print(top_documents)
        
        
        # generate
        top_documents = [pair[0] for pair in top_documents]          
        
        output = self.generate_result(query, top_documents, max_length)
        
        return output

In [54]:
questions = [
    "When did the GARDASIL 9 recommendations change?",
    "What were the past 3 recommendation changes for GARDASIL 9?",
    "Is GARDASIL 9 recommended for Adults?",
    "Does the ACIP recommend one dose GARDASIL 9?"
]
model = QnAModel(corpus, gen_model = "bigscience/T0_3B")

Downloading (…)lve/main/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/11.4G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

In [58]:
for i in range(4):
    print(questions[i])
    print(model.answer_question(k = 80, n = 8, query=questions[i]))

When did the GARDASIL 9 recommendations change?
[{'generated_text': 'The recommendations for children and adults aged 9 through 26 years and for adults aged >26 years apply'}]
What were the past 3 recommendation changes for GARDASIL 9?
[{'generated_text': 'The recommendation was designated as a Category A recommendation (recommendation for all persons'}]
Is GARDASIL 9 recommended for Adults?
[{'generated_text': 'No'}]
Does the ACIP recommend one dose GARDASIL 9?
[{'generated_text': 'No'}]


# Finetuning


In [3]:
from datasets import load_dataset
dataset = load_dataset("text", data_files = {"train": "/kaggle/input/qna-data/finetuning_train.txt", "test": "/kaggle/input/qna-data/finetuning_test.txt", "val": "/kaggle/input/qna-data/finetuning_val.txt"})

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-6c9f7dbab8f1c63d/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-6c9f7dbab8f1c63d/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5TokenizerFast, AutoModelForCausalLM

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", load_in_8bit=True)

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])
block_size = 128

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
tokenized_datasets2 = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

In [None]:
lm_datasets2 = tokenized_datasets2.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

In [None]:
model_checkpoint2 = "google/flan-t5-large"

In [None]:
from transformers import Trainer, TrainingArguments

model_name = model_checkpoint2.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-finetuning_final_data-5_epochs.txt",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    weight_decay=0.01,
    num_train_epochs=10,
    auto_find_batch_size = True
)

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, mlm=False)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets2["train"],
    eval_dataset=lm_datasets2["val"],
    data_collator=data_collator,
)

In [None]:
torch.cuda.empty_cache()
trainer.train()