
<div style="color:#ffffff;
          font-size:50px;
          font-style:italic;
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	&nbsp; Q&A with Fine-Tuning
</div>
<br>   
<div style="
          font-size:20px;
          text-align:left;
          font-family: 'Palatino';
          ">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Project: Q&A using pretrained model with Fine-Tuning<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Author: George Barrinuevo<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date: 07/09/2025<br>
</div>

<br><div style="color:#ffffff;
          font-size:30px;
          font-style:italic;
          text-align:left;
          font-family: 'Lucida Bright';
          background:#4686C8;">
  	      &nbsp; Project Notes
</div>
<div style="
          font-size:16px;
          text-align:left;
          font-family: 'Cambria';">
    
<b>My Thoughts</b><br>
- This script demonstrates creating a model specifically for Question and Answering tasks. This is for educational purposes only.
- A pretrained model that is also already fine-tuned for Question and Answering task is used. Using pretrained models saves a lot of cost and time compared with training the entire model.
- The notebook was developed on Google Colab.

<b>Technical Details</b><br>
<u>Tokenizer for model</u>
- The tokenizer is used to data pre-process the text corpus data to a format the model can use. It is highly recommended to use the tokenizer specific to the model being used.

<u>Model</u>
- A pretrained model is used to save cost and time in training a large model. Later, fine-tuning methods are used to make the model work on specific tasks.
- The pretrained model bert-large-uncased-whole-word-masking-finetuned-squad is already fine-tuned to the SQuAD 1.1 dataset, which uses the BertForQuestionAnswering architecture. This architecture specializes in producing an answer based on a question and text.

<u>Dataset</u>
- SQuAD (Stanford Question Answering Dataset) is a dataset designed for training and evaluating question answering systems. This dataset is download and saved, so it can be reloaded for later use.

<u>NLP parser</u>
- Spacy is in NLP tokenizer parser. From the input text, it will create the tokenizer, tagger, parser and NER (named entity recogniztion). See https://spacy.io/.

<u>Lemmatization</u>
- Lemmatization is grouping the various forms of a word. An example is 'walks', 'walking' and 'walked' are part of the base word 'walk'. The 'walk' version is it's lemma.

<u>Tokenizing the corpus text</u>
- For tokenizing the corpus text (dataset) TF–IDF (term frequency–inverse document frequency) is used. The 'bag of words' disregards word order (and thus most of syntax or grammar), but uses term frequency. An improvement on this is TF-IDF where importance of a word is determined by using 'term frequency'. But for common words uses 'inverse document frequency' where it reduces it's weight if it appears in multiple documents.

<u>pipeline process</u>
- All of the above is assembled in a pipeline:<br>
      SQuAD -> spacy -> lemma -> TF-IDF -> tokenizer -> model -> answer


In [None]:
from huggingface_hub import notebook_login
notebook_login()


In [None]:
!pip install torch transformers sklearn spacy[cuda92]

In [None]:
import os
import requests
import random
import pickle
import pandas as pd
import json
import spacy
import numpy as np
import torch

from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering, BertTokenizer, BertForQuestionAnswering
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
CACHE_DIR = os.getenv('QA_CACHE_PATH', 'cache')
DATA_DIR = os.getenv('QA_DATA_PATH', 'data')

SQUAD_URL = "https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"
SQUAD_TRAIN = f"{DATA_DIR}/train-v2.0.json"
LEMMA_CACHE = f"{CACHE_DIR}/lemmas.feather"

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
TOPK = 10 if DEVICE.type == 'cuda' else 5

In [None]:
os.system("jupyter nbextension enable --py widgetsnbextension")
os.system("python3 -m spacy download en_core_web_sm")

if not os.path.isdir(DATA_DIR):
    os.mkdir(DATA_DIR)
if not os.path.isdir(CACHE_DIR):
    os.mkdir(CACHE_DIR)

print("Downloading pretrained models to cache")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', return_token_type_ids=False)
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

if not os.path.isfile(SQUAD_TRAIN):
    print(f"Downloading squad dataset as {SQUAD_TRAIN}")
    response = requests.get(SQUAD_URL, stream=True)

    print(f'Saving SQUAD data')
    with open(SQUAD_TRAIN, "wb") as handle:
        for data in tqdm(response.iter_content()):
            handle.write(data)

print(f'All done')


In [None]:
spacy.prefer_gpu()
sp = spacy.load("en_core_web_sm")

In [None]:
# If there is any issues with loading the below file, than delete the SQUAD_TRAIN directory and re-run.
# this notebook. The notebook will download a fresh copy of the file.
# This code will load the saved SQuAD data.

with open(SQUAD_TRAIN) as f:
    doc = json.load(f)
doc.keys(), type(doc["data"]), len(doc["data"])

In [None]:
# Extracts the paragraphs (context) and questions from the SQuAD dataset.
# If there is any issues with loading the below file, than delete the SQUAD_TRAIN directory and re-run this notebook.
# The notebook will download a fresh copy of the file and save it for reloading for next script run.

paragraphs = []
questions = []
for topic in doc["data"]:
    for pgraph in topic["paragraphs"]:
        paragraphs.append(pgraph["context"])
        for qa in pgraph["qas"]:
            if not qa["is_impossible"]:
                questions.append(qa["question"])

len(paragraphs), len(questions), random.sample(paragraphs, 2), random.sample(questions, 5)

In [None]:
def lemmatize(phrase):
    return " ".join([word.lemma_ for word in sp(phrase)])

In [None]:
# Load the lemmatization cache. If it does not exist, then create it by lemmatization of every paragraph.
# The 'lemmas' list will contain the lemmatization version of the corresponding 'paragraphs'.
# The tqdm() is used to display the progress bar.

%%time

if not os.path.isfile(LEMMA_CACHE):
    lemmas = [lemmatize(par) for par in tqdm(paragraphs)]
    df = pd.DataFrame(data={'context': paragraphs, 'lemmas': lemmas})
    df.to_feather(LEMMA_CACHE)

df = pd.read_feather(LEMMA_CACHE)
paragraphs = df.context
lemmas = df.lemmas

In [None]:
# Print out an example of a paragraph and lemma.
print(f'A paragraph: {paragraphs[1]}')
print(f'A lemma: {lemmas[1]}')


In [None]:
# The code below will load the saved TF-IDF data, but if it does not exist will create this data and save it for later use.
# A stop-word are words that have little semantic value and are therefore filtered out.
# The code will create a TF-IDF from the Lemmatization of the paragraphs (context) and save this data to be loaded the next time.

%%time
VECTOR_CACHE = 'cache/vectors.pickle'
if not os.path.isfile(VECTOR_CACHE):
    vectorizer = TfidfVectorizer(
        stop_words='english', min_df=5, max_df=.5, ngram_range=(1,3))
    tfidf = vectorizer.fit_transform(lemmas)
    with open(VECTOR_CACHE, "wb") as f:
        pickle.dump(dict(vectorizer=vectorizer, tfidf=tfidf), f)
else:
    with open(VECTOR_CACHE, "rb") as f:
        cache = pickle.load(f)
        tfidf = cache["tfidf"]
        vectorizer = cache["vectorizer"]

len(vectorizer.vocabulary_)

In [None]:
# This code will create a Lemmatization on the question then converts it to TF-IDF weights.

question = "When did the last country to adopt the Gregorian calendar start using it?"
t_lemmatize = lemmatize(question)
query = vectorizer.transform([t_lemmatize])
(query > 0).sum(), vectorizer.inverse_transform(query)
print(f't_lemmatize: {t_lemmatize}')
print(f'query: {query}')

In [None]:
# This code will calculate the score for query and paragraph matches and sorts them. The 3 highest scoring
# paragraphs (context) are displayed.

%%time
scores = (tfidf * query.T).toarray()
results = (np.flip(np.argsort(scores, axis=0)))
[paragraphs[i] for i in results[:3, 0]]
# paragraphs[results[0]]

In [None]:
# Filter for the first 10 highest scoring query to paragraph matches that pass a threshold value and put in a dataframe
# format broken down to 'question' and 'context' sections. Store results in a cache file.

%%time
THRESH = 0.01
candidate_idxs = [ (i, scores[i]) for i in results[0:10, 0] ]
contexts = [ (paragraphs[i],s)
    for (i,s) in candidate_idxs if s > THRESH ]

question_df = pd.DataFrame.from_records([ {
    'question': question,
    'context':  ctx
} for (ctx,s) in contexts ])

question_df.to_feather("cache/question_context.feather")

In [None]:
# This implements the Question & Answer Fine-Tunning model using the above data.

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-distilled-squad')
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')

print(f'-----------------')
t_record = question_df.to_dict(orient="records")[9]
question = t_record['question']
text = t_record['context']
print(f'\nquestion: {question}')
print(f'\ntext: {text}')

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

answer_start_index = torch.argmax(outputs.start_logits)
answer_end_index = torch.argmax(outputs.end_logits)

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
t_answer = tokenizer.decode(predict_answer_tokens)
print(f'\nanswer: {t_answer}')

Here is an example output of this notebook.

---------------
question: When did the last country to adopt the Gregorian calendar start using it?

text: Philip II of Spain decreed the change from the Julian to the Gregorian calendar, which affected much of Roman Catholic Europe, as Philip was at the time ruler over Spain and Portugal as well as much of Italy. In these territories, as well as in the Polish–Lithuanian Commonwealth (ruled by Anna Jagiellon) and in the Papal States, the new calendar was implemented on the date specified by the bull, with Julian Thursday, 4 October 1582, being followed by Gregorian Friday, 15 October 1582. The Spanish and Portuguese colonies followed somewhat later de facto because of delay in communication.

answer: 15 october 1582