<a href="https://colab.research.google.com/github/acastellanos-ie/NLP-MBD-EN2022S-ELECTIVES-3/blob/main/qa_practice_dl/QA_practice_with_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, make sure you have the Hugging Face Transformers library installed. If you don't have it, you can install it via pip:

In [15]:
pip install -Uqq transformers

# First Try: QA system based on DistillBERT

build a simple Question Answering model using the [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) model. Start by importing the necessary libraries:

In [3]:
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
import torch

Load the pre-trained DistilBERT model and tokenizer:

In [4]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', return_token_type_ids=True)
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

function to ask a question, given the question and the context

In [5]:
def ask_question(question, context):
    # Encode the input question and context using the tokenizer
    inputs = tokenizer(question, context, return_tensors='pt', padding=True, truncation=True)

    # Get the model output, which includes the start and end logits for the answer
    outputs = model(**inputs)

    # Find the start and end positions of the answer in the input
    answer_start = torch.argmax(outputs.start_logits)
    answer_end = torch.argmax(outputs.end_logits)

    # If the answer_end comes before the answer_start, return an empty answer
    if answer_end < answer_start:
        return ""

    # Convert the tokens back to the original text
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs.input_ids[0][answer_start:answer_end+1]))

    return answer


Finally, use the ask_question function to test your model:

In [6]:
context = "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower."
question = "Who designed the Eiffel Tower?"

answer = ask_question(question, context)
print("Answer:", answer)


Answer: gustave eiffel


This example uses the DistilBERT model fine-tuned on the SQuAD dataset. It is a lighter version of BERT, making it faster and requiring less memory. You can encourage your students to try other models from the Hugging Face Model Hub, such as BERT, RoBERTa, or ALBERT, and compare their performance and speed.

# Second Try: Connect to the web

Instead of answering the questions based on a context that we provide, We will connect to the web to answer questions

Install the required libraries:

In [17]:
! pip install -Uqq googlesearch-python beautifulsoup4 requests

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 KB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 KB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [10]:
from googlesearch import search
import requests
from bs4 import BeautifulSoup

Define a function to fetch a relevant webpage using Google search:

In [12]:
def fetch_relevant_webpage(query, num_results=1):
    search_results = []
    for url in search(query, num_results=num_results):
        search_results.append(url)
    return search_results

Define a function to extract the text content from the fetched webpage using BeautifulSoup

In [8]:
def extract_text_from_webpage(url):
    # Send a GET request to fetch the webpage content
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all the paragraphs in the webpage
    paragraphs = soup.find_all('p')

    # Extract the text from each paragraph and join them together
    text = ' '.join([p.get_text() for p in paragraphs])

    return text

Modify the `ask_question` function to accept a query instead of a context, and use the above functions to fetch and extract context

In [11]:
def ask_question_with_query(question, query, num_results=1):
    urls = fetch_relevant_webpage(query, num_results=num_results)
    
    for url in urls:
        context = extract_text_from_webpage(url)
        answer = ask_question(question, context)
        if answer:
            return answer

    return "Sorry, I couldn't find an answer to your question."

Test the updated function by asking a question without providing context

In [12]:
question = "Who won the last Soccer World Cup?"
query = question

answer = ask_question_with_query(question, query)
print("Answer:", answer)

Answer: the only two teams remaining in contention


# Abstractive QA

Abstractive question answering involves generating answers that may not be directly extracted from the input context. For this task, you can use the  [T5 (Text-to-Text Transfer Transformer) model](https://huggingface.co/docs/transformers/model_doc/t5), which is designed for various NLP tasks, including abstractive question answering.

In [18]:
! pip install -Uqq sentencepiece

In [3]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small", model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained("t5-small")


Modify the `ask_question` function to use the T5 model:

In [4]:
def ask_question_abstractive(question, context):
    # Encode the input question and context using the tokenizer
    input_text = f"answer: {question} context: {context}"
    inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

    # Get the model output, which includes the generated answer
    outputs = model.generate(inputs["input_ids"], max_length=128, num_return_sequences=1)

    # Convert the generated tokens back to the original text
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer


Test the abstractive question answering function with a query and a question:

In [24]:
question = "Which were the causes of the 2008 finantial crysis?"
query = question

context = extract_text_from_webpage(fetch_relevant_webpage(query)[0])
answer = ask_question_abstractive(question, context)
print("Answer:", answer)


Answer: cheap credit and lax lending standards that fueled a housing bubble


This example demonstrates a simple way to create an abstractive question answering system using the T5 model. Note that the performance of the model can be further improved by using larger T5 models like "t5-base" or "t5-large". However, these models require more computational resources and may take longer to run.

Keep in mind that the quality of the generated answers might vary depending on the context provided, and you can further refine the results by adjusting the model's decoding strategy or incorporating additional information retrieval techniques.