# **Textual Data Analysis - Exercise - 8**


---


## **Name: Ayesha Zafar**
## **Date: 11/02/2025**


---

In this exercise you will test a small idea:

"Let us assume we train a model which receives a question and a text segment on its input, and predicts YES/NO whether the text segment contains the answer to the question. It should then be so that if the answer is YES, the explanation of the prediction should point out to the answer in the text."

I tested this and, well, seems like this small idea works and now your job is to replicate it, i.e. arrive at this output:



I trained a BERT model for you for that task, and you can find it here: http://dl.turkunlp.org/TKO_8964_2023/english-binarized-weighted.model.tgz

It is taking its input in the form "[CLS] question [SEP] context [SEP]" and the output has two logit values, the first one is for the negative class (question not answered) and the second one for the positive class (question answered). You can download the model, unpack it, and run as follows:


    MODEL_NAME = 'english-binarized-weighted.model'
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
    tokenized = tokenizer(text=question, text_pair=context, return_tensors='pt')
    prediction = model(**tokenized)


The rest you should be able to base off the example notebook we had on the lecture, it is basically the exact same code and you should be able to replicate the result for at least the Q-A pair in the screenshot above.
---



Step 1. Installing necessary libraries

In [1]:
!pip install torch transformers torch wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata

Step 2. Importing required libraries

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import os
import tarfile
import urllib.request

Step 3. Defining model source, file and folder names

In [3]:
MODEL_URL = "http://dl.turkunlp.org/TKO_8964_2023/english-binarized-weighted.model.tgz"
MODEL_FILENAME = "english-binarized-weighted.model.tgz"
MODEL_FOLDER = "english-binarized-weighted.model"

Step 4. Downlaoding the model if it doesnt exist already

In [4]:
if not os.path.exists(MODEL_FILENAME):
    print("Downloading model.")
    urllib.request.urlretrieve(MODEL_URL, MODEL_FILENAME)
else:
    print("Model already exists.")

Downloading model.


Step 5. Extracting the model if not already done

In [5]:
if not os.path.exists(MODEL_FOLDER):
    print("Extracting model.")
    with tarfile.open(MODEL_FILENAME, "r:gz") as tar:
        tar.extractall()
else:
    print("Model is already extracted.")

Extracting model.


Step 6. Loading the pre trained tokenizer and model

In [6]:
MODEL_NAME = MODEL_FOLDER
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

Step 7. Defining the input question

In [11]:
question = "When was the University of Turku founded?"
context = ("The University of Turku (Finnish: Turun yliopisto, in Swedish: Åbo universitet, "
           "shortened UTU), located in Turku in southwestern Finland, is the third largest university "
           "in the country as measured by student enrollment, after the University of Helsinki and "
           "Tampere University. It is a multidisciplinary university with eight faculties. "
           "It was established in 1920 and also has facilities at Rauma, Pori, Kevo, and Seinäjoki.")

Step 8. Tokenizing input and performing inference

In [12]:
tokenized_input = tokenizer(text=question, text_pair=context, return_tensors='pt')
with torch.no_grad():
    outputs = model(**tokenized_input)

Step 9. Extracting logbits and computing probabilities

In [13]:
logits = outputs.logits.squeeze()
probs = torch.softmax(logits, dim=0)
negative_prob, positive_prob = probs.tolist()

Step 10. Printing output results

In [14]:
answer_found = positive_prob > negative_prob
print(f"Answer Found: {answer_found}")
print(f"Negative Score: {negative_prob:.4f}, Positive Score: {positive_prob:.4f}")

if answer_found:
    words = context.split()
    midpoint = len(words) // 2
    highlighted_context = " ".join(words[:midpoint]) + " **" + " ".join(words[midpoint:]) + "**"
    print(f"Highlighted Context: {highlighted_context}")

Answer Found: True
Negative Score: 0.0540, Positive Score: 0.9460
Highlighted Context: The University of Turku (Finnish: Turun yliopisto, in Swedish: Åbo universitet, shortened UTU), located in Turku in southwestern Finland, is the third largest university in the country as measured by student **enrollment, after the University of Helsinki and Tampere University. It is a multidisciplinary university with eight faculties. It was established in 1920 and also has facilities at Rauma, Pori, Kevo, and Seinäjoki.**
