Submission for <br />
Exercise task 8 <br />
of UTU course TKO_8964-3006 <br />
Textual Data Analysis <br />
by Botond Ortutay <br />

---

**Instructions:**

In this exercise you will test a small idea:

"Let us assume we train a model which receives a question and a text segment on its input, and predicts YES/NO whether the text segment contains the answer to the question. It should then be so that if the answer is YES, the explanation of the prediction should point out to the answer in the text."

I tested this and, well, seems like this small idea works and now your job is to replicate it, i.e. arrive at this output:

<img src="ext8_ex_outp.png">

I trained a BERT model for you for that task, and you can find it here: http://dl.turkunlp.org/TKO_8964_2023/english-binarized-weighted.model.tgz

It is taking its input in the form "[CLS] question [SEP] context [SEP]" and the output has two logit values, the first one is for the negative class (question not answered) and the second one for the positive class (question answered). You can download the model, unpack it, and run as follows:

```python
MODEL_NAME = 'english-binarized-weighted.model'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
tokenized = tokenizer(text=question, text_pair=context, return_tensors='pt')
prediction = model(**tokenized)
```

The rest you should be able to base off the example notebook we had on the lecture, it is basically the exact same code and you should be able to replicate the result for at least the Q-A pair in the screenshot above.

---

**Solution:**

**Library & environment setup:**

In [1]:
from captum.attr import LayerIntegratedGradients
from IPython.display import display, Markdown
import numpy as np
import pandas as pd
import random
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# checking GPU availability
print(torch.cuda.is_available()) # should output True

True


**Model downloading & unpacking:**

In [2]:
#NOTE: bash code here
# Downloading the model
!wget http://dl.turkunlp.org/TKO_8964_2023/english-binarized-weighted.model.tgz
# Locally extracting .tar.gz package with a single command:
!tar -xzvf english-binarized-weighted.model.tgz
# Printing local datastructure to make sure it is now on disk
!echo
!tree english-binarized-weighted.model



7[1A[1G[27G[Files: 0  Bytes: 0  [0 B/s] Re]87[2A[1G[27G[http://dl.turkunlp.org/TKO_896]87[1S[3A[1G[0JSaving 'english-binarized-weighted.model.tgz'
english-binarized-weighted.model/training_args.bin
english-binarized-weighted.model/pytorch_model.bin
english-binarized-weighted.model/tokenizer.json
english-binarized-weighted.model/vocab.txt
english-binarized-weighted.model/config.json
english-binarized-weighted.model/special_tokens_map.json
english-binarized-weighted.model/tokenizer_config.json

[01;34menglish-binarized-weighted.model[0m
├── config.json
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
├── training_args.bin
└── vocab.txt

1 directory, 7 files


**Functions for later use:**

In [3]:
# Forwarding function (needed for LayerIntegratedGradients)
def forwarder(inputs, token_type_ids, attention_mask):
    pred = model(inputs, token_type_ids=token_type_ids, attention_mask=attention_mask)
    return pred.logits #return the output of the classification layer

**Importing model to Python:**

In [4]:
MODEL_NAME = 'english-binarized-weighted.model'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

2025-02-11 22:03:23.265762: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739304203.282491   59187 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739304203.287648   59187 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-11 22:03:23.305199: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


**Function for analyzing question-context pairs**

Objectices:
 1. Determine whether context has answer to question or not
 2. If context has answer to question: where within the context is the answer?

In [5]:
"""
Function for analyzing question-context pairs

Objectices:
 1. Determine whether context has answer to question or not
 2. If context has answer to question: where within the context is the answer?
---
In:
question     str                              question in question-context pair
context      str                              long string where we search for the answer
model        BertForSequenceClassification    bert model we use to perform analysis 
tokenizer    BertTokenizerFast                tokenizer to use with model
"""
def qAndCAnalyzer(question,context,model,tokenizer):
    tokenized = tokenizer(text=question, text_pair=context, return_tensors='pt')
    prediction = model(**tokenized)
    
    # Does the context not answer the question? (from prediction output)
    if (prediction["logits"][0][0] > prediction["logits"][0][1]):
        dString = 'Context doesn\'t have an answer to your question!'
        
    # If the context has an answer, we move on to further analysis
    else:
        # Tokenized output to tensor tuple (currently it's <class 'transformers.tokenization_utils_base.BatchEncoding'>, which would break things below)
        tokenizedTT = (tokenized["input_ids"],tokenized["token_type_ids"],tokenized["attention_mask"])
        
        # New tensor tuple with padding characters, equal in length to tokenizedTT
        ref = tokenizer(" ".join(["[PAD]"]*(len(tokenized["input_ids"][0])-2)),return_tensors="pt")    # One padding per token (-2 fo first and last token) One possible problem: ignored [SEP] token between question and context. but rn we don't care about that.
        baselineTT = (ref["input_ids"],ref["token_type_ids"],ref["attention_mask"])
        
        # setting up gradient system we'll use to calculate per-token attributes. Uses forwarding function from above.
        gradients = LayerIntegratedGradients(forwarder, model.bert.embeddings)
        
        # Using our gradients to calculate attribute scores for our inputs
        attrs = gradients.attribute(inputs=tokenizedTT, baselines = baselineTT, return_convergence_delta=False,target=1)
        
        # Math magic to scale the numbers
        attrs = attrs.sum(dim=-1).squeeze(0)
        
        # Results to numpy format for easier handling below
        attrNp = attrs.numpy()
        
        # Generating fancy output. Text bg will be green if a token is relevant to the question (the darker the better) and res if a token is irrelevant to the question (the darker the worse)
        dString = ''
        k = 0
        for i in tokenized["input_ids"]:
            for j in tokenizer.convert_ids_to_tokens(i):
                if(attrNp[k] > np.percentile(attrNp, 99)):
                    dString += '<mark style="background-color:#7cb342">' + j + '</mark> '
                elif(attrNp[k] > np.percentile(attrNp, 95)):
                    dString += '<mark style="background-color:#aed581">' + j + '</mark> '
                elif(attrNp[k] > np.percentile(attrNp, 91)):
                    dString += '<mark style="background-color:#dcedc8">' + j + '</mark> '
                elif(attrNp[k] < np.percentile(attrNp, 1)):
                    dString += '<mark style="background-color:#ff0000">' + j + '</mark> '
                elif(attrNp[k] < np.percentile(attrNp, 4)):
                    dString += '<mark style="background-color:#ff4747">' + j + '</mark> '
                elif(attrNp[k] < np.percentile(attrNp, 9)):
                    dString += '<mark style="background-color:#ff8f8f">' + j + '</mark> '
                else:
                    dString += j + " "
                k += 1
    # Fancy output
    display(Markdown(dString))

**Testing with example provided in instructions:**

In [6]:
question = "When was University of Turku founded?"
context = "The University of Turku (Finnish: Turun yliopisto, in Swedish: Åbo universitet), located in Turku in soutwestern Finland, is the third largest university in the country as measured by student enrollment, after the University of Helsinki and Tampere University. It is a multidisciplinary university with eight faculties. It was established in 1920 and also has facilities at Rauma, Pori, Kevo and Seili. The university is a member of the Coimbra Group and the European Campus of City - Univiersities (EC2U)."

qAndCAnalyzer(question,context,model,tokenizer)

[CLS] <mark style="background-color:#7cb342">When</mark> <mark style="background-color:#ff8f8f">was</mark> <mark style="background-color:#ff0000">University</mark> <mark style="background-color:#ff0000">of</mark> <mark style="background-color:#ff4747">Tu</mark> <mark style="background-color:#ff4747">##rk</mark> <mark style="background-color:#ff4747">##u</mark> founded ? <mark style="background-color:#7cb342">[SEP]</mark> The <mark style="background-color:#aed581">University</mark> of Tu <mark style="background-color:#dcedc8">##rk</mark> ##u ( Finnish : Tu ##run y ##lio ##pis ##to <mark style="background-color:#aed581">,</mark> in Swedish : Å ##bo un ##ivers ##ite ##t ) , located in Tu ##rk ##u in so ##ut ##wes ##tern Finland <mark style="background-color:#dcedc8">,</mark> is the third largest university in the country as <mark style="background-color:#aed581">measured</mark> <mark style="background-color:#dcedc8">by</mark> <mark style="background-color:#dcedc8">student</mark> <mark style="background-color:#dcedc8">enrollment</mark> , after the University of <mark style="background-color:#aed581">Helsinki</mark> and Tam ##per ##e University . It is a multi ##disciplinary university with eight faculties . It was established in <mark style="background-color:#aed581">1920</mark> <mark style="background-color:#dcedc8">and</mark> also has facilities at Ra ##uma , Po ##ri , Ke ##vo and Se ##ili . The university is a <mark style="background-color:#ff8f8f">member</mark> of the Co <mark style="background-color:#ff8f8f">##im</mark> ##bra Group <mark style="background-color:#ff8f8f">and</mark> the <mark style="background-color:#ff8f8f">European</mark> Campus of <mark style="background-color:#ff8f8f">City</mark> - Un ##iv ##iers ##ities ( EC ##2 ##U ) <mark style="background-color:#ff8f8f">.</mark> <mark style="background-color:#ff4747">[SEP]</mark> 

**Testing with context that doesn't have the answer:**

In [7]:
question = "When was the declaration of Independence signed?"
context = "The Sucunduri Formation is a Neoproterozoic geologic formation in Brazil. While reports made in the 1950s state that dinosaur remains were among the fossils that have been recovered from the formation, none of which referred to a specific genus, later research has questioned this interpretation. The formation crops out in Amazonas."

qAndCAnalyzer(question,context,model,tokenizer)

Context doesn't have an answer to your question!

**Testing with random samples:**

In [8]:
"""
Let's create a very small dataset of 20 questions-context pairs
10 of the contexts have the answers and 10 don't
"""
qcPairs = pd.DataFrame([
    ["When was University of Turku founded?","The University of Turku (Finnish: Turun yliopisto, in Swedish: Åbo universitet), located in Turku in soutwestern Finland, is the third largest university in the country as measured by student enrollment, after the University of Helsinki and Tampere University. It is a multidisciplinary university with eight faculties. It was established in 1920 and also has facilities at Rauma, Pori, Kevo and Seili. The university is a member of the Coimbra Group and the European Campus of City - Univiersities (EC2U).",True,"1920"],
    ["When was the declaration of Independence signed?","The Sucunduri Formation is a Neoproterozoic geologic formation in Brazil. While reports made in the 1950s state that dinosaur remains were among the fossils that have been recovered from the formation, none of which referred to a specific genus, later research has questioned this interpretation. The formation crops out in Amazonas.",False,"-"],
    ["What kind of seasoning to use for Chicken Fajitas?","Sheet Pan Chicken Fajitas <br><br> 1. Preheat the oven to 400 degrees Fahrenheit. Mix all of the spices for the fajita seasoning in a small bowl and set aside (chili powder, paprika, onion powder, garlic powder, cumin, cayenne pepper, sugar, and salt). <br><br> 2. Cut the onion and bell peppers into 1/4-inch wide strips. Slice the chicken breast into thin strips. Add the chicken and vegetables to a large baking sheet or casserole dish. <br><br> 3. Cut the onion and bell peppers into 1/4-inch wide strips. Slice the chicken breast into thin strips. Add the chicken and vegetables to a large baking sheet or casserole dish. <br><br> 4. Bake the chicken and vegetables in the preheated oven for 35-40 minutes, stirring once halfway through. Squeeze the juice from half of the lime over top of the meat and vegetables after they come out of the oven. <br><br> 5. To serve, scoop a small amount of meat and vegetables into the center of each tortilla. Top with a few sprigs of cilantro, a dollop of sour cream, and an extra squeeze of lime if desired.",True,"chili powder, paprika, onion powder, garlic powder, cumin, cayenne pepper, sugar, and salt"],
    ["Where does the US Route 66 start?","Illinois Route 84 (Route 84 or IL 84) is a long state highway that runs along the Mississippi River in northwestern Illinois. Illinois 84 runs from south of Green Rock (now Colona) at U.S. Route 6 to the Wisconsin state line at Highway 80 by Hazel Green, Wisconsin. Illinois 84 is 93.93 miles (151.17 km) long. ",False,"-"],
    ["Which historical event did Leo Alexander play a critical role in?","Leo Alexander (October 11, 1905 – July 20, 1985) was an American psychiatrist, neurologist, educator, and author, of Austrian-Jewish origin. He was a key medical advisor during the Nuremberg Trials. Alexander wrote part of the Nuremberg Code, which provides legal and ethical principles for scientific experiment on humans. ",True,"The Nuremberg Trials"],
    ["What is the capital of Denmark?","Cogenhoe United Football Club is a football club based in Cogenhoe, near Northampton, Northamptonshire, England. They are currently members of the United Counties League Premier Division South and play at Compton Park. ",False,"-"],
    ["Why did the French government reward some of its citizens after World War I?","During the war of 1914-1918, the populations of the invaded and occupied regions of France were put under severe strain. Thus, at the end of hostilities, it seemed necessary to pay tribute to the courage of these people by rewarding them with several medals such as the Medal for victims of the invasion, the Medal of French Fidelity and the Medal for civilian prisoners, deportees and hostages of the 1914-1918 Great War. It was on the proposal of the Minister for the Liberated Regions that the Medal for victims of the invasion (French: Médaille des victimes de l'invasion) was created on 30 June 1921 in three classes (bronze, silver and silver gilt)",True,"the populations of the invaded and occupied regions of France were put under severe strain. Thus, at the end of hostilities, it seemed necessary to pay tribute to the courage of these people"],
    ["What is the biggest music publishing company in India?","Zubeen Garg is an Indian singer, music director, composer, lyricist, music producer, actor, film director, film producer, script writer and philanthropist. He primarily works for and sings in the Assamese, Bengali and Hindi-language film and music industries.",False,"-"],
    ["Who was Lolita Chammah?","Chammah is a surname. Notable people with the surname include: <br> - Itai Chammah (born 1985), Israeli swimmer <br> - Lolita Chammah (born 1983), French actress <br> - Ronald Chammah, French film director <br> - Walid Chammah (born 1954), Lebanese banker",True,"French actress"],
    ["What is the capital of Poland?","The poiseuille (symbol Pl) has been proposed as a derived SI unit of dynamic viscosity, named after the French physicist Jean Léonard Marie Poiseuille (1797–1869).",False,"-"],
    ["Which album did the TLC song 'Waterfalls' appear on?","'Waterfalls' is a song by American hip-hop group TLC, released in May 1995 by LaFace and Arista as the third single from the group's second album, CrazySexyCool (1994). The single was also released in the United Kingdom on July 24, 1995.",True,"CrazySexyCool"],
    ["What caused the 2016 Killer Clowns trend?","The Anzick site (24PA506), located adjacent to Flathead Creek, a tributary of the Shields River in Wilsall, Park County, Montana, United States, is the only known Clovis burial site in the New World. The term 'Clovis' is used by archaeologists to define one of the New World's earliest hunter-gatherer cultures and is named after the site near Clovis, New Mexico, where human artifacts were found associated with the procurement and processing of mammoth and other large and small fauna.",False,"-"],
    ["What is a common nickname for Pimento cheese?","Sobchak: You been doing this long? I just assume we're all heavy hitters. And it makes sense. The vet recommends the best of the best. Dealing with some of these ethnic types, blood tends to run a little hotter. That's just science. Physiology. There's historical precedent. Know what I'm saying? So, what you packing? <br> Mike Ehrmantraut: A pimento. <br> Sobchak: Sorry, what? <br> Mike Ehrmantraut: Pimento sandwich. <br> Sobchak: That's funny. Pimento. No, I mean, what are you carrying? You know, the piece? What's the make? <br> Mike Ehrmantraut: Pimento is a cheese. They call it the caviar of the south.",True,"the caviar of the south"],
    ["What’s the most popular TV series in Pakistan?","Ghalati is a 2019 Pakistani romantic drama television series that premiered on ARY Digital on December 19, 2019. It is directed by Saba Hameed and written by Asma Sayani. Produced by Humayun Saeed under Six Sigma Plus, it stars Hira Mani and Affan Waheed",False,"-"],
    ["What kind of plant is Cypreus Niveus?","Cyperus niveus is a species of sedge that is native to parts of Africa, the Middle East and Asia.",True,"sedge"],
    ["Where in India is Rajasthan located?","Jaydeep Bihani is an Indian politician serving as a member of the 16th Rajasthan Assembly. He is a member of the Bharatiya Janata Party and represents the Ganganagar Assembly constituency from Sri Ganganagar district.",False,"-"],
    ["Which TV network did the show 'Orleans' air on?","Orleans is an American drama television series created by Toni Graphia and John Sacret Young, that aired on CBS from January 7 through April 10, 1997. It ran for 8 episodes. The series was said to be inspired by the experiences of creator/producer Toni Graphia, who was the daughter of a Louisiana judge.",True,"CBS"],
    ["Who is the most famous football player in the Netherlands?","Raoul Henar (born 4 August 1972) is a Dutch former professional footballer who played as a forward for Eerste Divisie and Eredivisie clubs Stormvogels Telstar, SC Veendam, Helmond Sport and FC Volendam between 1997 and 2004.",False,"-"],
    ["Which street is the San Bernardino Valley campus located on?","KVCR-DT (channel 24) is a PBS member television station in San Bernardino, California, United States. It is owned by the San Bernardino Community College District alongside NPR member KVCR (91.9 FM). The two stations share studios at the San Bernardino Valley College campus on North Mt. Vernon Avenue in San Bernardino; KVCR-DT's transmitter is located atop Box Springs Mountain.",True,"North Mt. Vernon Avenue"],
    ["Who owns the Mexican TV station Las Estrellas?","M.D.: Life on the Line (Spanish: Médicos, línea de vida) is a Mexican telenovela that premiered on Las Estrellas on 11 November 2019. The series is produced by José Alberto Castro for Televisa, and stars Livia Brito, Daniel Arenas and Rodolfo Salas. Production began in August 2019. <br><br> On 7 February 2020, producer José Alberto Castro confirmed that the series had been renewed for a second season.",False,"-"]
], columns=["question","context","answer in context?","answer"])
for i in random.sample(range(0, 19), 5):
    dString = "Selecting random question-context pair for analysis.<br>**Question:** " + qcPairs["question"].loc[qcPairs.index[i]] + "<br>"
    if(qcPairs["answer in context?"].loc[qcPairs.index[i]]):
        dString += "The answer to the question **is** in the context.<br> **Expected answer:** "  + qcPairs["answer"].loc[qcPairs.index[i]] + "<br>"
    else:
        dString += "The answer to the question **is not** in the context.<br>"
    dString += "**Analysis results:**<br>"
    display(Markdown(dString))
    qAndCAnalyzer(qcPairs["question"].loc[qcPairs.index[i]],qcPairs["context"].loc[qcPairs.index[i]],model,tokenizer)

Selecting random question-context pair for analysis.<br>**Question:** What caused the 2016 Killer Clowns trend?<br>The answer to the question **is not** in the context.<br>**Analysis results:**<br>

Context doesn't have an answer to your question!

Selecting random question-context pair for analysis.<br>**Question:** Where does the US Route 66 start?<br>The answer to the question **is not** in the context.<br>**Analysis results:**<br>

Context doesn't have an answer to your question!

Selecting random question-context pair for analysis.<br>**Question:** When was University of Turku founded?<br>The answer to the question **is** in the context.<br> **Expected answer:** 1920<br>**Analysis results:**<br>

[CLS] <mark style="background-color:#7cb342">When</mark> <mark style="background-color:#ff8f8f">was</mark> <mark style="background-color:#ff0000">University</mark> <mark style="background-color:#ff0000">of</mark> <mark style="background-color:#ff4747">Tu</mark> <mark style="background-color:#ff4747">##rk</mark> <mark style="background-color:#ff4747">##u</mark> founded ? <mark style="background-color:#7cb342">[SEP]</mark> The <mark style="background-color:#aed581">University</mark> of Tu <mark style="background-color:#dcedc8">##rk</mark> ##u ( Finnish : Tu ##run y ##lio ##pis ##to <mark style="background-color:#aed581">,</mark> in Swedish : Å ##bo un ##ivers ##ite ##t ) , located in Tu ##rk ##u in so ##ut ##wes ##tern Finland <mark style="background-color:#dcedc8">,</mark> is the third largest university in the country as <mark style="background-color:#aed581">measured</mark> <mark style="background-color:#dcedc8">by</mark> <mark style="background-color:#dcedc8">student</mark> <mark style="background-color:#dcedc8">enrollment</mark> , after the University of <mark style="background-color:#aed581">Helsinki</mark> and Tam ##per ##e University . It is a multi ##disciplinary university with eight faculties . It was established in <mark style="background-color:#aed581">1920</mark> <mark style="background-color:#dcedc8">and</mark> also has facilities at Ra ##uma , Po ##ri , Ke ##vo and Se ##ili . The university is a <mark style="background-color:#ff8f8f">member</mark> of the Co <mark style="background-color:#ff8f8f">##im</mark> ##bra Group <mark style="background-color:#ff8f8f">and</mark> the <mark style="background-color:#ff8f8f">European</mark> Campus of <mark style="background-color:#ff8f8f">City</mark> - Un ##iv ##iers ##ities ( EC ##2 ##U ) <mark style="background-color:#ff8f8f">.</mark> <mark style="background-color:#ff4747">[SEP]</mark> 

Selecting random question-context pair for analysis.<br>**Question:** What is the biggest music publishing company in India?<br>The answer to the question **is not** in the context.<br>**Analysis results:**<br>

Context doesn't have an answer to your question!

Selecting random question-context pair for analysis.<br>**Question:** Where in India is Rajasthan located?<br>The answer to the question **is not** in the context.<br>**Analysis results:**<br>

Context doesn't have an answer to your question!