# CS 195: Natural Language Processing
## Question Answering

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F2_3_QuestionAnswering.ipynb)


## References

Hugging Face Task Guide on Question Answering: https://huggingface.co/docs/transformers/tasks/question_answering


## Installing necessary modules

In [1]:
import sys
!{sys.executable} -m pip install transformers datasets evaluate rouge_score

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Frameworks/Python.framework/Versions/3.10/bin/python3 -m pip install --upgrade pip[0m


In [4]:
import sys
!poetry add transformers datasets evaluate rouge_score

Configuration file exists at /Users/elijahlueders/Library/Preferences/pypoetry, reusing this directory.

Consider moving TOML configuration files to /Users/elijahlueders/Library/Application Support/pypoetry, as support for the legacy directory will be removed in an upcoming release.
The following packages are already present in the pyproject.toml and will be skipped:

  • [36mtransformers[39m
  • [36mdatasets[39m
  • [36mevaluate[39m
  • [36mrouge_score[39m

If you want to update it to the latest compatible version, you can use `poetry update package`.
If you prefer to upgrade it to the latest available version, you can use `poetry add package@latest`.

Nothing to add.


## Question Answering

[roberta-based model](https://huggingface.co/deepset/roberta-base-squad2) trained on the [SQuAD2.0](https://huggingface.co/datasets/squad_v2) question answering data set

Requires two inputs
* a question
* context - where to find the answer

Returns
* an answer
* a location where you can find the answer in the context

In [5]:
times_delphic_story = """
How does the Supreme Court ruling on affirmative action affect Drake?
The answer has little to do with affirmative action.
Over the summer, the Supreme Court ruled against the admissions programs of Harvard University and the University of North Carolina in an affirmative action decision. Before the decision, race already wasn’t a factor in Drake University admissions, according to Provost Sue Mattison. 
“Affirmative action, with regards to admissions, only impacts those really highly selective institutions that limit the number of incoming students,” Mattison said. “So that doesn’t apply to Drake and most institutions across the country.”
She said schools like Harvard and UNC have enough applicants that they can pick and choose which applicants fill a certain number of spots.
Drake’s admissions team found that the university has “admitted all students who have a 3.0 high school GPA or [higher],” Mattison said. “Even though we’ve asked for a person’s race on the admissions form, it does not have an impact on the admissions decision, and it doesn’t displace anybody.”
Possible effects of the court’s ruling 
Mark Kende, director of Drake’s Constitutional Law Center, said the Supreme Court “basically has embraced an idea that it calls colorblindness.”
“If you take their principle of colorblindness and extend it beyond universities, to other places, it could raise some problems,” Kende said. “But we don’t know yet.”
Financial aid programs that prioritize applicants of a particular race over another are more vulnerable after the court’s decision, according to Kende. He said it’s not clear what impact the decision might have on university hiring practices that consider an employee’s race, as well as corporations’ diversity programs.
Following the Supreme Court’s decision, Missouri Attorney General Andrew Bailey said Missouri institutions subject to the U.S. Constitution or Title VI must stop using race-based standards “to make decisions about things like admissions, scholarships, programs and employment.” 
The University of Missouri System said that “a small number of our programs and scholarships have used race/ethnicity as a factor for admissions and scholarships,” and that “these practices will be discontinued.”
Drake is taking a different approach in the wake of the affirmative action decision. The university is monitoring maybe about forty to fifty scholarships, according to Ryan Zantingh, Drake’s director of financial aid. This is more in anticipation of a comparable case on financial aid that considers race, rather than a reaction to the affirmative action ruling.
Mattison said she thinks Drake is still trying to determine how the Supreme Court decision will impact Drake’s Crew Scholars program, which is for incoming students of color.  
“There are ways that we can ensure that we continue Crew Scholars while still being compliant,” Mattison said.
Donors for some Drake scholarships specified that they wanted to support a student of color or a woman in a STEM field, Mattison said.
“And so we’re still working through what that actually means, and what we have to do to continue to achieve the values that we expect,” Mattison said. “There are ways that we can change the wording of some of the scholarships.”
Like all students, students of color may qualify for scholarships for first-generation students or students with financial need. 
“There’s a lot of overlap between students of color and other areas where financial aid is directed,” Zantingh said. “Scholarship resources can be directed [to financial need or first generation status] and still reach the same students.”
Even if there is a ruling on financial aid that’s comparable to the affirmative action decision, Zantingh doesn’t expect a large impact on Drake financial aid from either decision. 
“There may be some implications, but I think the overall general effect on students will be little to none,” Zantingh said. 
Zantingh gave an example of scholarship language offered by legal counsel. If a scholarship is for only minority students, it might become a scholarship that gives preference to students who demonstrate a commitment to Drake’s vision for diversity on campus. 
“If a white student is actively involved in anti-racist leadership here on campus, certainly they would fit that description then, wouldn’t they?” Zantingh said. “Basically, the language would not seek to exclude any particular protected class categorically.”
In some cases, a donor might be unwilling to change the scholarship’s language or be deceased, Zantingh said. If a donor is deceased, a judge might approve changes. He said he doesn’t expect Drake to cut any of the scholarships it is monitoring.
“The scholarship criteria would have to change, or the dollars would have to be repurposed in another way. Per either the donor or a court’s approval,” Zantingh said. 
Race can still play a role in college admissions
The Supreme Court left at least one legal path open for race to play a role in college admissions. 
When admitting students, universities are allowed to consider “an applicant’s discussion of how race affected his or her life, be it through discrimination, inspiration or otherwise,” Chief Justice John Roberts wrote in the Court’s decision. However, “the student must be treated based on his or her experiences as an individual — not on the basis of race.” 
A student’s story can emerge without Drake asking for it, according to Dean of Admissions Joel Johnson. 
“Especially if they’ve overcome a lot, or it’s so key to their identity… it’ll come out on its own,” Johnson said. “I don’t know if I could say the Supreme Court protected it. They couldn’t have stopped it, honestly.”
Johnson said that caring about diversity also means intentionally recruiting a diverse group of students. He said students can’t join Drake if they never apply in the first place.
In the wake of the Supreme Court’s decision on affirmative action, The Times-Delphic is publishing a series. Check next week’s paper for an article about legacy admissions and legacy financial aid with a Drake focus. 

"""

In [6]:
from transformers import pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Can colleges take race into account when making admissions decisions?',
    'context': times_delphic_story
}
res = nlp(QA_input)
print(res)

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

{'score': 0.1444220393896103, 'start': 1416, 'end': 1433, 'answer': 'we don’t know yet'}


In [8]:
print( times_delphic_story[1416:1433] )
print( times_delphic_story[1200:1500] )

we don’t know yet
Court “basically has embraced an idea that it calls colorblindness.”
“If you take their principle of colorblindness and extend it beyond universities, to other places, it could raise some problems,” Kende said. “But we don’t know yet.”
Financial aid programs that prioritize applicants of a particula


### Let's try another question

In [9]:
QA_input2 = {
    'question' : "Which kinds of schools are most affected by the Supreme Court's affirmative action ruling?",
    'context': times_delphic_story
}
res = nlp(QA_input2)
print(res)

{'score': 0.035478729754686356, 'start': 671, 'end': 686, 'answer': 'Harvard and UNC'}


In [10]:
print( times_delphic_story[671:686] )
print( times_delphic_story[500:800] )

Harvard and UNC
 institutions that limit the number of incoming students,” Mattison said. “So that doesn’t apply to Drake and most institutions across the country.”
She said schools like Harvard and UNC have enough applicants that they can pick and choose which applicants fill a certain number of spots.
Drake’s adm


The answer I was hoping for was `"highly selective institutions"`.

### How you ask the question seems to have an impact on the answer it finds

In [11]:
QA_input3 = {
    'question' : "Does Drake consider race when deciding to admit a student?",
    'context': times_delphic_story
}
res = nlp(QA_input3)
print(res)

{'score': 0.1436648666858673, 'start': 1416, 'end': 1433, 'answer': 'we don’t know yet'}


In [12]:
QA_input4 = {
    'question' : "At Drake, does race have an impact on the admissions decision?",
    'context': times_delphic_story
}
res = nlp(QA_input4)
print(res)

{'score': 0.10744316130876541, 'start': 995, 'end': 1048, 'answer': 'it does not have an impact on the admissions decision'}


## Discussion question:

What are some ways you can think of for evaluating question answering models?

## Group Exercise

Find a question answering *dataset* on Hugging Face. Test out some of the examples from the data set using metrics we decided on.

In [22]:
fairytail = "once upon a time there was a king who went forth into the world and fetched back a beautiful queen . and after they had been married a while god gave them a little daughter . then there was great rejoicing in the city and throughout the country , for the people wished their king all that was good , since he was kind and just . while the child lay in its cradle , a strange - looking old woman entered the room , and no one knew who she was nor whence she came . the old woman spoke a verse over the child , and said that she must not be allowed out under the open sky until she were full fifteen years of age , since otherwise the mountain troll would fetch her . when the king heard this he took her words to heart , and posted guards to watch over the little princess so that she would not get out under the open sky ."	
fairytail_ref = "the people wished their king all that was good ."
fairytail_input = {
    'question' : "why was there great rejoicing in the city and throughout the country ?",
    'context': fairytail

}
fairytail_res = nlp(fairytail_input)
print(fairytail_res)

# find the rouge score
import evaluate

rouge = evaluate.load('rouge')

rouge.compute(predictions=[fairytail_res['answer']],references=[fairytail_ref])

{'score': 0.18159113824367523, 'start': 251, 'end': 297, 'answer': 'the people wished their king all that was good'}


{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

In [27]:
fairytail_2 = "the youth had no great appetite for this food . \" if i were only away and up above again , \" thought he , but he said nothing . \" now i think you must surely want to get home again , \" said the rat . \" i am well aware that you are waiting impatiently for the wedding , and i will hurry all i can . take this linen thread along , and when you get up above , you must not turn around , but must go straight home , and as you go you must keep repeating : ' short before and long behind ! ' \" and with that she laid a linen thread in his hand . \" heaven be praised ! \" said the youth when he was up above once more . \" i 'll not go down there again in a hurry . \" but he held the thread in his hand , and danced and sang as usual . and although he no longer had the rat - hole in mind , he began to hum : \" short before and long behind ! short before and long behind ! \""
fairytail_2_ref = "so he could go home ."	
fairytail_2_input = {
    'question' : "why did the rat give the youth the linen thread?",
    'context': fairytail_2

}
fairytail_2_res = nlp(fairytail_2_input)
print(fairytail_2_res)

# find the rouge score
import evaluate

rouge = evaluate.load('rouge')

rouge.compute(predictions=[fairytail_2_res['answer']],references=[fairytail_2_ref])

{'score': 0.01311157364398241, 'start': 223, 'end': 266, 'answer': 'you are waiting impatiently for the wedding'}


{'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0, 'rougeLsum': 0.0}

In [31]:
len(fairytail)

866

## Applied Exploration

Choose a Question Answering model from Hugging Face (you may use the one we used in class). Set up an experiment to answer the following question: How does the length of the context affect the performance of the model?

Answer the following questions:
* What dataset(s) did you use (provide links)?
* Describe the kinds of questions and answers that appear in this data. How do the lengths of the context vary? Maybe provide a histogram that describes this.
* What metrics did you use? Why did you choose those?
* What were your results? Describe what you found and any additional take-aways.

In [69]:
from datasets import load_dataset

dataset = load_dataset("squad_v2", split="validation")

In [70]:
dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 11873
})

In [71]:
# add column with length of context
dataset = dataset.map(lambda example: {'length': len(example['context'])})

In [72]:
dataset[0]


{'id': '56ddde6b9a695914005b9628',
 'title': 'Normans',
 'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
 'question': 'In what country is Normandy located?',
 'answers': {'text': ['France', 'France', 'France', 'France'],
  'answer_start': [159, 159, 159, 159]},
 'length': 742}

In [73]:
print(len(dataset[0]['context']))
print(dataset[0]['length'])

742
742


In [74]:
# sort by length
dataset = dataset.sort("length")

In [75]:
print(len(dataset[0]['context']))
print(len(dataset[-1]['context']))



169
4063


In [77]:
len(dataset)

11873

In [91]:
dataset[100]



{'id': '5ad562525b96ef001a10ad53',
 'title': 'Computational_complexity_theory',
 'context': 'The time and space hierarchy theorems form the basis for most separation results of complexity classes. For instance, the time hierarchy theorem tells us that P is strictly contained in EXPTIME, and the space hierarchy theorem tells us that L is strictly contained in PSPACE.',
 'question': 'What is not strictly contained in PSPACE?',
 'answers': {'text': [], 'answer_start': []},
 'length': 275}

In [92]:
# remove rows with no answers
dataset = dataset.filter(lambda example: len(example['answers']['text']) > 0)

len(dataset)

Filter:   0%|          | 0/11873 [00:00<?, ? examples/s]

5928

In [95]:
lengths = []
for i in range(0, len(dataset), 500):
    lengths.append(dataset[i]['length'])
    

print(len(lengths), lengths)


12 [169, 521, 556, 591, 625, 672, 721, 787, 855, 954, 1072, 1277]


In [None]:
lengths = []
for i in range(0, len(dataset), 500):
    lengths.append(dataset[i]['length'])
    

print(len(lengths), lengths)

12 [169, 521, 556, 591, 625, 672, 721, 787, 855, 954, 1072, 1277]


## What about conversational models?

Some of you have already experimented with the conversational models.

These are more difficult to evaluate than the others we've looked at.

Usually start with a pre-training step like "predict the next/missing word in this sequence"

Fine-tuned with human feedback

Next time, we'll look at a simple model for predicting the next word in a sequence