# Lesson notebook 7 - Summarization and Question Answering



### Extractive summarization example

One of the challenges faced by current neural systems is the size of the input they can manage.  As a result most  of these systems end up truncating the input in some fashion.  Can you get a good summary if you only read in the first 500 words of a document?  One solution to this is to use an older approach called extractive summarization.  In this approach the content of the input document(s) is broken into sentences which are scored for their relevance to either the document or to a query.  We'll demonstrate it's use on a wikipedia article.


### Abstractive  summarization example

We'll use T5 again to summarize some input text.  We do this because the text in -> text out interface as well as the multi-task fine tuning makes it a great vehicle for demonstration.


### Span-based question answering example

There are a variety of approaches to question answering.  Here we demonstrate one particular approach to the problem -- span detection -- where we feed a context paragraph and the question to the system and want the machine to identify the answer span within the context paragraph.

<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [Setup](#setup)
  * 2. [SumBasic Extractive Summarization](#extractiveSummarization)
  * 3. [Abstractive Summarization with T5](#abstractiveSummarization)
  * 4. [Extractive Question Answering with T5](#extractiveQA)
  * [Answers](#answers)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2024-summer-main/blob/master/materials/lesson_notebooks/lesson_7_summarization_QA.ipynb)



[Return to Top](#returnToTop)
<a id = 'setup'></a>

## 1. Setup

Let's set up our environment so we can grab the wikipedia page on Natural Language Processing.  You can modify the string to find the Wikipedia page of your choice.  We'll need NLTK to build our extractive summarizer.

We'll also need the HuggingFace Transformers library for our abstractive summarization and question answering examples.

Now let's get a document to summarize.  We'll use Wikipedia since it contains a large number of longer documents.

In [1]:
!pip install -q wikipedia

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone


In [2]:
!pip install -q sentencepiece

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.3 MB[0m [31m9.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m64.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
import nltk
import nltk.corpus
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

[Return to Top](#returnToTop)
<a id = 'extractiveSummarization'></a>

## 2. SumBasic Extractive Summarization

Let's run our extractive summarization example.  We'll use NLTK and a simple algorithm that relies on the frequency of words to identify sentences to extract and place in the summary.

The advantage of these older counting approaches is that they can handle documents of arbitrary length and can easily run without a GPU.

In [5]:
import string
from nltk.tokenize import sent_tokenize, regexp_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.probability import FreqDist

Extractive summarization allows us to specify the size of the summary we want.  We will do it as a percentage of the size of the input.  Let's first grab a document to work with.  We'll grab the [Wikipedia article on natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) since it is long.  Under the hood the system is breaking the document into sentences and scoring those sentences by their relevance to the document according to the SumBasic algorithm.  As a result the summary is a set of sentences copied directly from the original.  Some algorithms presented the extracted sentences in score order while others present in the order in which they appeared in the original document.  Why do you think that might matter?

In [6]:
import wikipedia
from pprint import pprint


# Get wiki content.
wikisearch = wikipedia.page("Natural Language Processing")
wikicontent = wikisearch.content

First let's implement the SumBasic algorithm using some NLTK functions.  It's a very straightforward approach using probabilities to assign scores to each sentence and word and then pick the highest scoring sentences.  Those highest scoring sentences are extracted from the original and then printed as part of the summary.  The original paper (MSR-TR-2005-101) can be [found here](https://www.cs.bgu.ac.il/~elhadad/nlp09/sumbasic.pdf) as well as a [followup article by the same authors here](https://www.cis.upenn.edu/~nenkova/papers/ipm.pdf).  The idea is to try and score sentences for inclusion in the summary based primarily on word frequency.

The basic algorithm is:

1.   Compute the score of each word in the document by dividng the frequency of the word by the total number of words.
2.   Compute the score of each sentence by computing the average score of its words
3.   Select the highest scoring sentence that contains the highest scoring word and add to summary
4.   For each word in the selected sentence, update its score by squaring the word score.  This makes words already in the summary score lower and sentences without those words score higher.
5.   If summary not long enough, return to step 2 and recalculate sentence scores.

The SumBasic algorithm is a computationally cheap way of creating an extractive summary of an arbitrarily long document.  Let's see what it looks like.




In [7]:
#score the sentences and print the highest scoring sentence with the highest scoring word
#keep repeating (with word score recalulation) until length is reached

def sumbasic(lem_sentences, lem_words, size):

    freq = FreqDist(lem_words)
    total = sum(freq.values())
    probs = {k: v/total for k, v in freq.items()}

    len_summary = int(size * len(lem_sentences))    #calculate number of sentences to put in the summary

    summary = []

    for _ in range(len_summary):

        scores = {k: [] for k in lem_sentences}
        importance = {k: 0 for k in scores}
        for key, value in lem_sentences.items():               #recalulate the sentence scores
            for word in value:
                scores[key].append(probs[word])
            importance[key] = sum(scores[key]) / len(scores[key])

        most_importance_sentence = max(scores, key=scores.get)  #pull out the most important sentence
        summary.append(most_importance_sentence)

        for word in lem_sentences[most_importance_sentence]:    #recalculate word scores
            probs[word] = probs[word] * probs[word]

    for sentence in lem_sentences:
        if sentence in summary:
            pprint(sentence, compact=True)


Now let's run the SumBasic function now with the Wikipedia page on NLP and lets ask for a summary that is 5% of the original.

In [8]:
#get the wiki article and break it first into sentences using NLTK's sent_tokenize
all_sentences = sent_tokenize(wikicontent)

#Let's walk through each of these sentences so we can divide into tokens (e.g. words)
word_tokens = []
sentence_tokens = {sentence: [] for sentence in all_sentences}

for one_sentence in all_sentences:
    for token in regexp_tokenize(one_sentence.lower(), '\w+'):  #divide the sentences into tokens based on the regex for whitespace
        if token not in string.punctuation:                     #ignore punctuation
            if token not in stopwords.words('english'):         #ignore stopwords
                word_tokens.append(token)
                sentence_tokens[one_sentence].append(token)

#A lemmatizer takes conjugated verbs and returns their infinitive form (e.g. conjugating -> conjugate)
#It does the same thing with nouns taking the plural form and returning the singular form.
#We're doing this because we want to count up occurences of word roots to get a tighter distribution
lem = WordNetLemmatizer()
lem_words = [lem.lemmatize(word) for word in word_tokens]
lem_sentences = {sentence: [lem.lemmatize(word) for word in sentence_tokens[sentence]] for sentence in sentence_tokens}

#Now we have a list of lemmatized words and a list of sentences containing lemmatized words
#we pass them to the sumbasic fiunction along with a size parameter
#We'll also pass a summary size as a percentage of the sentences in the original document
sumbasic(lem_sentences, lem_words, 0.05)


('Systems based on automatically learning the rules can be made more accurate '
 'simply by supplying more input data.')
('=== Statistical methods ===\n'
 'Since the so-called "statistical revolution" in the late 1980s and '
 'mid-1990s, much natural language processing research has relied heavily on '
 'machine learning.')
('Some of these tasks have direct real-world applications, while others more '
 'commonly serve as subtasks that are used to aid in solving larger tasks.')
('In natural speech there are hardly any pauses between successive words, and '
 'thus speech segmentation is a necessary subtask of speech recognition (see '
 'below).')
('For a language like English, this is fairly trivial, since words are usually '
 'separated by spaces.')
('Sentence boundaries are often marked by periods or other punctuation marks, '
 'but these same characters can serve other purposes (e.g., marking '
 'abbreviations).')
('Machine translation (MT)\n'
 'Automatically translate text from one h

[Return to Top](#returnToTop)
<a id = 'abstractiveSummarization'></a>

## 3. Abstractive summarization with T5

Let's set up our environment to run the Hugging Face version of T5 and feed it a small snippet of text to see what kind of summary it produces.  Note that we could not feed the entire Wikipedia article we used above into T5.

In [9]:
import tensorflow as tf

In [10]:
from transformers import T5Tokenizer, TFT5Model, TFT5ForConditionalGeneration

Here's the text that we'll summarize.

In [11]:
WARTICLE_TO_SUMMARIZE = ("A neutron star is the collapsed core of a massive supergiant star, which had a total mass of \
            between 10 and 25 solar masses, possibly more if the star was especially metal-rich. Except for black holes, \
            and some hypothetical objects (e.g. white holes, quark stars, and strange stars), neutron stars are the smallest \
            and densest currently known class of stellar objects.")

In [12]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-base') #also t5-small and t5-large
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

t5_model.summary()

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (Embedding)          multiple                  24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  109628544 
                                                                 
 decoder (TFT5MainLayer)     multiple                  137949312 
                                                                 
Total params: 222,903,552
Trainable params: 222,903,552
Non-trainable params: 0
_________________________________________________________________


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Don't forget to add the prompt to the begining of the article so T5 knows what we are asking it to do.

In [13]:
t5_input_text = "summarize: " + WARTICLE_TO_SUMMARIZE

In [14]:
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

Here's the output.  The sentence is quite fluid.  How faithful to you think it is?

In [15]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                    num_beams=3,
                                    no_repeat_ngram_size=3,
                                    min_length=15,
                                    max_length=35)

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['a neutron star is the collapsed core of a massive supergiant star . neutron stars are the smallest and densest currently known class']


[Return to Top](#returnToTop)
<a id = 'extractiveQA'></a>

## 4. Extractive question answering with T5

Now let's look at an extractive question answering example.  We'll need to feed the model a context paragraph and a question.  The T5 model was pre-trained on the SQUAD dataset so it knows how to identify and extract the answer span. Note that we already have the prompt in the respective texts.

In [16]:
t5_context_text = """context: Hyperbaric (high-pressure) medicine uses special oxygen
chambers to increase the partial pressure of O 2 around the patient and, when needed,
the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness
(the ’bends’) are sometimes treated using these devices. Increased O 2 concentration
in the lungs helps to displace carbon monoxide from the heme group of hemoglobin.
Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing
its partial pressure helps kill them. Decompression sickness occurs in divers who
decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen
and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible
is part of the treatment."""

In [17]:
t5_question_text = """question: What does increased oxygen concentrations in the patient’s
lungs displace? """

In [18]:
t5_qa_input_text = t5_question_text + t5_context_text

Now let's run T5 and see how well it answers our question.  What do you think?

In [19]:
t5_inputs = t5_tokenizer([t5_qa_input_text], return_tensors='tf')

t5_summary_ids = t5_model.generate(t5_inputs['input_ids'])

print([t5_tokenizer.decode(g, skip_special_tokens=True,
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])



['carbon monoxide']
