# Lesson notebook 7 - Summarization and Question Answering



### 1. Extractive summarization example

One of the challenges faced by current neural systems is the size of the input they can manage.  As a result most  of these systems end up truncating the inputOne solution to this is to use an older approach called extractive summarization.  In this approach the content of the input document(s) is broken into sentences which are scored for their relevance to either the document or to a query.  We'll demonstrate it's use on a wikipedia article.


### 2. Summarization example

We'll use T5 again to summarize some input text.  We do this because the text in -> text out interface as well as the multi-task fine tuning makes it a great vehicle for demonstration.


### 3. Question answering example

There are a variety of approaches to question answering.  Here we demonstrate one particular approach to the problem -- span detection -- where we feed a context paragraph and the question to the system and want the machine to identify the answer span within the context paragraph.



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2022-summer-main/blob/master/materials/lesson_notebooks/lesson_7_summarization_QA.ipynb)



Let's run our extractive summarization example.  We'll leverage an older version of gensim that has a built in extactive summarization module.  The module was contributed so it hasn't been update to 4.0 yet.

In [1]:
!pip install gensim==3.8.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim==3.8.3
  Downloading gensim-3.8.3-cp37-cp37m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 2.5 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-3.8.3


Now let's get a document to summarize.  We'll use Wikipedia since it contains a large number of longer documents.

In [2]:
!pip install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha256=79d3a2ecb3dcf3fa0433f8dc5de2c81736ed95402c53d98cfc5540aa75e899c2
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


Let's set up our environment and grab the wikipedia page on Natural Language Processing.  You can modify the string to find the Wikipedia page of your choice.

In [3]:
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
import wikipedia
from pprint import pprint

 
# Get wiki content.
wikisearch = wikipedia.page("Natural Language Processing")
wikicontent = wikisearch.content

The extractive summarization module allows to specify the size of the summary we want.  We can either do it as a percentage of the size of the input or by the number of words.  Let's first grab a small percentage of the original doc.  Under the hood the system is breaking the document into sentences and scoring those sentences by their relevance to the document.  This is typically done by comparing each sentence with a centroid sentence.  As a result the summary is a set of sentences.  They may be presented in score order or they may be presented in the order in which they appeared in the original document.  Why do you think that might matter?

In [4]:
# Summary (5% of the original content).
summ_per = summarize(wikicontent, ratio = 0.05)
print("Percent summary")
pprint(summ_per, compact=True)

Percent summary
("The premise of symbolic NLP is well-summarized by John Searle's Chinese room "
 'experiment: Given a collection of rules (e.g., a Chinese phrasebook, with '
 'questions and matching answers), the computer emulates natural language '
 'understanding (or other NLP tasks) by applying those rules to the data it '
 'confronts.\n'
 'Focus areas of the time included research on rule-based parsing (e.g., the '
 'development of HPSG as a computational operationalization of generative '
 'grammar), morphology (e.g., two-level morphology), semantics (e.g., Lesk '
 'algorithm), reference (e.g., within Centering Theory) and other areas of '
 'natural language understanding (e.g., in the Rhetorical Structure Theory).\n'
 'Generally, this task is much more difficult than supervised learning, and '
 'typically produces less accurate results for a given amount of input data.\n'
 'However, creating more data to input to machine-learning systems simply '
 'requires a corresponding incre

Now let's summarize the content again but this time by specifying the size of the summary in words.

In [5]:
# Summary (200 words)
summ_words = summarize(wikicontent, word_count = 200)
print("Word count summary")
pprint(summ_words, compact=True)

Word count summary
("The premise of symbolic NLP is well-summarized by John Searle's Chinese room "
 'experiment: Given a collection of rules (e.g., a Chinese phrasebook, with '
 'questions and matching answers), the computer emulates natural language '
 'understanding (or other NLP tasks) by applying those rules to the data it '
 'confronts.\n'
 'Focus areas of the time included research on rule-based parsing (e.g., the '
 'development of HPSG as a computational operationalization of generative '
 'grammar), morphology (e.g., two-level morphology), semantics (e.g., Lesk '
 'algorithm), reference (e.g., within Centering Theory) and other areas of '
 'natural language understanding (e.g., in the Rhetorical Structure Theory).\n'
 'However, creating more data to input to machine-learning systems simply '
 'requires a corresponding increase in the number of man-hours worked, '
 'generally without significant increases in the complexity of the annotation '
 'process.Despite the popularity o

The genism module can also summarize by keyword rather than sentence.

In [6]:
from gensim.summarization import keywords
print(keywords(wikicontent, ratio=0.01))

language
languages
words
word
semantics
semantic
nlp
text
generation
generic
generative
generally
generated
general
generate


### Abstractive summarization with T5

Let's set up our environment to run the Hugging Face version of T5 and feed it a small snippet of text to see what kind of summary it produces.  Note that we could not feed the entire Wikipedia article we used above into T5.

In [7]:
!pip install -q sentencepiece

[?25l[K     |▎                               | 10 kB 30.5 MB/s eta 0:00:01[K     |▌                               | 20 kB 34.8 MB/s eta 0:00:01[K     |▉                               | 30 kB 41.6 MB/s eta 0:00:01[K     |█                               | 40 kB 32.2 MB/s eta 0:00:01[K     |█▍                              | 51 kB 23.7 MB/s eta 0:00:01[K     |█▋                              | 61 kB 26.3 MB/s eta 0:00:01[K     |██                              | 71 kB 27.2 MB/s eta 0:00:01[K     |██▏                             | 81 kB 28.3 MB/s eta 0:00:01[K     |██▍                             | 92 kB 29.9 MB/s eta 0:00:01[K     |██▊                             | 102 kB 31.6 MB/s eta 0:00:01[K     |███                             | 112 kB 31.6 MB/s eta 0:00:01[K     |███▎                            | 122 kB 31.6 MB/s eta 0:00:01[K     |███▌                            | 133 kB 31.6 MB/s eta 0:00:01[K     |███▉                            | 143 kB 31.6 MB/s eta 0:

In [8]:
!pip install -q transformers

[K     |████████████████████████████████| 4.2 MB 17.5 MB/s 
[K     |████████████████████████████████| 6.6 MB 49.2 MB/s 
[K     |████████████████████████████████| 596 kB 26.5 MB/s 
[K     |████████████████████████████████| 86 kB 2.9 MB/s 
[?25h

In [9]:
import tensorflow as tf

In [10]:
from transformers import T5Tokenizer, TFT5Model, TFT5ForConditionalGeneration

Here's the text that we'll summarize.

In [11]:
WARTICLE_TO_SUMMARIZE = ("A neutron star is the collapsed core of a massive supergiant star, which had a total mass of \
            between 10 and 25 solar masses, possibly more if the star was especially metal-rich. Except for black holes, \
            and some hypothetical objects (e.g. white holes, quark stars, and strange stars), neutron stars are the smallest \
            and densest currently known class of stellar objects.")

In [12]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-base') #also t5-small and t5-large
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

t5_model.summary()

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Model: "tft5_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 shared (TFSharedEmbeddings)  multiple                 24674304  
                                                                 
 encoder (TFT5MainLayer)     multiple                  84954240  
                                                                 
 decoder (TFT5MainLayer)     multiple                  113275008 
                                                                 
Total params: 222,903,552
Trainable params: 222,903,552
Non-trainable params: 0
_________________________________________________________________


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Don't forget to add the prompt to the begining of the article so T5 knows what we are asking it to do.

In [13]:
t5_input_text = "summarize: " + WARTICLE_TO_SUMMARIZE

In [14]:
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')

Here's the output.  The sentence is quite fluid.  How faithful to you think it is?

In [15]:
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'],
                                    num_beams=3,
                                    no_repeat_ngram_size=1,
                                    min_length=15,
                                    max_length=35)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['neutron stars are the smallest and densest currently known class of stellar objects, says alexander miller.']


## Extractive question answering with T5

Now let's look at an extractive question answering example.  We'll need to feed the model a context paragraph and a question.  The T5 model was pre-trained on the SQUAD dataset so it knows how to identify and extract the answer span. Note that we already have the prompt in the respective texts.

In [17]:
t5_context_text = """context: Hyperbaric (high-pressure) medicine uses special oxygen
chambers to increase the partial pressure of O 2 around the patient and, when needed,
the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness
(the ’bends’) are sometimes treated using these devices. Increased O 2 concentration
in the lungs helps to displace carbon monoxide from the heme group of hemoglobin.
Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing
its partial pressure helps kill them. Decompression sickness occurs in divers who
decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen
and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible
is part of the treatment."""

In [16]:
t5_question_text = """question: What does increased oxygen concentrations in the patient’s
lungs displace? """

In [18]:
t5_qa_input_text = t5_question_text + t5_context_text

Now let's run T5 and see how well it answers our question.  What do you think?

In [19]:
t5_inputs = t5_tokenizer([t5_qa_input_text], return_tensors='tf')

t5_summary_ids = t5_model.generate(t5_inputs['input_ids'])
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

['carbon monoxide']
