# Summarization

*Updated: 02/18/2022*

This notebook allows you to summarize texts by using two different approaches.

The first approach is called 'Extractive Summarization' (ES). As its name suggests, ES employs algorithms, which try to identify and extract the most relevant sentences from a text. ES is a computationally  efficient and fast summarization approach. 

The second approach is called 'Abstractive Summarization' (AS). This approach is based on (large) language models (LLM) for which the so called transformer architecture has become the de-facto standard. Having 'seen' enough pairs of text and a summary during training and leveraging its general language capabilities, an LLM can generate a summary that can contain original sentences verbatim alongside paraphrased or newly generated text. In contrast to ES, AS is very compute expensive.      

Scientific texts strongly differ from most other texts in structure, form, and style. As LLMs are usually trained on vast troves of diverese texts scraped off the internet, we can't expect them to deal particularly well with scientific literature (unless we fine-tune them for this task). 

It is, therefore, strongly recommended to first use ES algorithms as they return original sentences from the text. AS might, instead, be a good starting point in case you have to write a summary, say, for a project proposal.     

## Working with Jupyter notebooks

In case you are not familiar with Jupyter notebooks, this is how to go about it: In order to execute a piece of code, click inside a cell (the ones with `[]` to the left) and press Shift+Enter. Wait until the cell is done--that's when the `*` in `[]` turned into a number--and move on to the next cell.

If you get inconceivable error messages or the notebook gets stuck, choose "Restart & Clear Output" in the "Kernel" dropdown-menu above and start afresh. 

___
**Please help us to improve this tool by [emailing us](mailto:ai4ki.dev@gmail.com?subject=ai4ki-tools:%20Summarization) your update ideas or error reports.**
___

## Preparation: Import libraries

In [None]:
import transformers
from summarization_utils import *
from transformers import pipeline
from transformers import GPT2Tokenizer
transformers.logging.set_verbosity_error()

## Data Preprocessing

### Step 1: Load your text
You can provide your text in the `.pdf` or `.txt`-formats.

Note that we use Adobe's Document Services API for extracting PDF content. This API is not a free service and thus requires credentials. You can regsiter for a free trial  [here](https://www.adobe.io/apis/documentcloud/dcsdk/). In case you need support with authentication of your credentials, [email us](mailto:ai4ki.dev@gmail.com?subject=Summarization:%20Authentication%20Issues). 

In future updates of this notebook, we will try to implement an open source alternative like [this one](https://github.com/allenai/s2orc-doc2json).

*Run the following cell and enter the name of your file.*

In [None]:
#Enter the name of your file
infile = input('Enter filename: ')

#### Alternatively, fetch an article from the Web 
*Run the following cell, enter the full URL of the article you want to summarize, and **proceed directly to summarization**.*

In [None]:
url = input('Enter full URL: ')
text_to_summarize = [fetch_article(url).strip().replace('\n',' ')]

### Step 2: Convert your text into a machine readable format

In [None]:
print("Preprocessing data...")
fulltext, _, _, headers = preprocess_data(infile, base_path='./')
if headers:
    print(f"==> Found the following chapters in {infile}")
    for i, h in enumerate(headers):
        print(f"{i}: {h['header']}")

### Step 3: Choose chapter(s) for summarization
You can either summarize a single chapter or a selection of chapters. Use the chapter numbering from the previous step (the numbers before the first colon) and change the variable `selected_chapters` in the cell below according to your choices.  

In [None]:
# Enter your chapter number(s) here: 
selected_chapters = [1,4]

text_to_summarize = []
for chap in selected_chapters:
    chap_data = headers[chap]
    text_chap = fulltext[chap_data['idx_start']:chap_data['idx_end']].strip()
    text_to_summarize.append(text_chap)

## Approach I: Extractive Summarization

In this approach you can choose between the following summarization algorithms:  [LRS](https://pypi.org/project/lexrank/), [LSA](https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python), [LUHN](https://pypi.org/project/sumy/), and [KLS](https://pypi.org/project/sumy/). Follow the links, if you want to learn more about how these algorithms work. We found the LSA algorithm to produce good results: It often extracts sentences, which capture core ideas of a text.  

*Run the following cell and enter the desired length of your summary as well as the summarization algorithm.*

In [None]:
# Choose length of summary (i.e. number of sentences)
len_sum = input('Number of sentences: ')

# Chose extraction algorithm
ext_alg = input('Extractive algorithm: ')

# Create the summary
ext_summary = ''
for chap in text_to_summarize:
    chap_summ = extractive_summarizers(chap, algorithm=ext_alg, max_len=int(len_sum), lang='english')
    ext_summary += chap_summ + '\n\n'

# Print the summary
print(f"==> {ext_alg} SUMMARY: {ext_summary}")

## Approach II: Abstractive Summarization

You have two options here: first, summarization with a pre-trained T5 transformer model; second, summarization with GPT-2. T5 creates summaries, which often contain originial or only slightly paraphrased sentences. GPT-2, in contrast, mostly generates new text and, therefore, is more of an abstractive summarizer. We use the [Huggingface transformer library](https://huggingface.co/docs/transformers/index) for both options. 

Large language models set a limit to the length their combined input and output can have. This length is measured in 'tokens'. A token does not always correspond to a word, but, as a rule of thumb, you can think of 100 tokens corresponding to 75 words. Typical limits are 512, 1024, or 2048 tokens (see [here](https://beta.openai.com/docs/introduction/key-concepts) for a short introduction into tokenization).

In order to be able to summarize longer texts, we first split the input text into equal chunks of appropriate size; we then summarize the first chunk, add this summary to the second chunk, and repeat this procedure until the last chunk. We set the maximum chunk size to 400 tokens (~300 words). This ensures that we stay safely below the models' limits (512 tokens and 1024 tokens for T5 and GPT-2, respectively).

*Run the following cell **once** at the beginning of a session to instatiate the models.*

In [None]:
# Instantiate the HuggingFace summarization pipeline
summarizer = pipeline('summarization', model='t5-base', tokenizer='t5-base', framework='tf')

# Set the GPT-2 tokenizer and instantiate GPT-2 model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
generator = pipeline('text-generation', model='gpt2')

### Option 1: Summarization with T5

*Run the following cell to get a T5 summary of your selected text. Be patient, as completion might take a few minutes.*

In [None]:
t5_summary = ''
for chap in text_to_summarize:
    # Split text into chunks, which fit into the transformer context window
    chunks = chunk_text(chap, max_tokens=400)

    # Generate chapter summaries
    chap_summ = ''
    for chunk in chunks:
        prompt = chap_summ + ' ' + chunk
        sum_tmp = summarizer(prompt.strip().replace('\n',' '), min_length=30, max_length=100, do_sample=False)
        chap_summ = sum_tmp[0]['summary_text']

    # Concatenate chapter summaries
    t5_summary += chap_summ + '\n\n'

t5_summary = t5_summary.replace(" .", ".")
print(f'==> T5 SUMMARY: {t5_summary}')

### Option 2: Summarization with GPT-2

The output of GPT-2 can be tuned with different parameters, the most important of which is `temperature`. Roughly speaking, temperature controls the randomness of the output. In our case, values close to zero will more likely create extractive summaries, while larger values allow for more abstractive summaries. You can experiment with different values by changing the parameter in the cell below.

Note that GPT-2 can generate text, which is essentially bunk (at least with respect to the summarization task at hand). Therefore, you might have to go through a couple of trials until you get something useful. In order to improve GPT-2's summarization performance, we would have to fine-tune the model with suitable data.

Also note that you can try summarization with GPT-3 in this [notebook](https://github.com/ai4ki/transformer-playground.git).  

*Run the following cell to have GPT-2 summarize your text. Be patient, as completion might take a few minutes.*

In [None]:
gpt2_summary = ''
for chap in text_to_summarize:
    # Split text into chunks, which fit into the transformer context window
    chunks = chunk_text(chap, max_tokens=400)

    # Generate chapter summaries
    chap_summ = ''
    for chunk in chunks:
        prompt = chap_summ + ' ' + chunk
        chap_summ = gpt2_summarizer(tokenizer, generator, prompt.strip().replace('\n',' '), max_sum_length=100, temperature=0.7)
        
    # Concatenate chapter summaries
    gpt2_summary += chap_summ + '\n\n'
        
print(f'==> SUMMARY: {gpt2_summary}')