# Summarization

*Updated: 01/27/2022*

This notebook allows you to summarize texts by using two different approaches.

The first approach is called 'Extractive Summarization' (ES). As its name suggests, ES employs algorithms, which try to identify and extract the most relevant sentences from a text. ES is a computationally  efficient and fast summarization approach. 

The second approach is called 'Abstractive Summarization' (AS). This approach is based on (large) language models (LLM) for which the so called transformer architecture has become the de-facto standard. Having 'seen' enough pairs of text and a summary during training and leveraging its general language capabilities, an LLM can generate a summary that can contain original sentences verbatim alongside rewritten or newly generated text. In contrast to ES, AS is very compute expensive.      

Scientific texts strongly differ from most other texts in structure, form, and style. As LLMs are usually trained on vast troves of diverese texts scraped off the internet, we can't expect them to deal particularly well with scientific literature (unless we fine-tune them for this task). 

It is, therefore, strongly recommended to first use ES algorithms as they return original sentences from the text. AS might, instead, be a good starting point in case you have to write a summary, say, for a project proposal.     

## Working with Jupyter notebooks

In case you are not familiar with Jupyter notebooks, this is how to go about it: In order to execute a piece of code, click inside a cell (the ones with `[]` to the left) and press Shift+Enter. Wait until the cell is done--that's when the `*` in `[]` turned into a number--and move on to the next cell.

If you get inconceivable error messages or the notebook gets stuck, choose "Restart & Clear Output" in the "Kernel" dropdown-menu above and start afresh. 

___
**Please help us to improve this tool by [emailing us](mailto:ai4ki.dev@gmail.com?subject=ai4ki-tools:%20Summarization) your update ideas or error reports.**
___

## Preparation: Import libraries

In [None]:
import transformers
from summarization_utils import *
from transformers import pipeline
transformers.logging.set_verbosity_error()

## Data Preprocessing

### Step 1: Load your text
You can provide your text in the `.pdf` or `.txt`-formats.

Note that we use Adobe's Document Services API for extracting PDF content. This API is not a free service and thus requires credentials. You can regsiter for a free trial  [here](https://www.adobe.io/apis/documentcloud/dcsdk/). In case you need support with authentication of your credentials, [email us](mailto:ai4ki.dev@gmail.com?subject=Summarization:%20Authentication%20Issues). 

In future updates of this notebook, we will try to implement an open source alternative like [this one](https://github.com/allenai/s2orc-doc2json).

*Run the following cell, and enter the name of your file.*

In [None]:
#Enter the name of your file
infile = input('Enter filename: ')

#### 1.1: Fetch an article from the Web (optional)
*Run the following cell, enter the full URL of the article you want to summarize, and proceed directly to summarization.*

In [None]:
url = input('Enter full URL: ')
text_to_summarize = fetch_article(url)

### Step 2: Convert your text into a machine readable format

In [None]:
print("Preprocessing data...")
fulltext, _, _, headers = preprocess_data(infile, base_path='./')
if headers:
    print(f"Found the following section headers in {infile}")
    for i, h in enumerate(headers):
        print(f"{i}: {h['header']}")

### Step 3: Choose text section for summarization

In [None]:
chap_slct = input("Enter section number or -1 for full text: ") 
if chap_slct == '-1':
    text_to_summarize = fulltext
else:
    chap_data = headers[int(chap_slct)]
    text_to_summarize = fulltext[chap_data['idx_start']:chap_data['idx_end']]

## Approach I: Extractive Summariztaion

*Enter the desired length of your summary and the summarization algorithm. You can choose between [LRS](https://pypi.org/project/lexrank/), [LSA](https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python), [LUHN](https://pypi.org/project/sumy/), and [KLS](https://pypi.org/project/sumy/).*

In [None]:
# Choose lenght of summary (i.e. number of sentences)
len_sum = input('Number of sentences: ')

# Chose extraction algorithm
ext_alg = input('Extractive algorithm: ')

# Create the summary 
summary = extractive_summarizers(text_to_summarize, method=ext_alg, max_len=int(len_sum), lang='german')

# Print the summary
print(f"==> SUMMARY:")
for sentence in summary:
    print(sentence)

## Approach II: Abstractive Summarization

Large Language Models like transformers set a limit to the length their combined input and output can have. This length is measured in 'tokens'. A token does not always correspond to a word, but as a rule of thumb you can think of 100 tokens corresponding to 75 words. Typical limits are 512, 1024, or 2048 tokens (see [here](https://beta.openai.com/docs/introduction/key-concepts) for a short introduction).

In the case of Huggingface's [Summarization Pipeline](https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline), which we use here, the limit is currently at 512 tokens. In order to to be able to summarize longer texts, we first split the input text into chunks of sizes smaller than 512 tokens; we then  summarize each chunk separately; finally, we concatenate the individual chunk summaries to get the full summary.     

In [None]:
# Split text into chunks, which fit into the transformer context window
chunks = chunk_text(text_to_summarize)
    
# Choose model (facebook/bart-large-cnn, t5-small, t5-base, t5-large, t5-3b, t5-11b)
model = 't5-base'

# Initialize the HuggingFace summarization pipeline
summarizer = pipeline('summarization', model=model, tokenizer=model, framework='tf')

# Create summary for each chunk and concatenate
summary = ''
for chunk in chunks:
    sum_tmp = summarizer(chunk, min_length=30, max_length=60)
    summary += sum_tmp[0]['summary_text']
# Print summarized text
print('==> SUMMARY:')
print(summary)