<a href="https://colab.research.google.com/github/ammarameenn/Document-Summarisation/blob/main/TextSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarisation

## Notebook Content

* Loading Dataset
* Preprocessing Datatset
* Extractive Summarization Approach
  * TF-IDF summarizer
* Abstractive summarization Approach
  * t5-base Transformer

In [1]:
import nltk
nltk.download('inaugural')
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import inaugural
from nltk import tokenize
import numpy as np  
import pandas as pd 
import re           
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from string import punctuation
from nltk.tokenize.treebank import TreebankWordDetokenizer
import torch

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Loading Dataset
The NLTK corpus is a massive dump of all kinds of natural language data sets. Here I am loading Inaugral Dataset which includes Welcome speech of American presidents from 1789 to 2021.

I am in this notebook summarizing Joe Biden's Speech given by him.

In [2]:
print(inaugural.fileids())
print(len(inaugural.fileids()))

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reaga

In [3]:
all_speech = inaugural.raw()

In [4]:
biden_speech = inaugural.raw('2021-Biden.txt')
biden_speech

'Chief Justice Roberts, Vice President Harris, Speaker Pelosi, Leader Schumer, Leader McConnell, Vice President Pence, and my distinguished guests, and my fellow Americans: This is America\'s day. This is democracy\'s day, a day of history and hope, of renewal and resolve. Through a crucible for the ages America has been tested anew, and America has risen to the challenge.\n\nToday we celebrate the triumph not of a candidate, but of a cause, the cause of democracy. The peopleâ\x80\x94the will of the people has been heard, and the will of the people has been heeded. We\'ve learned again that democracy is precious, democracy is fragile. And at this hour, my friends, democracy has prevailed.\n\nSo now, on this hallowed ground where just a few days ago violence sought to shake the Capitol\'s very foundation, we come together as one Nation under God, indivisible, to carry out the peaceful transfer of power as we have for more than two centuries. As we look ahead in our uniquely American way

### Text PreProcessing 
From tokenisation on sentence and word level to removing any HTML tags and all the basic preprocessing starts from here. All the contractions used in general english are also taken care of in this part.

In [5]:
biden_sents = tokenize.sent_tokenize(biden_speech)

In [6]:
contraction = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                 "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                 "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                  "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",

                  "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                  "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                  "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                  "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                   "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                   "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",

                   "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                   "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                   "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                   "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                   "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                   "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                   "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",

                    "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                    "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                    "you're": "you are", "you've": "you have"}

In [7]:
def textprocessing(text):
    Text = text.lower() # Convert everything to lowercase
    Text = BeautifulSoup(Text, "lxml").text #Remove HTML tags: Extract text from tags 
    Text = re.sub(r'\([^)]*\)', '', Text) # Remove any text inside the parenthesis
    Text = re.sub('"','', Text) 
    Text = ' '.join([contraction[t] if t in contraction else t for t in Text.split(" ")])

    return Text
   
    
cleaned_text = []
for sent in range(len(biden_sents)):
  temp = textprocessing(biden_sents[sent])
  cleaned_text.append(temp)

cleaned_text[1]

"this is democracy's day, a day of history and hope, of renewal and resolve."

In [8]:
stopword = list(stopwords.words('english'))
add_to_stop = list(punctuation)
stopword.extend(add_to_stop)

In [9]:
from nltk import tokenize
token_list = []
for sent in range(len(cleaned_text)):
  temp = tokenize.word_tokenize(cleaned_text[sent])
  for i in stopword:
    if i in temp:
      while i in temp:
        temp.remove(i)
  token_list.append(temp)

In [10]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
final_text=[]
for i in range(len(token_list)):
  tempSentence = [TreebankWordDetokenizer().detokenize(token_list[i])]
  final_text = final_text + tempSentence
  
final_text

["chief justice roberts vice president harris speaker pelosi leader schumer leader mcconnell vice president pence distinguished guests fellow americans america's day",
 "democracy's day day history hope renewal resolve",
 'crucible ages america tested anew america risen challenge',
 'today celebrate triumph candidate cause cause democracy',
 'peopleâ\x80\x94the people heard people heeded',
 'learned democracy precious democracy fragile',
 'hour friends democracy prevailed',
 "hallowed ground days ago violence sought shake capitol's foundation come together one nation god indivisible carry peaceful transfer power two centuries",
 'look ahead uniquely american wayâ\x80\x94restless bold optimisticâ\x80\x94and set sights nation know must thank predecessors parties presence today',
 'thank bottom heart',
 'know resilience constitution strength strength nation president carter spoke last night us today salute lifetime service',
 'taken sacred oath patriots taken oath first sworn george washi

### Extractive Summarisation Approach

* This method does not create new words or phrases, it just takes the already existing words and phrases and presents only that. You can imagine this as taking a page of text and marking the most important sentences using a highlighter.

* Extractive summarization methods work just like that. It takes the text, ranks all the sentences according to the understanding and relevance of the text, and presents you with the most important sentences. 

* There are many techniques to score sentences. Here I am using TF-IDF short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

In [11]:
wordfreq = {}
for sentence in final_text:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

import heapq
most_freq = heapq.nlargest(500, wordfreq, key=wordfreq.get)

In [12]:
word_idf_values = {}
for token in most_freq:
    doc_containing_word = 0
    for document in final_text:
        if token in nltk.word_tokenize(document):
            doc_containing_word += 1
    word_idf_values[token] = np.log(len(final_text)/(1 + doc_containing_word))

In [13]:
word_tf_values = {}
for token in most_freq:
    sent_tf_vector = []
    for document in final_text:
        doc_freq = 0
        for word in nltk.word_tokenize(document):
            if token == word:
                  doc_freq += 1
        try: word_tf = doc_freq/len(nltk.word_tokenize(document)) 
        except: pass
        sent_tf_vector.append(word_tf)
    word_tf_values[token] = sent_tf_vector

In [14]:
tfidf_values = []
for token in word_tf_values.keys():
    tfidf_sentences = []
    for tf_sentence in word_tf_values[token]:
        tf_idf_score = tf_sentence * word_idf_values[token]
        tfidf_sentences.append(tf_idf_score)
    tfidf_values.append(tfidf_sentences)

In [15]:
tf_idf_model = np.asarray(tfidf_values)
tf_idf_model = np.transpose(tf_idf_model)

In [16]:
model = pd.DataFrame(tf_idf_model)
model

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
0,0.00000,0.098525,0.000000,0.0,0.119413,0.000000,0.123368,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.00000,0.000000,0.000000,0.0,0.328385,0.339262,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.00000,0.541887,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.00000,0.000000,0.000000,0.0,0.000000,0.387728,0.000000,0.387728,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,1.123761,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161,0.12477,0.069921,0.000000,0.0,0.000000,0.087551,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162,0.00000,0.000000,0.424506,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
163,0.00000,0.000000,0.169803,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
164,0.00000,0.270944,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


After getting Sentence Scores, I average the score of all the sentences and took into account only 50 most important sentences from Joe Biden's speech.

In [31]:
mean_model = pd.DataFrame(model.mean(axis=1))
mean_model

Unnamed: 0,0
0,0.007663
1,0.006692
2,0.007264
3,0.007467
4,0.007550
...,...
161,0.003625
162,0.001904
163,0.004202
164,0.005159


In [18]:
def extract_summary(dataframe):
  temp = dataframe.sort_values(dataframe.columns[0], ascending=False).head(50)
  index_list = list(temp.index.values)
  summary = cleaned_text[0]
  for index in range(len(index_list)):
    i = index_list[index]
    #summary.append(cleaned_text[i])
    summary = summary + cleaned_text[i] 
  
  return summary

In [25]:
extract = extract_summary(mean_model)
extract

"chief justice roberts, vice president harris, speaker pelosi, leader schumer, leader mcconnell, vice president pence, and my distinguished guests, and my fellow americans: this is america's day.the battle is perennial.can i pay my mortgage?if we do that, i guarantee you, we will not fail.no progress, only exhausting outrage.a cry for survival comes from the planet itself, a cry that cannot be any more desperate or any more clear.and we must reject the culture in which facts themselves are manipulated and even manufactured.we can teach our children in safe schools.and victory is never assured.do not tell me things cannot change.we can join forces, stop the shouting, and lower the temperature.politics does not have to be a raging fire destroying everything in its path.we can see each other not as adversaries, but as neighbors.here we stand across the potomac from arlington cemetery, where heroes who gave the last full measure of devotion rest in eternal peace.and now, a rise of politica

### Abstractive summarization
* Up untill nnow we have extractive summary.
* Now I am using already pre trained model based on transformers called t5-base to train another model to create a short abstract of extractive summary.

In [20]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 3.9 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 65.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 10.8 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstal

In [21]:
from transformers import pipeline
import os

In [22]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [23]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [24]:
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")

Downloading config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/851M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading spiece.model:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [28]:
summary_text = summarizer(extract, max_length=100, min_length=5, do_sample=False)[0]['summary_text']

In [29]:
print(summary_text)

john avlon: this is america's day. the battle is perennial. no progress, only outrage . he asks americans to unite to fight the foes we face: anger, resentment, hatred, extremism, violence . we can reward work and rebuild the middle class, make health care secure for all, he says .
