# Text Summatization 

Summarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. Since manual text summarization is a time expensive and generally laborious task, the automatization of the task is gaining increasing popularity and therefore constitutes a strong motivation for academic research.

There are 2 approaches of the text summarization: **extractive** and **abstractive**.


*   **Extractive summarization** picks up sentences directly from the document based on a scoring function to form a coherent summary. This method work by identifying important sections of the text cropping out and stitch together portions of the content to produce a condensed version.
*   **Abstractive summarization** methods aim at producing summary by interpreting the text using advanced natural language techniques in order to generate a new shorter text — parts of which may not appear as part of the original document, that conveys the most critical information from the original text.

In this notebook we will perform TextRank algorithm, trying to summarize BBC news.


In [3]:
# import libraries
import pandas as pd 
import numpy as np

In [4]:
df = pd.read_csv('/content/bbc_text_cls.csv')

# we need only 1 text
text = df.loc[1, 'text']
text

'Dollar gains on Greenspan speech\n\nThe dollar has hit its highest level against the euro in almost three months after the Federal Reserve head said the US trade deficit is set to stabilise.\n\nAnd Alan Greenspan highlighted the US government\'s willingness to curb spending and rising household savings as factors which may help to reduce it. In late trading in New York, the dollar reached $1.2871 against the euro, from $1.2974 on Thursday. Market concerns about the deficit has hit the greenback in recent months. On Friday, Federal Reserve chairman Mr Greenspan\'s speech in London ahead of the meeting of G7 finance ministers sent the dollar higher after it had earlier tumbled on the back of worse-than-expected US jobs data. "I think the chairman\'s taking a much more sanguine view on the current account deficit than he\'s taken for some time," said Robert Sinche, head of currency strategy at Bank of America in New York. "He\'s taking a longer-term view, laying out a set of conditions u

Some text preprocessing steps.

In [6]:
# ! pip install contractions

In [7]:
import contractions
import re

import nltk
nltk.download('punkt')

from nltk.tokenize import sent_tokenize

text = contractions.fix(text)
text = re.sub('\n\n', ' ', text)

# tokenization
tokens = sent_tokenize(text)
len(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


15

Then we convert text into vector of numbers.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
text_tfidf = vectorizer.fit_transform(tokens)

# Text Rank

TextRank – is a graph-based ranking model for text processing which can be used in order to find the most relevant sentences in text and also to find keywords. Based on the PageRank algorithm.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# calculate cosine similarity 
similarity_matrix = cosine_similarity(text_tfidf)

for i in range(len(similarity_matrix)):
  similarity_matrix[i] = similarity_matrix[i] / np.sum(similarity_matrix[i])  # transforn into matrix G

In [None]:
similarity_matrix[0].sum()

1.0

In [None]:
U = np.ones_like(similarity_matrix) / len(similarity_matrix) # add smooth 

array([[0.00671141, 0.00671141, 0.00671141, ..., 0.00671141, 0.00671141,
        0.00671141],
       [0.00671141, 0.00671141, 0.00671141, ..., 0.00671141, 0.00671141,
        0.00671141],
       [0.00671141, 0.00671141, 0.00671141, ..., 0.00671141, 0.00671141,
        0.00671141],
       ...,
       [0.00671141, 0.00671141, 0.00671141, ..., 0.00671141, 0.00671141,
        0.00671141],
       [0.00671141, 0.00671141, 0.00671141, ..., 0.00671141, 0.00671141,
        0.00671141],
       [0.00671141, 0.00671141, 0.00671141, ..., 0.00671141, 0.00671141,
        0.00671141]])

In [None]:
factor = 0.15 
A = (1 - factor) * similarity_matrix + factor * U
A[0].sum()

1.0

In [None]:
# find limiting distrubution (stationary distribution)
value, vector = np.linalg.eig(A.T)

limiting_dist = np.ones(len(A)) / len(A)
threshold = 1e-8
delta = float('inf')
iters = 0

while delta > threshold:
  iters += 1

  p = limiting_dist.dot(A)
  delta = np.abs(p - limiting_dist).sum()
  limiting_dist = p

In [None]:
limiting_dist

array([0.0067357 , 0.00492512, 0.00816282, 0.00743893, 0.01022865,
       0.00856642, 0.00670511, 0.00643171, 0.00898258, 0.00600879,
       0.00735009, 0.00803472, 0.00548668, 0.00717573, 0.0073911 ,
       0.00631625, 0.00651664, 0.00728121, 0.01045734, 0.0069332 ,
       0.00630915, 0.00407364, 0.00754037, 0.0060187 , 0.00306393,
       0.00725798, 0.00392913, 0.00736709, 0.01007403, 0.00575486,
       0.0060993 , 0.00495088, 0.00895693, 0.00997778, 0.00540578,
       0.0062273 , 0.00444116, 0.00482682, 0.00626862, 0.00614313,
       0.00662038, 0.00663935, 0.00766984, 0.00770023, 0.00443793,
       0.00767137, 0.00856417, 0.00761413, 0.00912521, 0.0067839 ,
       0.00677215, 0.00555458, 0.00549138, 0.00358853, 0.01058893,
       0.00615067, 0.00511422, 0.00596411, 0.00612146, 0.00668699,
       0.00585096, 0.00638436, 0.00592216, 0.00863741, 0.00565775,
       0.00971829, 0.00730921, 0.00445467, 0.00570447, 0.00696997,
       0.00783858, 0.00762191, 0.00838896, 0.00724423, 0.00598

In [None]:
scores = limiting_dist

In [None]:
sort_idx = np.argsort(-scores)
[tokens[i] for i in sort_idx[:20]]

['I think I am obsessed with gaming in general, I spend far too much time playing games like Everquest 2 and Football Manager rather than going out and interacting with real people and when I do try to, I am always thinking in the back if my mind that I would rather be in front of the computer winning the league with Cambridge United.',
 'When you stop playing you are at the same point as when you started; all the achievements of your 10 hour session are irretrievably locked in the game and, since you have gained nothing in the real world, you may as well pile on more achievement in the fake one.',
 'He says that in the world of online gaming such behaviour "was not that unusual; lots of people I knew in the game played EQ that much".',
 'Some of the comments were humorous: "This game is so good I am not going to get it, there is no way I could limit the hours I would spend playing it," wrote Charles MacIntyre, from England.',
 'I spend about five hours per day online playing it and I 