# Extractive Text Summarizer using TextRank

## Text Summarization with TF-IDF

+ Split document into sentences
+ Compute TF-IDF Matrix (Sentences x Terms)
+ Score each sentence (average of non-zero TF-IDF components)
+ Take the top scoring sentences
+ TextRank is an alternative method of scoring each sentence, all other steps remain
+ Goal: score each sentence, and utilize cosine similarity to measure the similary to one another

In [1]:
# importing packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import textwrap
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity 

df = pd.read_csv('data/bbc_text.csv') # importing bbc_text dataset
df.head(10) # getting the first ten rows of the dataset

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business
5,Japan narrowly escapes recession\n\nJapan's ec...,business
6,Jobs growth still slow in the US\n\nThe US cre...,business
7,"India calls for fair trade rules\n\nIndia, whi...",business
8,Ethiopia's crop production up 24%\n\nEthiopia ...,business
9,Court rejects $280bn tobacco case\n\nA US gove...,business


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2225 non-null   object
 1   labels  2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [6]:
doc = df[df['labels'] == 'business']['text'].sample(random_state = 42) # filtering out labels that are not business and grabbing a random simple

In [7]:
# using textwrap make the text more visually appealing

def wrap(x):
    return textwrap.fill(x, replace_whitespace = False, fix_sentence_endings = True)

In [13]:
print(wrap(doc.iloc[0])) # printing out the first article from the dataframe

Christmas sales worst since 1981

UK retail sales fell in December,
failing to meet expectations and making it by some counts the worst
Christmas since 1981.

Retail sales dropped by 1% on the month in
December, after a 0.6% rise in November, the Office for National
Statistics (ONS) said.  The ONS revised the annual 2004 rate of growth
down from the 5.9% estimated in November to 3.2%. A number of
retailers have already reported poor figures for December.  Clothing
retailers and non-specialist stores were the worst hit with only
internet retailers showing any significant growth, according to the
ONS.

The last time retailers endured a tougher Christmas was 23 years
previously, when sales plunged 1.7%.

The ONS echoed an earlier
caution from Bank of England governor Mervyn King not to read too much
into the poor December figures.  Some analysts put a positive gloss on
the figures, pointing out that the non-seasonally-adjusted figures
showed a performance comparable with 2003. The Novembe

In [14]:
sents = nltk.sent_tokenize(doc.iloc[0].split('\n', 1)[1]) # tokenizing the sentences

In [15]:
featurizer = TfidfVectorizer(stop_words = stopwords.words('english'), norm = 'l1') # creating the featurizer and doing minor feature engineering

In [16]:
X = featurizer.fit_transform(sents) # fitting to the data and then transforming it

In [32]:
S = cosine_similarity(X)

In [33]:
S.shape

(17, 17)

In [34]:
len(sents)

17

In [35]:
# normalizing similarity matrix
# keepdims == true is done so that the result is a 2D matrix which ensures numpy broadcasting operates correctly

S /= S.sum(axis= 1, keepdims = True)

In [36]:
S[0].sum()

1.0

In [37]:
# uniform transition matrix
# matrix of ones divided by the number of sentences

U = np.ones_like(S) / len(S)

In [38]:
# checking to see if the sum of the first row is one

U[0].sum()

1.0

In [39]:
# smoothed similarity matrix
# compute final matrix - convex combination of S and U
# 85% of S and 15% of U - must add to 1

factor = 0.15
S = (1 - factor) * S + factor * U

In [41]:
# ensuring S is still 1

S[0].sum()

1.0

In [42]:
# find the limiting / stationary distribution
# transposing the matrix
# getting both eigenvalues and eigenvectors

eigenvals, eigenvecs = np.linalg.eig(S.T) 

In [43]:
eigenvals

array([1.        , 0.24245466, 0.72108199, 0.67644122, 0.34790129,
       0.34417302, 0.3866884 , 0.40333562, 0.41608572, 0.44238593,
       0.63909999, 0.62556792, 0.58922572, 0.57452382, 0.48511399,
       0.51329157, 0.52975372])

In [44]:
# if the result is the same, then it is a proper eigenvector

eigenvecs[:, 0] 

array([-0.24206557, -0.27051337, -0.2213806 , -0.28613638, -0.25065894,
       -0.2499217 , -0.279622  , -0.21515455, -0.2226665 , -0.22745415,
       -0.2059112 , -0.20959727, -0.23526242, -0.24203809, -0.23663025,
       -0.2940483 , -0.20865607])

In [45]:
eigenvecs[:, 0].dot(S)

array([-0.24206557, -0.27051337, -0.2213806 , -0.28613638, -0.25065894,
       -0.2499217 , -0.279622  , -0.21515455, -0.2226665 , -0.22745415,
       -0.2059112 , -0.20959727, -0.23526242, -0.24203809, -0.23663025,
       -0.2940483 , -0.20865607])

In [46]:
eigenvecs[:, 0] / eigenvecs[:, 0].sum()

array([0.05907327, 0.06601563, 0.05402535, 0.06982824, 0.06117038,
       0.06099047, 0.06823848, 0.05250595, 0.05433915, 0.05550753,
       0.05025022, 0.05114976, 0.05741304, 0.05906657, 0.05774684,
       0.07175905, 0.05092007])

In [47]:
limiting_dist = np.ones(len(S)) / len(S) # initializing limiting distribution to be uniform distribution
threshold = 1e-8 # defining a threshold tenths to the minus 8, used to quit the loop
delta = float('inf') # to the infinity, it will store how much the distribution has changed
iters = 0 # keeping track of the iterations
while delta > threshold: # as long as delta is bigger than the threshold, the loop will continue to run
    iters += 1 # increment iters by 1
    
    # Markov transition
    # computing distribution for the next step stored in p
    p = limiting_dist.dot(S)
    
    # compute change in limiting distribution\
    # sum of absolute differences between old and new distribution
    delta = np.abs(p - limiting_dist).sum()
    
    # update limiting distribution
    limiting_dist = p
    
print(iters)

41


### This only took 41 steps to complete

In [49]:
# printing out limiting distribution

limiting_dist

array([0.05907327, 0.06601563, 0.05402534, 0.06982824, 0.06117038,
       0.06099047, 0.06823848, 0.05250595, 0.05433915, 0.05550753,
       0.05025022, 0.05114977, 0.05741304, 0.05906657, 0.05774685,
       0.07175905, 0.05092008])

In [50]:
limiting_dist.sum()

0.9999999999999982

In [51]:
np.abs(eigenvecs[:, 0] / eigenvecs[:, 0].sum() - limiting_dist).sum()

1.9964738806610427e-08

In [52]:
scores = limiting_dist

In [53]:
sort_idx = np.argsort(-scores)

In [54]:
# printing out an extractive summary
# since newspapers tend to begin with specifics first and generalizing after, it is important to ensure the text is in order
# this way, it lessens the likelihood of any important details being left out

print('Generated Summary:')
for i in sort_idx[:5]:
    print(wrap("%.2f %s" % (scores[i], sents[i])))

Generated Summary:
0.07 "The retail sales figures are very weak, but as Bank of England
governor Mervyn King indicated last night, you don't really get an
accurate impression of Christmas trading until about Easter," said Mr
Shaw.
0.07 A number of retailers have already reported poor figures for
December.
0.07 The ONS echoed an earlier caution from Bank of England governor
Mervyn King not to read too much into the poor December figures.
0.07 Retail sales dropped by 1% on the month in December, after a 0.6%
rise in November, the Office for National Statistics (ONS) said.
0.06 Clothing retailers and non-specialist stores were the worst hit
with only internet retailers showing any significant growth, according
to the ONS.


In [56]:
doc.iloc[0].split('\n')[0] # getting the article title

'Christmas sales worst since 1981'