<a href="https://colab.research.google.com/github/fahmi54321/nlp_text-summarization-TextRank/blob/main/Text_Summarization_TextRank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [31]:
import pandas as pd
import numpy as np
import textwrap
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [32]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [33]:
# https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification
!wget -nc https://lazyprogrammer.me/course_files/nlp/bbc_text_cls.csv

File ‘bbc_text_cls.csv’ already there; not retrieving.



In [34]:
# load data
df = pd.read_csv('bbc_text_cls.csv')

In [35]:
df.head()

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [36]:
# choose a random document

doc = df[df.labels == 'business']['text'].sample(random_state=42)

In [37]:
# wrap function, which helps us print the results more nicely

def wrap(x):
  return textwrap.fill(x, replace_whitespace=False, fix_sentence_endings=True)

In [38]:
print(wrap(doc.iloc[0]))

Christmas sales worst since 1981

UK retail sales fell in December,
failing to meet expectations and making it by some counts the worst
Christmas since 1981.

Retail sales dropped by 1% on the month in
December, after a 0.6% rise in November, the Office for National
Statistics (ONS) said.  The ONS revised the annual 2004 rate of growth
down from the 5.9% estimated in November to 3.2%. A number of
retailers have already reported poor figures for December.  Clothing
retailers and non-specialist stores were the worst hit with only
internet retailers showing any significant growth, according to the
ONS.

The last time retailers endured a tougher Christmas was 23 years
previously, when sales plunged 1.7%.

The ONS echoed an earlier
caution from Bank of England governor Mervyn King not to read too much
into the poor December figures.  Some analysts put a positive gloss on
the figures, pointing out that the non-seasonally-adjusted figures
showed a performance comparable with 2003. The Novembe

In [39]:
# tokenize document into sentences
# doc.iloc[0] ==> which give us the document as string
# .split("\n", 1) ==> split function to split on new lines, but we only want to split one, since we only want to remove the title
# [1] ==> [title, text] ==> we choose the value at index one

sents = nltk.sent_tokenize(doc.iloc[0].split("\n", 1)[1])

In [40]:
# create TF-IDF vectorizer, we're using english stopwords and L1 normalization

featurizer = TfidfVectorizer(
    stop_words=stopwords.words('english'),
    norm='l1',
)

In [41]:
# call function fit_transform, which gives us TF-IDF matrix

X = featurizer.fit_transform(sents)

In [42]:
# compute similarity matrix
# note : https://docs.google.com/document/d/1dOjFnL4bcBjgtZ6p9E8wdbeySsGjjS4U/edit?usp=share_link&ouid=117635670089266886030&rtpof=true&sd=true
S = cosine_similarity(X)

In [43]:
# sanity check, we're going to check the shape of S, if did this correctly, this should be a relatively small matrix since the number of sentence is small.
S.shape

# furthermore, this implies that we have 17 sentences

(17, 17)

In [44]:
# length of sentences list
len(sents)

17

In [45]:
# normalize similarity matrix
# note : https://docs.google.com/document/d/1lqWzSY_0pbp0D1i4RPBApV5OH_nC_KKr/edit?usp=share_link&ouid=117635670089266886030&rtpof=true&sd=true
S /= S.sum(axis=1, keepdims=True)

In [46]:
# check the sum of the first row of matrix, as a sanity check
S[0].sum()

# and it seems to 1 as expected

1.0

In [47]:
# uniform transition matrix
# note : https://docs.google.com/document/d/1_v6ad-lpW9Wkg4ewsXEJIILVttgV8HpY/edit?usp=share_link&ouid=117635670089266886030&rtpof=true&sd=true
U = np.ones_like(S) / len(S)

In [48]:
# check the sum of the first row of matrix, as a sanity check
U[0].sum()

# and it seems to 1 as expected

1.0

In [49]:
# smoothed similarity matrix
# note : https://docs.google.com/document/d/1gd8fyHqu-h6Fx5SGrTjvMt1yIfNYgtxe/edit?usp=share_link&ouid=117635670089266886030&rtpof=true&sd=true
factor = 0.15
S = (1 - factor) * S + factor * U

In [50]:
# double check that the first row of the result still sums to 1
S[0].sum()

1.0

In [51]:
# find the limiting / stationary distribution
# note : https://docs.google.com/document/d/1mg-LbNKRXtUrei6rcwBM608_0b-akTia/edit?usp=share_link&ouid=117635670089266886030&rtpof=true&sd=true
eigenvals, eigenvecs = np.linalg.eig(S.T)


In [52]:
# check the eigenvalues,
# the theory states that, one of these eigenvalues should be one
eigenvals

# as you can see, the first eigenvalues is indeed one.

array([1.        , 0.24245466, 0.72108199, 0.67644122, 0.34790129,
       0.34417302, 0.3866884 , 0.40333562, 0.41608572, 0.44238593,
       0.63909999, 0.62556792, 0.58922572, 0.57452382, 0.48511399,
       0.51329157, 0.52975372])

In [53]:
# check the corresponding eigenvectors
# note : https://docs.google.com/document/d/1KrKEN6qRYCMRdkbSu80O-S4oWKT1cXJM/edit?usp=share_link&ouid=117635670089266886030&rtpof=true&sd=true
eigenvecs[:,0]

array([-0.24206557, -0.27051337, -0.2213806 , -0.28613638, -0.25065894,
       -0.2499217 , -0.279622  , -0.21515455, -0.2226665 , -0.22745415,
       -0.2059112 , -0.20959727, -0.23526242, -0.24203809, -0.23663025,
       -0.2940483 , -0.20865607])

In [54]:
# note : https://docs.google.com/document/d/1yCUMQhc8x4Qi0aJTCx5MOuLZbkxaC_7R/edit?usp=share_link&ouid=117635670089266886030&rtpof=true&sd=true
eigenvecs[:,0].dot(S)

array([-0.24206557, -0.27051337, -0.2213806 , -0.28613638, -0.25065894,
       -0.2499217 , -0.279622  , -0.21515455, -0.2226665 , -0.22745415,
       -0.2059112 , -0.20959727, -0.23526242, -0.24203809, -0.23663025,
       -0.2940483 , -0.20865607])

In [55]:
# note : https://docs.google.com/document/d/1dl6id89ZIuIp2m8HxPppllQNKFpdO39d/edit?usp=share_link&ouid=117635670089266886030&rtpof=true&sd=true
eigenvecs[:,0] / eigenvecs[:,0].sum()

array([0.05907327, 0.06601563, 0.05402535, 0.06982824, 0.06117038,
       0.06099047, 0.06823848, 0.05250595, 0.05433915, 0.05550753,
       0.05025022, 0.05114976, 0.05741304, 0.05906657, 0.05774684,
       0.07175905, 0.05092007])

In [56]:
# note : https://docs.google.com/document/d/1HzDdVFGanlTaSYwwAaqLvIENLr59oF-U/edit?usp=share_link&ouid=117635670089266886030&rtpof=true&sd=true
limiting_dist = np.ones(len(S)) / len(S)
threshold = 1e-8
delta = float('inf')
iters = 0
while delta > threshold:
  iters += 1

  # Markov transition
  p = limiting_dist.dot(S)

  # compute change in limiting distribution
  delta = np.abs(p - limiting_dist).sum()

  # update limiting distribution
  limiting_dist = p

print(iters)
# this only took 41 steps.

41


In [57]:
# print limiting distribution
limiting_dist

array([0.05907327, 0.06601563, 0.05402534, 0.06982824, 0.06117038,
       0.06099047, 0.06823848, 0.05250595, 0.05433915, 0.05550753,
       0.05025022, 0.05114977, 0.05741304, 0.05906657, 0.05774685,
       0.07175905, 0.05092008])

In [58]:
# check sum of the values
limiting_dist.sum()

# as you can see, the distribution of sum to one

0.9999999999999982

In [59]:
# compute the sum of absolute differences between the answer we got before and the answer we got by brute force.
np.abs(eigenvecs[:,0] / eigenvecs[:,0].sum() - limiting_dist).sum()

1.9964738834366003e-08

In [30]:
# set our score to the limiting distribution
# as you recall, this distribution actually to the TextRank score, which we will then use to find the top sentences in our document
scores = limiting_dist

In [60]:
# find the sort index for the score
sort_idx = np.argsort(-scores)

In [61]:
# Many options for how to choose which sentences to include:

# 1) top N sentences
# 2) top N words or characters.
# 3) top X% sentences or top X% words
# 4) sentences with scores > average score
# 5) sentences with scores > factor * average score

# You also don't have to sort. May make more sense in order.

print("Generated summary:")
for i in sort_idx[:5]:
  print(wrap("%.2f: %s" % (scores[i], sents[i])))

Generated summary:
0.07: "The retail sales figures are very weak, but as Bank of England
governor Mervyn King indicated last night, you don't really get an
accurate impression of Christmas trading until about Easter," said Mr
Shaw.
0.07: A number of retailers have already reported poor figures for
December.
0.07: The ONS echoed an earlier caution from Bank of England governor
Mervyn King not to read too much into the poor December figures.
0.07: Retail sales dropped by 1% on the month in December, after a
0.6% rise in November, the Office for National Statistics (ONS) said.
0.06: Clothing retailers and non-specialist stores were the worst hit
with only internet retailers showing any significant growth, according
to the ONS.


In [62]:
# one way to check whether or not this is a good summary, is to compare it with the title which is itself kind of like a summary.
# note that this is the same quote as before, except that we take the first element instead of the second which gives us the title.

doc.iloc[0].split("\n", 1)[0]

# now at this point, we should realize that whether or not our summary is good is really a subjevtive concept.
# there is not really a concept of accuracy, since it's possible that many different summaries would be sufficient.
# there are summaries which would obviously bad, but again, that's more of a subject of assessment.

'Christmas sales worst since 1981'

In [64]:
def summarize(text, factor = 0.15):
  # extract sentences
  sents = nltk.sent_tokenize(text)

  # perform tf-idf
  featurizer = TfidfVectorizer(
      stop_words=stopwords.words('english'),
      norm='l1')
  X = featurizer.fit_transform(sents)

  # compute similarity matrix
  S = cosine_similarity(X)

  # normalize similarity matrix
  S /= S.sum(axis=1, keepdims=True)

  # uniform transition matrix
  U = np.ones_like(S) / len(S)

  # smoothed similarity matrix
  S = (1 - factor) * S + factor * U

  # find the limiting / stationary distribution
  eigenvals, eigenvecs = np.linalg.eig(S.T)

  # compute scores
  scores = eigenvecs[:,0] / eigenvecs[:,0].sum()

  # sort the scores
  sort_idx = np.argsort(-scores)

  # print summary
  for i in sort_idx[:5]:
    print(wrap("%.2f: %s" % (scores[i], sents[i])))

In [65]:
# test our function on a random document
# this time i've chosen from the entertainment class

doc = df[df.labels == 'entertainment']['text'].sample(random_state=123)
summarize(doc.iloc[0].split("\n", 1)[1])

0.11: Goodrem, Green Day and the Black Eyed Peas took home two awards
each.
0.10: As well as best female, Goodrem also took home the Pepsi Viewers
Choice Award, whilst Green Day bagged the prize for best rock video
for American Idiot.
0.10: Other winners included Green Day, voted best group, and the
Black Eyed Peas.
0.10: The Black Eyed Peas won awards for best R 'n' B video and
sexiest video, both for Hey Mama.
0.10: Local singer and songwriter Missy Higgins took the title of
breakthrough artist of the year, with Australian Idol winner Guy
Sebastian taking the honours for best pop video.


In [66]:
# print out the title of this document, which kind of like a reference summary for us to compare to

doc.iloc[0].split("\n", 1)[0]

'Goodrem wins top female MTV prize'

In [67]:
print(wrap(doc.iloc[0]))

Goodrem wins top female MTV prize

Pop singer Delta Goodrem has
scooped one of the top individual prizes at the first Australian MTV
Music Awards.

The 21-year-old singer won the award for best female
artist, with Australian Idol runner-up Shannon Noll taking the title
of best male at the ceremony.  Goodrem, known in both Britain and
Australia for her role as Nina Tucker in TV soap Neighbours, also
performed a duet with boyfriend Brian McFadden.  Other winners
included Green Day, voted best group, and the Black Eyed Peas.
Goodrem, Green Day and the Black Eyed Peas took home two awards each.
As well as best female, Goodrem also took home the Pepsi Viewers
Choice Award, whilst Green Day bagged the prize for best rock video
for American Idiot.  The Black Eyed Peas won awards for best R 'n' B
video and sexiest video, both for Hey Mama.  Local singer and
songwriter Missy Higgins took the title of breakthrough artist of the
year, with Australian Idol winner Guy Sebastian taking the honours f