## Text Summarization with TDIDF

Two types of text summarization
- extractive summarization
    - it consits of text taken from the original document
- abstractive summarization
    - it can contains novel sequences of text not necessarily taken from the input
    - transformers, seq2seq

### Procedure

1. Split the document into sentences (nltk.sent_tokenize(your_text))
2. Compute the TF-IDF matrix from list of sentences
2. Score each sentence by taking average non-zero TF-IDf values
3. Sort(rank) each sentence by those scores
4. Print the top scoring sentences as the summary

- Note: extractive summarization

In [17]:
import pandas as pd
import numpy as np
import nltk
import textwrap
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sean\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sean\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
import os 

os.getcwd()

'd:\\github\\NLP-Projects\\skill_notes\\text_summarization'

In [11]:
df = pd.read_csv('d:/github/NLP-Projects/skill_notes/data/bbc_full_text_cls.csv')

In [15]:
doc = df[df.label == 'business']['text'].sample(random_state=42)

In [23]:
def wrap(x):
    return textwrap.fill(x, replace_whitespace=False, fix_sentence_endings=True)


In [25]:
print(wrap(doc.iloc[0]))

Christmas sales worst since 1981

UK retail sales fell in December,
failing to meet expectations and making it by some counts the worst
Christmas since 1981.

Retail sales dropped by 1% on the month in
December, after a 0.6% rise in November, the Office for National
Statistics (ONS) said.  The ONS revised the annual 2004 rate of growth
down from the 5.9% estimated in November to 3.2%. A number of
retailers have already reported poor figures for December.  Clothing
retailers and non-specialist stores were the worst hit with only
internet retailers showing any significant growth, according to the
ONS.

The last time retailers endured a tougher Christmas was 23 years
previously, when sales plunged 1.7%.

The ONS echoed an earlier
caution from Bank of England governor Mervyn King not to read too much
into the poor December figures.  Some analysts put a positive gloss on
the figures, pointing out that the non-seasonally-adjusted figures
showed a performance comparable with 2003. The Novembe

In [46]:
nltk.sent_tokenize(doc.iloc[0].split("\n\n", maxsplit=0)[0])

['Christmas sales worst since 1981\n\nUK retail sales fell in December, failing to meet expectations and making it by some counts the worst Christmas since 1981.',
 'Retail sales dropped by 1% on the month in December, after a 0.6% rise in November, the Office for National Statistics (ONS) said.',
 'The ONS revised the annual 2004 rate of growth down from the 5.9% estimated in November to 3.2%.',
 'A number of retailers have already reported poor figures for December.',
 'Clothing retailers and non-specialist stores were the worst hit with only internet retailers showing any significant growth, according to the ONS.',
 'The last time retailers endured a tougher Christmas was 23 years previously, when sales plunged 1.7%.',
 'The ONS echoed an earlier caution from Bank of England governor Mervyn King not to read too much into the poor December figures.',
 'Some analysts put a positive gloss on the figures, pointing out that the non-seasonally-adjusted figures showed a performance compara

In [39]:
# the tile is removed by selecting split("\n\n", 1)
sentences = nltk.sent_tokenize(doc.iloc[0].split("\n\n",1)[1])

In [84]:
# In L1 normalization, the relative importance of terms in a sentence is preserved, but the magnitude of the vector is scaled down so that it sums to 1.
# This ensures that longer sentences (which naturally have more terms and larger raw TF-IDF values) don't dominate the scoring simply due to their length.
featurizer = TfidfVectorizer(
    stop_words=stopwords.words('english'),
    norm='l1'
)

In [85]:
X = featurizer.fit_transform(sentences)

In [86]:
X

<17x154 sparse matrix of type '<class 'numpy.float64'>'
	with 212 stored elements in Compressed Sparse Row format>

In [145]:
print(X[3,  :])

  (0, 10)	0.1808607792791759
  (0, 30)	0.11871441769336029
  (0, 47)	0.10999441338787577
  (0, 89)	0.1808607792791759
  (0, 99)	0.14165079067847455
  (0, 110)	0.15792440629406163
  (0, 113)	0.10999441338787577


In [88]:
def get_sentence_score(tfidf_row):
    """
    return the average of the non-zero values of the tf-idf vector representation of a sentence
    """
    x = tfidf_row[tfidf_row != 0]
    return x.mean()

In [154]:
scores = np.zeros(len(sentences))
for i in range(len(sentences)):
    score = get_sentence_score(X[i , :])

    scores[i] = score
scores

array([0.07142857, 0.08333333, 0.125     , 0.14285714, 0.07692308,
       0.09090909, 0.07142857, 0.07692308, 0.06666667, 0.07142857,
       0.125     , 0.08333333, 0.1       , 0.06666667, 0.07142857,
       0.04545455, 0.1       ])

In [163]:
scores = np.array(X.sum(axis=1)).flatten() / np.array((X != 0).sum(axis=1)).flatten()
# np.array((X != 0).sum(axis=1)).flatten() = (X != 0).sum(axis=1).A1
scores

array([0.07142857, 0.08333333, 0.125     , 0.14285714, 0.07692308,
       0.09090909, 0.07142857, 0.07692308, 0.06666667, 0.07142857,
       0.125     , 0.08333333, 0.1       , 0.06666667, 0.07142857,
       0.04545455, 0.1       ])

In [165]:
np.array(X.sum(axis=1)).flatten()

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [156]:
# Actually, we don't have to sort, since the order means, like the story line and the causality.
sort_idx = np.argsort(-scores)
sort_idx

array([ 3, 10,  2, 16, 12,  5,  1, 11,  4,  7,  6, 14,  0,  9, 13,  8, 15],
      dtype=int64)