# Extractive Text Summarizer

## Text Summarization with TF-IDF

+ High-level outline
+ Split the document into sentences
+ Score each sentence
+ Rank each sentence by those score
+ Summary = Top scoring sentences

## Approach

+ Sentence tokenization (splitting the document into sentences) with NLTK
+ Build TD-IDF matrix, treating each sentence as if they were documents

In [1]:
# importing packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import textwrap
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer


df = pd.read_csv('data/bbc_text.csv') # importin csv into pandas dataframe 
df.head(10) # looking at the first ten rows

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business
5,Japan narrowly escapes recession\n\nJapan's ec...,business
6,Jobs growth still slow in the US\n\nThe US cre...,business
7,"India calls for fair trade rules\n\nIndia, whi...",business
8,Ethiopia's crop production up 24%\n\nEthiopia ...,business
9,Court rejects $280bn tobacco case\n\nA US gove...,business


In [2]:
# understanding the landscape quickly before starting the project

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2225 non-null   object
 1   labels  2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [3]:
doc = df[df['labels'] == 'business']['text'].sample(random_state = 42) # filtering out labels that are not business and grabbing a random simple

In [4]:
# using textwrap make the text more visually appealing

def wrap(x):
    return textwrap.fill(x, replace_whitespace = False, fix_sentence_endings = True)

In [5]:
print(wrap(doc.iloc[0])) # printing out the first article from the dataframe

Christmas sales worst since 1981

UK retail sales fell in December,
failing to meet expectations and making it by some counts the worst
Christmas since 1981.

Retail sales dropped by 1% on the month in
December, after a 0.6% rise in November, the Office for National
Statistics (ONS) said.  The ONS revised the annual 2004 rate of growth
down from the 5.9% estimated in November to 3.2%. A number of
retailers have already reported poor figures for December.  Clothing
retailers and non-specialist stores were the worst hit with only
internet retailers showing any significant growth, according to the
ONS.

The last time retailers endured a tougher Christmas was 23 years
previously, when sales plunged 1.7%.

The ONS echoed an earlier
caution from Bank of England governor Mervyn King not to read too much
into the poor December figures.  Some analysts put a positive gloss on
the figures, pointing out that the non-seasonally-adjusted figures
showed a performance comparable with 2003. The Novembe

In [6]:
sents = nltk.sent_tokenize(doc.iloc[0].split('\n', 1)[1]) # tokenizing the sentences

In [7]:
featurizer = TfidfVectorizer(stop_words = stopwords.words('english'), norm = 'l1') # creating the featurizer and doing minor feature engineering

In [8]:
X = featurizer.fit_transform(sents) # fitting to the data and then transforming it

In [9]:
# writing a function to score the sentence given its tfidf representation

def get_sentence_score(tfidf_row):
    # return the average of the non zero values
    # of the tf-idf vector representation of a sentence
    x = tfidf_row[tfidf_row != 0] 
    return x.mean()

In [10]:
# computing the scores for each sentence and storing it in a variable

scores = np.zeros(len(sents))

# running a for loop to go through each index until the length of sentences has been completed

for i in range(len(sents)):
    score = get_sentence_score(X[i,:]) # getting the score from each index
    scores[i] = score # return the score for the i the sentence

In [11]:
# sorting the scores
# negating scores so it begins descending rather than ascending

sort_idx = np.argsort(-scores)

## Many options for how to choose which sentences to include:

+ 1. top N sentences
+ 2. top N words or characters
+ 3. top X% or top X% words
+ 4. sentences with scores > average score
+ 5. sentences with scores > fsctor * average score

In [12]:
# printing out an extractive summary
# since newspapers tend to begin with specifics first and generalizing after, it is important to ensure the text is in order
# this way, it lessens the likelihood of any important details being left out

print('Generated Summary:')
for i in sort_idx[:5]:
    print(wrap("%.2f %s" % (scores[i], sents[i])))

Generated Summary:
0.14 A number of retailers have already reported poor figures for
December.
0.13 However, reports from some High Street retailers highlight the
weakness of the sector.
0.12 The ONS revised the annual 2004 rate of growth down from the 5.9%
estimated in November to 3.2%.
0.10 "Our view is the Bank of England will keep its powder dry and
wait to see the big picture."
0.10 And a British Retail Consortium survey found that Christmas 2004
was the worst for 10 years.


In [13]:
doc.iloc[0].split('\n', 1)[0] # printing out the title of the article

'Christmas sales worst since 1981'

In [14]:
# creating a function to do all of the steps to perform an extractive summary

def summarize(text):
    sents = nltk.sent_tokenize(text) #extract sentences
    X = featurizer.fit_transform(sents) # perform tf-idf
    scores = np.zeros(len(sents)) # compute scores for each sentence
    for i in range(len(sents)):
        score = get_sentence_score(X[i, :])
        scores[i] = score
        
    sort_idx = np.argsort(-scores) # sort the scores
    for i in sort_idx[:5]: # print the first five scores
        print(wrap('%.2f: %s' % (scores[i], sents[i]))) # print out the score, as well as the sentence

In [15]:
# filtering out by entertainment, getting a random sample and running the summarize function to see how it performs

doc = df[df['labels'] == 'entertainment']['text'].sample(random_state = 123)
summarize(doc.iloc[0].split('\n', 1)[1])

0.11: The Black Eyed Peas won awards for best R 'n' B video and
sexiest video, both for Hey Mama.
0.10: The ceremony was held at the Luna Park fairground in Sydney
Harbour and was hosted by the Osbourne family.
0.10: Goodrem, Green Day and the Black Eyed Peas took home two awards
each.
0.10: Other winners included Green Day, voted best group, and the
Black Eyed Peas.
0.10: The VH1 First Music Award went to Cher honouring her
achievements within the music industry.


## Observations:

+ 1. The summary does a good job of informing the reader of the purpose of the article
+ 2. However, there are some key details left out, such as: What event was this? Where was it?

In [16]:
# getting the title from the article

doc.iloc[0].split('\n', 1)[0]

'Goodrem wins top female MTV prize'

In [17]:
# getting the whole article

print(wrap(doc.iloc[0])) 

Goodrem wins top female MTV prize

Pop singer Delta Goodrem has
scooped one of the top individual prizes at the first Australian MTV
Music Awards.

The 21-year-old singer won the award for best female
artist, with Australian Idol runner-up Shannon Noll taking the title
of best male at the ceremony.  Goodrem, known in both Britain and
Australia for her role as Nina Tucker in TV soap Neighbours, also
performed a duet with boyfriend Brian McFadden.  Other winners
included Green Day, voted best group, and the Black Eyed Peas.
Goodrem, Green Day and the Black Eyed Peas took home two awards each.
As well as best female, Goodrem also took home the Pepsi Viewers
Choice Award, whilst Green Day bagged the prize for best rock video
for American Idiot.  The Black Eyed Peas won awards for best R 'n' B
video and sexiest video, both for Hey Mama.  Local singer and
songwriter Missy Higgins took the title of breakthrough artist of the
year, with Australian Idol winner Guy Sebastian taking the honours f

### We now know that the event was the first Australian MTV Words

### Next steps: Attempt an extractive summary using TextRank