# Text Summarization with Sumy

#### What is Sumy?

+ Is a simple library used for extracting summary from HTML pages or plain texts
+ The package also contains simple evaluation framework for text summaries

#### Goal

+ To build an extractive text summarizer highlighting the key points of the article

#### Approach

+ Import libraries and data
+ Create a function to wrap the text and easier to read
+ Build the summarizer and parser using sumy packages
+ Add them to the dataframe

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import textwrap

df = pd.read_csv('data/bbc_text.csv')
df.head(5)

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [2]:
# looking at the data landscape and the data types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2225 non-null   object
 1   labels  2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [3]:
# switching df['text'] to string from object 

df['text'] = pd.Series(df['text'], dtype = 'string')

In [4]:
# filtering out labels that are not business and grabbing a random simple

doc = df[df['labels'] == 'entertainment']['text'].sample(random_state = 123)

In [5]:
# using textwrap make the text more visually appealing

def wrap(x):
    return textwrap.fill(x, replace_whitespace = False, fix_sentence_endings = True)

In [6]:
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

In [7]:
summarizer = TextRankSummarizer() # assigning Textrank summarizer to variable
parser = PlaintextParser.from_string(doc.iloc[0].split('\n', 1)[1], Tokenizer('english')) # passing the first document through the parser

In [8]:
summary = summarizer(parser.document, sentences_count = 5) # summarizing to 5 sentences

In [9]:
summary # printing out the summary

(<Sentence: The 21-year-old singer won the award for best female artist, with Australian Idol runner-up Shannon Noll taking the title of best male at the ceremony.>,
 <Sentence: As well as best female, Goodrem also took home the Pepsi Viewers Choice Award, whilst Green Day bagged the prize for best rock video for American Idiot.>,
 <Sentence: The Black Eyed Peas won awards for best R 'n' B video and sexiest video, both for Hey Mama.>,
 <Sentence: Local singer and songwriter Missy Higgins took the title of breakthrough artist of the year, with Australian Idol winner Guy Sebastian taking the honours for best pop video.>,
 <Sentence: The ceremony was held at the Luna Park fairground in Sydney Harbour and was hosted by the Osbourne family.>)

In [10]:
# after adding the wrap function, here is how it looks 

for s in summary:
    print(wrap(str(s)))

The 21-year-old singer won the award for best female artist, with
Australian Idol runner-up Shannon Noll taking the title of best male
at the ceremony.
As well as best female, Goodrem also took home the Pepsi Viewers
Choice Award, whilst Green Day bagged the prize for best rock video
for American Idiot.
The Black Eyed Peas won awards for best R 'n' B video and sexiest
video, both for Hey Mama.
Local singer and songwriter Missy Higgins took the title of
breakthrough artist of the year, with Australian Idol winner Guy
Sebastian taking the honours for best pop video.
The ceremony was held at the Luna Park fairground in Sydney Harbour
and was hosted by the Osbourne family.


In [11]:
doc = df[df['labels'] == 'entertainment']['text'].sample(random_state = 123) # getting a random document under the entertainment label
summarizer = LsaSummarizer()
parser = PlaintextParser.from_string(doc.iloc[0].split('\n', 1)[1], Tokenizer('english')) # passing first document through the parser
summary = summarizer(parser.document, sentences_count = 5) # summarizing to 5 sentences

for s in summary:
    print(wrap(str(s))) # printing out the summary

Goodrem, known in both Britain and Australia for her role as Nina
Tucker in TV soap Neighbours, also performed a duet with boyfriend
Brian McFadden.
Other winners included Green Day, voted best group, and the Black Eyed
Peas.
Goodrem, Green Day and the Black Eyed Peas took home two awards each.
As well as best female, Goodrem also took home the Pepsi Viewers
Choice Award, whilst Green Day bagged the prize for best rock video
for American Idiot.
Artists including Carmen Electra, Missy Higgins, Kelly Osbourne, Green
Day, Ja Rule and Natalie Imbruglia gave live performances at the
event.


In [12]:
# adding the text into a variable to pass in the plain text parser

text = df['text']

In [13]:
# creating a function that summarizes the text, and then stores into a list that can be brought into the dataframe

def summarize(text):
    summarizer = LsaSummarizer() # assigning the summarizer to a variable
    parser = PlaintextParser(text, Tokenizer('english')) # passing the text as string through and tokenizing 
    summary = summarizer(parser.document, sentences_count = 6) # summarizing document to 6 sentences
    
    return [wrap(str(s)) for s in summary] # return the summary in a list

In [14]:
# now running the summarize function to loop through the dataset and provide a summary on each article

df['summary'] = df['text'].apply(summarize)
df.head(5)

Unnamed: 0,text,labels,summary
0,Ad sales boost Time Warner profit Quarterly p...,business,"[The firm, which is now one of the biggest inv..."
1,Dollar gains on Greenspan speech The dollar h...,business,[And Alan Greenspan highlighted the US governm...
2,Yukos unit buyer faces loan claim The owners ...,business,[The owners of embattled Russian oil giant Yuk...
3,High fuel prices hit BA's profits British Air...,business,[British Airways has blamed high fuel prices f...
4,Pernod takeover talk lifts Domecq Shares in U...,business,[Shares in UK drinks and food firm Allied Dome...


In [15]:
# switching df['text'] to string from object 

df['summary'] = pd.Series(df['summary'], dtype = 'string')

In [17]:
import re
import string

# creating a function to clean the summary data by removing unnecessary characters

def clean_data(text):
    text = text.lower()
    text = text.strip().replace("\\n", " ") # replacing the new line breaks with blank space
    text = text.strip().replace("\',", " ") # replacing \', with blank space
    text = text.strip().replace("\''", "") # replacing \' with blank space
    text = text.strip().replace("'", "") # replacing single quotation mark with blank space
    text = text.strip().replace('"', "") # replacing double quotation mark with blank space
    return text

df['clean'] = df['summary'].apply(clean_data) # applying the function on the summary column
df.head(5) # taking a look at the first five rows of data

Unnamed: 0,text,labels,summary,clean
0,Ad sales boost Time Warner profit Quarterly p...,business,"['The firm, which is now one of the biggest in...","[the firm, which is now one of the biggest inv..."
1,Dollar gains on Greenspan speech The dollar h...,business,"[""And Alan Greenspan highlighted the US govern...",[and alan greenspan highlighted the us governm...
2,Yukos unit buyer faces loan claim The owners ...,business,['The owners of embattled Russian oil giant Yu...,[the owners of embattled russian oil giant yuk...
3,High fuel prices hit BA's profits British Air...,business,['British Airways has blamed high fuel prices ...,[british airways has blamed high fuel prices f...
4,Pernod takeover talk lifts Domecq Shares in U...,business,"[""Shares in UK drinks and food firm Allied Dom...",[shares in uk drinks and food firm allied dome...


In [18]:
# downloading the dataframe to a csv

df.to_csv('sumy_summarizer.csv')