# Text Summarization with Sumy

#### What is Sumy?

+ Is a simple library used for extracting summary from HTML pages or plain texts
+ The package also contains simple evaluation framework for text summaries

#### Goal

+ To build an extractive text summarizer highlighting the key points of the article

#### Approach

+ Import libraries and data
+ Create a function to wrap the text and easier to read
+ Build the summarizer and parser using sumy packages
+ Add them to the dataframe
+ Clean the data and remove unwanted or unnecessary character
+ Using keybert, extract keywords from the clean summary


# VADER

### What is VADER?

+ VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. (All credits go to https://github.com/cjhutto/vaderSentiment)

### How is it scored?

+ The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate. (All credits go to https://github.com/cjhutto/vaderSentiment)

# Keybert

### What is KeyBERT?

+ KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document. (All credits go to https://github.com/MaartenGr/KeyBERT)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import textwrap

df = pd.read_csv('data/bbc_text.csv')
df.head(5)

Unnamed: 0,text,labels
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business


In [2]:
# looking at the data landscape and the data types

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2225 non-null   object
 1   labels  2225 non-null   object
dtypes: object(2)
memory usage: 34.9+ KB


In [3]:
# switching df['text'] to string from object 

df['text'] = pd.Series(df['text'], dtype = 'string')

In [4]:
# filtering out labels that are not business and grabbing a random simple

doc = df[df['labels'] == 'entertainment']['text'].sample(random_state = 123)

In [5]:
# using textwrap make the text more visually appealing

def wrap(x):
    return textwrap.fill(x, replace_whitespace = False, fix_sentence_endings = True)

In [6]:
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

In [7]:
summarizer = TextRankSummarizer() # assigning Textrank summarizer to variable
parser = PlaintextParser.from_string(doc.iloc[0].split('\n', 1)[1], Tokenizer('english')) # passing the first document through the parser

In [8]:
summary = summarizer(parser.document, sentences_count = 5) # summarizing to 5 sentences

In [9]:
summary # printing out the summary

(<Sentence: The 21-year-old singer won the award for best female artist, with Australian Idol runner-up Shannon Noll taking the title of best male at the ceremony.>,
 <Sentence: As well as best female, Goodrem also took home the Pepsi Viewers Choice Award, whilst Green Day bagged the prize for best rock video for American Idiot.>,
 <Sentence: The Black Eyed Peas won awards for best R 'n' B video and sexiest video, both for Hey Mama.>,
 <Sentence: Local singer and songwriter Missy Higgins took the title of breakthrough artist of the year, with Australian Idol winner Guy Sebastian taking the honours for best pop video.>,
 <Sentence: The ceremony was held at the Luna Park fairground in Sydney Harbour and was hosted by the Osbourne family.>)

In [10]:
# after adding the wrap function, here is how it looks 

for s in summary:
    print(wrap(str(s))) # printing out the same summary as above, but using the wrap definition to make it more readable

The 21-year-old singer won the award for best female artist, with
Australian Idol runner-up Shannon Noll taking the title of best male
at the ceremony.
As well as best female, Goodrem also took home the Pepsi Viewers
Choice Award, whilst Green Day bagged the prize for best rock video
for American Idiot.
The Black Eyed Peas won awards for best R 'n' B video and sexiest
video, both for Hey Mama.
Local singer and songwriter Missy Higgins took the title of
breakthrough artist of the year, with Australian Idol winner Guy
Sebastian taking the honours for best pop video.
The ceremony was held at the Luna Park fairground in Sydney Harbour
and was hosted by the Osbourne family.


In [11]:
doc = df[df['labels'] == 'entertainment']['text'].sample(random_state = 123) # getting a random document under the entertainment label
summarizer = LsaSummarizer()
parser = PlaintextParser.from_string(doc.iloc[0].split('\n', 1)[1], Tokenizer('english')) # passing first document through the parser
summary = summarizer(parser.document, sentences_count = 5) # summarizing to 5 sentences

for s in summary:
    print(wrap(str(s))) # printing out the summary

Goodrem, known in both Britain and Australia for her role as Nina
Tucker in TV soap Neighbours, also performed a duet with boyfriend
Brian McFadden.
Other winners included Green Day, voted best group, and the Black Eyed
Peas.
Goodrem, Green Day and the Black Eyed Peas took home two awards each.
As well as best female, Goodrem also took home the Pepsi Viewers
Choice Award, whilst Green Day bagged the prize for best rock video
for American Idiot.
Artists including Carmen Electra, Missy Higgins, Kelly Osbourne, Green
Day, Ja Rule and Natalie Imbruglia gave live performances at the
event.


In [12]:
# adding the text into a variable to pass in the plain text parser

text = df['text']

In [13]:
# creating a function that summarizes the text, and then stores into a list that can be brought into the dataframe

def summarize(text):
    summarizer = LsaSummarizer() # assigning the summarizer to a variable
    parser = PlaintextParser(text, Tokenizer('english')) # passing the text as string through and tokenizing 
    summary = summarizer(parser.document, sentences_count = 6) # summarizing document to 6 sentences
    return [wrap(str(s)) for s in summary] # return the summary in a list

In [14]:
# now running the summarize function to loop through the dataset and provide a summary on each article

df['summary'] = df['text'].apply(summarize)
df.head(5)

Unnamed: 0,text,labels,summary
0,Ad sales boost Time Warner profit Quarterly p...,business,"[The firm, which is now one of the biggest inv..."
1,Dollar gains on Greenspan speech The dollar h...,business,[And Alan Greenspan highlighted the US governm...
2,Yukos unit buyer faces loan claim The owners ...,business,[The owners of embattled Russian oil giant Yuk...
3,High fuel prices hit BA's profits British Air...,business,[British Airways has blamed high fuel prices f...
4,Pernod takeover talk lifts Domecq Shares in U...,business,[Shares in UK drinks and food firm Allied Dome...


In [15]:
# switching df['text'] to string from object 

df['summary'] = pd.Series(df['summary'], dtype = 'string')

In [16]:
# texts = df['summary']

In [17]:
# using a dictionary to map and replace unwanted and unnecessary characters with blank space or blank

def replace_text(text:str) -> str:
    replacement_map = {
    "\\n" : " ",
    "\'"  : " ",
    " , " : "",
    '",'  : "",
      "[" : "",
      "]" : "",
      '"' : "",
    }

    for find, new in replacement_map.items():
        text = text.lower() # lowercase all text as it is coming through
        text = text.replace(find, new) # find and replace with the dictionary above
    return text.strip()

In [18]:
df['clean'] = df['summary'].apply(replace_text) # applying the formula 

In [19]:
df['clean'][0] # accessing the first clean summary in the column

'the firm, which is now one of the biggest investors in google, benefited from sales of high-speed internet connections and higher advert sales. its profits were buoyed by one-off gains which offset a profit dip at warner bros, and less users for aol. time warner said on friday that it now owns 8% of search-engine google. but its own internet business, aol, had has mixed fortunes.time warner s fourth quarter profits were slightly better than analysts  expectations.  for the full-year, timewarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.'

In [20]:
df.head(5)

Unnamed: 0,text,labels,summary,clean
0,Ad sales boost Time Warner profit Quarterly p...,business,"['The firm, which is now one of the biggest in...","the firm, which is now one of the biggest inve..."
1,Dollar gains on Greenspan speech The dollar h...,business,"[""And Alan Greenspan highlighted the US govern...",and alan greenspan highlighted the us governme...
2,Yukos unit buyer faces loan claim The owners ...,business,['The owners of embattled Russian oil giant Yu...,the owners of embattled russian oil giant yuko...
3,High fuel prices hit BA's profits British Air...,business,['British Airways has blamed high fuel prices ...,british airways has blamed high fuel prices fo...
4,Pernod takeover talk lifts Domecq Shares in U...,business,"[""Shares in UK drinks and food firm Allied Dom...",shares in uk drinks and food firm allied domec...


## Now bringing VADER into the mix to get the overall sentiment of the summary provided

#### Scoring will be based on the compound score

    sentiment = 'Neutral'
    if compound >= 0.3:
        sentiment = 'Positive'
    elif compound <= -0.3:
        sentiment = 'Negative'   
## To end, the dataframe will have two additional columns created

1. The compound score
2. The sentiment based on that score (Positive, Negative, or Neutral)

In [21]:
## installing libraries for vader

import nltk

# Download the lexicon
nltk.download("vader_lexicon")

# Import the lexicon 
from nltk.sentiment.vader import SentimentIntensityAnalyzer # importing vader library

sid = SentimentIntensityAnalyzer() # storing the sentiment analyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\chaco\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [22]:
# pulling the first 251 characters of the first clean summary in the column

text = df['clean'][0][0:250]

In [23]:
# applying vader polarity scores to the sample text

sid.polarity_scores(text)

{'neg': 0.0, 'neu': 0.826, 'pos': 0.174, 'compound': 0.802}

In [24]:
# understanding the data landscape 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     2225 non-null   string
 1   labels   2225 non-null   object
 2   summary  2225 non-null   string
 3   clean    2225 non-null   object
dtypes: object(2), string(2)
memory usage: 69.7+ KB


In [25]:
df['clean'] = pd.Series(df['clean'], dtype = 'string')

In [26]:
# writing a function to get the sentiment of the overall summary

def get_compound(df):
    text = df['clean']
    scores = sid.polarity_scores(text)
    compound = scores['compound']
    
    sentiment = 'Neutral'
    if compound >= 0.3:
        sentiment = 'Positive'
    elif compound <= -0.3:
        sentiment = 'Negative'
    df['compound'] = compound
    df['sentiment'] = sentiment
    return df

df = df.apply(get_compound, axis = 1)

In [27]:
df.head(5)

Unnamed: 0,text,labels,summary,clean,compound,sentiment
0,Ad sales boost Time Warner profit\n\nQuarterly...,business,"['The firm, which is now one of the biggest in...","the firm, which is now one of the biggest inve...",0.9404,Positive
1,Dollar gains on Greenspan speech\n\nThe dollar...,business,"[""And Alan Greenspan highlighted the US govern...",and alan greenspan highlighted the us governme...,0.8122,Positive
2,Yukos unit buyer faces loan claim\n\nThe owner...,business,['The owners of embattled Russian oil giant Yu...,the owners of embattled russian oil giant yuko...,-0.5499,Negative
3,High fuel prices hit BA's profits\n\nBritish A...,business,['British Airways has blamed high fuel prices ...,british airways has blamed high fuel prices fo...,0.8625,Positive
4,Pernod takeover talk lifts Domecq\n\nShares in...,business,"[""Shares in UK drinks and food firm Allied Dom...",shares in uk drinks and food firm allied domec...,-0.4497,Negative


In [28]:
# looking at all different values in the sentiment category and the count per each value

df['sentiment'].value_counts()

Positive    1343
Negative     606
Neutral      276
Name: sentiment, dtype: int64

## Keybert

### What is KeyBERT?

+ KeyBERT is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

### How it works

1. Document embeddings are extracted with BERT to get document-level representation
2. Word embeddings are then extracted for N-gram words/phrases
3. Cosine similarity is then used to find the words that are most similar to the document

(All credits go to https://github.com/MaartenGr/KeyBERT)

In [29]:
from keybert import KeyBERT

bert = KeyBERT()

In [30]:
def keyword_extractor(text):
    keywords = bert.extract_keywords(text, keyphrase_ngram_range = (3, 5), use_mmr = 'True', diversity = 0.6, stop_words = 'english', top_n = 10)
    # keyphrase_ngram_range : length of extracted keywords / phrases
    # use_mmr : maximal marginal relevance 
    # diversity :  if mmr is set to true, higher diversity the less overlap there will be between keyphrases / words extracted
    # stop_words : remove common words with little to not meaning to the sentence - Ie. the, it, is, are
    # top_n : return number of keywords/keyphrases
    results = []
    for scored_keywords in keywords:
        for keyword in scored_keywords:
            if isinstance(keyword, str):
                results.append(keyword)
    return results

In [31]:
df['keywords'] = df['clean'].apply(keyword_extractor)

In [32]:
df.head(5)

Unnamed: 0,text,labels,summary,clean,compound,sentiment,keywords
0,Ad sales boost Time Warner profit\n\nQuarterly...,business,"['The firm, which is now one of the biggest in...","the firm, which is now one of the biggest inve...",0.9404,Positive,"[time warner fourth quarter profits, owns sear..."
1,Dollar gains on Greenspan speech\n\nThe dollar...,business,"[""And Alan Greenspan highlighted the US govern...",and alan greenspan highlighted the us governme...,0.8122,Positive,"[chinese policy meantime federal reserve, high..."
2,Yukos unit buyer faces loan claim\n\nThe owner...,business,['The owners of embattled Russian oil giant Yu...,the owners of embattled russian oil giant yuko...,-0.5499,Negative,"[loan yukos owner menatep, yugansk secured ass..."
3,High fuel prices hit BA's profits\n\nBritish A...,business,['British Airways has blamed high fuel prices ...,british airways has blamed high fuel prices fo...,0.8625,Positive,"[ba estimated annual surcharge revenues, resul..."
4,Pernod takeover talk lifts Domecq\n\nShares in...,business,"[""Shares in UK drinks and food firm Allied Dom...",shares in uk drinks and food firm allied domec...,-0.4497,Negative,"[glenmorangie scotland premier whisky firms, d..."
