<a href="https://colab.research.google.com/github/aadityadamle/News-Summarization/blob/main/News_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

There are two types of text summarizations - Extractive and Abstractive. 

Extractive summurization is simply selecting important words from the text to generate a summary. Abstractive summarization uses deeplearning techniques to choose crucial information and paraphrase it to form a more meaningful summary. We are going to look at simple the approach using algorithms.  

In [1]:
#import library
import nltk
import pandas as pd
import re

In [2]:
#download tokenizer
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
#Download dataset
!wget https://raw.githubusercontent.com/aadityadamle/News-Summarization/main/news_summary.csv

--2021-04-14 19:57:00--  https://raw.githubusercontent.com/aadityadamle/News-Summarization/main/news_summary.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11896415 (11M) [text/plain]
Saving to: ‘news_summary.csv’


2021-04-14 19:57:01 (39.1 MB/s) - ‘news_summary.csv’ saved [11896415/11896415]



Dataset source: https://www.kaggle.com/sunnysai12345/news-summary

In [4]:
#Read dataset
data = pd.read_csv("/content/news_summary.csv", encoding="latin-1")
data.head()

Unnamed: 0,author,date,headlines,read_more,text,ctext
0,Chhavi Tyagi,"03 Aug 2017,Thursday",Daman & Diu revokes mandatory Rakshabandhan in...,http://www.hindustantimes.com/india-news/raksh...,The Administration of Union Territory Daman an...,The Daman and Diu administration on Wednesday ...
1,Daisy Mowke,"03 Aug 2017,Thursday",Malaika slams user who trolled her for 'divorc...,http://www.hindustantimes.com/bollywood/malaik...,Malaika Arora slammed an Instagram user who tr...,"From her special numbers to TV?appearances, Bo..."
2,Arshiya Chopra,"03 Aug 2017,Thursday",'Virgin' now corrected to 'Unmarried' in IGIMS...,http://www.hindustantimes.com/patna/bihar-igim...,The Indira Gandhi Institute of Medical Science...,The Indira Gandhi Institute of Medical Science...
3,Sumedha Sehra,"03 Aug 2017,Thursday",Aaj aapne pakad liya: LeT man Dujana before be...,http://indiatoday.intoday.in/story/abu-dujana-...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...,Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4,Aarushi Maheshwari,"03 Aug 2017,Thursday",Hotel staff to get training to spot signs of s...,http://indiatoday.intoday.in/story/sex-traffic...,Hotels in Maharashtra will train their staff t...,Hotels in Mumbai and other Indian cities are t...


In [5]:
#Examine dataset and columns
print(data.shape)
data.columns

(4514, 6)


Index(['author', 'date', 'headlines', 'read_more', 'text', 'ctext'], dtype='object')

In [6]:
#Check for na values
print(data.isna().sum())

author         0
date           0
headlines      0
read_more      0
text           0
ctext        118
dtype: int64


In [7]:
#Delete rows with na values
data = data.dropna()
data.shape

(4396, 6)

In [8]:
#Select useful columns
data = data[["headlines","text","ctext"]]
print(data)

                                              headlines  ...                                              ctext
0     Daman & Diu revokes mandatory Rakshabandhan in...  ...  The Daman and Diu administration on Wednesday ...
1     Malaika slams user who trolled her for 'divorc...  ...  From her special numbers to TV?appearances, Bo...
2     'Virgin' now corrected to 'Unmarried' in IGIMS...  ...  The Indira Gandhi Institute of Medical Science...
3     Aaj aapne pakad liya: LeT man Dujana before be...  ...  Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4     Hotel staff to get training to spot signs of s...  ...  Hotels in Mumbai and other Indian cities are t...
...                                                 ...  ...                                                ...
4509  Rasna seeking ?250 cr revenue from snack categ...  ...  Mumbai, Feb 23 (PTI) Fruit juice concentrate m...
4510  Sachin attends Rajya Sabha after questions on ...  ...  Former cricketer Sachin Tendulkar was spot

In [9]:
#Prepare series for complete texts
c_texts = data["ctext"]
c_texts[:10]

0    The Daman and Diu administration on Wednesday ...
1    From her special numbers to TV?appearances, Bo...
2    The Indira Gandhi Institute of Medical Science...
3    Lashkar-e-Taiba's Kashmir commander Abu Dujana...
4    Hotels in Mumbai and other Indian cities are t...
5    An alleged suspect in a kidnapping case was fo...
6    In an interesting ruling, the Delhi high court...
7     A 60-year old Dalit woman was allegedly lynch...
8    Two years after a helicopter crash near the Bo...
9    It sounds like satire, but make no mistake: at...
Name: ctext, dtype: object

In [10]:
#Select a complete text for processing
c_text0 = c_texts[0]
print(c_text0)

The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media.The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.?It has been decided to celebrate the festival of Rakshabandhan on August 7. In this connection, all offices/ departments shall remain open and celebrate the festival collectively at a suitable time wherein all the lady staff shall tie rakhis to their colleagues,? the order, issued on August 1 by Gurpreet Singh, deputy secretary (personnel), had said.To ensure that no one skipped office, an attendance report was to be sent to the government the next evening.The two notifications ? one mandating the celebration of Rakshabandhan (left) and the other withdrawing the mandate (right) ? were issued by the Daman

In [11]:
#Prepare list of sentences with sentence tokenizer
sents0 = nltk.sent_tokenize(c_text0)
sents0

['The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media.The union territory?s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace.',
 '?It has been decided to celebrate the festival of Rakshabandhan on August 7.',
 'In this connection, all offices/ departments shall remain open and celebrate the festival collectively at a suitable time wherein all the lady staff shall tie rakhis to their colleagues,?',
 'the order, issued on August 1 by Gurpreet Singh, deputy secretary (personnel), had said.To ensure that no one skipped office, an attendance report was to be sent to the government the next evening.The two notifications ?',
 'one mandating the celebration of Rakshabandhan (left) and the other withdrawing the mandate (right) ?',
 'w

In [12]:
#Clean sentences of symbols and punctuation
sents0_cleaned = [re.sub(r"[^a-zA-Z0-9]", " ", sentence) for sentence in sents0]
sents0_cleaned

['The Daman and Diu administration on Wednesday withdrew a circular that asked women staff to tie rakhis on male colleagues after the order triggered a backlash from employees and was ripped apart on social media The union territory s administration was forced to retreat within 24 hours of issuing the circular that made it compulsory for its staff to celebrate Rakshabandhan at workplace ',
 ' It has been decided to celebrate the festival of Rakshabandhan on August 7 ',
 'In this connection  all offices  departments shall remain open and celebrate the festival collectively at a suitable time wherein all the lady staff shall tie rakhis to their colleagues  ',
 'the order  issued on August 1 by Gurpreet Singh  deputy secretary  personnel   had said To ensure that no one skipped office  an attendance report was to be sent to the government the next evening The two notifications  ',
 'one mandating the celebration of Rakshabandhan  left  and the other withdrawing the mandate  right   ',
 'w

In [13]:
#download stopwords
nltk.download("stopwords")
# Import and set the English stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [14]:
#Remove stopwords from sentences
sents = []
for sentence in sents0_cleaned:
  sentence = ' '.join([word for word in sentence.split() if word not in stop_words])
  sents.append(sentence)
sents

['The Daman Diu administration Wednesday withdrew circular asked women staff tie rakhis male colleagues order triggered backlash employees ripped apart social media The union territory administration forced retreat within 24 hours issuing circular made compulsory staff celebrate Rakshabandhan workplace',
 'It decided celebrate festival Rakshabandhan August 7',
 'In connection offices departments shall remain open celebrate festival collectively suitable time wherein lady staff shall tie rakhis colleagues',
 'order issued August 1 Gurpreet Singh deputy secretary personnel said To ensure one skipped office attendance report sent government next evening The two notifications',
 'one mandating celebration Rakshabandhan left withdrawing mandate right',
 'issued Daman Diu administration day apart',
 'The circular withdrawn one line order issued late evening UT department personnel administrative reforms',
 'The circular ridiculous',
 'There sensitivities involved',
 'How government dictate I

In [15]:
#tokenize sentences to get a lists of tokenized sentences then merge 
# all the lists to for all_words vocab list
tokens = []
for sentence in sents:
  tokens.append(nltk.word_tokenize(sentence.lower()))

all_words = []
for l_tokens in tokens:
  for token in l_tokens:
    all_words.append(token)

all_words

['the',
 'daman',
 'diu',
 'administration',
 'wednesday',
 'withdrew',
 'circular',
 'asked',
 'women',
 'staff',
 'tie',
 'rakhis',
 'male',
 'colleagues',
 'order',
 'triggered',
 'backlash',
 'employees',
 'ripped',
 'apart',
 'social',
 'media',
 'the',
 'union',
 'territory',
 'administration',
 'forced',
 'retreat',
 'within',
 '24',
 'hours',
 'issuing',
 'circular',
 'made',
 'compulsory',
 'staff',
 'celebrate',
 'rakshabandhan',
 'workplace',
 'it',
 'decided',
 'celebrate',
 'festival',
 'rakshabandhan',
 'august',
 '7',
 'in',
 'connection',
 'offices',
 'departments',
 'shall',
 'remain',
 'open',
 'celebrate',
 'festival',
 'collectively',
 'suitable',
 'time',
 'wherein',
 'lady',
 'staff',
 'shall',
 'tie',
 'rakhis',
 'colleagues',
 'order',
 'issued',
 'august',
 '1',
 'gurpreet',
 'singh',
 'deputy',
 'secretary',
 'personnel',
 'said',
 'to',
 'ensure',
 'one',
 'skipped',
 'office',
 'attendance',
 'report',
 'sent',
 'government',
 'next',
 'evening',
 'the',
 't

In [16]:
#import porter stemmer
from nltk import PorterStemmer 
#create porterstemmer object
ps = PorterStemmer()

In [17]:
#Create a dictionary using all_words with words as keys and word - frequencies
# as values 
frequency = dict()
for word in all_words:
  word = ps.stem(word)
  if word in frequency:
    frequency[word] += 1
  else:
    frequency[word] = 1
frequency    

{'1': 1,
 '2014': 1,
 '24': 1,
 '7': 1,
 'a': 1,
 'administr': 5,
 'affair': 1,
 'al': 1,
 'apart': 2,
 'area': 1,
 'ask': 2,
 'attend': 1,
 'august': 2,
 'backlash': 1,
 'becom': 1,
 'bhagwat': 1,
 'bjp': 2,
 'bond': 1,
 'border': 1,
 'brother': 1,
 'cabinet': 1,
 'celebr': 7,
 'centr': 1,
 'chief': 1,
 'circular': 4,
 'colleagu': 2,
 'collect': 1,
 'compulsori': 1,
 'confin': 1,
 'connect': 1,
 'constitu': 1,
 'cultur': 1,
 'daman': 3,
 'day': 2,
 'decid': 1,
 'depart': 2,
 'deputi': 1,
 'dictat': 1,
 'direct': 1,
 'diu': 3,
 'earlier': 1,
 'employe': 1,
 'enshrin': 1,
 'ensur': 1,
 'even': 2,
 'famili': 1,
 'festiv': 6,
 'forc': 1,
 'former': 1,
 'go': 1,
 'govern': 3,
 'gujarat': 1,
 'gurpreet': 1,
 'hindu': 2,
 'hindustan': 1,
 'home': 1,
 'hour': 1,
 'how': 1,
 'i': 1,
 'identifi': 1,
 'ideolog': 2,
 'in': 2,
 'involv': 1,
 'issu': 5,
 'it': 1,
 'kodabhai': 1,
 'ladi': 1,
 'last': 1,
 'late': 1,
 'left': 1,
 'line': 1,
 'live': 1,
 'longer': 1,
 'made': 1,
 'maintain': 1,
 'male'

In [18]:
# Get the most frequent word and its frequency
freq_word = None
most_freq = 0
for key, value in frequency.items():
  if value > most_freq:
    most_freq = value
    freq_word = key
print(most_freq)
print(freq_word)

7
the


In [19]:
# Nrmalize the word-counts in dictionary with in range of 0 to 1 
for key, value in frequency.items():
  frequency[key] = round(value/most_freq, 2)
print(frequency)

{'the': 1.0, 'daman': 0.43, 'diu': 0.43, 'administr': 0.71, 'wednesday': 0.14, 'withdrew': 0.14, 'circular': 0.57, 'ask': 0.29, 'women': 0.29, 'staff': 0.43, 'tie': 0.43, 'rakhi': 0.43, 'male': 0.14, 'colleagu': 0.29, 'order': 0.43, 'trigger': 0.14, 'backlash': 0.14, 'employe': 0.14, 'rip': 0.14, 'apart': 0.29, 'social': 0.14, 'media': 0.14, 'union': 0.14, 'territori': 0.14, 'forc': 0.14, 'retreat': 0.14, 'within': 0.14, '24': 0.14, 'hour': 0.14, 'issu': 0.71, 'made': 0.14, 'compulsori': 0.14, 'celebr': 1.0, 'rakshabandhan': 0.57, 'workplac': 0.29, 'it': 0.14, 'decid': 0.14, 'festiv': 0.86, 'august': 0.29, '7': 0.14, 'in': 0.29, 'connect': 0.14, 'offic': 0.29, 'depart': 0.29, 'shall': 0.29, 'remain': 0.14, 'open': 0.14, 'collect': 0.14, 'suitabl': 0.14, 'time': 0.29, 'wherein': 0.14, 'ladi': 0.14, '1': 0.14, 'gurpreet': 0.14, 'singh': 0.14, 'deputi': 0.14, 'secretari': 0.14, 'personnel': 0.29, 'said': 0.43, 'to': 0.14, 'ensur': 0.14, 'one': 0.57, 'skip': 0.14, 'attend': 0.14, 'report':

In [20]:
# Set a frequency-score and use it to make list of important-words 
imp_words = []
freq_score = 0.5
for key, value in frequency.items():
  if value > freq_score :
    imp_words.append(key)
imp_words

['the',
 'administr',
 'circular',
 'issu',
 'celebr',
 'rakshabandhan',
 'festiv',
 'one']

In [21]:
#Use the imp_words list to select important-sentences into a list 
imp_sents = []
for sentence in sents:
  for word in imp_words:
    if sentence.lower().find(word) != -1 and sentence not in imp_sents:
      imp_sents.append(sentence)
  
imp_sents  

['The Daman Diu administration Wednesday withdrew circular asked women staff tie rakhis male colleagues order triggered backlash employees ripped apart social media The union territory administration forced retreat within 24 hours issuing circular made compulsory staff celebrate Rakshabandhan workplace',
 'It decided celebrate festival Rakshabandhan August 7',
 'In connection offices departments shall remain open celebrate festival collectively suitable time wherein lady staff shall tie rakhis colleagues',
 'order issued August 1 Gurpreet Singh deputy secretary personnel said To ensure one skipped office attendance report sent government next evening The two notifications',
 'one mandating celebration Rakshabandhan left withdrawing mandate right',
 'issued Daman Diu administration day apart',
 'The circular withdrawn one line order issued late evening UT department personnel administrative reforms',
 'The circular ridiculous',
 'There sensitivities involved',
 'She refused identified T

In [22]:
#Personalize the imp_words list as per your needs and the above results
imp_words.remove("the")
imp_words

['administr', 'circular', 'issu', 'celebr', 'rakshabandhan', 'festiv', 'one']

In [23]:
#Make the list of important sentences using personalized list of words
imp_sents = []
for sentence in sents:
  for word in imp_words:
    if sentence.lower().find(word) != -1 and sentence not in imp_sents:
      imp_sents.append(sentence)
  
imp_sents  

['The Daman Diu administration Wednesday withdrew circular asked women staff tie rakhis male colleagues order triggered backlash employees ripped apart social media The union territory administration forced retreat within 24 hours issuing circular made compulsory staff celebrate Rakshabandhan workplace',
 'It decided celebrate festival Rakshabandhan August 7',
 'In connection offices departments shall remain open celebrate festival collectively suitable time wherein lady staff shall tie rakhis colleagues',
 'order issued August 1 Gurpreet Singh deputy secretary personnel said To ensure one skipped office attendance report sent government next evening The two notifications',
 'one mandating celebration Rakshabandhan left withdrawing mandate right',
 'issued Daman Diu administration day apart',
 'The circular withdrawn one line order issued late evening UT department personnel administrative reforms',
 'The circular ridiculous',
 'She refused identified The notice issued Daman Diu admini

In [24]:
#Prepare the summary using the sentence list
summarized_news = '. '.join([sent for sent in imp_sents])
summarized_news
   

'The Daman Diu administration Wednesday withdrew circular asked women staff tie rakhis male colleagues order triggered backlash employees ripped apart social media The union territory administration forced retreat within 24 hours issuing circular made compulsory staff celebrate Rakshabandhan workplace. It decided celebrate festival Rakshabandhan August 7. In connection offices departments shall remain open celebrate festival collectively suitable time wherein lady staff shall tie rakhis colleagues. order issued August 1 Gurpreet Singh deputy secretary personnel said To ensure one skipped office attendance report sent government next evening The two notifications. one mandating celebration Rakshabandhan left withdrawing mandate right. issued Daman Diu administration day apart. The circular withdrawn one line order issued late evening UT department personnel administrative reforms. The circular ridiculous. She refused identified The notice issued Daman Diu administrator former Gujarat ho

In [27]:
#Make the list of important sentences using personalized list of words
imp_sents = []
for sentence in sents:
    if sentence.lower().find(imp_words[4]) != -1 and sentence not in imp_sents:
      imp_sents.append(sentence)
  
print(imp_sents)

#Prepare the summary using the sentence list
summarized_news = '. '.join([sent for sent in imp_sents])
summarized_news

['The Daman Diu administration Wednesday withdrew circular asked women staff tie rakhis male colleagues order triggered backlash employees ripped apart social media The union territory administration forced retreat within 24 hours issuing circular made compulsory staff celebrate Rakshabandhan workplace', 'It decided celebrate festival Rakshabandhan August 7', 'one mandating celebration Rakshabandhan left withdrawing mandate right', 'She refused identified The notice issued Daman Diu administrator former Gujarat home minister Praful Kodabhai Patel direction sources said Rakshabandhan celebration bond brothers sisters one several Hindu festivities rituals longer confined private family affairs become tools push politic al ideologies In 2014 year BJP stormed power Centre Rashtriya Swayamsevak Sangh RSS chief Mohan Bhagwat said festival national significance']


'The Daman Diu administration Wednesday withdrew circular asked women staff tie rakhis male colleagues order triggered backlash employees ripped apart social media The union territory administration forced retreat within 24 hours issuing circular made compulsory staff celebrate Rakshabandhan workplace. It decided celebrate festival Rakshabandhan August 7. one mandating celebration Rakshabandhan left withdrawing mandate right. She refused identified The notice issued Daman Diu administrator former Gujarat home minister Praful Kodabhai Patel direction sources said Rakshabandhan celebration bond brothers sisters one several Hindu festivities rituals longer confined private family affairs become tools push politic al ideologies In 2014 year BJP stormed power Centre Rashtriya Swayamsevak Sangh RSS chief Mohan Bhagwat said festival national significance'