## Text Summarization using NLP - Overview

The main objective of this project is to build a function that can take in any huge piece of text and shorten it to give a gist, or summary of the information. We know that the internet contains large volumes of data, both descriptive as well as quantitiative. While the quantitative data can be analysed in several ways, it becomes cumbersome to read through the huge paragraphs. So, the most effieicnt way to get access to these important parts of the data is too summarize the data in a way that it contains non-redundant and useful information only.  <br> <br> 
Text summarization is a subdomain of Natural Language Processing (NLP) that deals with extracting summaries from huge chunks of texts. There are two main types of techniques used for text summarization: NLP-based techniques and deep learning-based techniques. In this notebook, we will see a simple NLP-based technique for text summarization

### Steps involved in Text Summarization using NLP

- <b> Converting paragraphs to sentences </b>
- <b> Text pre-processing -</b> Removing special characters, stop words, lower casing the text.
- <b> Tokenizing the sentences -</b> Tokenizing the sentences to get all the words that exist in the sentences. 
- <b> Finding the weighted frequency of occurrence </b> - Getting the weighted frequency of each word by finding the frequency of each word divided by the maximum frequency of a particular word. 
- <b> Replace the words by their weighted frequency in the original sentences - </b> Plug the weighted frequencies of the words in the sentences and find the sum.
- <b> Sort sentences in the descending order of sum - </b> The sentences with the highest frequenices summarize the text. The number of sentences that need to be taken depends on what size of summary we need. 

In [1]:
#!pip install newspaper3k

#Library to perform necessary operations involving text

In [2]:
!pip install beautifulsoup4
#A useful library for web scraping



In [3]:
!pip install lxml
#A library to parse XML and HTML



In [4]:
from newspaper import Article
import re
import bs4 as bs
import urllib.request
import numpy as np
import nltk

### Loading the parent text

The url of the article we would summarize is mentioned below

In [5]:
url = "https://en.wikipedia.org/wiki/FIFA_World_Cup"

In [6]:
#Using the urlopen function from the urllib.request library to scrap the data
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/FIFA_World_Cup')

#Reading the data
article = scraped_data.read()

#To parse the data, we use beautiful soup library
parsed_article = bs.BeautifulSoup(article, 'lxml')
paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

In [7]:
article_text

'\nThe FIFA World Cup, often simply called the World Cup, is an international association football competition contested by the senior men\'s national teams of the members of the Fédération Internationale de Football Association (FIFA), the sport\'s global governing body. The tournament has been held every four years since the inaugural tournament in 1930, except in 1942 and 1946 when it was not held because of the Second World War. The reigning champions are France, who won their second title at the 2018 tournament in Russia.\nThe format involves a qualification phase, which takes place over the preceding three years, to determine which teams qualify for the tournament phase. In the tournament phase, 32 teams compete for the title at venues within the host nation(s) over about a month. The host nation(s) automatically qualify.\nAs of the 2018 FIFA World Cup, twenty-one final tournaments have been held and a total of 79 national teams have competed. The trophy has been won by eight nat

## Pre-Processing the Text

The first preprocessing step is to remove references from the article. Wikipedia, references are enclosed in square brackets. 

In [8]:
#Removing square brackets and extra spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

In [9]:
#Removing special characters and digits
processed_text = re.sub('[^a-zA-Z0-9]', ' ', article_text)
processed_text = re.sub(r'\s+',' ', processed_text)

article_text contains the original article and processed_article contains the processed article. In order to create the weighted frequency of the words, we will use the latter. 

## Converting Text to Sentences

In [10]:
sentences = nltk.sent_tokenize(article_text)

In [11]:
len(sentences)
#The number of sentences in the text

242

## Finding the Weighted Frequency of occurrence

In [12]:
from nltk.corpus import stopwords

In [13]:
stopwords = stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(processed_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

- First, we store all the stopwords in the English language from the nltk library into a variable named <b> stopwords </b>. 
- We then loop through all the sentences and corresponding words to check if they are stop words. If not, we proceed to check whether the words exist in the <b> word_frequency dictionary </b> or not. 
- If the word is encountered for the first time, it is added to the dictionary, that is, its frequency is set to 1.
- If the word is already present, the freqeuency of the word is updated. 
- Finally, to find the weighted frequency, we divide the frequency of each word by the maximum frequency of any word in the text.

In [14]:
max_freq = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/max_freq)

## Calculating Sentence Scores

In [15]:
sentence_scores = {}

for sent in sentences:
    for word in nltk.sent_tokenize(sent.lower()):
        for word in word_frequencies.keys():
            if len(sent.split(' ')) < 40:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

- We first create an empty <b> sentence_scores </b> dictionary. The keys of this dictionary will be the sentences themselves and the values will be the corresponding scores of the sentences.
- We loop through each sentence in the <b> sentence_list </b> and tokenize the sentence into words. 
- We then check if the word exists in the <b> word_frequencies </b> dictionary. This check is performed since we created the <b> sentence_list </b> list from the article_text object; on the other hand, the word frequencies were calculated using the <b> processed_text </b> object, which doesn't contain any stop words, numbers, etc.
- We do not want very long sentences in the summary. Hence, we look only at those sentences with less than 40 words. 
- We check if the sentence exists in the sentence_scores dictionary or not. If it doesn't, we add it to the sentence_scores dictionary as a key and assign it to the weighted frequency of the first word in the sentence. 
- If it already exists, we add the weighted frequency of the word to the existing value. 

## Getting the summary

In [19]:
import heapq
summary_sent = heapq.nlargest(25, sentence_scores, key = sentence_scores.get)
#returns the top n sentences with the highest scores

summary = ' '.join(summary_sent)
print(summary)

In November 2007, FIFA announced that all members of World Cup-winning squads between 1930 and 1974 were to be retroactively awarded winners' medals. The reigning champions are France, who won their second title at the 2018 tournament in Russia. The format involves a qualification phase, which takes place over the preceding three years, to determine which teams qualify for the tournament phase. In the tournament phase, 32 teams compete for the title at venues within the host nation(s) over about a month. The host nation(s) automatically qualify. As of the 2018 FIFA World Cup, twenty-one final tournaments have been held and a total of 79 national teams have competed. The trophy has been won by eight national teams. Brazil have won five times, and they are the only team to have played in every tournament. The World Cup is the most prestigious association football tournament in the world, as well as the most widely viewed and followed single sporting event in the world. Seventeen countrie