<h1>Auto Summarizing text</h1>

<p1>This is a technique of creating a summary for a given piece of text. This involves assigning values/weights to each word in the text based on the frequency of their occurence. Interestingly, it turns out that authors of articles/content tend to use use those words that describe the theme of the content a lot more that other words.</p1>

<p2><b>This means, if we use frequency based encoding, we would be able to get those sentences that has the most valued words, which can be used as a summary/abstract</b></p2>

In [1]:
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from string import punctuation

<h3>Steps involved in Auto Summarizing text</h3>
<ol>
<li>Converting the text in to sentences and into words</li>
<li>Create a set of stopwords that needs to be removed from the list of words</li>
<li>Remove the stopwords from the list of words</li>
<li>Create a frequency distribution using the cleaned list of words</li>    
<li>Update the weightage of each sentence based on the words present in it</li>
<li>S</li>    
</ol>

In [2]:
# Open the text file and read its content into a variable called text

with open('baba.txt') as file:
    text=file.read()
text=text.lower()    

In [3]:
# Convert the text into sentences and store it in a variable

sentences_in_text=sent_tokenize(text)

In [4]:
# Convert the text into words

words_in_text=word_tokenize(text)


In [5]:
# create a set of stop words to be removed

_stopwords=set(list(punctuation)+stopwords.words('english'))
_stopwords

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'need

In [6]:
# Removing the stopwords from our words list to speed up processing and to avoid the inteference of these words

cleaned_words=[a for a in words_in_text if a not in _stopwords]
len(cleaned_words)

316

In [7]:
from nltk.probability import FreqDist

In [8]:
# Frequency Distribution is a table that contains the list of words in a column and their 
#respective frequency of occurence in another.

# We can use the inbuilt functionn called FreqDist to get this done for our cleaned list

freq=FreqDist(cleaned_words)
freq

FreqDist({'indian': 9, 'baba': 8, 'army': 7, 'soldiers': 6, '``': 4, "''": 4, 'singh': 4, 'nathu': 4, 'la': 4, 'sikkim': 4, ...})

In [9]:
# the output of the method FreqDist is a dictonary

type(freq)

nltk.probability.FreqDist

In [10]:
from heapq import nlargest

In [11]:
# the method nlargest takes 3 inputs, ie the top N values to return, the collection on which the sorting 
# has to be done, and the value based on which the sorting has to be done (ie, should we sort it based on)
# the value of the key or the its value


nlargest(10,freq,key=freq.get)


['indian',
 'baba',
 'army',
 'soldiers',
 '``',
 "''",
 'singh',
 'nathu',
 'la',
 'sikkim']

In [12]:
# defaultdict is a special type of dictonary, which doesn't throw errors if the key we are looking for is absent 
# in the dictonart

from collections import defaultdict

ranking=defaultdict(int)

In [13]:
# We would be looping through each sentence and increasin the weightage of each sentence based on the
# words present in it. At the end, we will have a weightage for all the sentences and we can select 
# the top sentences based on the weightage

for i,sent in enumerate(sentences_in_text):
    for w in word_tokenize(sent.lower()):
        if w in freq:
            ranking[i]+=freq[w]


# The enumerate will create a list of tuples, where each tuple has the index of the sentence as the fist
# element and the sentence itselt as the second element.

# In short, we break the sentence into words, run through each word checking if its present in the
# cleaned list of words. If yes, we add its frequency of occurence to the weightage of this sentence.
# And this process is done for all the sentences of the text


In [14]:
# The ranking dicronary has the indices of the sentences and their respective weightage as its value.
# We can now sort the dict based on the values and get the top 5/10 keys, which denote the index of the
# sentences that summarizes our text

ranking

defaultdict(int,
            {0: 52,
             1: 52,
             2: 26,
             3: 66,
             4: 14,
             5: 7,
             6: 19,
             7: 85,
             8: 60,
             9: 17,
             10: 5,
             11: 7,
             12: 16,
             13: 57,
             14: 21,
             15: 27,
             16: 12,
             17: 21,
             18: 40,
             19: 34,
             20: 29,
             21: 29})

In [15]:
# Again, we go back to nlargest method to perform this sorting, which would help us get the sentence with
# most weightage

to_be_used=nlargest(5,ranking,key=ranking.get)

In [16]:
[sentences_in_text[j] for j in sorted(to_be_used)]

['captain "baba" harbhajan singh (30 august 1946 – 4 october 2019)[1] was an indian army soldier.',
 'many of his faithful - chiefly indian army personnel posted in and around the nathu la and the sino-indian border between the state of sikkim and tibet autonomous region - have come to believe his spirit protects every soldier in the inhospitable high-altitude terrain of the eastern himalayas.',
 "harbhajan singh's early death at the age of 22 is the subject of legend and religious veneration that has become popular among indian army regulars (jawans), the people of his village and apparently soldiers of the chinese people's liberation army (pla) across the border guarding the indo-chinese border between sikkim and tibet.",
 '[3]\n\nthe official version of his death is that he was a martyr of battle at the 14,500 feet (4,400 m) nathu la, a mountain pass between tibet and sikkim where many battles took place between the indian army and the pla during the 1965 sino-indian war.',
 'many i

<h2>Lets put all we have done into a method</h2>

In [17]:
# This method takes the text content and the number of lines of summary that is required as inputs and give the
# summary of the text as output

def text_summarize(text,n):
    import nltk
    from nltk.tokenize import word_tokenize,sent_tokenize
    from nltk.corpus import stopwords
    from string import punctuation
    from nltk.probability import FreqDist
    from heapq import nlargest
    from collections import defaultdict
    
    sentences_in_text=sent_tokenize(text)
    assert n<len(sentences_in_text), "The number of summary point should be less than the number of lines in the actual text"
    words_in_text=word_tokenize(text.lower())
    
    _stopwords=set(stopwords.words('english')+list(punctuation))
    
    cleaned_words=[word for word in words_in_text if word not in _stopwords]
    
    freq=FreqDist(cleaned_words)
    ranking=defaultdict(int)
    
    for i,sent in enumerate(sentences_in_text):
        for w in word_tokenize(sent.lower()):
            if w in freq:
                ranking[i]+=freq[w]
    
    sents_to_returned=nlargest(n,ranking,key=ranking.get)
    return [sentences_in_text[i] for i in sorted(sents_to_returned)]

<b>Lets test this function</b>

In [18]:
a=text_summarize(text,5)

In [19]:
a

['captain "baba" harbhajan singh (30 august 1946 – 4 october 2019)[1] was an indian army soldier.',
 'many of his faithful - chiefly indian army personnel posted in and around the nathu la and the sino-indian border between the state of sikkim and tibet autonomous region - have come to believe his spirit protects every soldier in the inhospitable high-altitude terrain of the eastern himalayas.',
 "harbhajan singh's early death at the age of 22 is the subject of legend and religious veneration that has become popular among indian army regulars (jawans), the people of his village and apparently soldiers of the chinese people's liberation army (pla) across the border guarding the indo-chinese border between sikkim and tibet.",
 '[3]\n\nthe official version of his death is that he was a martyr of battle at the 14,500 feet (4,400 m) nathu la, a mountain pass between tibet and sikkim where many battles took place between the indian army and the pla during the 1965 sino-indian war.',
 'many i

<h1>Thanks for your time!</h2>