## Extractive Text Summarizer using nltk

### Import the libraries

In [None]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

### Here is an  excerpt from Wikipedia on "Machine Learning" (our dataset to be summarized)

In [None]:
text="""
An old man lived in the village. He was one of the most unfortunate people in the world. The whole village was tired of him; he was always gloomy, he constantly complained and was always in a bad mood.
The longer he lived, the more bile he was becoming and the more poisonous were his words. People avoided him, because his misfortune became contagious. It was even unnatural and insulting to be happy next to him.

He created the feeling of unhappiness in others.
But one day, when he turned eighty years old, an incredible thing happened. Instantly everyone started hearing the rumour: “An Old Man is happy today, he doesn’t complain about anything, smiles, and even his face is freshened up.”
The whole village gathered together. The old man was asked:
Villager: What happened to you?
“Nothing special. Eighty years I’ve been chasing happiness, and it was useless. And then I decided to live without happiness and just enjoy life. That’s why I’m happy now.” – An Old Man

"""

### Create a function to calculate frequency of words in the given text

In [None]:
def frequency_table(text_string):
    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text_string)
    ps = PorterStemmer()
    
    freqTable = dict()
    for word in words:
        word = ps.stem(word)
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    return freqTable

### Create a score function that calculates score for every sentence.

<p>
Adding the frequency of every non-stop word in a sentence divided by total no of words in a sentence is the score of that word.
</p>

In [None]:
def score_sentences(sentences, freqTable):
    sent_value = dict()

    for sentence in sentences:
        word_count = (len(word_tokenize(sentence)))
        word_count_except_sw = 0
        for wordValue in freqTable:
            if wordValue in sentence.lower():
                word_count_except_sw += 1
                if sentence[:10] in sent_value:
                    sent_value[sentence[:10]] += freqTable[wordValue]
                else:
                    sent_value[sentence[:10]] = freqTable[wordValue]

        if sentence[:10] in sent_value:
            sent_value[sentence[:10]] = sent_value[sentence[:10]] / word_count_except_sw

    return sent_value

<p>
    <b>Disadvantage -</b>
    <br> Long sentences will have an advantage over short sentences. To solve this, we're dividing every sentence score by the number of words in the sentence.<br>
<b>Note -</b><br>  Here sentence[:10] is the first 10 character of any sentence, this is to save memory while saving keys of
the dictionary.
</p>

### Create a function to find the average score

In [None]:
def find_average_score(sent_value):
    sumValues = 0
    for entry in sent_value:
        sumValues += sent_value[entry]

    # Average value of a sentence from original text
    average = (sumValues / len(sent_value))

    return average

### Create a function to generate the summary

In [None]:
def generate_summary(sentences, sent_value, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:10] in sent_value and sent_value[sentence[:10]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary

<p>Average score of the sentence is the threshold</p>

### Create a function to call all the above functions

In [None]:
def run_summarization(text):
    # 1) Create the word frequency table
    freq_table = frequency_table(text)

    # 2) Tokenize the sentences
    sentences = sent_tokenize(text)

    # 3) Important Algorithm: score the sentences
    sentence_scores = score_sentences(sentences, freq_table)

    # 4) Find the threshold
    threshold = find_average_score(sentence_scores)

    # 5) Important Algorithm: Generate the summary
    summary = generate_summary(sentences, sentence_scores, 1.3 * threshold)

    return summary

### Generate the summary

In [None]:
  >>> import nltk
  >>> nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
if __name__ == '__main__':
    result = run_summarization(text)

In [None]:
print(result)

 
An old man lived in the village. Eighty years I’ve been chasing happiness, and it was useless.
