**Text Summarization using NLTK: TF-IDF Algorithm**

***TF: Term Frequency***

> Term frequency (TF) is how often a word appears in a document, divided by how many words there are.

> **TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)**





***IDF: Inverse Document Frequency*** 

> Term frequency is how common a word is, inverse document frequency (IDF) is how unique or rare a word is.

> **IDF(t) = log_e(Total number of documents / Number of documents with term t in it)**



A high weight in TF-IDF is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents.

For example, if the word "bug" appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant.

Example,

Consider a document containing 100 words wherein the word 'python' appears 5 times. The term frequency (i.e., TF) for python is then (5 / 100) = 0.05.

Now, assume we have 10 million documents and the word python appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4.

Thus, the TF-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.


**1. Tokenize the sentences**

In [24]:
import math
import textwrap
from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords  

textDoc = open("/NLPText.txt", "r")
text = ""

for line in textDoc:
  text += line

print(text)

sentences = sent_tokenize(text) # NLTK function
total_documents = len(sentences)

Those Who Are Resilient Stay In The Game Longer
“On the mountains of truth you can never climb in vain: either you will reach a point higher up today, or you will be training your powers so that you will be able to climb higher tomorrow.” — Friedrich Nietzsche
Challenges and setbacks are not meant to defeat you, but promote you. However, I realise after many years of defeats, it can crush your spirit and it is easier to give up than risk further setbacks and disappointments. Have you experienced this before? To be honest, I don’t have the answers. I can’t tell you what the right course of action is; only you will know. However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people. To a person with a Fixed Mindset failure is a blow to their self-esteem, yet to a person with a Growth Mindset, it’s an opportunity to improve and find new ways to overcome their obstacles. Same failure, yet different 

We’ll tokenize the sentences here instead of words. And we’ll give weight to these sentences.


**2. Create the Frequency matrix of the words in each sentence.**

We calculate the frequency of words in each sentence.

In [0]:
def remove_punctuation(corpus):
    punctuations = ".,\"-\\/#!?$%\^&\*;:{}=\-_'~()"    
    filtered_corpus = [token for token in corpus if (not token in punctuations)]
    return filtered_corpus

def create_frequency_matrix(sentences):
    frequency_matrix = {}
    stopWords = set(stopwords.words("english"))
    ps = PorterStemmer()

    for sent in sentences:
        
        freq_table = {}
        words = word_tokenize(sent)
        
        for word in words:
            
            word = word.lower()
            word = ps.stem(word)
            
            if word in stopWords:
                continue

            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        frequency_matrix[sent[:15]] = freq_table

    return frequency_matrix

In [0]:
sentences = remove_punctuation(sentences)
freqMatrix = create_frequency_matrix(sentences)

Here, each sentence is the key and the value is a dictionary of word frequency.

**3. Calculate the Term Frequency and Generate a Matrix**

We’ll find the Term Frequency for each word in a paragraph.

Now, remember the definition of TF,
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

Here, the document is a paragraph, the term is a word in a paragraph.

In [0]:
def create_tf_matrix(freq_matrix):
    tf_matrix = {}

    for sent, f_table in freq_matrix.items():
        tf_table = {}
        count_words_in_sentence = len(f_table)
        
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence

        tf_matrix[sent] = tf_table

    return tf_matrix

In [5]:
create_tf_matrix(freqMatrix)

{'And that’s fine': {',': 0.1,
  '.': 0.1,
  'content': 0.1,
  'fine': 0.1,
  'later': 0.1,
  'less': 0.1,
  'long': 0.1,
  'receiv': 0.1,
  'regret': 0.1,
  '’': 0.3},
 'Are you willing': {'?': 0.1,
  'commit': 0.1,
  'failur': 0.1,
  'first': 0.1,
  'jump': 0.1,
  'life': 0.1,
  'ship': 0.1,
  'sign': 0.1,
  'thi': 0.1,
  'way': 0.1},
 'Because I assur': {',': 0.17647058823529413,
  '.': 0.058823529411764705,
  'assur': 0.058823529411764705,
  'becaus': 0.058823529411764705,
  'book': 0.058823529411764705,
  'contest': 0.058823529411764705,
  'dream': 0.058823529411764705,
  'harder': 0.058823529411764705,
  'less': 0.058823529411764705,
  'may': 0.058823529411764705,
  'read': 0.058823529411764705,
  'realis': 0.058823529411764705,
  'right': 0.058823529411764705,
  'sacrif': 0.058823529411764705,
  'sleep': 0.058823529411764705,
  'someon': 0.058823529411764705,
  'work': 0.058823529411764705},
 'Commit to it.': {'.': 0.5, 'commit': 0.5},
 'Consider the ad': {',': 0.083333333333333

If we compare this table with the table we’ve generated in step 2, you will see the words having the same frequency are having the similar TF score.

**4. Creating a Table for Documents Per Words**

This again a simple table which helps in calculating IDF matrix.

we calculate, “how many sentences contain a word”, Let’s call it documents per words matrix.

In [0]:
def create_documents_per_words(freq_matrix):
    word_per_doc_table = {}

    for sent, f_table in freq_matrix.items():
        
        for word, count in f_table.items():
            
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table

In [7]:
create_documents_per_words(freqMatrix)

{'+': 1,
 ',': 22,
 '.': 45,
 '1990': 1,
 '19th': 1,
 ':': 8,
 ';': 1,
 '?': 6,
 'abl': 1,
 'academ': 1,
 'achiev': 1,
 'action': 3,
 'advic': 2,
 'advis': 1,
 'affirm': 1,
 'american': 1,
 'amus': 1,
 'angela': 1,
 'answer': 2,
 'ask': 2,
 'assur': 2,
 'astonish': 1,
 'author': 1,
 'away': 2,
 'becam': 1,
 'becaus': 3,
 'becom': 2,
 'beecher': 1,
 'befor': 2,
 'best': 2,
 'bigger': 2,
 'biggest': 1,
 'blow': 1,
 'book': 1,
 'breakthrough': 1,
 'bridg': 1,
 'came': 1,
 'capabl': 1,
 'carv': 1,
 'centuri': 1,
 'certain': 1,
 'challeng': 2,
 'chanc': 1,
 'circumst': 2,
 'client': 1,
 'climb': 1,
 'coach': 1,
 'come': 2,
 'commit': 2,
 'condit': 1,
 'consid': 2,
 'constant': 1,
 'content': 1,
 'contest': 1,
 'convey': 1,
 'convinc': 1,
 'could': 1,
 'cours': 1,
 'crush': 1,
 'daili': 2,
 'day': 1,
 'decad': 2,
 'decid': 2,
 'dedic': 1,
 'deeper': 1,
 'defeat': 3,
 'deserv': 2,
 'desir': 1,
 'develop': 6,
 'dice': 1,
 'dictat': 1,
 'differ': 3,
 'disappoint': 2,
 'disappointments.': 1,
 'd

**5. Calculate IDF and Generate a Matrix**

We’ll find the IDF for each word in a paragraph.

Now, remember the definition of IDF,
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Here, the document is a paragraph, the term is a word in a paragraph.


In [0]:
def create_idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix = {}

    for sent, f_table in freq_matrix.items():
        idf_table = {}

        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix

Now the resultant matrix would look something like this:

In [9]:
create_idf_matrix(freqMatrix, create_documents_per_words(freqMatrix), len(sentences))

{'And that’s fine': {',': 0.37358066281259295,
  '.': 0.06279082985945543,
  'content': 1.7160033436347992,
  'fine': 1.7160033436347992,
  'later': 1.414973347970818,
  'less': 0.9378520932511555,
  'long': 1.2388820889151366,
  'receiv': 1.414973347970818,
  'regret': 1.414973347970818,
  '’': 0.5118833609788744},
 'Are you willing': {'?': 0.9378520932511555,
  'commit': 1.414973347970818,
  'failur': 1.1139433523068367,
  'first': 1.7160033436347992,
  'jump': 1.7160033436347992,
  'life': 1.0170333392987803,
  'ship': 1.7160033436347992,
  'sign': 1.414973347970818,
  'thi': 1.1139433523068367,
  'way': 1.414973347970818},
 'Because I assur': {',': 0.37358066281259295,
  '.': 0.06279082985945543,
  'assur': 1.414973347970818,
  'becaus': 1.2388820889151366,
  'book': 1.7160033436347992,
  'contest': 1.7160033436347992,
  'dream': 0.9378520932511555,
  'harder': 1.7160033436347992,
  'less': 0.9378520932511555,
  'may': 1.414973347970818,
  'read': 1.2388820889151366,
  'realis': 1.

**6. Calculate TF-IDF and Generate a Matrix**

Now we have both the matrix and the next step is very easy.

TF-IDF algorithm is made of 2 algorithms multiplied together.

In simple terms, we are multiplying the values from both the matrix and generating new matrix.

In [0]:
def create_tf_idf_matrix(tf_matrix, idf_matrix):
    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):

        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(),
                                                    f_table2.items()):  # here, keys are the same in both the table
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix

In [11]:
tf_matrix = create_tf_matrix(freqMatrix)
count_doc_per_words = create_documents_per_words(freqMatrix)
idf_matrix = create_idf_matrix(freqMatrix, count_doc_per_words, len(sentences))

create_tf_idf_matrix(tf_matrix, idf_matrix)

{'And that’s fine': {',': 0.037358066281259296,
  '.': 0.006279082985945543,
  'content': 0.17160033436347993,
  'fine': 0.17160033436347993,
  'later': 0.1414973347970818,
  'less': 0.09378520932511555,
  'long': 0.12388820889151367,
  'receiv': 0.1414973347970818,
  'regret': 0.1414973347970818,
  '’': 0.15356500829366232},
 'Are you willing': {'?': 0.09378520932511555,
  'commit': 0.1414973347970818,
  'failur': 0.11139433523068368,
  'first': 0.17160033436347993,
  'jump': 0.17160033436347993,
  'life': 0.10170333392987803,
  'ship': 0.17160033436347993,
  'sign': 0.1414973347970818,
  'thi': 0.11139433523068368,
  'way': 0.1414973347970818},
 'Because I assur': {',': 0.06592599931986935,
  '.': 0.00369357822702679,
  'assur': 0.08323372635122459,
  'becaus': 0.07287541699500803,
  'book': 0.10094137315498819,
  'contest': 0.10094137315498819,
  'dream': 0.05516777019124444,
  'harder': 0.10094137315498819,
  'less': 0.05516777019124444,
  'may': 0.08323372635122459,
  'read': 0.07

**7. Score the sentences**

Scoring a sentence is differs with different algorithms. Here, we are using Tf-IDF score of words in a sentence to give weight to the paragraph.


In [0]:
def score_sentences(tf_idf_matrix) -> dict:
    """
    score a sentence by its word's TF
    Basic algorithm: adding the TF frequency of every non-stop word in a sentence divided by total no of words in a sentence.
    :rtype: dict
    """

    sentenceValue = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score

        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue

This gives the table of sentences and their respected score:

In [13]:
tf_idf_matrix = create_tf_idf_matrix(tf_matrix, idf_matrix)
score_sentences(tf_idf_matrix)

{'And that’s fine': 0.11825682488957016,
 'Are you willing': 0.13575702211980462,
 'Because I assur': 0.07822442210598175,
 'Commit to it.': 0.36944104445756837,
 'Consider the ad': 0.06407040138894513,
 'Could you be yo': 0.30003341888505486,
 'Don’t leave it ': 0.1076945290795475,
 'Don’t leave you': 0.17869646866557684,
 'Each person has': 0.17094978895160579,
 'Even more than ': 0.050422705224886254,
 'Focus on your d': 0.0910346091504206,
 'For others, at ': 0.13562338250422484,
 'Gnaw away at yo': 0.18068695741527235,
 'Have you experi': 0.3239232585727256,
 'However, I real': 0.09203831532832171,
 'However, it’s i': 0.08732864933977157,
 'I can’t tell yo': 0.12383203821623005,
 'If you have not': 0.11932249535494273,
 'If you leave it': 0.12784187000038838,
 'If you settle f': 0.17196930858836237,
 'It must come fr': 0.28804630433974315,
 'It was at that ': 0.2233540987973747,
 'It was the 19th': 0.04074504992847192,
 'It’s a fact, if': 0.05533828611748099,
 'I’m amused when': 0

**8. Find the Threshold**

Similar to any summarization algorithms, there can be different ways to calculate a threshold value. We’re calculating the average sentence score.

In [0]:
def find_average_score(sentenceValue) -> int:
    """
    Find the average score from the sentence value dictionary
    :rtype: int
    """
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original summary_text
    average = (sumValues / len(sentenceValue))

    return average

We get the following score as an average:

In [15]:
sentence_scores = score_sentences(tf_idf_matrix)
find_average_score(sentence_scores)

0.15647933364110622

**9. Generate the summary**

Algorithm: Select a sentence for a summarization if the sentence score is more than the average score.


In [0]:
def generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary

**Putting Everything Together**

For the threshold, we’ve used 1.3x of the average score. You can play with such variables to generate the summary as you like.

In [0]:
    
'''
We already have a sentence tokenizer, so we just need 
to run the sent_tokenize() method to create the array of sentences.
'''
# 1 Sentence Tokenize
sentences = sent_tokenize(text)
total_documents = len(sentences)
#print(sentences)

# 2 Create the Frequency matrix of the words in each sentence.
freq_matrix = create_frequency_matrix(sentences)
#print(freq_matrix)

'''
Term frequency (TF) is how often a word appears in a document, divided by how many words are there in a document.
'''
# 3 Calculate TermFrequency and generate a matrix
tf_matrix = create_tf_matrix(freq_matrix)
#print(tf_matrix)

# 4 creating table for documents per words
count_doc_per_words = create_documents_per_words(freq_matrix)
#print(count_doc_per_words)

'''
Inverse document frequency (IDF) is how unique or rare a word is.
'''
# 5 Calculate IDF and generate a matrix
idf_matrix = create_idf_matrix(freq_matrix, count_doc_per_words, total_documents)
#print(idf_matrix)

# 6 Calculate TF-IDF and generate a matrix
tf_idf_matrix = create_tf_idf_matrix(tf_matrix, idf_matrix)
#print(tf_idf_matrix)

# 7 Important Algorithm: score the sentences
sentence_scores = score_sentences(tf_idf_matrix)
#print(sentence_scores)

# 8 Find the threshold
threshold = find_average_score(sentence_scores)
#print(threshold)

# 9 Important Algorithm: Generate the summary
summary = generate_summary(sentences, sentence_scores, 1.3 * threshold)


**Here is the original text:**


In [18]:
print("\n".join(textwrap.wrap(text,75)))

Those Who Are Resilient Stay In The Game Longer “On the mountains of truth
you can never climb in vain: either you will reach a point higher up today,
or you will be training your powers so that you will be able to climb
higher tomorrow.” — Friedrich Nietzsche Challenges and setbacks are not
meant to defeat you, but promote you. However, I realise after many years
of defeats, it can crush your spirit and it is easier to give up than risk
further setbacks and disappointments. Have you experienced this before? To
be honest, I don’t have the answers. I can’t tell you what the right course
of action is; only you will know. However, it’s important not to be
discouraged by failure when pursuing a goal or a dream, since failure
itself means different things to different people. To a person with a Fixed
Mindset failure is a blow to their self-esteem, yet to a person with a
Growth Mindset, it’s an opportunity to improve and find new ways to
overcome their obstacles. Same failure, yet different 

Here is the summary of the text:

In [19]:
print("\n".join(textwrap.wrap(summary,75)))

 Have you experienced this before? Who is right and who is wrong? Neither.
It was at that point their biggest breakthrough came. Perhaps all those
years of perseverance finally paid off. It must come from within you. Where
are you settling in your life right now? Could you be you playing for
bigger stakes than you are? So become intentional on what you want out of
life. Commit to it. Nurture your dreams.
