<a href="https://colab.research.google.com/github/YD140/Extractive-Text-Summarization/blob/main/EXTRACTIVE_TEXT_SUMMARIZATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Extractive Summarization is a method of generating a summary by selecting important sentences from the original text based on statistical and semantic approaches without altering the original meaning.**

# **Importing Libraries**

In [106]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import numpy as np
import re
import heapq


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Data Pre-Processing**

In [91]:
article_text = """
Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: ‘I don’t really hide any feelings too much.
I think everyone knows this is my job here. When I’m on the courts or when I’m on the court playing, I’m a competitor and I want to beat every single person whether they’re in the locker room or across the net.
So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.
I’m a pretty competitive girl. I say my hellos, but I’m not sending any players flowers as well. Uhm, I’m not really friendly or close to many players.
I have not a lot of friends away from the courts.’ When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men’s tour than the women’s tour? ‘No, not at all.
I think just because you’re in the same sport doesn’t mean that you have to be friends with everyone just because you’re categorized, you’re a tennis player, so you’re going to get along with tennis players.
I think every person has different interests. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life.
I think everyone just thinks because we’re tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do.
There are so many other things that we’re interested in, that we do.’
"""

In [92]:
#CONVERT CORPUS TO SENTENCES
def split_to_sentences(article_text):

    sentences = article_text.split('\n')

#REMOVE EXTRA SPACE
    sentences = [s.strip() for s in sentences]  #strip can only apply on string not list

#DROP EMPTY SENTENCE
    sentences = [s for s in sentences if len(s) > 0] # remove empty string ('')

    return sentences

In [93]:
split_to_sentences(article_text)

['Maria Sharapova has basically no friends as tennis players on the WTA Tour. The Russian player has no problems in openly speaking about it and in a recent interview she said: ‘I don’t really hide any feelings too much.',
 'I think everyone knows this is my job here. When I’m on the courts or when I’m on the court playing, I’m a competitor and I want to beat every single person whether they’re in the locker room or across the net.',
 'So I’m not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.',
 'I’m a pretty competitive girl. I say my hellos, but I’m not sending any players flowers as well. Uhm, I’m not really friendly or close to many players.',
 'I have not a lot of friends away from the courts.’ When she said she is not really close to a lot of players, is that something strategic that she is doing? Is it different on the men’s tour than the women’s tour? ‘No, not at all.',
 'I think just becau

In [94]:
# testing our code
x = " I like to play cricket.\nI also like to watch it.\nI an a big fan of cricket. "
print(x)

split_to_sentences(x)

 I like to play cricket.
I also like to watch it.
I an a big fan of cricket. 


['I like to play cricket.',
 'I also like to watch it.',
 'I an a big fan of cricket.']

# **Tokenization**

In [95]:
#Tokenization

def tokenize_sentences(sentences):

# Now we create an empty list an give name as tokenized_sentences
    tokenized_sentences = []  #This list is the lists of tokens

    for sentence in sentences: # This will iterate over each sentences

        sentence = sentence.lower() #now we convert our text into lowercase

        tokenized = nltk.word_tokenize(sentence) #now we apply tokenization on sentences

#Now we append the list of words into an empty list which we created in starting
        tokenized_sentences.append(tokenized)


    return tokenized_sentences

In [96]:
tokenize_sentences(split_to_sentences(article_text))

[['maria',
  'sharapova',
  'has',
  'basically',
  'no',
  'friends',
  'as',
  'tennis',
  'players',
  'on',
  'the',
  'wta',
  'tour',
  '.',
  'the',
  'russian',
  'player',
  'has',
  'no',
  'problems',
  'in',
  'openly',
  'speaking',
  'about',
  'it',
  'and',
  'in',
  'a',
  'recent',
  'interview',
  'she',
  'said',
  ':',
  '‘',
  'i',
  'don',
  '’',
  't',
  'really',
  'hide',
  'any',
  'feelings',
  'too',
  'much',
  '.'],
 ['i',
  'think',
  'everyone',
  'knows',
  'this',
  'is',
  'my',
  'job',
  'here',
  '.',
  'when',
  'i',
  '’',
  'm',
  'on',
  'the',
  'courts',
  'or',
  'when',
  'i',
  '’',
  'm',
  'on',
  'the',
  'court',
  'playing',
  ',',
  'i',
  '’',
  'm',
  'a',
  'competitor',
  'and',
  'i',
  'want',
  'to',
  'beat',
  'every',
  'single',
  'person',
  'whether',
  'they',
  '’',
  're',
  'in',
  'the',
  'locker',
  'room',
  'or',
  'across',
  'the',
  'net',
  '.'],
 ['so',
  'i',
  '’',
  'm',
  'not',
  'the',
  'one',
  'to

In [97]:
#testing our code
x = "I like to play cricket.\nI also like $ to watch it.\nI an a big &/ fan of cricket. "
print(x)

tokenize_sentences(split_to_sentences(x))

I like to play cricket.
I also like $ to watch it.
I an a big &/ fan of cricket. 


[['i', 'like', 'to', 'play', 'cricket', '.'],
 ['i', 'also', 'like', '$', 'to', 'watch', 'it', '.'],
 ['i', 'an', 'a', 'big', '&', '/', 'fan', 'of', 'cricket', '.']]

# **Count Word Frequency**

In [98]:
def count_words(tokenized_sentences):

    word_counts = {} #get output in a key-value pair

    for sentence in tokenized_sentences:

# Now we are go through each token in the sentence
        for token in sentence:
            if token not in word_counts.keys():
                word_counts[token] = 1
            else:
                word_counts[token] += 1

    return word_counts

In [99]:
word_counts = count_words(tokenize_sentences(split_to_sentences(article_text)))
word_counts

{'maria': 1,
 'sharapova': 1,
 'has': 3,
 'basically': 1,
 'no': 3,
 'friends': 5,
 'as': 2,
 'tennis': 6,
 'players': 6,
 'on': 4,
 'the': 14,
 'wta': 1,
 'tour': 3,
 '.': 15,
 'russian': 1,
 'player': 2,
 'problems': 1,
 'in': 7,
 'openly': 1,
 'speaking': 1,
 'about': 2,
 'it': 2,
 'and': 6,
 'a': 9,
 'recent': 1,
 'interview': 1,
 'she': 4,
 'said': 2,
 ':': 1,
 '‘': 2,
 'i': 18,
 'don': 1,
 '’': 21,
 't': 2,
 'really': 3,
 'hide': 1,
 'any': 2,
 'feelings': 1,
 'too': 1,
 'much': 1,
 'think': 4,
 'everyone': 3,
 'knows': 1,
 'this': 1,
 'is': 6,
 'my': 3,
 'job': 1,
 'here': 1,
 'when': 3,
 'm': 7,
 'courts': 1,
 'or': 3,
 'court': 1,
 'playing': 1,
 ',': 9,
 'competitor': 1,
 'want': 1,
 'to': 8,
 'beat': 1,
 'every': 2,
 'single': 1,
 'person': 2,
 'whether': 1,
 'they': 1,
 're': 7,
 'locker': 1,
 'room': 1,
 'across': 1,
 'net': 1,
 'so': 3,
 'not': 6,
 'one': 1,
 'strike': 1,
 'up': 1,
 'conversation': 1,
 'weather': 1,
 'know': 1,
 'that': 7,
 'next': 1,
 'few': 1,
 'minutes

In [100]:
#Replacing weighted Frequencies
maximum_frequncy = max(word_counts.values())

for word in word_counts.keys():
    word_counts[word] = (word_counts[word]/maximum_frequncy) # we divided each and every word frequency with max_frequency to normalize word frequency

**In above step we divided each frequency with the maximum frequency to normalize the frequency of each word.**

**Consistency:** Different texts may have different word counts, so normalizing ensures a consistent scale.  
**Relative Importance:** Normalized frequencies allow us to see how frequently a word appears relative to the most frequent word, rather than just its raw count.  
**Input for Algorithms**: Some machine learning models (like summarization models) may work better with normalized or scaled values rather than raw counts.

In [101]:
word_counts

{'maria': 0.047619047619047616,
 'sharapova': 0.047619047619047616,
 'has': 0.14285714285714285,
 'basically': 0.047619047619047616,
 'no': 0.14285714285714285,
 'friends': 0.23809523809523808,
 'as': 0.09523809523809523,
 'tennis': 0.2857142857142857,
 'players': 0.2857142857142857,
 'on': 0.19047619047619047,
 'the': 0.6666666666666666,
 'wta': 0.047619047619047616,
 'tour': 0.14285714285714285,
 '.': 0.7142857142857143,
 'russian': 0.047619047619047616,
 'player': 0.09523809523809523,
 'problems': 0.047619047619047616,
 'in': 0.3333333333333333,
 'openly': 0.047619047619047616,
 'speaking': 0.047619047619047616,
 'about': 0.09523809523809523,
 'it': 0.09523809523809523,
 'and': 0.2857142857142857,
 'a': 0.42857142857142855,
 'recent': 0.047619047619047616,
 'interview': 0.047619047619047616,
 'she': 0.19047619047619047,
 'said': 0.09523809523809523,
 ':': 0.047619047619047616,
 '‘': 0.09523809523809523,
 'i': 0.8571428571428571,
 'don': 0.047619047619047616,
 '’': 1.0,
 't': 0.09523

# **Sentence Score**

In [102]:
#Assigning Sentence Score

sentence_scores = {}
for sent in split_to_sentences(article_text):
    for word in nltk.word_tokenize(sent.lower()): #firstly the word converted into lowercase and then it will be tokenized
        if word in word_counts.keys():            # if word is present in word count the word is consider as a key and its frequency is consider as value
          if len(sent.split(' ')) < 30:           # check the length of sentences if it is less than 30 then and only than we should take it

# If the above condition for the sentences is satisfied then and only then we move to the next step.

              if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_counts[word]
              else:
                    sentence_scores[sent] += word_counts[word]

In [103]:
sentence_scores

{'I’m a pretty competitive girl. I say my hellos, but I’m not sending any players flowers as well. Uhm, I’m not really friendly or close to many players.': 13.76190476190476,
 'I think every person has different interests. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life.': 9.476190476190471,
 'I think everyone just thinks because we’re tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do.': 8.857142857142854,
 'There are so many other things that we’re interested in, that we do.’': 5.523809523809524}

# **Generate Summary**

In [105]:
summary_sentences = heapq.nlargest(10, sentence_scores, key=sentence_scores.get)

summary = ' ' .join(summary_sentences)
summary

'I’m a pretty competitive girl. I say my hellos, but I’m not sending any players flowers as well. Uhm, I’m not really friendly or close to many players. I think every person has different interests. I have friends that have completely different jobs and interests, and I’ve met them in very different parts of my life. I think everyone just thinks because we’re tennis players we should be the greatest of friends. But ultimately tennis is just a very small part of what we do. There are so many other things that we’re interested in, that we do.’'