<a href="https://colab.research.google.com/github/dwishashah/Text-Summarization-NLP/blob/main/NLP_TextSummarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing Libraries

In [None]:
import nltk
import string
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tag import pos_tag

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## 1: Preprocessing


In [None]:
# Sample document
document = """
Text summarization is the process of distilling the most important information from a source to produce a concise summary.
There are two main approaches to text summarization: extractive and abstractive summarization.
Extractive summarization involves selecting and combining existing sentences or phrases from the source document.
Abstractive summarization, on the other hand, generates new sentences to convey the key points of the source text.
TextRank is a popular algorithm for extractive text summarization, which ranks sentences based on their importance in the document.
"""

In [None]:
def preprocess(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenize text into words
    words = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

preprocessed_text = preprocess(document)
print("1. Preprocessed Text:")
print(preprocessed_text)

1. Preprocessed Text:
text summarization process distilling important information source produce concise summary two main approach text summarization extractive abstractive summarization extractive summarization involves selecting combining existing sentence phrase source document abstractive summarization hand generates new sentence convey key point source text textrank popular algorithm extractive text summarization rank sentence based importance document


## 2: N-gram Modeling

In [None]:
def generate_ngrams(text, n=2):
    words = word_tokenize(text)
    n_grams = list(ngrams(words, n))
    return [' '.join(gram) for gram in n_grams]

ngrams_text = generate_ngrams(preprocessed_text)
print("\n2. N-grams:")
for gram in ngrams_text:
    print(gram)


2. N-grams:
text summarization
summarization process
process distilling
distilling important
important information
information source
source produce
produce concise
concise summary
summary two
two main
main approach
approach text
text summarization
summarization extractive
extractive abstractive
abstractive summarization
summarization extractive
extractive summarization
summarization involves
involves selecting
selecting combining
combining existing
existing sentence
sentence phrase
phrase source
source document
document abstractive
abstractive summarization
summarization hand
hand generates
generates new
new sentence
sentence convey
convey key
key point
point source
source text
text textrank
textrank popular
popular algorithm
algorithm extractive
extractive text
text summarization
summarization rank
rank sentence
sentence based
based importance
importance document


## 3: Porter Stemmer

In [None]:
def porter_stem(text):
    porter = PorterStemmer()
    return ' '.join(porter.stem(word) for word in word_tokenize(text))

stemmed_text = porter_stem(preprocessed_text)
print("\n3. Stemmed Text:")
print(stemmed_text)


3. Stemmed Text:
text summar process distil import inform sourc produc concis summari two main approach text summar extract abstract summar extract summar involv select combin exist sentenc phrase sourc document abstract summar hand gener new sentenc convey key point sourc text textrank popular algorithm extract text summar rank sentenc base import document


## 4: POS Tagging


In [None]:
def pos_tagging(text):
    tagged_words = pos_tag(word_tokenize(text))
    return tagged_words

pos_tagged_text = pos_tagging(preprocessed_text)
print("\n4. POS Tagging:")
print(pos_tagged_text)


4. POS Tagging:
[('text', 'JJ'), ('summarization', 'NN'), ('process', 'NN'), ('distilling', 'VBG'), ('important', 'JJ'), ('information', 'NN'), ('source', 'NN'), ('produce', 'VBP'), ('concise', 'NN'), ('summary', 'JJ'), ('two', 'CD'), ('main', 'JJ'), ('approach', 'NN'), ('text', 'NN'), ('summarization', 'NN'), ('extractive', 'JJ'), ('abstractive', 'JJ'), ('summarization', 'NN'), ('extractive', 'JJ'), ('summarization', 'NN'), ('involves', 'VBZ'), ('selecting', 'VBG'), ('combining', 'VBG'), ('existing', 'VBG'), ('sentence', 'NN'), ('phrase', 'NN'), ('source', 'NN'), ('document', 'NN'), ('abstractive', 'JJ'), ('summarization', 'NN'), ('hand', 'NN'), ('generates', 'VBZ'), ('new', 'JJ'), ('sentence', 'NN'), ('convey', 'NN'), ('key', 'JJ'), ('point', 'NN'), ('source', 'NN'), ('text', 'NN'), ('textrank', 'NN'), ('popular', 'JJ'), ('algorithm', 'NN'), ('extractive', 'JJ'), ('text', 'NN'), ('summarization', 'NN'), ('rank', 'NN'), ('sentence', 'NN'), ('based', 'VBN'), ('importance', 'NN'), ('do

## 5: TF-IDF Vectorization


In [None]:
def tfidf_vectorization(documents):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
    return tfidf_vectorizer, tfidf_matrix

tfidf_vectorizer, tfidf_matrix = tfidf_vectorization([preprocessed_text])
print("\n5. TF-IDF Vectorization (Vectorized Document):")
print(tfidf_matrix.toarray())


5. TF-IDF Vectorization (Vectorized Document):
[[0.18731716 0.09365858 0.09365858 0.09365858 0.09365858 0.09365858
  0.09365858 0.09365858 0.18731716 0.09365858 0.28097574 0.09365858
  0.09365858 0.09365858 0.09365858 0.09365858 0.09365858 0.09365858
  0.09365858 0.09365858 0.09365858 0.09365858 0.09365858 0.09365858
  0.09365858 0.09365858 0.09365858 0.28097574 0.28097574 0.56195149
  0.09365858 0.37463432 0.09365858 0.09365858]]


## 6: WordNet

In [None]:
def wordnet_disambiguation(word, pos):
    synsets = wordnet.synsets(word, pos=pos)
    if synsets:
        return synsets[0].definition()
    else:
        return "No suitable synset found."

word = input("Enter a word for WordNet disambiguation: ")
pos = input("Enter the part of speech tag for the word (e.g., 'n' for noun, 'v' for verb): ")
print("WordNet Disambiguation:")
print(wordnet_disambiguation(word, pos))

Enter a word for WordNet disambiguation: rank
Enter the part of speech tag for the word (e.g., 'n' for noun, 'v' for verb): v
WordNet Disambiguation:
take or have a position relative to others


## 7: Word Sense Disambiguation

In [None]:
def word_sense_disambiguation(sentence):
    disambiguated_sentence = []
    for word, tag in pos_tag(word_tokenize(sentence)):
        if tag.startswith('N'):
            disambiguated_sentence.append(wordnet_disambiguation(word, sentence))
        else:
            disambiguated_sentence.append(word)
    return ' '.join(disambiguated_sentence)

## 8: Text Summarization

In [None]:
def text_summarization(document):
  try:
      preprocessed_text = preprocess(document)
      ngrams_text = generate_ngrams(preprocessed_text)
      stemmed_text = porter_stem(preprocessed_text)
      pos_tagged_text = pos_tagging(preprocessed_text)
      tfidf_vectorizer, tfidf_matrix = tfidf_vectorization([preprocessed_text])
      # Calculate sentence importance scores using TF-IDF
      sentence_scores = {}
      for i, sentence in enumerate(sent_tokenize(document)):
          sentence_vector = tfidf_vectorizer.transform([preprocess(sentence)])
          similarity_score = cosine_similarity(sentence_vector, tfidf_matrix)
          sentence_scores[i] = similarity_score.mean()
      # Sort sentences by importance score
      ranked_sentences = sorted(((score, i) for i, score in sentence_scores.items()), reverse=True)
      # Select top 2 sentences as summary
      summary = [sent_tokenize(document)[i] for _, i in ranked_sentences[:2]]
      return ' '.join(summary)
  except (TypeError, ValueError):
      return None

In [None]:
# Test the Text Summarization function
print("Original Document:")
print(document)
print("\nSummary:")
print(text_summarization(document))

Original Document:

Text summarization is the process of distilling the most important information from a source to produce a concise summary.
There are two main approaches to text summarization: extractive and abstractive summarization.
Extractive summarization involves selecting and combining existing sentences or phrases from the source document.
Abstractive summarization, on the other hand, generates new sentences to convey the key points of the source text.
TextRank is a popular algorithm for extractive text summarization, which ranks sentences based on their importance in the document.


Summary:
There are two main approaches to text summarization: extractive and abstractive summarization. TextRank is a popular algorithm for extractive text summarization, which ranks sentences based on their importance in the document.


#Summarization on News Dataset

In [None]:
import pandas as pd
df = pd.read_csv('news.csv')

##Pre-processing the data

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [None]:
df.columns

Index(['Unnamed: 0', 'title', 'text', 'label'], dtype='object')

In [None]:
df.drop(columns=['Unnamed: 0','title','label'],inplace=True)

In [None]:
df.shape

(6335, 1)

In [None]:
df = df.iloc[:500]

In [None]:
df.shape

(500, 1)

In [None]:
df.head()

Unnamed: 0,text
0,"Daniel Greenfield, a Shillman Journalism Fello..."
1,Google Pinterest Digg Linkedin Reddit Stumbleu...
2,U.S. Secretary of State John F. Kerry said Mon...
3,"— Kaydee King (@KaydeeKing) November 9, 2016 T..."
4,It's primary day in New York and front-runners...


## Text Summarization

In [None]:
df['summarized text'] = df['text'].apply(text_summarization)

In [None]:
df.head()

Unnamed: 0,text,summarized text
0,"Daniel Greenfield, a Shillman Journalism Fello...","In the final stretch of the election, Hillary ..."
1,Google Pinterest Digg Linkedin Reddit Stumbleu...,In a particularly staggering example of politi...
2,U.S. Secretary of State John F. Kerry said Mon...,U.S. Secretary of State John F. Kerry said Mon...
3,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",https://t.co/KHyOuUSrFS \n— Teachers for Berni...
4,It's primary day in New York and front-runners...,While wins in New York for Trump and Clinton a...


In [None]:
df.to_csv('summarized_news.csv', index=False)

In [None]:
print("Original Document:")
print(df.text[0])
print("\nSummary:")
print(df['summarized text'][0])

Original Document:
Daniel Greenfield, a Shillman Journalism Fellow at the Freedom Center, is a New York writer focusing on radical Islam. 
In the final stretch of the election, Hillary Rodham Clinton has gone to war with the FBI. 
The word “unprecedented” has been thrown around so often this election that it ought to be retired. But it’s still unprecedented for the nominee of a major political party to go war with the FBI. 
But that’s exactly what Hillary and her people have done. Coma patients just waking up now and watching an hour of CNN from their hospital beds would assume that FBI Director James Comey is Hillary’s opponent in this election. 
The FBI is under attack by everyone from Obama to CNN. Hillary’s people have circulated a letter attacking Comey. There are currently more media hit pieces lambasting him than targeting Trump. It wouldn’t be too surprising if the Clintons or their allies were to start running attack ads against the FBI. 
The FBI’s leadership is being warned t