Text Summarization with NLTK in Python

Text summarization is a subdomain of Natural Language Processing (NLP) that deals with extracting summaries from huge chunks of texts. There are two main types of techniques used for text summarization: NLP-based techniques and deep learning-based techniques. In this article, we will see a simple NLP-based technique for text summarization. We will not use any machine learning library in this article. Rather we will simply use Python's NLTK library for summarizing Wikipedia articles.

In [1]:
pip install --upgrade pip


Collecting pip
  Downloading pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 7.0 MB/s eta 0:00:01
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.1
    Uninstalling pip-20.1:
      Successfully uninstalled pip-20.1
Successfully installed pip-20.1.1
Note: you may need to restart the kernel to use updated packages.


**** * Fetching Articles from Wikipedia
 Before we could summarize Wikipedia articles, we need to fetch them from the web. To do so we will use a couple of libraries. The first library that we need to download is the beautiful soup which is very useful Python utility for web scraping. Execute the following command at the command prompt to download the Beautiful Soup utility.**********

In [2]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install lxml

Collecting lxml
  Downloading lxml-5.3.0-cp311-cp311-win_amd64.whl.metadata (3.9 kB)
Downloading lxml-5.3.0-cp311-cp311-win_amd64.whl (3.8 MB)
   ---------------------------------------- 0.0/3.8 MB ? eta -:--:--
   --------------------- ------------------ 2.1/3.8 MB 10.7 MB/s eta 0:00:01
   ---------------------------------------- 3.8/3.8 MB 10.8 MB/s eta 0:00:00
Installing collected packages: lxml
Successfully installed lxml-5.3.0
Note: you may need to restart the kernel to use updated packages.


**Another important library that we need to parse XML and HTML is the lxml library. Execute the following command at command prompt to download lxml
******

In [1]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


**NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.**

# Preprocessing
The first preprocessing step is to remove references from the article. Wikipedia, references are enclosed in square brackets. The following script removes the square brackets and replaces the resulting multiple spaces by a single space. 

# Removing Square Brackets and Extra Spaces

The article_text object contains text without brackets. However, we do not want to remove anything else from the article since this is the original article. We will not remove other numbers, punctuation marks and special characters from this text since we will use this text to create summaries and weighted word frequencies will be replaced in this article.

To clean the text and calculate weighted frequences, we will create another object. 

# Removing special characters and digits

Now we have two objects article_text, which contains the original article and formatted_article_text which contains the formatted article. We will use formatted_article_text to create weighted frequency histograms for the words and will replace these weighted frequencies with the words in the article_text object.

# Converting Text To Sentences
At this point we have preprocessed the data. Next, we need to tokenize the article into sentences. We will use thearticle_text object for tokenizing the article to sentence since it contains full stops. The formatted_article_text does not contain any punctuation and therefore cannot be converted into sentences using the full stop as a parameter.

# Find Weighted Frequency of Occurrence
To find the frequency of occurrence of each word, we use the formatted_article_text variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters.

In the script above, we first store all the English stop words from the nltk library into a stopwords variable. Next, we loop through all the sentences and then corresponding words to first check if they are stop words. If not, we proceed to check whether the words exist in word_frequency dictionary i.e. word_frequencies, or not. If the word is encountered for the first time, it is added to the dictionary as a key and its value is set to 1. Otherwise, if the word previously exists in the dictionary, its value is simply updated by 1.

Finally, to find the weighted frequency, we can simply divide the number of occurances of all the words by the frequency of the most occurring word.

# Calculating Sentence Scores
We have now calculated the weighted frequencies for all the words. Now is the time to calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence. 

n the script above, we first create an empty sentence_scores dictionary. The keys of this dictionary will be the sentences themselves and the values will be the corresponding scores of the sentences. Next, we loop through each sentence in the sentence_list and tokenize the sentence into words.

We then check if the word exists in the word_frequencies dictionary. This check is performed since we created the sentence_list list from the article_text object; on the other hand, the word frequencies were calculated using the formatted_article_text object, which doesn't contain any stop words, numbers, etc.

We do not want very long sentences in the summary, therefore, we calculate the score for only sentences with less than 30 words (although you can tweak this parameter for your own use-case). Next, we check whether the sentence exists in the sentence_scores dictionary or not. If the sentence doesn't exist, we add it to the sentence_scores dictionary as a key and assign it the weighted frequency of the first word in the sentence, as its value. On the contrary, if the sentence exists in the dictionary, we simply add the weighted frequency of the word to the existing value.

# Getting the Summary
Now we have the sentence_scores dictionary that contains sentences with their corresponding score. To summarize the article, we can take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen.

In the script above, we use the heapq library and call its nlargest function to retrieve the top 7 sentences with the highest scores.

In [1]:
import bs4 as bs
import urllib.request
import re
import nltk

scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
sentence_list = nltk.sent_tokenize(article_text)
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
    maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)
    sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

Other studies have suggested that the virus may be airborne as well, with aerosols potentially being able to transmit the virus. The host protein neuropilin 1 (NRP1) may aid the virus in host cell entry using ACE2. During the initial outbreak in Wuhan, China, various names were used for the virus; some names used by different sources included "the coronavirus" or "Wuhan coronavirus". The virus previously had the provisional name 2019 novel coronavirus (2019-nCoV), and has also been called human coronavirus 2019 (HCoV-19 or hCoV-19). Differences between the bat coronavirus and SARS‑CoV‑2 suggest that humans may have been infected via an intermediate host; although the source of introduction into humans remains unknown. The original source of viral transmission to humans remains unclear, as does whether the virus became pathogenic before or after the spillover event. Research into the natural reservoir of the virus that caused the 2002–2004 SARS outbreak has resulted in the discovery of 

#### The below code is a improved version of the above code does the same thing but little optimized

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import heapq

# Download the stopwords from NLTK if not already downloaded
nltk.download('punkt')
nltk.download('stopwords')

# Fetch article data
url = 'https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2'
response = requests.get(url)
article = response.text

# Parse the article
soup = BeautifulSoup(article, 'lxml')
paragraphs = soup.find_all('p')
article_text = " ".join([p.text for p in paragraphs])

# Clean the text
article_text = re.sub(r'\[[0-9]*\]', '', article_text)  # Remove reference numbers
article_text = re.sub(r'\s+', ' ', article_text)  # Remove extra spaces

# Remove special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

# Tokenize sentences
sentence_list = sent_tokenize(article_text)

# Stopwords
stop_words = set(stopwords.words('english'))

# Calculate word frequencies
word_frequencies = Counter(word.lower() for word in word_tokenize(formatted_article_text) if word.lower() not in stop_words)

# Normalize frequencies
maximum_frequency = max(word_frequencies.values())
for word in word_frequencies:
    word_frequencies[word] /= maximum_frequency

# Score sentences based on word frequencies
sentence_scores = {}
for sent in sentence_list:
    word_count = len(sent.split(' '))
    if word_count < 30:  # Consider only shorter sentences
        for word in word_tokenize(sent.lower()):
            if word in word_frequencies:
                if sent not in sentence_scores:
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

# Generate summary
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)

print(summary)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\GOD\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\GOD\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Research into the natural reservoir of the virus that caused the 2002–2004 SARS outbreak has resulted in the discovery of many SARS-like bat coronaviruses, most originating in horseshoe bats. Studies have shown that SARS‑CoV‑2 has a higher affinity to human ACE2 than the original SARS virus. SARS‑CoV‑2 is a strain of the species Betacoronavirus pandemicum (SARSr-CoV), as is SARS-CoV-1, the virus that caused the 2002–2004 SARS outbreak. During the initial outbreak in Wuhan, China, various names were used for the virus; some names used by different sources included "the coronavirus" or "Wuhan coronavirus". Like the SARS-related coronavirus implicated in the 2003 SARS outbreak, SARS‑CoV‑2 is a member of the subgenus Sarbecovirus (beta-CoV lineage B). Other studies have suggested that the virus may be airborne as well, with aerosols potentially being able to transmit the virus. The host protein neuropilin 1 (NRP1) may aid the virus in host cell entry using ACE2.


Leverage TfidfVectorizer from Scikit-Learn: This can replace manual frequency calculations and better weigh words by their importance in the document.
Use NLTK's TextBlob for Sentence Tokenization: This provides a higher level of abstraction for handling text.
Improve Text Cleaning Using re.sub with Better Regex: Reduce redundancy in the cleaning steps.

In [5]:
import requests
from bs4 import BeautifulSoup
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize
import nltk

# Download the necessary NLTK data
#nltk.download('punkt')

# Fetch article data
url = 'https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome_coronavirus_2'
response = requests.get(url)
article = response.text

# Parse the article
soup = BeautifulSoup(article, 'lxml')
paragraphs = soup.find_all('p')
article_text = " ".join([p.text for p in paragraphs])

# Clean the text
def clean_text(text):
    clean_text = re.sub(r'\[[0-9]*\]', '', text)  # Remove reference numbers
    clean_text = re.sub(r'\s+', ' ', clean_text)  # Remove extra spaces
    clean_text = re.sub('[^a-zA-Z]', ' ', clean_text)  # Keep only alphabets
    return clean_text.strip()

cleaned_article = clean_text(article_text)

# Tokenize sentences
sentences = sent_tokenize(article_text)

# Generate TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)

# Calculate sentence scores based on TF-IDF
sentence_scores = tfidf_matrix.sum(axis=1).flatten().tolist()[0]

# Get the top N sentences
def summarize(sentences, scores, top_n=7):
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    summary = ' '.join([ranked_sentences[i][1] for i in range(top_n)])
    return summary

# Generate summary
summary = summarize(sentences, sentence_scores, top_n=20)
print(summary)


[130][131]
 A phylogenetic tree based on whole-genome sequences of SARS-CoV-2 and related coronaviruses is:[132][133]
 (Bat) Rc-o319, 81% to SARS-CoV-2, Rhinolophus cornutus, Iwate, Japan[134]
 Bat SL-ZXC21, 88% to SARS-CoV-2, Rhinolophus pusillus, Zhoushan, Zhejiang[135]
 Bat SL-ZC45, 88% to SARS-CoV-2, Rhinolophus pusillus, Zhoushan, Zhejiang[135]
 Pangolin SARSr-CoV-GX, 85.3% to SARS-CoV-2, Manis javanica, smuggled from Southeast Asia[136]
 Pangolin SARSr-CoV-GD, 90.1% to SARS-CoV-2, Manis javanica, smuggled from Southeast Asia[137]
 Bat RshSTT182, 92.6% to SARS-CoV-2, Rhinolophus shameli, Steung Treng, Cambodia[138]
 Bat RshSTT200, 92.6% to SARS-CoV-2, Rhinolophus shameli, Steung Treng, Cambodia[138]
 (Bat) RacCS203, 91.5% to SARS-CoV-2, Rhinolophus acuminatus, Chachoengsao, Thailand[133]
 (Bat) RmYN02, 93.3% to SARS-CoV-2, Rhinolophus malayanus, Mengla, Yunnan[139]
 (Bat) RpYN06, 94.4% to SARS-CoV-2, Rhinolophus pusillus, Xishuangbanna, Yunnan[132]
 (Bat) RaTG13, 96.1% to SARS-CoV

In [9]:
import requests
from bs4 import BeautifulSoup
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize
import nltk

# Download necessary NLTK data
#nltk.download('punkt')

def fetch_text_from_url(url):
    """
    Fetches text from a given URL.

    Args:
        url (str): URL of the webpage to scrape.

    Returns:
        str: Raw text extracted from the webpage.
    """
    response = requests.get(url)
    article = response.text
    soup = BeautifulSoup(article, 'lxml')
    paragraphs = soup.find_all('p')
    return " ".join([p.text for p in paragraphs])

def clean_text(text):
    """
    Cleans the input text by removing references, extra spaces, and non-alphabet characters.

    Args:
        text (str): The raw text to be cleaned.

    Returns:
        str: Cleaned text.
    """
    text = re.sub(r'\[[0-9]*\]', '', text)  # Remove reference numbers
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub('[^a-zA-Z]', ' ', text)  # Keep only alphabets
    return text

def extract_sentences_with_keywords(text, keywords):
    """
    Extracts sentences that contain specific keywords.

    Args:
        text (str): The text to search for keywords.
        keywords (set): A set of keywords to look for in the text.

    Returns:
        list: A list of sentences that contain the keywords.
    """
    sentences = sent_tokenize(text)
    keyword_sentences = [sentence for sentence in sentences if any(keyword.lower() in sentence.lower() for keyword in keywords)]
    return keyword_sentences

def extractive_summary_with_keywords(text, keywords, num_sentences=7):
    """
    Generates an extractive summary from the given text using TF-IDF and keyword boosting.

    Args:
        text (str): The text to summarize.
        keywords (set): A set of keywords to prioritize in the summary.
        num_sentences (int): Number of sentences to include in the summary.

    Returns:
        str: Extractive summary focusing on keywords.
    """
    # Clean and tokenize sentences
    cleaned_text = clean_text(text)
    sentences = sent_tokenize(text)
    
    # Generate TF-IDF matrix for sentences
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)

    # Calculate sentence scores based on TF-IDF
    sentence_scores = tfidf_matrix.sum(axis=1).flatten().tolist()[0]
    
    # Boost scores for sentences containing keywords
    keyword_sentences = extract_sentences_with_keywords(text, keywords)
    for i, sentence in enumerate(sentences):
        if sentence in keyword_sentences:
            sentence_scores[i] *= 1.5  # Boost score by 50% if the sentence contains a keyword

    # Get top N sentences for the summary
    ranked_sentences = sorted(((sentence_scores[i], s) for i, s in enumerate(sentences)), reverse=True)
    summary = ' '.join([ranked_sentences[i][1] for i in range(min(num_sentences, len(ranked_sentences)))])

    return summary

def summarize_policy_from_url_with_keywords(url, keywords, num_sentences=7):
    """
    Fetches a policy document from a URL and generates an extractive summary focused on specific keywords.

    Args:
        url (str): URL of the policy document.
        keywords (set): A set of keywords to prioritize in the summary.
        num_sentences (int): Number of sentences to include in the summary.

    Returns:
        str: Extractive summary of the policy document focusing on keywords.
    """
    raw_text = fetch_text_from_url(url)
    summary = extractive_summary_with_keywords(raw_text, keywords, num_sentences=num_sentences)
    return summary

# Example usage
policy_url = 'https://www.hdfcbank.com/personal/resources/learning-centre/borrow/everything-you-need-to-know-about-a-personal-loan'
keywords = {'interest', 'ROI', 'principal', 'policy', 'insurance'}
summary = summarize_policy_from_url_with_keywords(policy_url, keywords, num_sentences=20)
print(summary)


In the early period of the loan tenure, the EMI will have a higher interest component and lower principal amount, but this will reverse as you near the end stages.HDFC Bank offers loan amounts upto Rs. HDFC Bank usually disburses loan within 10 seconds if you are a pre-approved customer, while non-HDFC Bank customers can get the loan in 4 days.What other options do I have apart from Personal Loans?If you are not sure about a Personal Loan, then HDFC Bank offers several other options that you can use to generate funds for your needs. Just like most loans, however, it must be repaid in monthly instalments.You can use it to fund any expense including education, a wedding, a trip, home renovation, medical expenses, and even to buy a gadget. You can even use the money to help out with the day-to-day expenses in case of a cash flow crunch.HDFC Bank offers a Personal Loan to pre-approved customers in just 10 seconds. This instalment amount is calculated using the loan amount, the payment tenu