# TEXT SUMMARIZATION PROJECT:

Text summarization is the creation of a short, accurate, and fluent summary of a longer text document. Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online. This could help to discover relevant information and to consume relevant information faster

# MODEL-1:

1.IMPORT THE LIBRARIES:

In [1]:
import bs4 as bs
import urllib.request
import re
import nltk
import heapq

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Kumarapk\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Kumarapk\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

2.FUNCTION FOR PROCESSING ARTICLE TEXT:

In [2]:
def get_article_text(url):
    scraped_data = urllib.request.urlopen(url)
    article = scraped_data.read()
    parsed_article = bs.BeautifulSoup(article, 'lxml')
    paragraphs = parsed_article.find_all('p')
    article_text = ""

    for p in paragraphs:
        article_text += p.text

    article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
    article_text = re.sub(r'\s+', ' ', article_text)
    return article_text

3.FUNCTION FOR SUMMARIZATION:

In [3]:
def summarize_text(article_text):
    # Removing special characters and digits
    formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
    formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
    sentence_list = nltk.sent_tokenize(article_text)
    stopwords = nltk.corpus.stopwords.words('english')

    word_frequencies = {}
    for word in nltk.word_tokenize(formatted_article_text):
        if word.lower() not in stopwords:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

    maximum_frequency = max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word] / maximum_frequency)

    sentence_scores = {}
    for sent in sentence_list:
        for word in nltk.word_tokenize(sent.lower()):
            if word in word_frequencies.keys():
                if len(sent.split(' ')) < 30:
                    if sent not in sentence_scores.keys():
                        sentence_scores[sent] = word_frequencies[word]
                    else:
                        sentence_scores[sent] += word_frequencies[word]

    summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
    summary = ' '.join(summary_sentences)
    return summary


4.USER-DEFINED INPUT:

In [4]:
if __name__ == "__main__":
    choice = input("Would you like to summarize text from a URL or direct input? (Enter 'URL' or 'Text'): ").strip().lower()

    if choice == 'url':
        url = input("Enter the URL of the article you want to summarize: ").strip()
        article_text = get_article_text(url)
    elif choice == 'text':
        article_text = input("Enter the text you want to summarize: ").strip()
    else:
        print("Invalid choice. Please run the script again and enter either 'URL' or 'Text'.")
        exit()

    summary = summarize_text(article_text)
    print("\nSummary:")
    print(summary)

Would you like to summarize text from a URL or direct input? (Enter 'URL' or 'Text'): URL
Enter the URL of the article you want to summarize: https://www.ibm.com/topics/text-summarization

Summary:
Gupta, “Abstractive summarization: An overview of the state of the art,” Expert Systems With Applications, 2019, https://www.sciencedirect.com/science/article/abs/pii/S0957417418307735 (link resides outside of ibm.com). Researchers also show how text summarization transformers can advance additional tasks, however.News News articles are a common dataset for testing and comparing text summarization techniques. Differences in sentence scoring determines which sentences to extract and which to retain.Abstractive summarization generates original summaries using sentences not found in the original text documents. Build AI applications in a fraction of the time with a fraction of the data.1 Juan-Manuel Torres-Moreno, Automatic Text Summarization, Wiley, 2014.2 Aggarwal, Machine Learning for Text, 

# MODEL-2: EXTRACTIVE SUMMARIZATION:

Extractive summarization extracts unmodified sentences from the original text documents. A key difference between extractive algorithms is how they score sentence importance while reducing topical redundancy. Differences in sentence scoring determines which sentences to extract and which to retain.

1.IMPORTING THE LIBRARIES:

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import heapq


nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Kumarapk\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Kumarapk\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

2.NLTK SUMMARIZATION:

In [7]:
def nltk_summarize(text, n):
    sentences = sent_tokenize(text)
    words = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word.isalnum() and word not in stop_words]
    
    word_frequencies = {}
    for word in words:
        if word not in word_frequencies:
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
 
    max_frequency = max(word_frequencies.values())
    for word in word_frequencies:
        word_frequencies[word] = word_frequencies[word] / max_frequency

    sentence_scores = {}
    for sentence in sentences:
        for word in word_tokenize(sentence.lower()):
            if word in word_frequencies:
                if sentence not in sentence_scores:
                    sentence_scores[sentence] = word_frequencies[word]
                else:
                    sentence_scores[sentence] += word_frequencies[word]

    summary_sentences = heapq.nlargest(n, sentence_scores, key=sentence_scores.get)
    summary = ' '.join(summary_sentences)
    return summary


3.OUTPUT:

In [9]:
text = """
Abstractive summarization is a technique in natural language processing (NLP) where the goal is to generate a concise and coherent summary of a given text by rephrasing the original content. Unlike extractive summarization, which selects and compiles sentences or phrases directly from the source text, abstractive summarization creates new sentences that convey the main ideas, often using different words and structures.
Key Characteristics of Abstractive Summarization
Paraphrasing: Abstractive summarization rephrases the original text, potentially using synonyms and different grammatical structures to convey the same meaning.
Compression: The summary is typically shorter than the original text, focusing on the most important information.
Understanding Context: This method attempts to understand the context and semantics of the text, allowing for more natural and human-like summaries.
Complexity: Abstractive summarization is more computationally complex and challenging than extractive summarization because it involves natural language generation (NLG).
Techniques for Abstractive Summarization
Abstractive summarization often leverages advanced NLP techniques, such as:
Sequence-to-Sequence Models (Seq2Seq): These models, often based on Recurrent Neural Networks (RNNs) or Transformers, are trained to convert a sequence of words (the input text) into another sequence (the summary).
Attention Mechanisms: Used to focus on different parts of the input text when generating each word of the summary.
Transformers: The architecture used in models like BERT, GPT-3, and T5, which can handle long-range dependencies and context more effectively.
"""
n = 3  
summary = nltk_summarize(text, n)
print(summary)


Techniques for Abstractive Summarization
Abstractive summarization often leverages advanced NLP techniques, such as:

Sequence-to-Sequence Models (Seq2Seq): These models, often based on Recurrent Neural Networks (RNNs) or Transformers, are trained to convert a sequence of words (the input text) into another sequence (the summary). Unlike extractive summarization, which selects and compiles sentences or phrases directly from the source text, abstractive summarization creates new sentences that convey the main ideas, often using different words and structures. Key Characteristics of Abstractive Summarization
Paraphrasing: Abstractive summarization rephrases the original text, potentially using synonyms and different grammatical structures to convey the same meaning.


Thus, extractive summarization has been implemented sucessfully using NLTK module.