<a href="https://colab.research.google.com/github/dipesh2108/AI_Notes/blob/main/Summarizing_Text_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Summarizing Text Data
--
If you just look around, there are lots of articles and books available. Let’s
assume you want to learn a concept in NLP and if you Google it, you will
find an article. You like the content of the article, but it’s too long to read
it one more time. You want to basically summarize the article and save it
somewhere so that you can read it later.

Well, NLP has a solution for that. Text summarization will help us do
that. You don’t have to read the full article or book every time.

Problem
--
Text summarization of article/document using different algorithms in Python.

Solution
--
Text summarization is the process of making large documents into smaller
ones without losing the context, which eventually saves readers time. This
can be done using different techniques like the following:

• TextRank: A graph-based ranking algorithm <br>
**Video 1** : https://www.youtube.com/watch?v=PNHB6OuFv7I
<br>
**Video 2 :** TextRank is based on PageRank : <b>Watch <a href="https://drive.google.com/open?id=1LIepskEND-1FuvfEkgEj1wFgTV6uHNlv">this</a></b>

• Feature-based text summarization

• LexRank: TF-IDF with a graph-based algorithm

• Topic based

• Using sentence embeddings

• Encoder-Decoder Model: Deep learning techniques

In [None]:
!pip install gensim



In [None]:
!pip show gensim

Name: gensim
Version: 4.3.2
Summary: Python framework for fast Vector Space Modelling
Home-page: https://radimrehurek.com/gensim/
Author: Radim Rehurek
Author-email: me@radimrehurek.com
License: LGPL-2.1-only
Location: /usr/local/lib/python3.10/dist-packages
Requires: numpy, scipy, smart-open
Required-by: 


In [None]:
ls /usr/local/lib/python3.10/dist-packages/gensim

[0m[01;34mcorpora[0m/       _matutils.c                                 [01;34mmodels[0m/       [01;34mscripts[0m/          utils.py
downloader.py  [01;32m_matutils.cpython-310-x86_64-linux-gnu.so[0m*  nosy.py       [01;34msimilarities[0m/
__init__.py    matutils.py                                 [01;34mparsing[0m/      [01;34mtest[0m/
interfaces.py  _matutils.pyx                               [01;34m__pycache__[0m/  [01;34mtopic_coherence[0m/


## What is gensim ?

**`Gensim = “Generate Similar”`** is a popular open source NLP library used for unsupervised topic modeling. It uses top academic models and modern statistical machine learning to perform various complex tasks such as −

1. Building document or word vectors
2. Corpora
3. Performing topic identification
4. Performing document comparison (retrieving semantically similar documents)
5. Analysing plain-text documents for semantic structure

Apart from performing the above complex tasks, Gensim, `implemented in Python and Cython`, is designed to handle large text collections using data streaming as well as incremental online algorithms. *This makes it different from those machine learning software packages that target only in-memory processing.*

> `Note` : One significant advantage with gensim is: it lets you handle large text files **without having to load the entire file in memory**.

In [None]:
# Import BeautifulSoup and urllib libraries to fetch data from Wikipedia.
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Function to get data from Wikipedia
def get_only_text(url):
 page = urlopen(url)
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 return soup.title.text, text

# Mention the Wikipedia url
url="https://en.wikipedia.org/wiki/Natural_language_processing"

# Call the function created above
fulltext = get_only_text(url)

# Count the number of chars
print(len(''.join(fulltext)))

# printing the full Text.  Its a tuple of sentences.
print(fulltext)

7945
('Natural language processing - Wikipedia', 'Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.\n Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.\n Natural language processing has its roots in the 1940s.[1] Already in 1940, Ala

In [None]:
tuple_data = fulltext


## This code iterates over each element in the tuple,
## concatenates it to the result_sentence variable,
## and adds a space after each element except the last one.
result_sentence = ''
for i in range(len(tuple_data)):
    result_sentence += tuple_data[i]
    if i != len(tuple_data) - 1:
        result_sentence += ' '
print(result_sentence)

Natural language processing - Wikipedia Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
 Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.
 Natural language processing has its roots in the 1940s.[1] Already in 1940, Alan Turing pub

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from heapq import nlargest

# Sample text
fulltext = result_sentence

# Tokenize the text into sentences
sentences = sent_tokenize(fulltext)

# Tokenize the text into words
words = word_tokenize(fulltext)

# Remove stopwords
stop_words = set(stopwords.words("english"))
filtered_words = [word for word in words if word.lower() not in stop_words]

# Calculate word frequency
word_freq = FreqDist(filtered_words)

# Calculate sentence scores based on word frequency
sent_scores = {}
for sentence in sentences:
    for word in word_tokenize(sentence.lower()):
        if word in word_freq:
            if len(sentence.split(' ')) < 30:  # Consider only sentences less than 30 words long
                if sentence not in sent_scores:
                    sent_scores[sentence] = word_freq[word]
                else:
                    sent_scores[sentence] += word_freq[word]

# Get the top sentences with the highest scores
summary_sentences = nlargest(5, sent_scores, key=sent_scores.get)
summary = ' '.join(summary_sentences)
print(summary)

print(len(summary))
print("ratio of reduction = " , 1 - (len(summary) / len(fulltext)) )
print("-------------------")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


That popularity was due partly to a flurry of results showing that such techniques[11][12] can achieve state-of-the-art results in many natural language tasks, e.g., in language modeling[13] and parsing. Neural machine translation, based on then-newly-invented sequence-to-sequence transformations, made obsolete the intermediate steps, such as word alignment, previously necessary for statistical machine translation. [20][21]
 The earliest decision trees, producing systems of hard if–then rules, were still very similar to the old rule-based approaches. More recently, ideas of cognitive NLP have been revived as an approach to achieve explainability, e.g., under the notion of "cognitive AI". Since 2015,[22] the statistical approach was replaced by the neural networks approach, using word embeddings to capture semantic properties of words.
846
ratio of reduction =  0.8935313365215203
-------------------


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


let's break down the logic behind the (above) code:

1. **Tokenization:**
   - The text is tokenized into sentences using `sent_tokenize` and into words using `word_tokenize`. This breaks down the text into meaningful units, allowing us to analyze it more effectively.

2. **Stopword Removal:**
   - Stopwords are common words that typically do not carry significant meaning in a sentence. They are removed from the list of words using NLTK's `stopwords` corpus. This step helps in focusing on the most meaningful words in the text.

3. **Word Frequency Calculation:**
   - The frequency of each word is calculated using `FreqDist`, which creates a frequency distribution of words in the text. This step helps in identifying the most important words in the text based on their frequency.

4. **Sentence Score Calculation:**
   - For each sentence in the text, the code iterates over its words and calculates a score based on the frequency of each word. Only sentences with a length less than 30 words are considered. The score of each sentence is the sum of frequencies of words it contains.

5. **Top Sentences Selection:**
   - The top 5 sentences with the highest scores are selected using `nlargest`. This function returns the n largest elements from a list based on a specified key function, which in this case is the score of each sentence.

6. **Summary Generation:**
   - The selected sentences are concatenated into a single summary using `' '.join(summary_sentences)`. This creates a single string containing the top sentences, separated by a space.

7. **Printing the Summary:**
   - Finally, the generated summary is printed to the console.

This code essentially summarizes a given text by identifying the most important sentences based on the frequency of the words they contain. It's a basic approach to automatic text summarization using NLTK.

Method 2 :  Feature-based text summarization
--
This feature-based text summarization methods will extract a feature from
the sentence and check the importance to rank it. Position, length, term
frequency, named entity, and many other features are used to calculate the
score.

Luhn’s Algorithm is one of the feature-based algorithms, and we will
see how to implement it using the sumy library.

In [None]:
# Install sumy
!pip install sumy

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-23.12.11-py3-none-any.whl (6.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: breadability, docopt
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Created wheel for breadability: filename=breadability-0.1.20-py2.py3-none-any.whl size=21691 sha256=8f86c23f7913e990376dde0b521e82f453dc021c29fb32b22c732f19e0986c93
  Stor

In [None]:
# Import the packages
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer

# Extracting and summarizing
LANGUAGE = "english"
SENTENCES_COUNT = 10

url="https://en.wikipedia.org/wiki/Natural_language_processing"
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))

# Get the text content of the document
document_text = ""
for sentence in parser.document.sentences:
    document_text += str(sentence) + " "  # Concatenate each sentence into the document text

# Calculate the length of characters in the document
chars_count = len(document_text)

# Print the length of characters
print("Length of characters in the document:", chars_count)

print("---------------------")

## Summarizing the content
summarizer = LuhnSummarizer(Stemmer(LANGUAGE))
summarizer.stop_words = get_stop_words(LANGUAGE)

countAfterSummary = 0
for sentence in summarizer(parser.document, SENTENCES_COUNT):
 print(sentence)
 countAfterSummary = countAfterSummary + len(str(sentence))

print("-------------------")
print(countAfterSummary)
print("-------------------")
print("ratio of reduction = " , 1 - (countAfterSummary / chars_count) )

Length of characters in the document: 40648
---------------------
Focus areas of the time included research on rule-based parsing (e.g., the development of HPSG as a computational operationalization of generative grammar), morphology (e.g., two-level morphology[5]), semantics (e.g., Lesk algorithm), reference (e.g., within Centering Theory[6]) and other areas of natural language understanding (e.g., in the Rhetorical Structure Theory).
Dependency parsing focuses on the relationships between words in a sentence (marking things like primary objects and predicates), whereas constituency parsing focuses on building out the parse tree using a probabilistic context-free grammar(PCFG) (see also stochastic grammar).
Semantic role labelling(see also implicit semantic role labelling below) Given a single sentence, identify and disambiguate semantic predicates (e.g., verbal frames), then identify and classify the frame elements ( semantic roles).
The more general task of coreference resolution al

In [None]:
# just to check the summarizer object
print(summarizer)

# and see its stop words list
print(summarizer.stop_words)

<sumy.summarizers.luhn.LuhnSummarizer object at 0x7e33af5e58a0>
frozenset({'us', 'perhaps', 'welcome', 'placed', 'him', 'anyways', 'considering', 'brief', 'r', 'sup', 'eight', 'doing', 'most', 'anyway', 'd', 'theirs', 'particularly', "aren't", "wouldn't", 'tell', 'whereas', 'upon', 'immediate', 'therefore', 'th', 'fifth', 'per', 'tends', 'whenever', 'like', 'you', 'everyone', 'ltd', 'onto', 'came', 'certainly', 'among', 'wants', 'indicate', 'above', 'zero', 'respectively', 'everywhere', 'see', 'such', 'these', "they'd", 'somehow', 'everything', 'and', 'regardless', 'meanwhile', 'say', 'thereafter', 'entirely', 'old', "they've", 'taken', 'saying', 'none', 'non', 'those', 'f', 'could', 'went', "it'd", 'into', 'certain', "couldn't", 'second', 'ones', "he's", 'where', 'h', 'insofar', 'would', 'himself', 'do', 'much', 'six', 'herein', 'ex', 'with', 'seeing', "doesn't", 'he', 'especially', 'were', 'keeps', 'happens', 'currently', 'even', "they'll", 'despite', 'specifying', 'myself', 'comes',

Recommended Reading
--

https://towardsdatascience.com/text-summarization-on-the-books-of-harry-potter-5e9f5bf8ca6c