Gil Paolo Adiao (101590566)

# Summarization



### Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

### Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1

## 1. Sumy

1. Luhn – Heurestic method
2. Latent Semantic Analysis
3. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
4. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [1]:
!pip install sumy
!pip install lxml_html_clean
!pip install requests
!pip install nltk



### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [2]:
# prompt: import the libraries

import nltk
nltk.download('punkt_tab')

from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
import requests
from bs4 import BeautifulSoup


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\gilad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Scrape the text

In [3]:
# prompt: Scrape the text from the url
url = "https://en.wikipedia.org/wiki/Automatic_summarization"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extract the text from the main content area of the page
content_div = soup.find(id="mw-content-text")
if content_div:
  paragraphs = content_div.find_all("p")
  text = ""
  for paragraph in paragraphs:
    text += paragraph.get_text() + "\n"
  print(text)
else:
  print("Could not find the main content area on the page.")


Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video

### Summarize - TextRankSummarizer




In [4]:
# prompt: Summarize using TextRankSummarizer
from sumy.parsers.plaintext import PlaintextParser

def summarize_text(text, sentence_count=3):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = TextRankSummarizer()
    summary = summarizer(parser.document, sentence_count)
    return summary


# Example usage (replace with your own text):
text_to_summarize = text

summary = summarize_text(text_to_summarize)
print("TextRankSummary: ")
for sentence in summary:
  print(sentence)


TextRankSummary: 
For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents, comprehensive multi-document summary should itself contain the required information, hence limiting the need for accessing original files to cases when refinement is required.


### Import the summarizers

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

In [5]:
# prompt: import LexRankSummarizer, LuhnSummarizer, and LSASummarizer

from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.summarizers.lsa import LsaSummarizer

# ... (your existing code) ...


def summarize_text_lexrank(text, sentence_count=3):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LexRankSummarizer()
    summary = summarizer(parser.document, sentence_count)
    return summary

def summarize_text_luhn(text, sentence_count=3):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LuhnSummarizer()
    summary = summarizer(parser.document, sentence_count)
    return summary

def summarize_text_lsa(text, sentence_count=3):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LsaSummarizer()
    summary = summarizer(parser.document, sentence_count)
    return summary

# Example usage for LexRankSummarizer:
lexrank_summary = summarize_text_lexrank(text_to_summarize)
print("LexRank Summary:")
for sentence in lexrank_summary:
  print(sentence)

# Example usage for LuhnSummarizer:
luhn_summary = summarize_text_luhn(text_to_summarize)
print("\nLuhn Summary:")
for sentence in luhn_summary:
  print(sentence)

# Example usage for LsaSummarizer:
lsa_summary = summarize_text_lsa(text_to_summarize)
print("\nLSA Summary:")
for sentence in lsa_summary:
  print(sentence)



LexRank Summary:
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".
For example, in document summarization, one would like the summary to cover all important and relevant concepts in the document.

Luhn Summary:
Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).
For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original te

## 2. Gensim

### Task: Take a piece of text from wiki page and summarize them using Gensim
Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [6]:
!pip install gensim==3.8.3



### Import the library

In [7]:
# prompt: import gensim

from gensim.summarization import summarize

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [8]:
def get_page(url):
    # Make a request to the given URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, "html.parser")
        return soup
    else:
        print(f"Error: Unable to fetch the page. Status code: {response.status_code}")
        return None

In [9]:
def collect_text(soup):
    # Extract the text from the main content area of the page
    content_div = soup.find(id="mw-content-text")
    if content_div:
        paragraphs = content_div.find_all("p")
        text = ""
        for paragraph in paragraphs:
            text += paragraph.get_text() + "\n"
        return text
    else:
        print("Could not find the main content area on the page.")
        return ""

In [10]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"

In [11]:
text = collect_text(get_page(url))
text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.\n\nText summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important vi

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [13]:
# Summarize with word_count parameter (limits summary to specified number of words)
summary_by_word_count = summarize(text, word_count=100)
print("Summary with word_count=100:")
print(summary_by_word_count)
print("-" * 80)

Summary with word_count=100:
Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a carefully selected subset of the original video frames and, therefore, are not identical to the output of video synopsis algorithms, where new video frames are being synthesized based on the original video content.
--------------------------------------------------------------------------------


In [14]:
# Summarize with ratio parameter (returns a fraction of the original text)
summary_by_ratio = summarize(text, ratio=0.01)  # 1% of the original sentences
print("Summary with ratio=0.01:")
print(summary_by_ratio)
print("-" * 80)

Summary with ratio=0.01:
Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a carefully selected subset of the original video frames and, therefore, are not identical to the output of video synopsis algorithms, where new video frames are being synthesized based on the original video content.
Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc.
-------------------------------

In [15]:
# Try different ratio
summary_by_ratio_larger = summarize(text, ratio=0.05)  # 5% of the original sentences
print("Summary with ratio=0.05:")
print(summary_by_ratio_larger)

Summary with ratio=0.05:
Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.
Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms.
Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retai

## 3. Summa

Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [16]:
!pip install summa

Collecting summa
  Downloading summa-1.2.0.tar.gz (54 kB)
     -------------------------------------- 54.9/54.9 kB 475.2 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: summa
  Building wheel for summa (setup.py): started
  Building wheel for summa (setup.py): finished with status 'done'
  Created wheel for summa: filename=summa-1.2.0-py3-none-any.whl size=54415 sha256=c4b90b4487f602857a30d35a1b0c2c6646a45e6c16b24d0736dd74f52621724e
  Stored in directory: c:\users\gilad\appdata\local\pip\cache\wheels\dc\1c\fd\4a9777b43e1504f70d83e82d7cac97e7caf97e6ed133ac4681
Successfully built summa
Installing collected packages: summa
Successfully installed summa-1.2.0


### Import the library

In [None]:
from summa import summarizer, keywords

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [18]:
# Use the existing functions to scrape text from the URL
soup = get_page(url)
text = collect_text(soup)

# Display the first 500 characters to check the content
print(text[:500])

Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the othe


### Summarize

In [19]:
# Summarize using Summa
# Ratio-based summarization (similar to Gensim)
summary_ratio_small = summarizer.summarize(text, ratio=0.01)
print("Summa Summary with ratio=0.01:")
print(summary_ratio_small)
print("-" * 80)

summary_ratio_medium = summarizer.summarize(text, ratio=0.05)
print("Summa Summary with ratio=0.05:")
print(summary_ratio_medium)
print("-" * 80)

# Word count-based summarization
summary_words = summarizer.summarize(text, words=100)
print("Summa Summary with words=100:")
print(summary_words)
print("-" * 80)

# Extract keywords from the text
extracted_keywords = keywords.keywords(text, ratio=0.05)
print("Summa Keywords (ratio=0.05):")
print(extracted_keywords)

Summa Summary with ratio=0.01:
Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms.
The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary".
--------------------------------------------------------------------------------
Summa Summary with ratio=0.05:
Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.
Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand,

## ASSIGNMENT: 
Take the same medium article (the one I wrote) we used for Task 1 of ML-1 and extract the text and summarize them using all the above methods and provide the best summary with a note saying why the chosen library is the best
url = https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

### Submit 2 files
- (notebook) .ipynb
- (summary) .txt