# Summarization
## This notebook outlines the concepts behind Text Summarization

## Summarization
- concept of capturing very important gist of a long piece of text

### Types of Summarization
- 1. **Extractive Summarization**
    - Select sentences from the corpus that best represent the text
    - Arrange them to form a summary
- 2. **Abstractive Summarization**
    - Captures the very important sentences from the text
    - Paraphrases them to form a summary

## Summarization Libraries
- Sumy
- Gensim
- Summa
- BERT **
    - BART **
    - PEGASUS **
    - T5 **

** Will be seen in DL-1


## 1. Sumy :
    1. Luhn – Heurestic method
    2. Latent Semantic Analysis
    4. LexRank – Unsupervised approach inspired by algorithms PageRank and HITS
    5. TextRank - Graph-based summarization technique with keyword extractions in from document

Documentation Reference [sumy](https://github.com/miso-belica/sumy)

## Task: Take a piece of text from wiki page and summarize them using Sumy
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install Sumy

In [1]:
#! pip install sumy

### Import the libraries
- HtmlParser
- Tokenizer
- TextRankSummarizer

In [5]:
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

LANGUAGE = "english"
SENTENCES_COUNT = 5

### Scrape the text

In [6]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"


In [7]:
# Create parser
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))

### Summarize - TextRankSummarizer

In [8]:
# Create summarizer
stemmer = Stemmer(LANGUAGE)
summarizer = TextRankSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

In [None]:
# Summarize
summary = summarizer(parser.document, SENTENCES_COUNT)
print("TextRank Summary:")
for sentence in summary:
    print(sentence)

### Try different Summarizers
- LexRankSummarizer
- LuhnSummarizer
- LsaSummarizer

### Import the summarizers

In [18]:
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer 
from sumy.summarizers.luhn import LuhnSummarizer

### Create Summarizers

In [19]:
# Initialize the summarizers
lsa_summarizer = LsaSummarizer(stemmer)
lsa_summarizer.stop_words = get_stop_words(LANGUAGE)

lexrank_summarizer = LexRankSummarizer(stemmer)
lexrank_summarizer.stop_words = get_stop_words(LANGUAGE)

luhn_summarizer = LuhnSummarizer(stemmer)
luhn_summarizer.stop_words = get_stop_words(LANGUAGE)

### LexRankSummarizer

In [None]:
# Summarize using LexRank
lexrank_summary = lexrank_summarizer(parser.document, SENTENCES_COUNT)
print("LexRank Summary:")
for sentence in lexrank_summary:
    print(sentence)

### LuhnSummarizer

In [None]:
# Summarize using Luhn
luhn_summary = luhn_summarizer(parser.document, SENTENCES_COUNT)
print("Luhn Summary:")
for sentence in luhn_summary:
    print(sentence)

### LsaSummarizer

In [None]:
# Summarize using LSA
lsa_summary = lsa_summarizer(parser.document, SENTENCES_COUNT)
print("LSA Summary:")
for sentence in lsa_summary:
    print(sentence)

## 2. Gensim

## Task: Take a piece of text from wiki page and summarize them using Gensim
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [30]:
#!pip install "numpy<=1.16.1" "gensim==3.8.3"

### Import the library

In [28]:
from gensim.summarization import summarize

ModuleNotFoundError: No module named 'gensim.summarization'

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

In [None]:
import requests
from bs4 import BeautifulSoup

In [15]:
def get_page(url):
    
    
    # Get the webpage content
    page = requests.get(url)
    
    # Create BeautifulSoup object
    soup = BeautifulSoup(page.content, 'html.parser')
    
    return soup

In [16]:
def collect_text(soup):
    # Find all paragraph tags
    paragraphs = soup.find_all('p')
    
    # Extract text from paragraphs
    text = ' '.join([p.get_text() for p in paragraphs])
    
    return text

In [17]:
url = "https://en.wikipedia.org/wiki/Automatic_summarization"

In [None]:
text = collect_text(get_page(url))
text

### Summarize
- **word_count**: maximum amount of words we want in the summary
- **ratio**: fraction of sentences in the original text should be returned as output

In [None]:
# Summarize using Gensim
summary = summarize(text, ratio=0.3)
print("Gensim Summary:")
print(summary)

## 3. Summa

## Task: Take a piece of text from wiki page and summarize them using Summa
### Steps
- Install the necessary libraries
- Import the libraries
- Scrape the text from a pre-defined webpage
- Summarize

### Install the library

In [20]:
# !pip install summa

### Import the library

In [None]:
from summa import summarizer as summa_summarizer
from summa import keywords

### Scrape the text
- Use beautifulSoup to extract text (from Task1 of ML-1)

### Summarize

In [None]:
# Summarize using Summa
summary = summa_summarizer.summarize(text, ratio=0.3)
print("Summa Summary:")
print(summary)

# Extract keywords
key_words = keywords.keywords(text)
print("\nKeywords:")
print(key_words)

## ASSIGNMENT: Take the same medium article (the one I wrote) we used for Task 1 of ML-1 and extract the text and summarize them using all the above methods and provide the best summary with a note saying why the chosen library is the best
url = https://medium.com/@subashgandyer/papa-what-is-a-neural-network-c5e5cc427c7

### Submit 2 files
- (notebook) .ipynb
- (summary) .txt