# Text Summarization with Python
The text summarization with Python workshop is designed to be a gentle introduction to natural language processing. In this short workshop, we'll cover the basics of Python, types of text summarization, and actually build out a real-world text summarization app that you can deploy and share with friends!

[![Workshop Cover](https://i.imgur.com/LfkpgC7.png)](https://www.canva.com/design/DAFOp9FIr4Q/_Op6f8q2fgRXSmLmufETXw/view?utm_content=DAFOp9FIr4Q&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton)

### Workshop Resources
- [Slide Deck]()
- [Teaching Plan]()
- [YouTube Video Workshop]()
- [Student Copy of Workshop](https://colab.research.google.com/drive/1h58KbSNvtfBqaGq0Pp1KyZlTYgjbVEq_?usp=sharing)

### Before you begin
1. Make a copy of this colab file by clicking `Save copy to drive`. This allows you to keep the code in your Google Drive. A new tab should be opened with the copied code file!
2. Start enjoying the workshop below!

### Installing and importing dependencies

In [None]:
!pip install gradio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gradio
  Downloading gradio-3.6-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 27.9 MB/s 
[?25hCollecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Collecting fastapi
  Downloading fastapi-0.85.1-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 4.3 MB/s 
Collecting ffmpy
  Downloading ffmpy-0.3.0.tar.gz (4.8 kB)
Collecting python-multipart
  Downloading python-multipart-0.0.5.tar.gz (32 kB)
Collecting orjson
  Downloading orjson-3.8.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (270 kB)
[K     |████████████████████████████████| 270 kB 50.6 MB/s 
Collecting markdown-it-py[linkify,plugins]
  Downloading markdown_it_py-2.1.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 2.4 MB/s 
Collecting h11<0.13,>=0.11
  Downloading h11-0.12.0-py3-none-any.whl (54 kB)
[K     |

In [None]:
import nltk
import heapq
import gradio as gr
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

### Finding a piece of text to summarize

In [None]:
text = """
There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[8] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured.
"""

## Adding our stopwords
Stopwords are "useless" words in the English language—think words like "a", "the", or "so". These words don't really add much meaning to the text, so we remove them!

In [None]:
stopwords = list(STOP_WORDS)
stopwords

### Loading & using our NLP model
Don't worry too much about understanind this for now. An NLP model is basically a tool that processes a big chunk of text!

In [None]:
nlp = spacy.load('en_core_web_sm') # nlp is a function
nlp

<spacy.lang.en.English at 0x7f764ef82f90>

In [None]:
doc = nlp(text) # English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
doc


There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summar

In [None]:
tokens = [token.text for token in doc]
tokens

In [None]:
cleaned = [word for word in tokens if word not in stopwords and word not in punctuation + '\n']
cleaned

### Calculating word frequencies
We're going to create a dictionary. This dictionary will contain a key with each unique word, and a value of the number of times that word occurs in the entire piece of text.

*Note: We're not including stopwods, as they don't mean much to the overall meaning of the text*

In [None]:
word_frequencies = {}
for word in cleaned:
  if word not in word_frequencies.keys():
    word_frequencies[word] = 1
  else:
    word_frequencies[word] += 1

word_frequencies


{'There': 1,
 'broadly': 1,
 'types': 1,
 'extractive': 1,
 'summarization': 11,
 'tasks': 1,
 'depending': 2,
 'program': 1,
 'focuses': 2,
 'The': 2,
 'generic': 3,
 'obtaining': 1,
 'summary': 4,
 'abstract': 2,
 'collection': 3,
 'documents': 2,
 'sets': 1,
 'images': 3,
 'videos': 3,
 'news': 4,
 'stories': 1,
 'etc': 1,
 'second': 1,
 'query': 4,
 'relevant': 2,
 'called': 2,
 'based': 1,
 'summarizes': 1,
 'objects': 1,
 'specific': 1,
 'Summarization': 1,
 'systems': 1,
 'able': 1,
 'create': 1,
 'text': 1,
 'summaries': 2,
 'machine': 1,
 'generated': 1,
 'user': 1,
 'needs': 1,
 'An': 1,
 'example': 3,
 'problem': 2,
 'document': 4,
 'attempts': 1,
 'automatically': 3,
 'produce': 1,
 'given': 2,
 'Sometimes': 1,
 'interested': 1,
 'generating': 1,
 'single': 1,
 'source': 2,
 'use': 1,
 'multiple': 1,
 'cluster': 1,
 'articles': 3,
 'topic': 2,
 'This': 2,
 'multi': 1,
 'A': 2,
 'related': 2,
 'application': 2,
 'summarizing': 1,
 'Imagine': 1,
 'system': 3,
 'pulls': 1,
 'w

In [None]:
max_frequency = max(word_frequencies.values())
max_frequency

11

In [None]:
# normalize frequencey

for key in word_frequencies:
  word_frequencies[key] /= max_frequency

word_frequencies

{'There': 0.09090909090909091,
 'broadly': 0.09090909090909091,
 'types': 0.09090909090909091,
 'extractive': 0.09090909090909091,
 'summarization': 1.0,
 'tasks': 0.09090909090909091,
 'depending': 0.18181818181818182,
 'program': 0.09090909090909091,
 'focuses': 0.18181818181818182,
 'The': 0.18181818181818182,
 'generic': 0.2727272727272727,
 'obtaining': 0.09090909090909091,
 'summary': 0.36363636363636365,
 'abstract': 0.18181818181818182,
 'collection': 0.2727272727272727,
 'documents': 0.18181818181818182,
 'sets': 0.09090909090909091,
 'images': 0.2727272727272727,
 'videos': 0.2727272727272727,
 'news': 0.36363636363636365,
 'stories': 0.09090909090909091,
 'etc': 0.09090909090909091,
 'second': 0.09090909090909091,
 'query': 0.36363636363636365,
 'relevant': 0.18181818181818182,
 'called': 0.18181818181818182,
 'based': 0.09090909090909091,
 'summarizes': 0.09090909090909091,
 'objects': 0.09090909090909091,
 'specific': 0.09090909090909091,
 'Summarization': 0.09090909090909

### Extracting sentence tokens

In [None]:
sentence_tokens = [sent for sent in doc.sents]
sentence_tokens

[,
 There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic).,
 This problem is called multi-document summarization.,
 A related appl

### Finding sentence scores
We're going to score each sentence to determine how important they are to the overall piece of the text!

***How do we do this?***

Great question! We'll be scoring sentences using the word frequencies dictionary we created earlier. We'll loop through each word in the sentence and add up the frequencies of each word as its score.

In [None]:
# using the score of each word, give sentence a score based on how often that word appered

sentence_scores = {}
for sent in sentence_tokens:
   for word in sent:
     if word.text.lower() in word_frequencies:
       if sent not in sentence_scores.keys():
         sentence_scores[sent] = word_frequencies[word.text.lower()]
       else:
          sentence_scores[sent] += word_frequencies[word.text.lower()]

sentence_scores

{There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.: 2.818181818181818,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).: 3.9999999999999987,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.: 3.909090909090909,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.: 3.09090909090909,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.: 3.9999999999999996,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of article

### Finding the sentences with the highest scores

In [None]:
from heapq import nlargest

select_length = int(len(sentence_tokens) * 0.3) # only want 0.3 percent

summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)



In [None]:
final_summary = [word.text for word in summary]
final_summary

['An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.',
 'The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).',
 'The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.',
 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\n']

In [None]:
final_summary = " ".join(final_summary)
final_summary

'An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\n'

### Deployment on Gradio

In [None]:
# This next portion will take what you've built and deploy it on gradio

def main(text, length):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    cleaned = [word for word in tokens if word not in stopwords and word not in punctuation + '\n']
    word_frequencies = {}

    for word in cleaned:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

    max_frequency = max(word_frequencies.values())


    for key in word_frequencies:
        word_frequencies[key] /= max_frequency

    sentence_tokens = [sent for sent in doc.sents]

    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]

    select_length = length
    summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)
    final_summary = [word.text for word in summary]
    final_summary = " ".join(final_summary)
    return final_summary

In [None]:
gr.Interface(
  fn=main, 
  inputs=[gr.inputs.Textbox(lines=5, placeholder="Enter the entire text here..."), 
          gr.inputs.Slider(0, 10, step=1)],
  outputs=["text"], 
  theme="huggingface").launch(debug=False, share=True)

  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components",
  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components",


Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Running on public URL: https://267373fc9816b7b2.gradio.app

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


(<gradio.routes.App at 0x7f764d11f6d0>,
 'http://127.0.0.1:7860/',
 'https://267373fc9816b7b2.gradio.app')