# Text Summarization with Python
The text summarization with Python workshop is designed to be a gentle introduction to natural language processing. In this short workshop, we'll cover the basics of Python, types of text summarization, and actually build out a real-world text summarization app that you can deploy and share with friends!

[![Workshop Cover](https://i.imgur.com/LfkpgC7.png)](https://www.canva.com/design/DAFOp9FIr4Q/_Op6f8q2fgRXSmLmufETXw/view?utm_content=DAFOp9FIr4Q&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton)

### Workshop Resources
- [Slide Deck]()
- [Teaching Plan]()
- [YouTube Video Workshop]()
- [Completed Copy of Workshop](https://colab.research.google.com/drive/1kkiPgiJsp8qS9xd64v_JCtf0VDmhZ0Om?usp=sharing)

### Before you begin
1. Make a copy of this colab file by clicking `Save copy to drive`. This allows you to keep the code in your Google Drive. A new tab should be opened with the copied code file!
2. Start enjoying the workshop below!

### Installing and importing dependencies

In [None]:
# install gradio!
!pip intall gradio

In [3]:
import nltk
import heapq
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

### Finding a piece of text to summarize

In [4]:
text = """
The Boeing 737 MAX is the fourth generation of the Boeing 737, a narrow-body airliner manufactured by Boeing Commercial Airplanes (BCA), a division of American company Boeing. It succeeds the Boeing 737 Next Generation (NG) and competes with the Airbus A320neo family. The new series was announced on August 30, 2011. It took its maiden flight on January 29, 2016 and was certified by the United States Federal Aviation Administration (FAA) in March 2017. The first delivery was a MAX 8 in May 2017 to Malindo Air, with whom it commenced service on May 22, 2017.

The 737 MAX is based on earlier 737 designs, with more efficient CFM International LEAP-1B engines, aerodynamic changes, including distinctive split-tip winglets, and airframe modifications. The 737 MAX series has been offered in four variants, offering 138 to 204 seats in typical two-class configuration, and a range of 3,300 to 3,850 nautical miles (6,110 to 7,130 km). The 737 MAX 7, MAX 8 (including the 200–seat MAX 200), and MAX 9 are intended to replace the 737-700, -800, and -900 respectively, and a further-stretched 737 MAX 10 is available. As of September 2022, the 737 MAX has 4,166 unfilled orders and 926 deliveries.

The 737 MAX suffered a recurring failure in the Maneuvering Characteristics Augmentation System (MCAS), causing two fatal crashes, Lion Air Flight 610 and Ethiopian Airlines Flight 302, in which 346 people died in total. It was subsequently grounded worldwide from March 2019 to November 2020. The FAA garnered criticism for defending the aircraft and was the last major authority to ground it.[6] Investigations faulted a Boeing cover-up of a defect and lapses in the FAA's certification of the aircraft for flight. Boeing paid US$2.5 billion in penalties and compensation to settle the DOJ's fraud conspiracy case against the company. Further investigations also revealed that the FAA and Boeing had colluded on recertification test flights, attempted to cover up important information and that the FAA had retaliated against whistleblowers.[7]

The FAA cleared the return to service on November 18, 2020, subject to mandated design and training changes. Canadian and European authorities only followed in late January 2021, and Chinese authorities in early December, as over 180 countries out of 195 had lifted the grounding. Over 450 MAX aircraft were awaiting delivery in November 2020; 335 remained by January 2022. Boeing estimated that the backlog would be largely cleared by the end of 2023, after its order book was reduced by almost 1000 aircraft due to cancellations from loss of trust in the aircraft.
"""

In [5]:
text

"\nThe Boeing 737 MAX is the fourth generation of the Boeing 737, a narrow-body airliner manufactured by Boeing Commercial Airplanes (BCA), a division of American company Boeing. It succeeds the Boeing 737 Next Generation (NG) and competes with the Airbus A320neo family. The new series was announced on August 30, 2011. It took its maiden flight on January 29, 2016 and was certified by the United States Federal Aviation Administration (FAA) in March 2017. The first delivery was a MAX 8 in May 2017 to Malindo Air, with whom it commenced service on May 22, 2017.\n\nThe 737 MAX is based on earlier 737 designs, with more efficient CFM International LEAP-1B engines, aerodynamic changes, including distinctive split-tip winglets, and airframe modifications. The 737 MAX series has been offered in four variants, offering 138 to 204 seats in typical two-class configuration, and a range of 3,300 to 3,850 nautical miles (6,110 to 7,130 km). The 737 MAX 7, MAX 8 (including the 200–seat MAX 200), and

## Adding our stopwords
Stopwords are "useless" words in the English language—think words like "a", "the", or "so". These words don't really add much meaning to the text, so we remove them!

In [6]:
stopwords = list(STOP_WORDS)
stopwords

['hundred',
 'much',
 'ourselves',
 'the',
 '‘ll',
 'among',
 'full',
 'otherwise',
 'herein',
 'get',
 'indeed',
 'these',
 'nor',
 'therein',
 'without',
 'cannot',
 'nobody',
 'always',
 'n’t',
 'then',
 'by',
 'n‘t',
 'see',
 'other',
 'me',
 'whereafter',
 'no',
 'fifteen',
 "'ll",
 'make',
 'itself',
 'her',
 'four',
 'thence',
 'few',
 '’d',
 'own',
 'now',
 'everyone',
 'beside',
 'top',
 'elsewhere',
 're',
 '’m',
 'anyway',
 'such',
 'he',
 'doing',
 'various',
 'therefore',
 'its',
 'because',
 'nine',
 'your',
 'whenever',
 'six',
 'do',
 'nevertheless',
 'several',
 'whence',
 'of',
 'side',
 'others',
 '’ve',
 'being',
 'through',
 'really',
 'latter',
 'might',
 'most',
 'quite',
 '‘d',
 'which',
 'whereby',
 'besides',
 'can',
 'is',
 'within',
 'often',
 'seems',
 'rather',
 'was',
 'whatever',
 'please',
 'someone',
 'after',
 'who',
 'just',
 'take',
 'and',
 '’re',
 'however',
 'behind',
 "'ve",
 'serious',
 'last',
 'three',
 'for',
 'becoming',
 '’s',
 'whole',
 '

### Loading & using our NLP model
Don't worry too much about understanind this for now. An NLP model is basically a tool that processes a big chunk of text!

In [9]:
nlp = spacy.load('en_core_web_sm')
nlp

<spacy.lang.en.English at 0x7f5c30c0c050>

In [10]:
doc = nlp(text)
doc


The Boeing 737 MAX is the fourth generation of the Boeing 737, a narrow-body airliner manufactured by Boeing Commercial Airplanes (BCA), a division of American company Boeing. It succeeds the Boeing 737 Next Generation (NG) and competes with the Airbus A320neo family. The new series was announced on August 30, 2011. It took its maiden flight on January 29, 2016 and was certified by the United States Federal Aviation Administration (FAA) in March 2017. The first delivery was a MAX 8 in May 2017 to Malindo Air, with whom it commenced service on May 22, 2017.

The 737 MAX is based on earlier 737 designs, with more efficient CFM International LEAP-1B engines, aerodynamic changes, including distinctive split-tip winglets, and airframe modifications. The 737 MAX series has been offered in four variants, offering 138 to 204 seats in typical two-class configuration, and a range of 3,300 to 3,850 nautical miles (6,110 to 7,130 km). The 737 MAX 7, MAX 8 (including the 200–seat MAX 200), and MAX

In [15]:
tokens = []
for token in doc:
    tokens.append(token.text)
tokens

['\n',
 'The',
 'Boeing',
 '737',
 'MAX',
 'is',
 'the',
 'fourth',
 'generation',
 'of',
 'the',
 'Boeing',
 '737',
 ',',
 'a',
 'narrow',
 '-',
 'body',
 'airliner',
 'manufactured',
 'by',
 'Boeing',
 'Commercial',
 'Airplanes',
 '(',
 'BCA',
 ')',
 ',',
 'a',
 'division',
 'of',
 'American',
 'company',
 'Boeing',
 '.',
 'It',
 'succeeds',
 'the',
 'Boeing',
 '737',
 'Next',
 'Generation',
 '(',
 'NG',
 ')',
 'and',
 'competes',
 'with',
 'the',
 'Airbus',
 'A320neo',
 'family',
 '.',
 'The',
 'new',
 'series',
 'was',
 'announced',
 'on',
 'August',
 '30',
 ',',
 '2011',
 '.',
 'It',
 'took',
 'its',
 'maiden',
 'flight',
 'on',
 'January',
 '29',
 ',',
 '2016',
 'and',
 'was',
 'certified',
 'by',
 'the',
 'United',
 'States',
 'Federal',
 'Aviation',
 'Administration',
 '(',
 'FAA',
 ')',
 'in',
 'March',
 '2017',
 '.',
 'The',
 'first',
 'delivery',
 'was',
 'a',
 'MAX',
 '8',
 'in',
 'May',
 '2017',
 'to',
 'Malindo',
 'Air',
 ',',
 'with',
 'whom',
 'it',
 'commenced',
 'serv

In [21]:
cleaned = [word.lower() for word in tokens if word not in stopwords and word not in punctuation + '\n']
cleaned

['the',
 'boeing',
 '737',
 'max',
 'fourth',
 'generation',
 'boeing',
 '737',
 'narrow',
 'body',
 'airliner',
 'manufactured',
 'boeing',
 'commercial',
 'airplanes',
 'bca',
 'division',
 'american',
 'company',
 'boeing',
 'it',
 'succeeds',
 'boeing',
 '737',
 'next',
 'generation',
 'ng',
 'competes',
 'airbus',
 'a320neo',
 'family',
 'the',
 'new',
 'series',
 'announced',
 'august',
 '30',
 '2011',
 'it',
 'took',
 'maiden',
 'flight',
 'january',
 '29',
 '2016',
 'certified',
 'united',
 'states',
 'federal',
 'aviation',
 'administration',
 'faa',
 'march',
 '2017',
 'the',
 'delivery',
 'max',
 '8',
 'may',
 '2017',
 'malindo',
 'air',
 'commenced',
 'service',
 'may',
 '22',
 '2017',
 '\n\n',
 'the',
 '737',
 'max',
 'based',
 'earlier',
 '737',
 'designs',
 'efficient',
 'cfm',
 'international',
 'leap-1b',
 'engines',
 'aerodynamic',
 'changes',
 'including',
 'distinctive',
 'split',
 'tip',
 'winglets',
 'airframe',
 'modifications',
 'the',
 '737',
 'max',
 'series',

### Calculating word frequencies
We're going to create a dictionary. This dictionary will contain a key with each unique word, and a value of the number of times that word occurs in the entire piece of text.

*Note: We're not including stopwods, as they don't mean much to the overall meaning of the text*

In [22]:
word_frequencies = {}
for word in cleaned:
    if word not in word_frequencies.keys():
        word_frequencies[word] = 1
    else:
        word_frequencies[word] += 1
word_frequencies

{'the': 9,
 'boeing': 9,
 '737': 11,
 'max': 12,
 'fourth': 1,
 'generation': 2,
 'narrow': 1,
 'body': 1,
 'airliner': 1,
 'manufactured': 1,
 'commercial': 1,
 'airplanes': 1,
 'bca': 1,
 'division': 1,
 'american': 1,
 'company': 2,
 'it': 3,
 'succeeds': 1,
 'next': 1,
 'ng': 1,
 'competes': 1,
 'airbus': 1,
 'a320neo': 1,
 'family': 1,
 'new': 1,
 'series': 2,
 'announced': 1,
 'august': 1,
 '30': 1,
 '2011': 1,
 'took': 1,
 'maiden': 1,
 'flight': 4,
 'january': 3,
 '29': 1,
 '2016': 1,
 'certified': 1,
 'united': 1,
 'states': 1,
 'federal': 1,
 'aviation': 1,
 'administration': 1,
 'faa': 6,
 'march': 2,
 '2017': 3,
 'delivery': 2,
 '8': 2,
 'may': 2,
 'malindo': 1,
 'air': 2,
 'commenced': 1,
 'service': 2,
 '22': 1,
 '\n\n': 3,
 'based': 1,
 'earlier': 1,
 'designs': 1,
 'efficient': 1,
 'cfm': 1,
 'international': 1,
 'leap-1b': 1,
 'engines': 1,
 'aerodynamic': 1,
 'changes': 2,
 'including': 2,
 'distinctive': 1,
 'split': 1,
 'tip': 1,
 'winglets': 1,
 'airframe': 1,
 'mo

In [23]:
max_frequency = max(word_frequencies.values())
max_frequency

12

In [24]:
# normalize frequency, z-score normalization, min-max normalization
for key in word_frequencies:
    word_frequencies[key] /= max_frequency
word_frequencies

{'the': 0.75,
 'boeing': 0.75,
 '737': 0.9166666666666666,
 'max': 1.0,
 'fourth': 0.08333333333333333,
 'generation': 0.16666666666666666,
 'narrow': 0.08333333333333333,
 'body': 0.08333333333333333,
 'airliner': 0.08333333333333333,
 'manufactured': 0.08333333333333333,
 'commercial': 0.08333333333333333,
 'airplanes': 0.08333333333333333,
 'bca': 0.08333333333333333,
 'division': 0.08333333333333333,
 'american': 0.08333333333333333,
 'company': 0.16666666666666666,
 'it': 0.25,
 'succeeds': 0.08333333333333333,
 'next': 0.08333333333333333,
 'ng': 0.08333333333333333,
 'competes': 0.08333333333333333,
 'airbus': 0.08333333333333333,
 'a320neo': 0.08333333333333333,
 'family': 0.08333333333333333,
 'new': 0.08333333333333333,
 'series': 0.16666666666666666,
 'announced': 0.08333333333333333,
 'august': 0.08333333333333333,
 '30': 0.08333333333333333,
 '2011': 0.08333333333333333,
 'took': 0.08333333333333333,
 'maiden': 0.08333333333333333,
 'flight': 0.3333333333333333,
 'january'

### Extracting sentence tokens

In [25]:
sentence_tokens = [sent for sent in doc.sents]
sentence_tokens

[
 The Boeing 737 MAX is the fourth generation of the Boeing 737, a narrow-body airliner manufactured by Boeing Commercial Airplanes (BCA), a division of American company Boeing.,
 It succeeds the Boeing 737 Next Generation (NG) and competes with the Airbus A320neo family.,
 The new series was announced on August 30, 2011.,
 It took its maiden flight on January 29, 2016 and was certified by the United States Federal Aviation Administration (FAA) in March 2017.,
 The first delivery was a MAX 8 in May 2017 to Malindo Air, with whom it commenced service on May 22, 2017.
 ,
 The 737 MAX is based on earlier 737 designs, with more efficient CFM International LEAP-1B engines, aerodynamic changes, including distinctive split-tip winglets, and airframe modifications.,
 The 737 MAX series has been offered in four variants, offering 138 to 204 seats in typical two-class configuration, and a range of 3,300 to 3,850 nautical miles (6,110 to 7,130 km).,
 The 737 MAX 7, MAX 8 (including the 200–seat 

### Finding sentence scores
We're going to score each sentence to determine how important they are to the overall piece of the text!

***How do we do this?***

Great question! We'll be scoring sentences using the word frequencies dictionary we created earlier. We'll loop through each word in the sentence and add up the frequencies of each word as its score.

In [26]:
# The 737 max is really cool
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies:
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

In [27]:
sentence_scores

{
 The Boeing 737 MAX is the fourth generation of the Boeing 737, a narrow-body airliner manufactured by Boeing Commercial Airplanes (BCA), a division of American company Boeing.: 9.25,
 It succeeds the Boeing 737 Next Generation (NG) and competes with the Airbus A320neo family.: 4.166666666666666,
 The new series was announced on August 30, 2011.: 1.333333333333333,
 It took its maiden flight on January 29, 2016 and was certified by the United States Federal Aviation Administration (FAA) in March 2017.: 3.3333333333333335,
 The first delivery was a MAX 8 in May 2017 to Malindo Air, with whom it commenced service on May 22, 2017.
 : 3.9999999999999996,
 The 737 MAX is based on earlier 737 designs, with more efficient CFM International LEAP-1B engines, aerodynamic changes, including distinctive split-tip winglets, and airframe modifications.: 5.166666666666664,
 The 737 MAX series has been offered in four variants, offering 138 to 204 seats in typical two-class configuration, and a rang

### Finding the sentences with the highest scores

In [28]:
from heapq import nlargest

In [33]:
select_length = 2

In [34]:
summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)

In [35]:
final_summary = [word.text for word in summary]
final_summary = " ".join(final_summary)
final_summary

'The 737 MAX 7, MAX 8 (including the 200–seat MAX 200), and MAX 9 are intended to replace the 737-700, -800, and -900 respectively, and a further-stretched 737 MAX 10 is available. \nThe Boeing 737 MAX is the fourth generation of the Boeing 737, a narrow-body airliner manufactured by Boeing Commercial Airplanes (BCA), a division of American company Boeing.'

### Deployment on Gradio

In [36]:
!pip install gradio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gradio
  Downloading gradio-3.9-py3-none-any.whl (11.6 MB)
[K     |████████████████████████████████| 11.6 MB 36.5 MB/s 
Collecting websockets
  Downloading websockets-10.4-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 71.3 MB/s 
[?25hCollecting paramiko
  Downloading paramiko-2.12.0-py2.py3-none-any.whl (213 kB)
[K     |████████████████████████████████| 213 kB 67.2 MB/s 
[?25hCollecting pycryptodome
  Downloading pycryptodome-3.15.0-cp35-abi3-manylinux2010_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 55.4 MB/s 
[?25hCollecting fastapi
  Downloading fastapi-0.86.0-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 4.5 MB/s 
[?25hCollecting markdown-it-py[linkify,plugins]
  Downloading markdown_it_py-2.1.0-py3-none-any

In [43]:
def main(text, length):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    cleaned = [word for word in tokens if word not in stopwords and word not in punctuation + '\n']
    word_frequencies = {}
    for word in cleaned:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
    max_frequency = max(word_frequencies.values())
    for key in word_frequencies:
        word_frequencies[key] /= max_frequency  
    
    sentence_tokens = [sent for sent in doc.sents]

    # using the score of each word, give sentence a score based on how often that word appered

    sentence_scores = {}
    for sent in sentence_tokens:
        for word in sent:
            if word.text.lower() in word_frequencies:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word.text.lower()]
                else:
                    sentence_scores[sent] += word_frequencies[word.text.lower()]

    select_length = length # only want 0.3 percent

    summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)
    final_summary = [word.text for word in summary]
    final_summary = " ".join(final_summary)
    return final_summary

In [44]:
text = """
The Boeing 737 MAX is the fourth generation of the Boeing 737, a narrow-body airliner manufactured by Boeing Commercial Airplanes (BCA), a division of American company Boeing. It succeeds the Boeing 737 Next Generation (NG) and competes with the Airbus A320neo family. The new series was announced on August 30, 2011. It took its maiden flight on January 29, 2016 and was certified by the United States Federal Aviation Administration (FAA) in March 2017. The first delivery was a MAX 8 in May 2017 to Malindo Air, with whom it commenced service on May 22, 2017.

The 737 MAX is based on earlier 737 designs, with more efficient CFM International LEAP-1B engines, aerodynamic changes, including distinctive split-tip winglets, and airframe modifications. The 737 MAX series has been offered in four variants, offering 138 to 204 seats in typical two-class configuration, and a range of 3,300 to 3,850 nautical miles (6,110 to 7,130 km). The 737 MAX 7, MAX 8 (including the 200–seat MAX 200), and MAX 9 are intended to replace the 737-700, -800, and -900 respectively, and a further-stretched 737 MAX 10 is available. As of September 2022, the 737 MAX has 4,166 unfilled orders and 926 deliveries.

The 737 MAX suffered a recurring failure in the Maneuvering Characteristics Augmentation System (MCAS), causing two fatal crashes, Lion Air Flight 610 and Ethiopian Airlines Flight 302, in which 346 people died in total. It was subsequently grounded worldwide from March 2019 to November 2020. The FAA garnered criticism for defending the aircraft and was the last major authority to ground it.[6] Investigations faulted a Boeing cover-up of a defect and lapses in the FAA's certification of the aircraft for flight. Boeing paid US$2.5 billion in penalties and compensation to settle the DOJ's fraud conspiracy case against the company. Further investigations also revealed that the FAA and Boeing had colluded on recertification test flights, attempted to cover up important information and that the FAA had retaliated against whistleblowers.[7]

The FAA cleared the return to service on November 18, 2020, subject to mandated design and training changes. Canadian and European authorities only followed in late January 2021, and Chinese authorities in early December, as over 180 countries out of 195 had lifted the grounding. Over 450 MAX aircraft were awaiting delivery in November 2020; 335 remained by January 2022. Boeing estimated that the backlog would be largely cleared by the end of 2023, after its order book was reduced by almost 1000 aircraft due to cancellations from loss of trust in the aircraft
"""

In [47]:
main(text, 1)

'The 737 MAX 7, MAX 8 (including the 200–seat MAX 200), and MAX 9 are intended to replace the 737-700, -800, and -900 respectively, and a further-stretched 737 MAX 10 is available.'

In [48]:
import gradio as gr

In [49]:
gr.Interface(
    fn=main,
    inputs=[
        gr.inputs.Textbox(lines=5, placeholder="Enter the text below..."),
        gr.inputs.Slider(0, 10, step=1)
    ],
    outputs=["text"],
    theme="huggingface"
).launch(debug=False, share=True)

  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components",
  "Usage of gradio.inputs is deprecated, and will not be supported in the future, please import your component from gradio.components",


Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
Running on public URL: https://c0950e2dd297a1a9.gradio.app

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


(<gradio.routes.App at 0x7f5c31f5b450>,
 'http://127.0.0.1:7860/',
 'https://c0950e2dd297a1a9.gradio.app')