# **SUMMER CAMP 2022**
## DAY 4: TEXT SUMMARIZATION

## What we are going to do in this project

- Data collection 
- Text cleaning 

- Extractive text summary
  - Sentence Tokenization
  - Word frequency table
  - Clustering 
  - Summarization


- Abstractive text summary
  - Introduction to Hugging face transformers library

## Learning points 
- learning how to scrape data from the web using requests and beautiful soup
- preprocessing text data
- using models from the HF transformers library


### Step 0: Import and configure modules

In [7]:
# !pip install transformers
# !pip install beautifulsoup4

# install spacy
# https://spacy.io/usage


In [8]:
from bs4 import BeautifulSoup
import requests

### Step 1: Gather text data through web scraping

In [40]:
url = "https://en.wikipedia.org/wiki/Amazon_(company)"
r = requests.get(url)

In [41]:
r.text



In [42]:
#!pip install lxml
soup = BeautifulSoup(r.text)
results = soup.find_all("p")

results

[<p class="mw-empty-elt">
 </p>,
 <p><b>Amazon.com, Inc.</b><sup class="reference" id="cite_ref-10K_1-1"><a href="#cite_note-10K-1">[1]</a></sup> (<span class="rt-commentedText nowrap"><span class="IPA nopopups noexcerpt" lang="en-fonipa"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="/ˈ/: primary stress follows">ˈ</span><span title="/æ/: 'a' in 'bad'">æ</span><span title="'m' in 'my'">m</span><span title="/ə/: 'a' in 'about'">ə</span><span title="'z' in 'zoom'">z</span><span title="/ɒ/: 'o' in 'body'">ɒ</span><span title="'n' in 'nigh'">n</span></span>/</a></span></span> <a href="/wiki/Help:Pronunciation_respelling_key" title="Help:Pronunciation respelling key"><i title="English pronunciation respelling"><span style="font-size:90%">AM</span>-ə-zon</i></a>) is an American <a href="/wiki/Multinational_corporation" title="Multinational corporation">multinational</a> <a href="/wiki/Technology_company" title="Technology compan

In [43]:
text = ""
for sent in results: 
  text += sent.get_text()

text



### Step 2: Text cleaning

In [44]:
import re
# used this to remove the references/citations from the wikipedia text
pattern = "\[\d*?\]"

In [45]:
text = re.sub(pattern, '', text)

In [46]:
text.replace("\n", "")



## **PART 1**
### EXTRACTIVE TEXT SUMMARY

### Step 1: Sentence Tokenization 

In [16]:
# text = """
# A car (or automobile) is a wheeled motor vehicle that is used for transportation. Most definitions of cars say that they run primarily on roads, seat one to eight people, have four wheels, and mainly transport people instead of goods.
# The year 1886 is regarded as the birth year of the car when German inventor Carl Benz patented his Benz Patent-Motorwagen. Cars became widely available during the 20th century. One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced animal-drawn carriages and carts.In Europe and other parts of the world, demand for automobiles did not increase until after World War II. The car is considered an essential part of the developed economy.
# Cars have controls for driving, parking, passenger comfort, and a variety of lights. Over the decades, additional features and controls have been added to vehicles, making them progressively more complex. These include rear-reversing cameras, air conditioning, navigation systems, and in-car entertainment. Most cars in use in the early 2020s are propelled by an internal combustion engine, fueled by the combustion of fossil fuels. Electric cars, which were invented early in the history of the car, became commercially available in the 2000s and are predicted to cost less to buy than gasoline cars before 2025. The transition from fossil fuels to electric cars features prominently in most climate change mitigation scenarios, such as Project Drawdown's 100 actionable solutions for climate change.
# There are costs and benefits to car use. The costs to the individual include acquiring the vehicle, interest payments (if the car is financed), repairs and maintenance, fuel, depreciation, driving time, parking fees, taxes, and insurance. The costs to society include maintaining roads, land use, road congestion, air pollution, public health, healthcare, and disposing of the vehicle at the end of its life. Traffic collisions are the largest cause of injury-related deaths worldwide.
# """

In [17]:
# !pip install -U pip setuptools wheel
# !pip install -U 'spacy[apple]'
# !python -m spacy download en_core_web_sm
import spacy
from string import punctuation
sp = spacy.load('en_core_web_sm')

# What are Stopwords? In English vocabulary, there are many words like “I”, “the” and “you” that appear very frequently in the text but they do not add any valuable information for NLP operations and modeling. 
# These words are called stopwords and they are almost always advised to be removed as part of text preprocessing.

In [18]:
# returning a list of stopwords 
all_stopwords = sp.Defaults.stop_words
list(all_stopwords)

['i',
 'of',
 'its',
 'when',
 'take',
 'during',
 'back',
 'everyone',
 'moreover',
 'and',
 'fifteen',
 'put',
 'could',
 'also',
 'show',
 'but',
 'see',
 'almost',
 'thereupon',
 '’m',
 'due',
 "'ll",
 'most',
 'less',
 'never',
 'down',
 'beforehand',
 'being',
 'forty',
 'three',
 "n't",
 'what',
 'elsewhere',
 "'ve",
 'her',
 'am',
 'a',
 'go',
 'about',
 'on',
 '‘d',
 'although',
 '’s',
 'hundred',
 'anyone',
 'call',
 'he',
 'into',
 'unless',
 'hereafter',
 'some',
 'very',
 'top',
 'else',
 'whole',
 're',
 'nor',
 'front',
 'thru',
 'hers',
 'my',
 '’d',
 'none',
 'how',
 '’re',
 'nine',
 'more',
 'herein',
 'thereafter',
 'any',
 'much',
 'someone',
 'many',
 'again',
 'nowhere',
 'themselves',
 'least',
 'are',
 'around',
 'mostly',
 'myself',
 'seem',
 'anyway',
 'enough',
 'sometimes',
 'whither',
 'may',
 'alone',
 'name',
 'have',
 'eight',
 'ever',
 'mine',
 'under',
 'if',
 'empty',
 'before',
 'beyond',
 'whence',
 'last',
 'too',
 'up',
 'serious',
 'fifty',
 'our

In [19]:
doc = sp(text)

In [20]:
# tokenizing the original text
# it returns stopwords as well
tokens = [token.text for token in doc]
tokens

['A',
 'playing',
 'card',
 'is',
 'a',
 'piece',
 'of',
 'specially',
 'prepared',
 'card',
 'stock',
 ',',
 'heavy',
 'paper',
 ',',
 'thin',
 'cardboard',
 ',',
 'plastic',
 '-',
 'coated',
 'paper',
 ',',
 'cotton',
 '-',
 'paper',
 'blend',
 ',',
 'or',
 'thin',
 'plastic',
 'that',
 'is',
 'marked',
 'with',
 'distinguishing',
 'motifs',
 '.',
 'Often',
 'the',
 'front',
 '(',
 'face',
 ')',
 'and',
 'back',
 'of',
 'each',
 'card',
 'has',
 'a',
 'finish',
 'to',
 'make',
 'handling',
 'easier',
 '.',
 'They',
 'are',
 'most',
 'commonly',
 'used',
 'for',
 'playing',
 'card',
 'games',
 ',',
 'and',
 'are',
 'also',
 'used',
 'in',
 'magic',
 'tricks',
 ',',
 'cardistry',
 ',',
 'card',
 'throwing',
 ',',
 'and',
 'card',
 'houses',
 ';',
 'cards',
 'may',
 'also',
 'be',
 'collected',
 '.',
 'Some',
 'patterns',
 'of',
 'Tarot',
 'playing',
 'card',
 'are',
 'also',
 'used',
 'for',
 'divination',
 ',',
 'although',
 'bespoke',
 'cards',
 'for',
 'this',
 'use',
 'are',
 'more

In [21]:
# adding a new line character to the punctuation
punctuation = punctuation + "\n"
list(punctuation)

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~',
 '\n']

### Step 2: Word frequency table

In [22]:
word_frequencies = {}
for word in doc: 
  if word.text.lower() not in all_stopwords: 
    if word.text.lower() not in punctuation: 
      if word.text not in word_frequencies.keys(): 
        word_frequencies[word.text] = 1
      else: 
        word_frequencies[word.text] += 1

word_frequencies

{'playing': 30,
 'card': 49,
 'piece': 1,
 'specially': 1,
 'prepared': 1,
 'stock': 2,
 'heavy': 1,
 'paper': 6,
 'thin': 2,
 'cardboard': 1,
 'plastic': 2,
 'coated': 2,
 'cotton': 1,
 'blend': 1,
 'marked': 1,
 'distinguishing': 1,
 'motifs': 2,
 'face': 3,
 'finish': 2,
 'handling': 2,
 'easier': 1,
 'commonly': 5,
 'games': 11,
 'magic': 1,
 'tricks': 1,
 'cardistry': 2,
 'throwing': 1,
 'houses': 1,
 'cards': 95,
 'collected': 3,
 'patterns': 7,
 'Tarot': 2,
 'divination': 1,
 'bespoke': 1,
 'use': 6,
 'common.[citation': 1,
 'needed': 2,
 'Playing': 11,
 'typically': 1,
 'palm': 1,
 'sized': 1,
 'convenient': 1,
 'usually': 6,
 'sold': 4,
 'set': 3,
 'deck': 18,
 'pack': 9,
 'common': 6,
 'type': 1,
 'found': 6,
 'French': 15,
 'suited': 4,
 'standard': 3,
 '52': 7,
 'design': 5,
 'English': 2,
 'pattern,[a': 1,
 'followed': 3,
 'Belgian': 1,
 'Genoese': 1,
 'pattern': 3,
 'countries': 3,
 'traditional': 2,
 'types': 2,
 'including': 5,
 'German': 6,
 'Italian': 4,
 'Spanish': 3

In [23]:
max_freq = max(word_frequencies.values())
max_freq

95

In [24]:
for word in word_frequencies.keys():
  word_frequencies[word] = word_frequencies[word]/max_freq

word_frequencies

{'playing': 0.3157894736842105,
 'card': 0.5157894736842106,
 'piece': 0.010526315789473684,
 'specially': 0.010526315789473684,
 'prepared': 0.010526315789473684,
 'stock': 0.021052631578947368,
 'heavy': 0.010526315789473684,
 'paper': 0.06315789473684211,
 'thin': 0.021052631578947368,
 'cardboard': 0.010526315789473684,
 'plastic': 0.021052631578947368,
 'coated': 0.021052631578947368,
 'cotton': 0.010526315789473684,
 'blend': 0.010526315789473684,
 'marked': 0.010526315789473684,
 'distinguishing': 0.010526315789473684,
 'motifs': 0.021052631578947368,
 'face': 0.031578947368421054,
 'finish': 0.021052631578947368,
 'handling': 0.021052631578947368,
 'easier': 0.010526315789473684,
 'commonly': 0.05263157894736842,
 'games': 0.11578947368421053,
 'magic': 0.010526315789473684,
 'tricks': 0.010526315789473684,
 'cardistry': 0.021052631578947368,
 'throwing': 0.010526315789473684,
 'houses': 0.010526315789473684,
 'cards': 1.0,
 'collected': 0.031578947368421054,
 'patterns': 0.073

In [25]:
sentence_tokens = [sent for sent in doc.sents]
sentence_tokens

[A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs.,
 Often the front (face) and back of each card has a finish to make handling easier.,
 They are most commonly used for playing card games, and are also used in magic tricks, cardistry, card throwing, and card houses; cards may also be collected.,
 Some patterns of Tarot playing card are also used for divination, although bespoke cards for this use are more common.[citation needed],
 Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards.,
 The most common type of playing card is that found in the French-suited, standard 52-card pack, of which the most common design is the English pattern,[a] followed by the Belgian-Genoese pattern.,
 However, many countries use other, traditional types of playing card, including tho

In [26]:
sentence_scores = {}
for sent in sentence_tokens:
  for word in sent: 
    if word.text.lower() in word_frequencies.keys():
      if sent not in sentence_scores.keys():
        sentence_scores[sent] = word_frequencies[word.text.lower()]
      else: 
        sentence_scores[sent] += word_frequencies[word.text.lower()]

sentence_scores

{A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs.: 1.7789473684210535,
 Often the front (face) and back of each card has a finish to make handling easier.: 0.6,
 They are most commonly used for playing card games, and are also used in magic tricks, cardistry, card throwing, and card houses; cards may also be collected.: 3.1263157894736846,
 Some patterns of Tarot playing card are also used for divination, although bespoke cards for this use are more common.[citation needed]: 2.073684210526316,
 Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards.: 3.8,
 The most common type of playing card is that found in the French-suited, standard 52-card pack, of which the most common design is the English pattern,[a] followed by the Belgian-Genoese pattern.: 1.9157894736842

In [27]:
from heapq import nlargest

In [28]:
select_length = int(len(sentence_tokens)*0.3)
select_length

49

In [29]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
final_summary = [word.text for word in summary]
final_text = " ".join(final_summary)

### Step 3: Summarization

In [30]:
final_text

'Michael Dummett speculated that Mamluk cards may have descended from an earlier deck which consisted of 48 cards divided into four suits each with ten pip cards and two court cards.\n Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards.\n The earliest dated instance of a game involving cards occurred on 17 July 1294 when "Yan Sengzhu and Zheng Pig-Dog were caught playing cards [zhi pai] and that wood blocks for printing them had been impounded, together with nine of the actual cards. Later, Unicode 7.0 added the 52 cards of the modern French pack, plus 4 knights, and a character for "Playing Card Back" and black, red and white jokers, in the Playing Cards block (U+1F0A0–1F0FF).\n Every suit contains twelve cards with the top two usually being the court cards of king and vizier and the bottom ten being pip cards. Cards may also be produced for trading card sets or collectible card games, which can co

In [31]:
final_text = final_text.replace("\n", "")
final_text

'Michael Dummett speculated that Mamluk cards may have descended from an earlier deck which consisted of 48 cards divided into four suits each with ten pip cards and two court cards. Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards. The earliest dated instance of a game involving cards occurred on 17 July 1294 when "Yan Sengzhu and Zheng Pig-Dog were caught playing cards [zhi pai] and that wood blocks for printing them had been impounded, together with nine of the actual cards. Later, Unicode 7.0 added the 52 cards of the modern French pack, plus 4 knights, and a character for "Playing Card Back" and black, red and white jokers, in the Playing Cards block (U+1F0A0–1F0FF). Every suit contains twelve cards with the top two usually being the court cards of king and vizier and the bottom ten being pip cards. Cards may also be produced for trading card sets or collectible card games, which can comprise

In [32]:
doc

A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs. Often the front (face) and back of each card has a finish to make handling easier. They are most commonly used for playing card games, and are also used in magic tricks, cardistry, card throwing, and card houses; cards may also be collected. Some patterns of Tarot playing card are also used for divination, although bespoke cards for this use are more common.[citation needed] Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards.
The most common type of playing card is that found in the French-suited, standard 52-card pack, of which the most common design is the English pattern,[a] followed by the Belgian-Genoese pattern. However, many countries use other, traditional types of playing card, including those that are G

## **PART 2**
### ABSTRACTIVE TEXT SUMMARY USING HUGGING FACE TRANSFORMERS LIBRARY 🤗

In [33]:
from transformers import pipeline 

In [34]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


In [47]:
hf_summary = summarizer(text, max_length= 500, min_length= 100, do_sample= False, truncation=True)

In [49]:
hf_summary[0]['summary_text']

' Amazon is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft . As of 2021, it is the world\'s largest online retailer and marketplace, smart speaker provider, cloud computing service through AWS, live-streaming service through Twitch, and more . It is the second-largest private employer in the United States . Amazon has been criticized for customer data collection practices, a toxic work culture, tax avoidance, and anti-competitive behavior . It has been referred to as "one of the most influential economic and cultural forces in the world"'