# **SUMMER CAMP 2022**
## DAY 4: TEXT SUMMARIZATION

## What we are going to do in this project

- Data collection 
- Text cleaning 

- Extractive text summary
  - Sentence Tokenization
  - Word frequency table
  - Clustering 
  - Summarization


- Abstractive text summary
  - Introduction to Hugging face transformers library

## Learning points 
- learning how to scrape data from the web using requests and beautiful soup
- preprocessing text data
- using models from the HF transformers library


### Step 0: Import and configure modules

In [None]:
!pip install transformers
!pip install beautifulsoup4
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 5.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 48.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 51.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m2022-09-07 01:20:54.296311: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m59.3 MB/s[0m eta [36m0:00:00[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
from transformers import pipeline 
from bs4 import BeautifulSoup
import requests

### Step 1: Gather text data through web scraping

In [None]:
url = "https://en.wikipedia.org/wiki/Playing_card"
r = requests.get(url)

In [None]:
r.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Playing card - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"cf6491a9-1178-460a-a008-1b72dcd7cd99","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Playing_card","wgTitle":"Playing card","wgCurRevisionId":1105739523,"wgRevisionId":1105739523,"wgArticleId":23083,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Harv and Sfn no-target errors","CS1 Italian-language sources (it)","Articles with short description","Short description is different from Wikidata","All

In [None]:
!pip install lxml
soup = BeautifulSoup(r.text)
results = soup.find_all("p")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [None]:
text = ""
for sent in results: 
  text += sent.get_text()

text

'A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs. Often the front (face) and back of each card has a finish to make handling easier. They are most commonly used for playing card games, and are also used in magic tricks, cardistry,[1][2] card throwing,[3] and card houses; cards may also be collected.[4] Some patterns of Tarot playing card are also used for divination, although bespoke cards for this use are more common.[citation needed] Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards.\nThe most common type of playing card is that found in the French-suited, standard 52-card pack, of which the most common design is the English pattern,[a] followed by the Belgian-Genoese pattern.[5] However, many countries use other, traditional types of playing card, including

### Step 2: Text cleaning

In [None]:
import re
# used this to remove the references/citations from the wikipedia text
pattern = "\[\d*?\]"

In [None]:
text = re.sub(pattern, '', text)

In [None]:
text

'A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs. Often the front (face) and back of each card has a finish to make handling easier. They are most commonly used for playing card games, and are also used in magic tricks, cardistry, card throwing, and card houses; cards may also be collected. Some patterns of Tarot playing card are also used for divination, although bespoke cards for this use are more common. Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards.\nThe most common type of playing card is that found in the French-suited, standard 52-card pack, of which the most common design is the English pattern, followed by the Belgian-Genoese pattern. However, many countries use other, traditional types of playing card, including those that are German, Italian, Sp

## **PART 1**
### EXTRACTIVE TEXT SUMMARY

### Step 1: Sentence Tokenization 

In [None]:
# text = """
# A car (or automobile) is a wheeled motor vehicle that is used for transportation. Most definitions of cars say that they run primarily on roads, seat one to eight people, have four wheels, and mainly transport people instead of goods.
# The year 1886 is regarded as the birth year of the car when German inventor Carl Benz patented his Benz Patent-Motorwagen. Cars became widely available during the 20th century. One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced animal-drawn carriages and carts.In Europe and other parts of the world, demand for automobiles did not increase until after World War II. The car is considered an essential part of the developed economy.
# Cars have controls for driving, parking, passenger comfort, and a variety of lights. Over the decades, additional features and controls have been added to vehicles, making them progressively more complex. These include rear-reversing cameras, air conditioning, navigation systems, and in-car entertainment. Most cars in use in the early 2020s are propelled by an internal combustion engine, fueled by the combustion of fossil fuels. Electric cars, which were invented early in the history of the car, became commercially available in the 2000s and are predicted to cost less to buy than gasoline cars before 2025. The transition from fossil fuels to electric cars features prominently in most climate change mitigation scenarios, such as Project Drawdown's 100 actionable solutions for climate change.
# There are costs and benefits to car use. The costs to the individual include acquiring the vehicle, interest payments (if the car is financed), repairs and maintenance, fuel, depreciation, driving time, parking fees, taxes, and insurance. The costs to society include maintaining roads, land use, road congestion, air pollution, public health, healthcare, and disposing of the vehicle at the end of its life. Traffic collisions are the largest cause of injury-related deaths worldwide.
# """

In [None]:
import spacy 
import spacy.lang.en.stop_words as STOP_WORDS
from string import punctuation
sp = spacy.load('en_core_web_sm')

# What are Stopwords? In English vocabulary, there are many words like “I”, “the” and “you” that appear very frequently in the text but they do not add any valuable information for NLP operations and modeling. 
# These words are called stopwords and they are almost always advised to be removed as part of text preprocessing.

In [None]:
# returning a list of stopwords 
all_stopwords = sp.Defaults.stop_words

list(all_stopwords)

['ourselves',
 'whose',
 'while',
 'amongst',
 'third',
 'just',
 'give',
 'doing',
 'anything',
 'over',
 'used',
 'until',
 'meanwhile',
 'too',
 'move',
 'own',
 'we',
 'even',
 'quite',
 'since',
 'per',
 'whoever',
 'whereas',
 'had',
 'twelve',
 'beforehand',
 'seems',
 'sometimes',
 'some',
 'mine',
 'empty',
 'least',
 'any',
 'someone',
 'themselves',
 'up',
 'who',
 'thereafter',
 'ours',
 'keep',
 'cannot',
 'there',
 'via',
 'somehow',
 '‘re',
 'perhaps',
 'hers',
 'their',
 'else',
 'behind',
 'me',
 'toward',
 'always',
 'thru',
 'whether',
 'of',
 'them',
 'because',
 'yourself',
 'down',
 'thereby',
 'where',
 'throughout',
 'many',
 'would',
 'afterwards',
 'very',
 'take',
 'towards',
 'nine',
 'mostly',
 'much',
 'an',
 'top',
 'everything',
 'among',
 'otherwise',
 'first',
 'other',
 'everywhere',
 'as',
 'which',
 'upon',
 'one',
 'am',
 'for',
 'will',
 'beside',
 'now',
 'does',
 'if',
 'already',
 'less',
 'herself',
 'might',
 'n‘t',
 'go',
 'could',
 'namely'

In [None]:
doc = sp(text)

In [None]:
# tokenizing the original text
# it returns stopwords as well
tokens = [token.text for token in doc]
tokens

['A',
 'playing',
 'card',
 'is',
 'a',
 'piece',
 'of',
 'specially',
 'prepared',
 'card',
 'stock',
 ',',
 'heavy',
 'paper',
 ',',
 'thin',
 'cardboard',
 ',',
 'plastic',
 '-',
 'coated',
 'paper',
 ',',
 'cotton',
 '-',
 'paper',
 'blend',
 ',',
 'or',
 'thin',
 'plastic',
 'that',
 'is',
 'marked',
 'with',
 'distinguishing',
 'motifs',
 '.',
 'Often',
 'the',
 'front',
 '(',
 'face',
 ')',
 'and',
 'back',
 'of',
 'each',
 'card',
 'has',
 'a',
 'finish',
 'to',
 'make',
 'handling',
 'easier',
 '.',
 'They',
 'are',
 'most',
 'commonly',
 'used',
 'for',
 'playing',
 'card',
 'games',
 ',',
 'and',
 'are',
 'also',
 'used',
 'in',
 'magic',
 'tricks',
 ',',
 'cardistry,[1][2',
 ']',
 'card',
 'throwing,[3',
 ']',
 'and',
 'card',
 'houses',
 ';',
 'cards',
 'may',
 'also',
 'be',
 'collected.[4',
 ']',
 'Some',
 'patterns',
 'of',
 'Tarot',
 'playing',
 'card',
 'are',
 'also',
 'used',
 'for',
 'divination',
 ',',
 'although',
 'bespoke',
 'cards',
 'for',
 'this',
 'use',
 '

In [None]:
# adding a new line character to the punctuation
punctuation = punctuation + "\n"
list(punctuation)

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~',
 '\n']

### Step 2: Word frequency table

In [None]:
word_frequencies = {}
for word in doc: 
  if word.text.lower() not in all_stopwords: 
    if word.text.lower() not in punctuation: 
      if word.text not in word_frequencies.keys(): 
        word_frequencies[word.text] = 1
      else: 
        word_frequencies[word.text] += 1

word_frequencies

{'playing': 30,
 'card': 47,
 'piece': 1,
 'specially': 1,
 'prepared': 1,
 'stock': 2,
 'heavy': 1,
 'paper': 6,
 'thin': 2,
 'cardboard': 1,
 'plastic': 2,
 'coated': 2,
 'cotton': 1,
 'blend': 1,
 'marked': 1,
 'distinguishing': 1,
 'motifs': 2,
 'face': 3,
 'finish': 2,
 'handling': 2,
 'easier': 1,
 'commonly': 5,
 'games': 11,
 'magic': 1,
 'tricks': 1,
 'cardistry,[1][2': 1,
 'throwing,[3': 1,
 'houses': 1,
 'cards': 90,
 'collected.[4': 1,
 'patterns': 7,
 'Tarot': 2,
 'divination': 1,
 'bespoke': 1,
 'use': 6,
 'common.[citation': 1,
 'needed': 2,
 'Playing': 11,
 'typically': 1,
 'palm': 1,
 'sized': 1,
 'convenient': 1,
 'usually': 6,
 'sold': 4,
 'set': 3,
 'deck': 17,
 'pack': 9,
 'common': 6,
 'type': 1,
 'found': 6,
 'French': 15,
 'suited': 4,
 'standard': 3,
 '52': 7,
 'design': 5,
 'English': 2,
 'pattern,[a': 1,
 'followed': 3,
 'Belgian': 1,
 'Genoese': 1,
 'pattern.[5': 1,
 'countries': 3,
 'traditional': 2,
 'types': 2,
 'including': 5,
 'German': 6,
 'Italian': 4

In [None]:
max_freq = max(word_frequencies.values())
max_freq

90

In [None]:
for word in word_frequencies.keys():
  word_frequencies[word] = word_frequencies[word]/max_freq

word_frequencies

{'playing': 0.3333333333333333,
 'card': 0.5222222222222223,
 'piece': 0.011111111111111112,
 'specially': 0.011111111111111112,
 'prepared': 0.011111111111111112,
 'stock': 0.022222222222222223,
 'heavy': 0.011111111111111112,
 'paper': 0.06666666666666667,
 'thin': 0.022222222222222223,
 'cardboard': 0.011111111111111112,
 'plastic': 0.022222222222222223,
 'coated': 0.022222222222222223,
 'cotton': 0.011111111111111112,
 'blend': 0.011111111111111112,
 'marked': 0.011111111111111112,
 'distinguishing': 0.011111111111111112,
 'motifs': 0.022222222222222223,
 'face': 0.03333333333333333,
 'finish': 0.022222222222222223,
 'handling': 0.022222222222222223,
 'easier': 0.011111111111111112,
 'commonly': 0.05555555555555555,
 'games': 0.12222222222222222,
 'magic': 0.011111111111111112,
 'tricks': 0.011111111111111112,
 'cardistry,[1][2': 0.011111111111111112,
 'throwing,[3': 0.011111111111111112,
 'houses': 0.011111111111111112,
 'cards': 1.0,
 'collected.[4': 0.011111111111111112,
 'patte

In [None]:
sentence_tokens = [sent for sent in doc.sents]
sentence_tokens

[A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs.,
 Often the front (face) and back of each card has a finish to make handling easier.,
 They are most commonly used for playing card games, and are also used in magic tricks, cardistry,[1][2] card throwing,[3] and card houses; cards may also be collected.[4],
 Some patterns of Tarot playing card are also used for divination, although bespoke cards for this use are more common.[citation needed],
 Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards.,
 The most common type of playing card is that found in the French-suited, standard 52-card pack, of which the most common design is the English pattern,[a] followed by the Belgian-Genoese pattern.[5],
 However, many countries use other, traditional types of playing card

In [None]:
sentence_scores = {}
for sent in sentence_tokens:
  for word in sent: 
    if word.text.lower() in word_frequencies.keys():
      if sent not in sentence_scores.keys():
        sentence_scores[sent] = word_frequencies[word.text.lower()]
      else: 
        sentence_scores[sent] += word_frequencies[word.text.lower()]

sentence_scores

{A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs.: 1.8333333333333324,
 Often the front (face) and back of each card has a finish to make handling easier.: 0.6111111111111112,
 They are most commonly used for playing card games, and are also used in magic tricks, cardistry,[1][2] card throwing,[3] and card houses; cards may also be collected.[4]: 3.144444444444444,
 Some patterns of Tarot playing card are also used for divination, although bespoke cards for this use are more common.[citation needed]: 2.111111111111111,
 Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards.: 3.8333333333333335,
 The most common type of playing card is that found in the French-suited, standard 52-card pack, of which the most common design is the English pattern,[a] followed by the 

In [None]:
from heapq import nlargest

In [None]:
select_length = int(len(sentence_tokens)*0.1)
select_length

12

In [None]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
final_summary = [word.text for word in summary]
final_text = " ".join(final_summary)

### Step 3: Summarization

In [None]:
final_text

'Playing cards are available in a wide variety of styles, as decks may be custom-produced for casinos[6] and magicians[7] (sometimes in the form of trick decks),[8] made as promotional items,[9] or intended as souvenirs,[10][11] artistic works, educational tools,[12][13][14] or branded accessories.[15] Decks of cards or even single cards are also collected as a hobby or for monetary value.[16][17] Cards may also be produced for trading card sets or collectible card games, which can comprise hundreds if not thousands of unique cards, or as supplements for board games.\n Instead, they were printed with instructions or forfeits for whomever drew them.[26]\nThe earliest dated instance of a game involving cards occurred on 17 July 1294 when "Yan Sengzhu and Zheng Pig-Dog were caught playing cards [zhi pai] and that wood blocks for printing them had been impounded, together with nine of the actual cards."[26]\nWilliam Henry Wilkinson suggests that the first cards may have been actual paper c

In [None]:
final_text = final_text.replace("\n", "")
final_text

'Playing cards are available in a wide variety of styles, as decks may be custom-produced for casinos[6] and magicians[7] (sometimes in the form of trick decks),[8] made as promotional items,[9] or intended as souvenirs,[10][11] artistic works, educational tools,[12][13][14] or branded accessories.[15] Decks of cards or even single cards are also collected as a hobby or for monetary value.[16][17] Cards may also be produced for trading card sets or collectible card games, which can comprise hundreds if not thousands of unique cards, or as supplements for board games. Instead, they were printed with instructions or forfeits for whomever drew them.[26]The earliest dated instance of a game involving cards occurred on 17 July 1294 when "Yan Sengzhu and Zheng Pig-Dog were caught playing cards [zhi pai] and that wood blocks for printing them had been impounded, together with nine of the actual cards."[26]William Henry Wilkinson suggests that the first cards may have been actual paper currenc

In [None]:
doc

A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs. Often the front (face) and back of each card has a finish to make handling easier. They are most commonly used for playing card games, and are also used in magic tricks, cardistry,[1][2] card throwing,[3] and card houses; cards may also be collected.[4] Some patterns of Tarot playing card are also used for divination, although bespoke cards for this use are more common.[citation needed] Playing cards are typically palm-sized for convenient handling, and usually are sold together in a set as a deck of cards or pack of cards.
The most common type of playing card is that found in the French-suited, standard 52-card pack, of which the most common design is the English pattern,[a] followed by the Belgian-Genoese pattern.[5] However, many countries use other, traditional types of playing card, including t

## **PART 2**
### ABSTRACTIVE TEXT SUMMARY USING HUGGING FACE TRANSFORMERS LIBRARY 🤗

In [None]:
from transformers import pipeline 


In [None]:
summarizer = pipeline("summarization")


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [None]:
hf_summary = summarizer(text, max_length= 100, min_length= 30, do_sample= False)

In [None]:
hf_summary[0]['summary_text']

' The year 1886 is regarded as the birth year of the car when German inventor Carl Benz patented his Benz Patent-Motorwagen . Electric cars are predicted to cost less to buy than gasoline cars before 2025 . The transition from fossil fuels to electric cars features prominently in most climate change mitigation scenarios .'