# Important parts of Text processing 

This notebook will provide a more practical overview of how the text processing works in python. note that, it is a part of an important medium post that I have explained in brief but consise the important steps of text processing.





### How to get text data using REST API

I will use 'http://quotes.rest/' as an example of fetching data from an API. to fitch quotes and the corresponding author.

In [0]:
import requests
import json
import re

In [3]:
# Fetch data 
res = requests.get('http://quotes.rest/qod.json').json()
print(json.dumps(res, indent=4))

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "A successful man is one who can lay a firm foundation with the bricks that others throw at him.",
                "length": "95",
                "author": "Sidney Greenberg",
                "tags": {
                    "0": "inspire",
                    "1": "success",
                    "3": "tso-life"
                },
                "category": "inspire",
                "language": "en",
                "date": "2020-05-19",
                "permalink": "https://theysaidso.com/quote/sidney-greenberg-a-successful-man-is-one-who-can-lay-a-firm-foundation-with-the",
                "id": "O8OiauUuV2FEq8DZElUNwQeF",
                "background": "https://theysaidso.com/img/qod/qod-inspire.jpg",
                "title": "Inspiring Quote of the day"
            }
        ]
    },
    "baseurl": "https://theysaidso.com",
    "copyright": {
        "year": 2022,

In [4]:
# Extract relevant object
res['contents']['quotes'][0]

{'author': 'Sidney Greenberg',
 'background': 'https://theysaidso.com/img/qod/qod-inspire.jpg',
 'category': 'inspire',
 'date': '2020-05-19',
 'id': 'O8OiauUuV2FEq8DZElUNwQeF',
 'language': 'en',
 'length': '95',
 'permalink': 'https://theysaidso.com/quote/sidney-greenberg-a-successful-man-is-one-who-can-lay-a-firm-foundation-with-the',
 'quote': 'A successful man is one who can lay a firm foundation with the bricks that others throw at him.',
 'tags': {'0': 'inspire', '1': 'success', '3': 'tso-life'},
 'title': 'Inspiring Quote of the day'}

In [5]:
print(res['contents']['quotes'][0]['quote'], '\n--', res['contents']['quotes'][0]['author'])

A successful man is one who can lay a firm foundation with the bricks that others throw at him. 
-- Sidney Greenberg


### How to get data from Web scraping 

As an example I'm going to do web scraping on 'https://news.ycombinator.com/' to fetch all the article summaries provided. 

In [6]:
r = requests.get('https://news.ycombinator.com/')
print(r.text)

<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?DLnbuyFPBOBfE68cHyiw">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
              <a href="newest">new</a> | <a href="front">past</a> | <a href=

In [13]:
# Use beautiful soup to parse webpage content
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html5lib')
print(soup.get_text())


        
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      New York Times phasing out all 3rd-party advertising data (axios.com)
        154 points by jbegley 53 minutes ago  | hide | 60 comments              
      
                
      2.      We built a new GPS receiver engine (coresemi.io)
        76 points by jgarzik 1 hour ago  | hide | 37 comments              
      
                
      3.      How to Center in CSS (howtocenterincss.com)
        37 points by HeinZawHtet 41 minutes ago  | hide | 7 comments              
      
                
      4.      Argon – a clean, responsive, modern template for Dokuwiki (github.com)
        16 points by thunderbong 47 minutes ago  | hide | 1 comment              
      
                
      5.      How Distortion Works in Mus

In [14]:
# Find all articles
summaries = soup.find_all("tr", class_="athing")
print(summaries[0])

<tr class="athing" id="23235141">
      <td align="right" class="title" valign="top"><span class="rank">1.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=23235141&amp;how=up&amp;goto=news" id="up_23235141"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="storylink" href="https://www.axios.com/new-york-times-advertising-792b3cd6-4bdb-47c3-9817-36601211a79d.html">New York Times phasing out all 3rd-party advertising data</a><span class="sitebit comhead"> (<a href="from?site=axios.com"><span class="sitestr">axios.com</span></a>)</span></td></tr>


In [15]:
# Extract article name
summaries[0].find('a', class_='storylink').get_text()

'New York Times phasing out all 3rd-party advertising data'

In [16]:
# Find and Extract all articles 
articles = list()
for summary in summaries:
  articles.append(summary.find('a', class_='storylink').get_text())
articles

['New York Times phasing out all 3rd-party advertising data',
 'We built a new GPS receiver engine',
 'How to Center in CSS',
 'Argon – a clean, responsive, modern template for Dokuwiki',
 'How Distortion Works in Music',
 'Defold game engine source now available and free to use for commercial games',
 'Vermont proposes providing broadband internet service to all state residents',
 "German intelligence can't spy on foreigners outside Germany",
 'The unreasonable effectiveness of declarative programming',
 'Walmart says it will discontinue Jet, which it acquired for $3B in 2016',
 'Unnamed SFU – Open Source One-to-many videoconferencing for teaching/conferences',
 'Ask HN: Production Prolog in 2020?',
 'EasyJet admits nine million customers hacked',
 'Ask HN: Production Lisp in 2020?',
 "EasyJet reveals cyber-attack exposed 9m customers' details",
 'Spleeter – Music Source-Separation Engine',
 'Lost and Found: Stopping Bluetooth Finders from Leaking Private Information',
 'Linux Product

### Normalization Step

In [22]:
# case normalization 
normalized_articles = [article.lower() for article in articles]

# punctuation removal
normalized_articles = [re.sub(r'[^A-Za-z0-9]', ' ', article) for article in normalized_articles]
normalized_articles

['new york times phasing out all 3rd party advertising data',
 'we built a new gps receiver engine',
 'how to center in css',
 'argon   a clean  responsive  modern template for dokuwiki',
 'how distortion works in music',
 'defold game engine source now available and free to use for commercial games',
 'vermont proposes providing broadband internet service to all state residents',
 'german intelligence can t spy on foreigners outside germany',
 'the unreasonable effectiveness of declarative programming',
 'walmart says it will discontinue jet  which it acquired for  3b in 2016',
 'unnamed sfu   open source one to many videoconferencing for teaching conferences',
 'ask hn  production prolog in 2020 ',
 'easyjet admits nine million customers hacked',
 'ask hn  production lisp in 2020 ',
 'easyjet reveals cyber attack exposed 9m customers  details',
 'spleeter   music source separation engine',
 'lost and found  stopping bluetooth finders from leaking private information',
 'linux product

### Tokenization -- using simple python

In [23]:
tokens = [article.split() for article in normalized_articles]
tokens

[['new',
  'york',
  'times',
  'phasing',
  'out',
  'all',
  '3rd',
  'party',
  'advertising',
  'data'],
 ['we', 'built', 'a', 'new', 'gps', 'receiver', 'engine'],
 ['how', 'to', 'center', 'in', 'css'],
 ['argon',
  'a',
  'clean',
  'responsive',
  'modern',
  'template',
  'for',
  'dokuwiki'],
 ['how', 'distortion', 'works', 'in', 'music'],
 ['defold',
  'game',
  'engine',
  'source',
  'now',
  'available',
  'and',
  'free',
  'to',
  'use',
  'for',
  'commercial',
  'games'],
 ['vermont',
  'proposes',
  'providing',
  'broadband',
  'internet',
  'service',
  'to',
  'all',
  'state',
  'residents'],
 ['german',
  'intelligence',
  'can',
  't',
  'spy',
  'on',
  'foreigners',
  'outside',
  'germany'],
 ['the', 'unreasonable', 'effectiveness', 'of', 'declarative', 'programming'],
 ['walmart',
  'says',
  'it',
  'will',
  'discontinue',
  'jet',
  'which',
  'it',
  'acquired',
  'for',
  '3b',
  'in',
  '2016'],
 ['unnamed',
  'sfu',
  'open',
  'source',
  'one',
  'to

### It is more convenient to use NLTK for these kind of processing

In [43]:
# first you need to download important NLTK parts
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# import tokenizers
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

# import stopword dictionary
from nltk.corpus import stopwords

# import stemmers and lemmatizers
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [32]:
tokens_list = [word_tokenize(article) for article in normalized_articles]
tokens_list

[['new',
  'york',
  'times',
  'phasing',
  'out',
  'all',
  '3rd',
  'party',
  'advertising',
  'data'],
 ['we', 'built', 'a', 'new', 'gps', 'receiver', 'engine'],
 ['how', 'to', 'center', 'in', 'css'],
 ['argon',
  'a',
  'clean',
  'responsive',
  'modern',
  'template',
  'for',
  'dokuwiki'],
 ['how', 'distortion', 'works', 'in', 'music'],
 ['defold',
  'game',
  'engine',
  'source',
  'now',
  'available',
  'and',
  'free',
  'to',
  'use',
  'for',
  'commercial',
  'games'],
 ['vermont',
  'proposes',
  'providing',
  'broadband',
  'internet',
  'service',
  'to',
  'all',
  'state',
  'residents'],
 ['german',
  'intelligence',
  'can',
  't',
  'spy',
  'on',
  'foreigners',
  'outside',
  'germany'],
 ['the', 'unreasonable', 'effectiveness', 'of', 'declarative', 'programming'],
 ['walmart',
  'says',
  'it',
  'will',
  'discontinue',
  'jet',
  'which',
  'it',
  'acquired',
  'for',
  '3b',
  'in',
  '2016'],
 ['unnamed',
  'sfu',
  'open',
  'source',
  'one',
  'to

### stop words removal using NLTK

In [38]:
rm_stopwords = list()
for tokens in tokens_list:
  rm_stopwords.append([w for w in tokens if w not in stopwords.words('english')])
rm_stopwords

[['new', 'york', 'times', 'phasing', '3rd', 'party', 'advertising', 'data'],
 ['built', 'new', 'gps', 'receiver', 'engine'],
 ['center', 'css'],
 ['argon', 'clean', 'responsive', 'modern', 'template', 'dokuwiki'],
 ['distortion', 'works', 'music'],
 ['defold',
  'game',
  'engine',
  'source',
  'available',
  'free',
  'use',
  'commercial',
  'games'],
 ['vermont',
  'proposes',
  'providing',
  'broadband',
  'internet',
  'service',
  'state',
  'residents'],
 ['german', 'intelligence', 'spy', 'foreigners', 'outside', 'germany'],
 ['unreasonable', 'effectiveness', 'declarative', 'programming'],
 ['walmart', 'says', 'discontinue', 'jet', 'acquired', '3b', '2016'],
 ['unnamed',
  'sfu',
  'open',
  'source',
  'one',
  'many',
  'videoconferencing',
  'teaching',
  'conferences'],
 ['ask', 'hn', 'production', 'prolog', '2020'],
 ['easyjet', 'admits', 'nine', 'million', 'customers', 'hacked'],
 ['ask', 'hn', 'production', 'lisp', '2020'],
 ['easyjet',
  'reveals',
  'cyber',
  'attack

### Stemming and Lemmatization using NLTK

#### Stemming 

In [40]:
# Reduce words into there stems 
stems_list = list()
for seq in rm_stopwords:
  stems_list.append([PorterStemmer().stem(w) for w in seq])
stems_list

[['new', 'york', 'time', 'phase', '3rd', 'parti', 'advertis', 'data'],
 ['built', 'new', 'gp', 'receiv', 'engin'],
 ['center', 'css'],
 ['argon', 'clean', 'respons', 'modern', 'templat', 'dokuwiki'],
 ['distort', 'work', 'music'],
 ['defold',
  'game',
  'engin',
  'sourc',
  'avail',
  'free',
  'use',
  'commerci',
  'game'],
 ['vermont',
  'propos',
  'provid',
  'broadband',
  'internet',
  'servic',
  'state',
  'resid'],
 ['german', 'intellig', 'spi', 'foreign', 'outsid', 'germani'],
 ['unreason', 'effect', 'declar', 'program'],
 ['walmart', 'say', 'discontinu', 'jet', 'acquir', '3b', '2016'],
 ['unnam',
  'sfu',
  'open',
  'sourc',
  'one',
  'mani',
  'videoconferenc',
  'teach',
  'confer'],
 ['ask', 'hn', 'product', 'prolog', '2020'],
 ['mount', 'st', 'helen', 'erupt', 'volcan', 'warn', 'need'],
 ['easyjet', 'admit', 'nine', 'million', 'custom', 'hack'],
 ['ask', 'hn', 'product', 'lisp', '2020'],
 ['easyjet', 'reveal', 'cyber', 'attack', 'expos', '9m', 'custom', 'detail'],
 

#### Lemmatization

In [45]:
# Reduce each word into its root form based on the wordnet by deffault it limmatize nouns but I will change it into verps 
lemma = list()
for seq in rm_stopwords:
  lemma.append([WordNetLemmatizer().lemmatize(w, pos='v') for w in seq])
lemma

[['new', 'york', 'time', 'phase', '3rd', 'party', 'advertise', 'data'],
 ['build', 'new', 'gps', 'receiver', 'engine'],
 ['center', 'css'],
 ['argon', 'clean', 'responsive', 'modern', 'template', 'dokuwiki'],
 ['distortion', 'work', 'music'],
 ['defold',
  'game',
  'engine',
  'source',
  'available',
  'free',
  'use',
  'commercial',
  'game'],
 ['vermont',
  'propose',
  'provide',
  'broadband',
  'internet',
  'service',
  'state',
  'residents'],
 ['german', 'intelligence', 'spy', 'foreigners', 'outside', 'germany'],
 ['unreasonable', 'effectiveness', 'declarative', 'program'],
 ['walmart', 'say', 'discontinue', 'jet', 'acquire', '3b', '2016'],
 ['unnamed',
  'sfu',
  'open',
  'source',
  'one',
  'many',
  'videoconferencing',
  'teach',
  'conferences'],
 ['ask', 'hn', 'production', 'prolog', '2020'],
 ['mount', 'st', 'helens', 'eruption', 'volcanic', 'warn', 'need'],
 ['easyjet', 'admit', 'nine', 'million', 'customers', 'hack'],
 ['ask', 'hn', 'production', 'lisp', '2020'],
