Introduction to NLP. This notebook touches on five major points involved in the realm of NLP.
1. First we will web-scrape the data
2. We'll tokenize the data and remove stop words
3. We'll experiment with stemming (various stemmers such as Porter / Lancaster) and Lemmatization
4. We can experiment with POS tagging and Named Entity Recognition

By Anas Puthawala

# Web-scraping data

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
url = 'https://9to5mac.com/2021/07/19/man-behind-linkedin-scraping/'
html_source = requests.get(url).text
html_source

'<!DOCTYPE html>\n<html lang="en-US">\n\t<head>\n\t\t\t\t<meta charset="UTF-8" />\n\t\t<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1.0, minimal-ui">\n\t\t\t\t<meta name=\'robots\' content=\'index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1\' />\n<!-- Jetpack Site Verification Tags -->\n<meta name="google-site-verification" content="DzsUrmZ9ZlyfaPuXpOlNS3LpML0I6aZl89iaB86c9v8" />\n<meta name="msvalidate.01" content="808CE29E29F1230249D24A893DE871FB" />\n<meta name="p:domain_verify" content="e6250102e52a4d42f3964f1974e77e52" />\n<meta name="yandex-verification" content="97441597b112e146" />\n\t\t<script>\n\t\t\twindow.dataLayer = window.dataLayer || [];\n\t\t\t\t\t\tdataLayer.push({"pageType":"post","postCategory":["facebook","linkedin","privacy","security"]});\n\t\t\t\t\t</script>\n\t\t\t\t<script src="https://9to5mac.com/wp-content/themes/9to5-2015/assets/js/adsbygoogle.js"></script>\n\n               <!-- Google Tag Mana

In [3]:
soup = BeautifulSoup(html_source, 'html')

In [4]:
soup.prettify()

'<!DOCTYPE html>\n<html lang="en-US">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="width=device-width, initial-scale=1, maximum-scale=1.0, minimal-ui" name="viewport"/>\n  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>\n  <!-- Jetpack Site Verification Tags -->\n  <meta content="DzsUrmZ9ZlyfaPuXpOlNS3LpML0I6aZl89iaB86c9v8" name="google-site-verification"/>\n  <meta content="808CE29E29F1230249D24A893DE871FB" name="msvalidate.01"/>\n  <meta content="e6250102e52a4d42f3964f1974e77e52" name="p:domain_verify"/>\n  <meta content="97441597b112e146" name="yandex-verification"/>\n  <script>\n   window.dataLayer = window.dataLayer || [];\n\t\t\t\t\t\tdataLayer.push({"pageType":"post","postCategory":["facebook","linkedin","privacy","security"]});\n  </script>\n  <script src="https://9to5mac.com/wp-content/themes/9to5-2015/assets/js/adsbygoogle.js">\n  </script>\n  <!-- Google Tag Manager -->\n  <script>\n   (function(w,d,s,l,i)

In [5]:
class_name = 'post-body'
match = soup.find('div', class_=class_name)
print(match.text)


Last month we reported a LinkedIn scraping that exposed the data of 700 million users – some 92% of all those on the service. The data included location, phone numbers, and inferred salaries.
The man behind it has now been identified, and says that he did it “for fun” – though he is also selling the data … 


Background
Data scraping is a controversial topic. At its simplest, it means writing a piece of software to visit a webpage, read the data displayed, and then add it to a database.
More commonly, people will use APIs (application programming interfaces) provided by the web service for legitimate purposes, and use it to grab large quantities of data.
It’s controversial because, on the one hand, those doing the scraping can argue that they are only accessing publicly available data – they are simply doing so in an efficient way. Others argue that they are abusing tools not intended for the purpose, and that there is more data available through APIs than is visible on websites, maki

# Tokenize the data

In [6]:
from nltk.tokenize import word_tokenize

In [7]:
tok = word_tokenize(match.text)

In [8]:
len(tok)

796

### Lowercase and Removing stop-words

In [9]:
from nltk.corpus import stopwords

In [10]:
#lowercase
tok = [str.lower(i) for i in tok]

In [11]:
stopwords = set(stopwords.words('english'))

In [12]:
stopwords

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [13]:
import string

In [14]:
filt_tok = [word for word in tok if word not in stopwords]
print(f'The length of the new filtered token with stopwords removed is: {len(filt_tok)}')

#Removing punctuation

punctuations = string.punctuation + '–' + '…' + '“'
# for char in filt_tok:
#     if char in punctuations:
#         del
filt_tok_2 = []
for char in filt_tok:
    if char not in punctuations:
#         print(char)
        filt_tok_2.append(char)
    
print(f'The length of the list with stopwords and punctuation removed is:{len(filt_tok_2)}')

The length of the new filtered token with stopwords removed is: 466
The length of the list with stopwords and punctuation removed is:376


In [15]:
filt_tok_2

['last',
 'month',
 'reported',
 'linkedin',
 'scraping',
 'exposed',
 'data',
 '700',
 'million',
 'users',
 '92',
 'service',
 'data',
 'included',
 'location',
 'phone',
 'numbers',
 'inferred',
 'salaries',
 'man',
 'behind',
 'identified',
 'says',
 'fun',
 '”',
 'though',
 'also',
 'selling',
 'data',
 'background',
 'data',
 'scraping',
 'controversial',
 'topic',
 'simplest',
 'means',
 'writing',
 'piece',
 'software',
 'visit',
 'webpage',
 'read',
 'data',
 'displayed',
 'add',
 'database',
 'commonly',
 'people',
 'use',
 'apis',
 'application',
 'programming',
 'interfaces',
 'provided',
 'web',
 'service',
 'legitimate',
 'purposes',
 'use',
 'grab',
 'large',
 'quantities',
 'data',
 '’',
 'controversial',
 'one',
 'hand',
 'scraping',
 'argue',
 'accessing',
 'publicly',
 'available',
 'data',
 'simply',
 'efficient',
 'way',
 'others',
 'argue',
 'abusing',
 'tools',
 'intended',
 'purpose',
 'data',
 'available',
 'apis',
 'visible',
 'websites',
 'making',
 'hard',
 

### Frequency distribution of words

In [16]:
from nltk.probability import FreqDist
fdist = FreqDist()

In [17]:
for word in filt_tok_2:
    fdist[word]+=1
fdist

FreqDist({'data': 14, '’': 11, 'linkedin': 8, 'scraping': 6, '”': 6, 'million': 5, 'says': 5, 'security': 5, 'liner': 5, 'api': 5, ...})

In [18]:
fdist['linkedin']

8

In [19]:
fdist_top10 = fdist.most_common(10)
fdist_top10

[('data', 14),
 ('’', 11),
 ('linkedin', 8),
 ('scraping', 6),
 ('”', 6),
 ('million', 5),
 ('says', 5),
 ('security', 5),
 ('liner', 5),
 ('api', 5)]

# Stemming / Lemmatization

In [21]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()
pst.stem('testing')

'test'

In [60]:
# Creating the stems list
stems = []
[stems.append(pst.stem(filt_tok_2[idx])) for idx, key in enumerate(filt_tok_2)]

len(stems) == len(filt_tok_2)

#Creating a dictionary of word : stem
d_stems = dict(zip(filt_tok_2, stems))

d_stems

{'last': 'last',
 'month': 'month',
 'reported': 'report',
 'linkedin': 'linkedin',
 'scraping': 'scrape',
 'exposed': 'expos',
 'data': 'data',
 '700': '700',
 'million': 'million',
 'users': 'user',
 '92': '92',
 'service': 'servic',
 'included': 'includ',
 'location': 'locat',
 'phone': 'phone',
 'numbers': 'number',
 'inferred': 'infer',
 'salaries': 'salari',
 'man': 'man',
 'behind': 'behind',
 'identified': 'identifi',
 'says': 'say',
 'fun': 'fun',
 '”': '”',
 'though': 'though',
 'also': 'also',
 'selling': 'sell',
 'background': 'background',
 'controversial': 'controversi',
 'topic': 'topic',
 'simplest': 'simplest',
 'means': 'mean',
 'writing': 'write',
 'piece': 'piec',
 'software': 'softwar',
 'visit': 'visit',
 'webpage': 'webpag',
 'read': 'read',
 'displayed': 'display',
 'add': 'add',
 'database': 'databas',
 'commonly': 'commonli',
 'people': 'peopl',
 'use': 'use',
 'apis': 'api',
 'application': 'applic',
 'programming': 'program',
 'interfaces': 'interfac',
 'pro

In [52]:
# Finding frequency distribution of stems
fdist_stems = FreqDist()
for w in stems:
    fdist_stems[w]+=1
fdist_stems

FreqDist({'data': 14, '’': 11, 'linkedin': 8, 'million': 7, 'use': 7, 'api': 7, 'scrape': 6, 'user': 6, '”': 6, 'say': 5, ...})

In [53]:
fdist

FreqDist({'data': 14, '’': 11, 'linkedin': 8, 'scraping': 6, '”': 6, 'million': 5, 'says': 5, 'security': 5, 'liner': 5, 'api': 5, ...})

We can see some differences

#### Lemmatization using WordNetLemmatizer

In [54]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer

In [59]:
word_lem = WordNetLemmatizer()
word_lem.lemmatize('salaries')

'salary'

In [64]:
lemmas = []

[lemmas.append(word_lem.lemmatize(filt_tok_2[idx])) for idx, key in enumerate(filt_tok_2)]

len(lemmas) == len(filt_tok_2)

#Creating a dictionary of word : stem
d_lemmas = {}
d_lemmas = dict(zip(filt_tok_2, lemmas))

d_lemmas

{'last': 'last',
 'month': 'month',
 'reported': 'reported',
 'linkedin': 'linkedin',
 'scraping': 'scraping',
 'exposed': 'exposed',
 'data': 'data',
 '700': '700',
 'million': 'million',
 'users': 'user',
 '92': '92',
 'service': 'service',
 'included': 'included',
 'location': 'location',
 'phone': 'phone',
 'numbers': 'number',
 'inferred': 'inferred',
 'salaries': 'salary',
 'man': 'man',
 'behind': 'behind',
 'identified': 'identified',
 'says': 'say',
 'fun': 'fun',
 '”': '”',
 'though': 'though',
 'also': 'also',
 'selling': 'selling',
 'background': 'background',
 'controversial': 'controversial',
 'topic': 'topic',
 'simplest': 'simplest',
 'means': 'mean',
 'writing': 'writing',
 'piece': 'piece',
 'software': 'software',
 'visit': 'visit',
 'webpage': 'webpage',
 'read': 'read',
 'displayed': 'displayed',
 'add': 'add',
 'database': 'database',
 'commonly': 'commonly',
 'people': 'people',
 'use': 'use',
 'apis': 'apis',
 'application': 'application',
 'programming': 'progr

In [65]:
# Finding frequency distribution of lemmas
fdist_lemmas = FreqDist()
for w in lemmas:
    fdist_lemmas[w]+=1
fdist_lemmas

FreqDist({'data': 14, '’': 11, 'linkedin': 8, 'million': 7, 'scraping': 6, 'user': 6, '”': 6, 'say': 5, 'security': 5, 'liner': 5, ...})

In [66]:
fdist_stems

FreqDist({'data': 14, '’': 11, 'linkedin': 8, 'million': 7, 'use': 7, 'api': 7, 'scrape': 6, 'user': 6, '”': 6, 'say': 5, ...})

In [67]:
fdist

FreqDist({'data': 14, '’': 11, 'linkedin': 8, 'scraping': 6, '”': 6, 'million': 5, 'says': 5, 'security': 5, 'liner': 5, 'api': 5, ...})

# POS Tagging and NER

In [68]:
from nltk import ne_chunk

In [78]:
# First we need to tag all the tokenized words
ne_tags = nltk.pos_tag(filt_tok_2)

In [79]:
ner = ne_chunk(ne_tags)

In [85]:
print(ner)

(S
  last/JJ
  month/NN
  reported/VBD
  linkedin/JJ
  scraping/NN
  exposed/VBN
  data/NNS
  700/CD
  million/CD
  users/NNS
  92/CD
  service/NN
  data/NNS
  included/VBD
  location/NN
  phone/NN
  numbers/NNS
  inferred/JJ
  salaries/NNS
  man/NN
  behind/IN
  identified/JJ
  says/VBZ
  fun/NN
  ”/NNP
  though/IN
  also/RB
  selling/VBG
  data/NNS
  background/NN
  data/NNS
  scraping/VBG
  controversial/JJ
  topic/NN
  simplest/JJS
  means/VBZ
  writing/VBG
  piece/NN
  software/NN
  visit/NN
  webpage/NN
  read/VBD
  data/NNS
  displayed/VBD
  add/JJ
  database/NN
  commonly/RB
  people/NNS
  use/VBP
  apis/JJ
  application/NN
  programming/VBG
  interfaces/NNS
  provided/VBD
  web/JJ
  service/NN
  legitimate/JJ
  purposes/NNS
  use/VBP
  grab/JJ
  large/JJ
  quantities/NNS
  data/NNS
  ’/RB
  controversial/JJ
  one/CD
  hand/NN
  scraping/VBG
  argue/NN
  accessing/VBG
  publicly/RB
  available/JJ
  data/NNS
  simply/RB
  efficient/JJ
  way/NN
  others/NNS
  argue/VBP
  abusing/

In [84]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

#Chunking
cp = nltk.RegexpParser(pattern)
cs = cp.parse(ner)
print(cs)

(S
  (NP last/JJ month/NN)
  reported/VBD
  (NP linkedin/JJ scraping/NN)
  exposed/VBN
  data/NNS
  700/CD
  million/CD
  users/NNS
  92/CD
  (NP service/NN)
  data/NNS
  included/VBD
  (NP location/NN)
  (NP phone/NN)
  numbers/NNS
  inferred/JJ
  salaries/NNS
  (NP man/NN)
  behind/IN
  identified/JJ
  says/VBZ
  (NP fun/NN)
  ”/NNP
  though/IN
  also/RB
  selling/VBG
  data/NNS
  (NP background/NN)
  data/NNS
  scraping/VBG
  (NP controversial/JJ topic/NN)
  simplest/JJS
  means/VBZ
  writing/VBG
  (NP piece/NN)
  (NP software/NN)
  (NP visit/NN)
  (NP webpage/NN)
  read/VBD
  data/NNS
  displayed/VBD
  (NP add/JJ database/NN)
  commonly/RB
  people/NNS
  use/VBP
  (NP apis/JJ application/NN)
  programming/VBG
  interfaces/NNS
  provided/VBD
  (NP web/JJ service/NN)
  legitimate/JJ
  purposes/NNS
  use/VBP
  grab/JJ
  large/JJ
  quantities/NNS
  data/NNS
  ’/RB
  controversial/JJ
  one/CD
  (NP hand/NN)
  scraping/VBG
  (NP argue/NN)
  accessing/VBG
  publicly/RB
  available/JJ
  da

In [87]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('last', 'JJ', 'B-NP'),
 ('month', 'NN', 'I-NP'),
 ('reported', 'VBD', 'O'),
 ('linkedin', 'JJ', 'B-NP'),
 ('scraping', 'NN', 'I-NP'),
 ('exposed', 'VBN', 'O'),
 ('data', 'NNS', 'O'),
 ('700', 'CD', 'O'),
 ('million', 'CD', 'O'),
 ('users', 'NNS', 'O'),
 ('92', 'CD', 'O'),
 ('service', 'NN', 'B-NP'),
 ('data', 'NNS', 'O'),
 ('included', 'VBD', 'O'),
 ('location', 'NN', 'B-NP'),
 ('phone', 'NN', 'B-NP'),
 ('numbers', 'NNS', 'O'),
 ('inferred', 'JJ', 'O'),
 ('salaries', 'NNS', 'O'),
 ('man', 'NN', 'B-NP'),
 ('behind', 'IN', 'O'),
 ('identified', 'JJ', 'O'),
 ('says', 'VBZ', 'O'),
 ('fun', 'NN', 'B-NP'),
 ('”', 'NNP', 'O'),
 ('though', 'IN', 'O'),
 ('also', 'RB', 'O'),
 ('selling', 'VBG', 'O'),
 ('data', 'NNS', 'O'),
 ('background', 'NN', 'B-NP'),
 ('data', 'NNS', 'O'),
 ('scraping', 'VBG', 'O'),
 ('controversial', 'JJ', 'B-NP'),
 ('topic', 'NN', 'I-NP'),
 ('simplest', 'JJS', 'O'),
 ('means', 'VBZ', 'O'),
 ('writing', 'VBG', 'O'),
 ('piece', 'NN', 'B-NP'),
 ('software', 'NN', 'B-NP'),
 (