# Introduction to Natural Language Processing

## Session 1:  Basic Text Analysis
 

For the demonstrations in this module, we shall be using the NLTK library. **NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries (src: https://www.nltk.org/)

In [1]:
##Install the NLTK library
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/




Let's start with basic NLP operations, usually used for text preprocessing 
to improve the quality of data for better subsequent tasks, such as: 

- stopword removal
- word/sentence tokenization
- part-of-speech (POS)
- Stemming
- Lemmatization
- Bag of words
- Tf-idf

## Stopword Removal and Tokenization

- **Stopword Removal**: Many frequently occurring words that are not important for understanding semantics, also called *stopwords*, can be removed.
- **Tokenization**: Splitting text into smaller elements (characters, words, sentences, paragraphs).

In [2]:
## We need to download the stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Next, we shall be downloading the *Punkt* sentence tokenizer. Read more: https://www.nltk.org/_modules/nltk/tokenize/punkt.html

In [3]:
from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
## We have stopwords in multiple languages
stops_en = stopwords.words('english')
stops_ge = stopwords.words('german')
print(stops_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [5]:
# customize your stop word list by adding words
stops_en.append('airline')
print(stops_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [6]:
# sentence / word tokenization
cnn = 'The Cable News Network is a multinational news-based pay television channel headquartered in Atlanta, Georgia. It is owned by CNN Global, which is part of Warner Bros, Discovery. It was founded in 1980 by American media proprietor Ted Turner and Reese Schonfeld as a 24-hour cable news channel.'

word_tokenize(cnn)
#sent_tokenize(cnn)

['The',
 'Cable',
 'News',
 'Network',
 'is',
 'a',
 'multinational',
 'news-based',
 'pay',
 'television',
 'channel',
 'headquartered',
 'in',
 'Atlanta',
 ',',
 'Georgia',
 '.',
 'It',
 'is',
 'owned',
 'by',
 'CNN',
 'Global',
 ',',
 'which',
 'is',
 'part',
 'of',
 'Warner',
 'Bros',
 ',',
 'Discovery',
 '.',
 'It',
 'was',
 'founded',
 'in',
 '1980',
 'by',
 'American',
 'media',
 'proprietor',
 'Ted',
 'Turner',
 'and',
 'Reese',
 'Schonfeld',
 'as',
 'a',
 '24-hour',
 'cable',
 'news',
 'channel',
 '.']

In [7]:
# a combination of tokenization and stopword removal
sent = 'This is the first sentence, and this is the second sentence.'
words = word_tokenize(sent.lower())

for word in words:
  if len(word) <= 3:
    continue
  if word in stops_en:
    print(word,': a stop word.')
  else:
    print(word)


this : a stop word.
first
sentence
this : a stop word.
second
sentence


## POS tagging

- **Part-of-speech (POS)** tagging: Figuring out what are nouns, verbs, adjectives, etc. 
- It refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context.

In [8]:
# POS tagging
from nltk import pos_tag 
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\DELL 5590\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [9]:
sent = 'I like that awesome movie, especially the great director.'
words = word_tokenize(sent)
tagged = pos_tag(words) 
print(tagged)

[('I', 'PRP'), ('like', 'VBP'), ('that', 'DT'), ('awesome', 'JJ'), ('movie', 'NN'), (',', ','), ('especially', 'RB'), ('the', 'DT'), ('great', 'JJ'), ('director', 'NN'), ('.', '.')]


## Stemming and Lemmatization

- **Stemming** - Removal or "stemming" of the last few words of a certain word.
- **Lemmatization** - Merging modified versions of "same" word to be analyzed as a single word

In [10]:
# Stemming

porter = nltk.PorterStemmer()

lancaster = nltk.LancasterStemmer()

sent = 'I liked that awesome movie, especially the director was the best guy.'
words = word_tokenize(sent)
for w in words:
    print(w,',',porter.stem(w),',',lancaster.stem(w)) 

I , i , i
liked , like , lik
that , that , that
awesome , awesom , awesom
movie , movi , movy
, , , , ,
especially , especi , espec
the , the , the
director , director , direct
was , wa , was
the , the , the
best , best , best
guy , guy , guy
. , . , .


In [11]:
# Lemmatization
nltk.download('omw-1.4')
wnl = nltk.WordNetLemmatizer()
nltk.download('wordnet')

sent = 'I liked that awesome movie, especially the director now was moving to another states. I am liking the show as well.'
words = word_tokenize(sent)
for t in words:
    print(t,',',wnl.lemmatize(t))

[nltk_data] Downloading package omw-1.4 to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data] Downloading package wordnet to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


I , I
liked , liked
that , that
awesome , awesome
movie , movie
, , ,
especially , especially
the , the
director , director
now , now
was , wa
moving , moving
to , to
another , another
states , state
. , .
I , I
am , am
liking , liking
the , the
show , show
as , a
well , well
. , .


## Word Sense Disambiguation

In [12]:
!pip install pywsd

Collecting pywsd
  Downloading pywsd-1.2.5-py3-none-any.whl.metadata (336 bytes)
Collecting wn==0.0.23 (from pywsd)
  Downloading wn-0.0.23.tar.gz (31.6 MB)
     ---------------------------------------- 0.0/31.6 MB ? eta -:--:--
     ---------------------------------------- 0.0/31.6 MB ? eta -:--:--
     ---------------------------------------- 0.0/31.6 MB ? eta -:--:--
     --------------------------------------- 0.0/31.6 MB 165.2 kB/s eta 0:03:12
     --------------------------------------- 0.0/31.6 MB 219.4 kB/s eta 0:02:24
     --------------------------------------- 0.1/31.6 MB 328.6 kB/s eta 0:01:37
     --------------------------------------- 0.1/31.6 MB 525.1 kB/s eta 0:01:00
     ---------------------------------------- 0.3/31.6 MB 1.0 MB/s eta 0:00:31
      --------------------------------------- 0.4/31.6 MB 1.3 MB/s eta 0:00:24
      --------------------------------------- 0.7/31.6 MB 1.8 MB/s eta 0:00:17
     - -------------------------------------- 1.0/31.6 MB 2.3 MB/s eta

In [13]:
# Word Sense Disambiguation

from pywsd.lesk import simple_lesk

bank_sents = ['I went to the bank to deposit my money', 'The river bank was full of dead fishes']

#plant_sents = ['The workers at the industrial plant were overworked', 'The plant was no longer bearing flowers']

answer = simple_lesk(bank_sents[0],'bank')
print("Sense:", answer)
print("Definition:",answer.definition())

print('===')

answer = simple_lesk(bank_sents[1],'bank')
print("Sense:", answer)
print("Definition:",answer.definition())

Warming up PyWSD (takes ~10 secs)... 

Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
===
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)


took 6.410693407058716 secs.


## Name Entity Recognition (NER)

In [14]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('tagsets')

[nltk_data] Downloading package maxent_ne_chunker to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package tagsets to C:\Users\DELL
[nltk_data]     5590\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [15]:
#ex = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
ex = 'What is the weather in New York and Chicago today?'
words = word_tokenize(ex)
tags = pos_tag(words)

ne_tree = nltk.ne_chunk(tags)

print(ne_tree)

(S
  What/WP
  is/VBZ
  the/DT
  weather/NN
  in/IN
  (GPE New/NNP York/NNP)
  and/CC
  (GPE Chicago/NNP)
  today/NN
  ?/.)


<a id='lab2'></a>
## Session 2: Document Clustering

### The task here is to cluster scientific papers based on their textual titles, basically separating them into a few groups. The features we use to do this document clustering task is TF-IDF values of top unique words across documents. 

### The clustering algorithm we use for this demonstration is KMeans. 

### The main objective of this task is to learn how to represent a document using TF-IDF.

### 1. Load the data

In [16]:
import pandas as pd

papers = pd.read_csv('papers.csv',header=0)

papers.head()

Unnamed: 0,id,title
0,1,Self-Organization of Associative Database and ...
1,10,A Mean Field Theory of Layer IV of Visual Cort...
2,100,Storing Covariance by the Associative Long-Ter...
3,1000,Bayesian Query Construction for Neural Network...
4,1001,Neural Network Ensembles Cross Validation an...


### 2. Clean the data

In [17]:
papers['processed_title'] = papers['title'].map(lambda x:x.lower())

papers.head()

Unnamed: 0,id,title,processed_title
0,1,Self-Organization of Associative Database and ...,self-organization of associative database and ...
1,10,A Mean Field Theory of Layer IV of Visual Cort...,a mean field theory of layer iv of visual cort...
2,100,Storing Covariance by the Associative Long-Ter...,storing covariance by the associative long-ter...
3,1000,Bayesian Query Construction for Neural Network...,bayesian query construction for neural network...
4,1001,Neural Network Ensembles Cross Validation an...,neural network ensembles cross validation an...


In [18]:
titles = list(papers['processed_title'])
print(len(titles))
print(titles[:10])

999
['self-organization of associative database and its applications', 'a mean field theory of layer iv of visual cortex and its application to artificial neural networks', 'storing covariance by the associative long-term potentiation and depression of synaptic strengths in the hippocampus', 'bayesian query construction for neural network models', 'neural network ensembles  cross validation  and active learning', 'using a neural net to instantiate a deformable model', 'plasticity-mediated competitive learning', 'iceg morphology classification using an analogue vlsi neural network', 'real-time control of a tokamak plasma using neural networks', 'pulsestream synapses with non-volatile analogue amorphous-silicon memories']


### 3. Construct features using TF-IDF

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np 

tfidf_vectorizer = TfidfVectorizer(max_features=500, stop_words='english',use_idf=True, ngram_range=(1,1))
tfidf_matrix = tfidf_vectorizer.fit_transform(titles).toarray()
print(tfidf_matrix.shape)

print(tfidf_vectorizer.get_feature_names_out())

print(tfidf_matrix[2])

(999, 500)
['account' 'acquisition' 'active' 'activity' 'adaptation' 'adaptive'
 'algorithm' 'algorithms' 'allocation' 'alternative' 'analog' 'analysis'
 'analytic' 'analytical' 'annealing' 'application' 'applications'
 'applied' 'approach' 'approaches' 'approximate' 'approximating'
 'approximation' 'approximations' 'arbitrary' 'architecture' 'artificial'
 'associative' 'attention' 'attentional' 'attractor' 'attractors' 'audio'
 'auditory' 'automatic' 'autonomous' 'average' 'backpropagation' 'based'
 'basis' 'batch' 'bayes' 'bayesian' 'behavior' 'belief' 'better' 'bias'
 'bifurcation' 'binary' 'blind' 'boltzmann' 'boosting' 'bound' 'bounds'
 'brain' 'capacity' 'carlo' 'case' 'cell' 'cells' 'center' 'channel'
 'chip' 'choice' 'circuit' 'circuits' 'class' 'classes' 'classification'
 'classifier' 'classifiers' 'clustering' 'cmos' 'code' 'codes' 'coding'
 'color' 'combined' 'combining' 'committee' 'comparison' 'competition'
 'competitive' 'complex' 'complexity' 'component' 'components'
 'c

### 4. Run clustering and evaluate the model

In [20]:
from sklearn.cluster import KMeans
from sklearn import metrics

k = 30
km = KMeans(n_clusters=k, random_state=42)
km.fit(tfidf_matrix)

# evaluation
print(metrics.silhouette_score(tfidf_matrix, km.labels_, sample_size=100, random_state=42))


  super()._check_params_vs_input(X, default_n_init=10)


0.040834639741475914


In [21]:
print(km.labels_.tolist())

[28, 4, 28, 12, 12, 6, 21, 10, 4, 29, 5, 15, 4, 2, 5, 5, 10, 24, 23, 12, 4, 19, 5, 8, 25, 8, 25, 4, 21, 9, 0, 29, 25, 17, 27, 12, 19, 21, 24, 4, 5, 8, 6, 4, 17, 25, 19, 21, 19, 5, 14, 5, 25, 22, 25, 3, 4, 19, 19, 4, 5, 19, 2, 19, 4, 19, 19, 20, 1, 1, 19, 6, 19, 15, 13, 4, 19, 12, 5, 12, 6, 8, 4, 5, 24, 19, 5, 16, 1, 24, 4, 26, 6, 16, 16, 6, 12, 5, 4, 23, 28, 10, 4, 26, 15, 21, 21, 14, 15, 6, 8, 4, 6, 4, 19, 4, 18, 19, 28, 6, 19, 19, 5, 4, 4, 8, 7, 19, 2, 19, 23, 19, 4, 19, 23, 6, 19, 29, 23, 22, 4, 5, 24, 19, 19, 2, 10, 23, 5, 6, 25, 19, 14, 12, 19, 5, 5, 8, 19, 3, 14, 18, 4, 4, 25, 4, 5, 4, 4, 2, 1, 4, 4, 29, 19, 12, 19, 4, 10, 4, 4, 5, 19, 8, 13, 24, 21, 12, 20, 19, 6, 9, 12, 8, 4, 18, 11, 5, 10, 19, 19, 5, 4, 5, 6, 7, 24, 2, 12, 11, 13, 19, 25, 19, 25, 4, 4, 2, 12, 6, 8, 24, 12, 16, 16, 20, 6, 22, 19, 8, 19, 19, 19, 12, 8, 22, 2, 21, 5, 3, 2, 21, 11, 19, 5, 19, 9, 5, 19, 26, 6, 14, 5, 5, 17, 19, 27, 6, 2, 19, 0, 7, 4, 19, 28, 10, 16, 4, 24, 12, 19, 19, 22, 12, 12, 15, 1, 7, 15, 24, 