# Introduction to Natural Language Processing

## Session 1:  Basic Text Analysis
 

For the demonstrations in this module, we shall be using the NLTK library. **NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries (src: https://www.nltk.org/)

In [1]:
##Install the NLTK library
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/




Let's start with basic NLP operations, usually used for text preprocessing 
to improve the quality of data for better subsequent tasks, such as: 

- stopword removal
- word/sentence tokenization
- part-of-speech (POS)
- Stemming
- Lemmatization
- Bag of words
- Tf-idf

## Stopword Removal and Tokenization

- **Stopword Removal**: Many frequently occurring words that are not important for understanding semantics, also called *stopwords*, can be removed.
- **Tokenization**: Splitting text into smaller elements (characters, words, sentences, paragraphs).

In [2]:
## We need to download the stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Next, we shall be downloading the *Punkt* sentence tokenizer. Read more: https://www.nltk.org/_modules/nltk/tokenize/punkt.html

In [3]:
from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
## We have stopwords in multiple languages
stops_en = stopwords.words('english')
stops_ge = stopwords.words('german')
print(stops_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [5]:
# customize your stop word list by adding words
stops_en.append('airline')
print(stops_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
# sentence / word tokenization
cnn = 'The Cable News Network is a multinational news-based pay television channel headquartered in Atlanta, Georgia. It is owned by CNN Global, which is part of Warner Bros, Discovery. It was founded in 1980 by American media proprietor Ted Turner and Reese Schonfeld as a 24-hour cable news channel.'

word_tokenize(cnn)
#sent_tokenize(cnn)

In [7]:
# a combination of tokenization and stopword removal
sent = 'This is the first sentence, and this is the second sentence.'
words = word_tokenize(sent.lower())

for word in words:
  if len(word) <= 3:
    continue
  if word in stops_en:
    print(word,': a stop word.')
  else:
    print(word)


this : a stop word.
first
sentence
this : a stop word.
second
sentence


## POS tagging

- **Part-of-speech (POS)** tagging: Figuring out what are nouns, verbs, adjectives, etc. 
- It refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context.

In [None]:
# POS tagging
from nltk import pos_tag 
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.help.upenn_tagset()

In [9]:
sent = 'I like that awesome movie, especially the great director.'
words = word_tokenize(sent)
tagged = pos_tag(words) 
print(tagged)

[('I', 'PRP'), ('like', 'VBP'), ('that', 'DT'), ('awesome', 'JJ'), ('movie', 'NN'), (',', ','), ('especially', 'RB'), ('the', 'DT'), ('great', 'JJ'), ('director', 'NN'), ('.', '.')]


## Stemming and Lemmatization

- **Stemming** - Removal or "stemming" of the last few words of a certain word.
- **Lemmatization** - Merging modified versions of "same" word to be analyzed as a single word

In [10]:
# Stemming

porter = nltk.PorterStemmer()

lancaster = nltk.LancasterStemmer()

sent = 'I liked that awesome movie, especially the director was the best guy.'
words = word_tokenize(sent)
for w in words:
    print(w,',',porter.stem(w),',',lancaster.stem(w)) 

I , i , i
liked , like , lik
that , that , that
awesome , awesom , awesom
movie , movi , movy
, , , , ,
especially , especi , espec
the , the , the
director , director , direct
was , wa , was
the , the , the
best , best , best
guy , guy , guy
. , . , .


In [11]:
# Lemmatization
nltk.download('omw-1.4')
wnl = nltk.WordNetLemmatizer()
nltk.download('wordnet')

sent = 'I liked that awesome movie, especially the director now was moving to another states. I am liking the show as well.'
words = word_tokenize(sent)
for t in words:
    print(t,',',wnl.lemmatize(t))

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


I , I
liked , liked
that , that
awesome , awesome
movie , movie
, , ,
especially , especially
the , the
director , director
now , now
was , wa
moving , moving
to , to
another , another
states , state
. , .
I , I
am , am
liking , liking
the , the
show , show
as , a
well , well
. , .


## Word Sense Disambiguation

In [12]:
!pip install pywsd

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pywsd
  Downloading pywsd-1.2.5-py3-none-any.whl (26.9 MB)
[K     |████████████████████████████████| 26.9 MB 2.7 MB/s 
Collecting wn==0.0.23
  Downloading wn-0.0.23.tar.gz (31.6 MB)
[K     |████████████████████████████████| 31.6 MB 1.6 MB/s 
Building wheels for collected packages: wn
  Building wheel for wn (setup.py) ... [?25l[?25hdone
  Created wheel for wn: filename=wn-0.0.23-py3-none-any.whl size=31792926 sha256=4996c9f66dc825e98015cd390274ea3d0627482aa500cb2b978fc76ede7c70e1
  Stored in directory: /root/.cache/pip/wheels/ec/47/17/409766c99dd470f34c512000b90b83f34747c2c975769654d7
Successfully built wn
Installing collected packages: wn, pywsd
Successfully installed pywsd-1.2.5 wn-0.0.23


In [13]:
# Word Sense Disambiguation

from pywsd.lesk import simple_lesk

bank_sents = ['I went to the bank to deposit my money', 'The river bank was full of dead fishes']

#plant_sents = ['The workers at the industrial plant were overworked', 'The plant was no longer bearing flowers']

answer = simple_lesk(bank_sents[0],'bank')
print("Sense:", answer)
print("Definition:",answer.definition())

print('===')

answer = simple_lesk(bank_sents[1],'bank')
print("Sense:", answer)
print("Definition:",answer.definition())

Warming up PyWSD (takes ~10 secs)... 

Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
===
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)


took 4.277987241744995 secs.


## Name Entity Recognition (NER)

In [16]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('tagsets')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [17]:
#ex = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
ex = 'What is the weather in New York and Chicago today?'
words = word_tokenize(ex)
tags = pos_tag(words)

ne_tree = nltk.ne_chunk(tags)

print(ne_tree)

(S
  What/WP
  is/VBZ
  the/DT
  weather/NN
  in/IN
  (GPE New/NNP York/NNP)
  and/CC
  (GPE Chicago/NNP)
  today/NN
  ?/.)


<a id='lab2'></a>
## Session 2: Document Clustering

### The task here is to cluster scientific papers based on their textual titles, basically separating them into a few groups. The features we use to do this document clustering task is TF-IDF values of top unique words across documents. 

### The clustering algorithm we use for this demonstration is KMeans. 

### The main objective of this task is to learn how to represent a document using TF-IDF.

### 1. Load the data

In [18]:
import pandas as pd

papers = pd.read_csv('papers.csv',header=0)

papers.head()

Unnamed: 0,id,title
0,1,Self-Organization of Associative Database and ...
1,10,A Mean Field Theory of Layer IV of Visual Cort...
2,100,Storing Covariance by the Associative Long-Ter...
3,1000,Bayesian Query Construction for Neural Network...
4,1001,Neural Network Ensembles Cross Validation an...


### 2. Clean the data

In [19]:
papers['processed_title'] = papers['title'].map(lambda x:x.lower())

papers.head()

Unnamed: 0,id,title,processed_title
0,1,Self-Organization of Associative Database and ...,self-organization of associative database and ...
1,10,A Mean Field Theory of Layer IV of Visual Cort...,a mean field theory of layer iv of visual cort...
2,100,Storing Covariance by the Associative Long-Ter...,storing covariance by the associative long-ter...
3,1000,Bayesian Query Construction for Neural Network...,bayesian query construction for neural network...
4,1001,Neural Network Ensembles Cross Validation an...,neural network ensembles cross validation an...


In [20]:
titles = list(papers['processed_title'])
print(len(titles))
print(titles[:10])

999
['self-organization of associative database and its applications', 'a mean field theory of layer iv of visual cortex and its application to artificial neural networks', 'storing covariance by the associative long-term potentiation and depression of synaptic strengths in the hippocampus', 'bayesian query construction for neural network models', 'neural network ensembles  cross validation  and active learning', 'using a neural net to instantiate a deformable model', 'plasticity-mediated competitive learning', 'iceg morphology classification using an analogue vlsi neural network', 'real-time control of a tokamak plasma using neural networks', 'pulsestream synapses with non-volatile analogue amorphous-silicon memories']


### 3. Construct features using TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np 

tfidf_vectorizer = TfidfVectorizer(max_features=500, stop_words='english',use_idf=True, ngram_range=(1,1))
tfidf_matrix = tfidf_vectorizer.fit_transform(titles).toarray()
print(tfidf_matrix.shape)

print(tfidf_vectorizer.get_feature_names_out())

print(tfidf_matrix[2])

### 4. Run clustering and evaluate the model

In [28]:
from sklearn.cluster import KMeans
from sklearn import metrics

k = 30
km = KMeans(n_clusters=k, random_state=42)
km.fit(tfidf_matrix)

# evaluation
print(metrics.silhouette_score(tfidf_matrix, km.labels_, sample_size=100, random_state=42))


0.04505905414792544


In [29]:
print(km.labels_.tolist())

[19, 19, 3, 17, 17, 20, 2, 14, 6, 10, 2, 5, 6, 13, 9, 9, 14, 24, 19, 17, 27, 9, 2, 20, 18, 20, 11, 6, 0, 0, 20, 10, 28, 0, 22, 25, 5, 0, 15, 6, 2, 20, 22, 6, 1, 11, 24, 29, 0, 2, 1, 2, 15, 4, 18, 8, 6, 18, 0, 6, 16, 0, 13, 25, 6, 0, 0, 9, 25, 19, 1, 2, 9, 5, 22, 6, 18, 17, 2, 17, 28, 20, 6, 2, 15, 0, 2, 26, 0, 15, 20, 23, 2, 26, 26, 6, 17, 2, 6, 19, 2, 14, 6, 23, 5, 0, 0, 25, 5, 16, 25, 6, 25, 18, 0, 6, 0, 0, 19, 9, 0, 0, 2, 6, 27, 20, 12, 9, 13, 0, 19, 0, 6, 1, 19, 25, 28, 10, 19, 4, 6, 2, 15, 0, 11, 13, 25, 9, 2, 0, 21, 0, 0, 28, 0, 2, 7, 20, 0, 9, 20, 3, 27, 27, 24, 25, 2, 27, 27, 13, 15, 27, 25, 10, 0, 17, 0, 27, 14, 6, 6, 2, 7, 20, 22, 15, 2, 17, 5, 0, 10, 0, 17, 0, 6, 1, 0, 2, 14, 0, 0, 25, 27, 2, 21, 12, 15, 13, 17, 28, 22, 0, 11, 0, 21, 27, 27, 25, 17, 6, 20, 15, 17, 26, 26, 17, 25, 4, 24, 5, 0, 0, 24, 22, 28, 4, 13, 29, 2, 9, 13, 0, 28, 0, 2, 1, 9, 2, 0, 23, 21, 23, 2, 2, 29, 0, 8, 13, 13, 0, 9, 12, 6, 9, 0, 14, 2, 6, 15, 20, 0, 1, 4, 17, 17, 5, 6, 12, 5, 15, 2, 25, 17, 27, 2,