# Introduction to Natural Language Processing

## Session 1:  Basic Text Analysis
 

For the demonstrations in this module, we shall be using the NLTK library. **NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries (src: https://www.nltk.org/)

In [None]:
##Install the NLTK library
!pip install nltk



Let's start with basic NLP operations, usually used for text preprocessing 
to improve the quality of data for better subsequent tasks, such as: 

- stopword removal
- word/sentence tokenization
- part-of-speech (POS)
- Stemming
- Lemmatization
- Bag of words
- Tf-idf

## Stopword Removal and Tokenization

- **Stopword Removal**: Many frequently occurring words that are not important for understanding semantics, also called *stopwords*, can be removed.
- **Tokenization**: Splitting text into smaller elements (characters, words, sentences, paragraphs).

In [None]:
## We need to download the stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Next, we shall be downloading the *Punkt* sentence tokenizer. Read more: https://www.nltk.org/_modules/nltk/tokenize/punkt.html

In [None]:
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
## We have stopwords in multiple languages
stops_en = stopwords.words('english')
stops_ge = stopwords.words('german')
print(stops_en)

In [None]:
# customize your stop word list by adding words
stops_en.append('airline')
print(stops_en)

In [None]:
# sentence / word tokenization
cnn = 'The Cable News Network is a multinational news-based pay television channel headquartered in Atlanta, Georgia. It is owned by CNN Global, which is part of Warner Bros, Discovery. It was founded in 1980 by American media proprietor Ted Turner and Reese Schonfeld as a 24-hour cable news channel.'

#word_tokenize(cnn)
sent_tokenize(cnn)

In [None]:
# a combination of tokenization and stopword removal
sent = 'This is the first sentence, and this is the second sentence.'
words = word_tokenize(sent.lower())

for word in words:
  if len(word) <= 3:
    continue
  if word in stops_en:
    print(word,': a stop word.')
  else:
    print(word)

'''
text = nltk.Text(words)
dist = nltk.FreqDist(text)
print(dist)
print(dist.most_common(3))
'''

## POS tagging

- **Part-of-speech (POS)** tagging: Figuring out what are nouns, verbs, adjectives, etc. 
- It refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context.

In [None]:
# POS tagging
from nltk import pos_tag 
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.help.upenn_tagset()

In [None]:
sent = 'I like that awesome movie, especially the great director.'
words = word_tokenize(sent)
tagged = pos_tag(words) 
print(tagged)

## Stemming and Lemmatization

- **Stemming** - Removal or "stemming" of the last few words of a certain word.
- **Lemmatization** - Merging modified versions of "same" word to be analyzed as a single word

In [None]:
# Stemming

porter = nltk.PorterStemmer()

lancaster = nltk.LancasterStemmer()

sent = 'I liked that awesome movie, especially the director was the best guy.'
words = word_tokenize(sent)
for w in words:
    print(w,',',porter.stem(w),',',lancaster.stem(w)) 

In [None]:
# Lemmatization

wnl = nltk.WordNetLemmatizer()
nltk.download('wordnet')

sent = 'I liked that awesome movie, especially the director now was moving to another state. I am liking the show as well.'
words = word_tokenize(sent)
for t in words:
    print(t,',',wnl.lemmatize(t))

## Word Sense Disambiguation

In [None]:
from nltk.wsd import lesk

sent = 'I went to the bank to deposit money and then went to the river to swim in the bank.'
sent = 'the river overflowed the bank'
words = word_tokenize(sent)
print(lesk(words,'bank'))

#print(words)
for w in words:
  print(lesk(w,w))

In [None]:
from nltk.wsd import lesk

sent = 'I went to the bank to deposit money and then went to the river to swim in the bank.'
sent = 'the river overflowed the bank'
words = word_tokenize(sent)
print(lesk(words,'bank'))

#print(words)
for w in words:
  print(lesk(w,w))

In [None]:
from nltk.corpus import wordnet as wn

for ss in wn.synsets('business'):
    print(ss, ss.definition())

## Name Entity Recognition (NER)

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('tagsets')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

In [None]:

#ex = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
ex = 'What is the weather in New York and Chicago today?'
words = word_tokenize(ex)
tags = pos_tag(words)

ne_tree = nltk.ne_chunk(tags)

print(ne_tree)

<a id='lab2'></a>
## Session 2: Document Clustering

Problem Statement - To be Added

### 1. Load the data from Google drive

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

papers = pd.read_csv('/content/drive/MyDrive/papers (1).csv') # change this directory to where you store Papers

# View the first 5 rows of papers
papers.head()

Unnamed: 0,id,title
0,1,Self-Organization of Associative Database and ...
1,10,A Mean Field Theory of Layer IV of Visual Cort...
2,100,Storing Covariance by the Associative Long-Ter...
3,1000,Bayesian Query Construction for Neural Network...
4,1001,Neural Network Ensembles Cross Validation an...


### 2. Clean the data

In [None]:
# Convert the titles to lowercase
papers['title_text_processed'] = papers['title'].map(lambda x: x.lower())

titles = list(papers['title_text_processed'])
print(len(titles))

# Print out the first rows of papers
titles[:10]

999


['self-organization of associative database and its applications',
 'a mean field theory of layer iv of visual cortex and its application to artificial neural networks',
 'storing covariance by the associative long-term potentiation and depression of synaptic strengths in the hippocampus',
 'bayesian query construction for neural network models',
 'neural network ensembles  cross validation  and active learning',
 'using a neural net to instantiate a deformable model',
 'plasticity-mediated competitive learning',
 'iceg morphology classification using an analogue vlsi neural network',
 'real-time control of a tokamak plasma using neural networks',
 'pulsestream synapses with non-volatile analogue amorphous-silicon memories']

### 3. Construct features using TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np 

# Specify vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_features=500, stop_words='english', use_idf=True, ngram_range=(1,2))

tfidf_matrix = tfidf_vectorizer.fit_transform(titles).toarray()

print(tfidf_vectorizer.get_feature_names_out())
print(tfidf_matrix[2])

print(tfidf_matrix.shape)

['acquisition' 'active' 'active learning' 'activity' 'adaptation'
 'adaptive' 'algorithm' 'algorithms' 'analog' 'analog vlsi' 'analysis'
 'analytical' 'annealing' 'application' 'applications' 'applied'
 'approach' 'approaches' 'approximate' 'approximating' 'approximation'
 'approximations' 'architecture' 'artificial' 'artificial neural'
 'associative' 'attention' 'attentional' 'audio' 'auditory'
 'backpropagation' 'based' 'based reinforcement' 'basis' 'basis function'
 'batch' 'bayes' 'bayesian' 'bayesian learning' 'bayesian model'
 'bayesian networks' 'belief' 'belief networks' 'bifurcation' 'binary'
 'blind' 'blind separation' 'boltzmann' 'boltzmann machine' 'boosting'
 'bounds' 'brain' 'capacity' 'carlo' 'case' 'cell' 'cells' 'center'
 'center surround' 'channel' 'chip' 'choice' 'circuit' 'circuits' 'class'
 'classes' 'classification' 'classifier' 'classifiers' 'clustering' 'code'
 'codes' 'coding' 'color' 'combined' 'combining' 'committee' 'comparison'
 'competition' 'competitive' 

### 4. Run clustering and evaluate the model

In [None]:
from sklearn.cluster import KMeans
from sklearn import metrics

k = 5

km = KMeans(n_clusters = k, random_state=42)
km.fit(tfidf_matrix)

# evaluation
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(tfidf_matrix, km.labels_, sample_size=100, random_state=2021))

clusters = km.labels_.tolist()
print(clusters)
#for i in range(len(tfidf_matrix)):
#  print('paper id:',i+1,', cluster id:',clusters[i])

Silhouette Coefficient: 0.034
[0, 2, 0, 2, 2, 3, 1, 2, 2, 0, 1, 0, 2, 0, 1, 1, 0, 3, 0, 0, 0, 0, 4, 3, 0, 3, 0, 2, 0, 0, 3, 3, 0, 0, 1, 0, 0, 0, 3, 2, 1, 3, 0, 2, 0, 0, 0, 0, 0, 4, 2, 1, 0, 0, 0, 0, 2, 0, 0, 2, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 2, 0, 2, 1, 2, 0, 3, 2, 4, 0, 0, 1, 2, 0, 0, 2, 0, 1, 0, 0, 2, 0, 1, 2, 0, 1, 0, 2, 0, 3, 0, 0, 4, 1, 0, 0, 2, 2, 0, 0, 2, 0, 0, 0, 0, 0, 1, 4, 2, 1, 3, 1, 0, 1, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 1, 1, 3, 0, 0, 3, 0, 0, 1, 0, 2, 1, 1, 0, 0, 0, 2, 1, 3, 0, 2, 0, 0, 2, 2, 2, 1, 0, 3, 0, 0, 1, 2, 0, 0, 0, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 1, 0, 4, 0, 3, 0, 1, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 2, 2, 3, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 2, 3, 0, 0, 0, 4, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 4, 1, 1, 0, 0, 0, 1, 0, 2, 3, 2, 0, 0, 0, 1, 2, 3, 2, 0, 1, 0, 2, 2, 0, 2, 3, 1, 3, 1, 0, 0, 2, 1, 0, 0, 0, 0, 0, 3, 1, 0, 2, 1, 0, 0, 1, 3, 4, 1, 2, 2, 3, 4, 0, 0, 0, 2, 0, 3, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 