#**Introduction of Natural Language Processing (NLP)**

At the end of this workshop, you will be able to understand

*   the basic NLP pipeline
*   the different word embeddings and their pros and cons 
*   how to use custom stop words
*   the implementation of NLP in Python 
*   how to use Google Colab (- if you are not already familiar :) )



## **What is NLP?** 

- Sub field of Computer Science and Artificial Intelligence that focuses on interactions between computer and human (natural) languages 
- Application of machine learning (ML) and deep learning (DL) algorithms to text and speech (datasets). 
- Applications: Speech recognition, machine translation, spam detection, auto complete/next word suggestion, chat bot etc. 


# **NLP Pipeline**

Here, NLP pipeline refers to the pre-processing steps that should be applied on the text data before proceeding towards the machine learning aspect of the model. 

For example, the **objective** of a project is identification of e-mails as spam (or non-spam). 

1.   Identification of *type of ML problem*: Classification (using text data)
2.   ML algorithms: Multinomial naive bayes, Logistic Regression and Support Vector Machine

Great! We have an idea about the type of problem and what possible ML algorithms to use. But before that, how do we process the text data?  

Here is an outline of the steps that we could use for processig the text data:

### **Text Pre-processing** 

1.   Spell check (- depending on the context)
1.   Sentence Tokenization
2.   Word Tokenization
3.   Conversion to lower case
4.   Lexicon Normalization: Lemmatization and Stemming 
5.   Removal of puncatuations and stop words (and numbers - depending on the context)
6.   Parts-of-speech (POS) tagging 
7.   Creation of n-grams 

### **Exploratory Analysis**

1.   Word Cloud
2.   Distribution of data with respect to each class 

### **Word Embeddings** 

1.   Bag-of-Words (BoW)
2.   Term Frequency (TF)
3.   Term Frequency - Inverse Document Frequency (TF - IDF)
4.   Pre-trained (Neural) Word Embeddings 

> * Word level embeddings: Word2Vec and Glove 
> * Character level embeddings: ELMo and Flair 

We will explore each of these topics using a dataset. 


## **Resources in Python**

* [NLTK](https://www.nltk.org/): Natural Language Tool Kit 
* [spaCy](https://spacy.io/)
* [Gensim](https://github.com/RaRe-Technologies/gensim)
-----
* [TextBlob](https://textblob.readthedocs.io/en/dev/)
* [CoreNLP](https://stanfordnlp.github.io/CoreNLP/)
* [polyglot](https://polyglot.readthedocs.io/en/latest/index.html)

### **NLP Pipeline**

#### *Spell Check*
 To check for spelling errors and to get possible alternatives for the misspelled words. 


*   Using `autocorrect` module 
*   Using `pyspellchecker` module: It uses a *Levenshtein Distance algorithm* to find permutations within an edit distance of 2 from the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are more likely the correct results.
*   Using `textblob` module: returns two values – a recommended correction for this word, and a confidence score associated with the correction.


In [78]:
#!pip install autocorrect
#Using autocorrect module 
from autocorrect import Speller

spell = Speller(lang='en')

print(spell('caaaar'))
print(spell('mussage'))
print(spell('survice'))
print(spell('hte'))
print(spell("Let is check whehter spel check works hree"))  #Correct sentence: Let us check whether spell check works here


caesar
message
service
the
Let is check whether spell check works here


In [88]:
#Using pyspellchecker 
#!pip install pyspellchecker
from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['let', 'us', 'wlak','on','the','groun'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

group
{'grout', 'ground', 'groin', 'grown', 'groan', 'group'}
walk
{'flak', 'walk', 'weak'}


In [93]:
#!pip install textblob
from textblob import Word
from textblob import TextBlob

word = Word('personell')
print(word.spellcheck())

b = TextBlob("I havv goood speling!")
print(b.correct())

[('personal', 0.65), ('personally', 0.2642857142857143), ('peroneal', 0.06428571428571428), ('personnel', 0.014285714285714285), ('personen', 0.007142857142857143)]
I have good spelling!


#### *Sentence Tokenization*

To break paragraphs into sentences

In [94]:
from nltk.tokenize import sent_tokenize

text=""""Oh, Marilla, looking forward to things is half the pleasure of them," exclaimed Anne. 
        "You mayn’t get the things themselves; but nothing can prevent you from having the fun of looking forward to them. 
        Mrs. Lynde says, 'Blessed are they who expect nothing for they shall not be disappointed.' 
        But I think it would be worse to expect nothing than to be disappointed."""

#Text from Anne of Green Gables

tokenized_text=sent_tokenize(text)

print(tokenized_text)

['"Oh, Marilla, looking forward to things is half the pleasure of them," exclaimed Anne.', '"You mayn’t get the things themselves; but nothing can prevent you from having the fun of looking forward to them.', "Mrs. Lynde says, 'Blessed are they who expect nothing for they shall not be disappointed.'", 'But I think it would be worse to expect nothing than to be disappointed.']


#### *Word Tokenization*

To break sentences into words (or tokens)

In [95]:
from nltk.tokenize import word_tokenize

tokenized_word=word_tokenize(text)
print(tokenized_word)

['``', 'Oh', ',', 'Marilla', ',', 'looking', 'forward', 'to', 'things', 'is', 'half', 'the', 'pleasure', 'of', 'them', ',', "''", 'exclaimed', 'Anne', '.', '``', 'You', 'mayn', '’', 't', 'get', 'the', 'things', 'themselves', ';', 'but', 'nothing', 'can', 'prevent', 'you', 'from', 'having', 'the', 'fun', 'of', 'looking', 'forward', 'to', 'them', '.', 'Mrs.', 'Lynde', 'says', ',', "'Blessed", 'are', 'they', 'who', 'expect', 'nothing', 'for', 'they', 'shall', 'not', 'be', 'disappointed', '.', "'", 'But', 'I', 'think', 'it', 'would', 'be', 'worse', 'to', 'expect', 'nothing', 'than', 'to', 'be', 'disappointed', '.']


#### *Conversion to lower case*

Converting the text (in upper case or sentence case) to lower case. 

1.   Helps to maintain consistency of expected output.
2.   Maintains uniformity among different cases - easiers to search. For instance, searching for "Canada" may not yield results - if the text is "canada"
3.   Often times, word embeddings might perform poorly 



In [96]:
lower_case_text = text.lower()
print(lower_case_text)

"oh, marilla, looking forward to things is half the pleasure of them," exclaimed anne. 
        "you mayn’t get the things themselves; but nothing can prevent you from having the fun of looking forward to them. 
        mrs. lynde says, 'blessed are they who expect nothing for they shall not be disappointed.' 
        but i think it would be worse to expect nothing than to be disappointed.


#### *Removal of punctuations and stop words*

Stop words such as "a", "the", "in" etc do not add meaning in the text analysis and are considered as noise in the data. Hence, they should be removed. Also, it is advisable to remove the punctuations and numbers (depending upon the context). 

P.S.: Sometimes, removing the punctuation might distort the meaning of the word. For instance, in the sentence "you're good" would return "youre good" - where the meaning of the sentence is lost 

In [101]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)


filtered_sent=[]

for w in tokenized_word:
    if w not in stop_words:
        filtered_sent.append(w)

print("Tokenized Sentence:",tokenized_word)
print("Filterd Sentence:",filtered_sent)

{'too', 't', 'were', 'because', 'wouldn', 'can', 'yourselves', 'is', 'hadn', 'through', 'now', "aren't", 'or', 're', "mustn't", "doesn't", 'was', "that'll", 'i', 'down', 'we', 'isn', "weren't", 'some', 'by', 'so', "couldn't", 'very', 'been', 'these', 'how', 'himself', 'o', 'don', "mightn't", "she's", 'from', 'do', 'haven', 'the', 'ain', 'have', "needn't", 'being', 'while', 'until', 'of', 'has', 'themselves', 'at', 'not', 'are', 'myself', 'on', "it's", 'about', "isn't", 'just', 'her', 'their', 'mightn', 'then', 'where', 'than', 'they', 'nor', 'them', 'off', 'against', 'only', 'and', 'y', 've', 'weren', 'above', 'for', 'which', 'your', 'itself', 'ma', 'he', 'd', 'with', 'does', 'ourselves', 'yourself', 'an', 'what', 'below', 'other', 'such', 'aren', 'you', 'under', "hadn't", "you'd", 'once', 'own', "should've", 'between', 'after', 'up', 'be', 'any', 'will', 'in', 'again', 'she', "you're", 'here', 'this', 'when', 'won', "shan't", 'did', 'should', 'me', 'needn', 'as', "didn't", 'its', 'my'

In [108]:
#Removal of punctuations

words_no_punkt = [word for word in tokenized_word if word.isalpha()]
print(words_no_punkt[:100])

['Oh', 'Marilla', 'looking', 'forward', 'to', 'things', 'is', 'half', 'the', 'pleasure', 'of', 'them', 'exclaimed', 'Anne', 'You', 'mayn', 't', 'get', 'the', 'things', 'themselves', 'but', 'nothing', 'can', 'prevent', 'you', 'from', 'having', 'the', 'fun', 'of', 'looking', 'forward', 'to', 'them', 'Lynde', 'says', 'are', 'they', 'who', 'expect', 'nothing', 'for', 'they', 'shall', 'not', 'be', 'disappointed', 'But', 'I', 'think', 'it', 'would', 'be', 'worse', 'to', 'expect', 'nothing', 'than', 'to', 'be', 'disappointed']


#### *Lexicon normalization: Lemmatization and Stemming*

*Stemming*: Reduces the words to their root form by removing the derivational affixes. For instance, connection, connected, connecting word reduce to a common word "connect". 

> 1.   Porter Stemmer: Uses a set of five rules to remove the suffix (also known as suffix stripping)
> 2.   Lancaster Stemmer: Uses an iterative algorithm that uses a set of 120 rules to remove the suffixes. The algorithm tries to find an applicable rule by the last character of the word. Each rule specifies a replacement or deletion of a terminal character. It continues till it could find no such rule. Another stopping criterion - a word starting with a vowel and with only two characters left; or a word starting with a consonant and with only three characters left. 
The process repeats till it meets one of the stopping criteria. However, this could lead to over stemming. 

*Lemmatization*: Reduces words to their base word (or called lemmas). Since it is based on morphological analysis, lemmatization is better than stemming. Stemmer ignores the context of the word. For example, lemma of "better" is "good" - which is missed by the stemmer. 

In [107]:


from nltk import PorterStemmer, LancasterStemmer  #Other stemmer is LancasterStemmer; PorterStemmer is commonly used since it is simple and fast to use 
porter = PorterStemmer()
lancaster=LancasterStemmer()

#A list of words to be stemmed
word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]
print("{0:20}{1:20}{2:20}".format("Word","Porter Stemmer","lancaster Stemmer"))
for word in word_list:
    print("{0:20}{1:20}{2:20}".format(word,porter.stem(word),lancaster.stem(word)))

Word                Porter Stemmer      lancaster Stemmer   
friend              friend              friend              
friendship          friendship          friend              
friends             friend              friend              
friendships         friendship          friend              
stabil              stabil              stabl               
destabilize         destabil            dest                
misunderstanding    misunderstand       misunderstand       
railroad            railroad            railroad            
moonlight           moonlight           moonlight           
football            footbal             footbal             


In [109]:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()

print("{0:20}{1:20}".format("Word","Lemma"))
for word in words_no_punkt:
    print ("{0:20}{1:20}".format(word,lem.lemmatize(word)))

Word                Lemma               
Oh                  Oh                  
Marilla             Marilla             
looking             looking             
forward             forward             
to                  to                  
things              thing               
is                  is                  
half                half                
the                 the                 
pleasure            pleasure            
of                  of                  
them                them                
exclaimed           exclaimed           
Anne                Anne                
You                 You                 
mayn                mayn                
t                   t                   
get                 get                 
the                 the                 
things              thing               
themselves          themselves          
but                 but                 
nothing             nothing             
can             

In [111]:
for word in words_no_punkt:
    print ("{0:20}{1:20}".format(word,lem.lemmatize(word, pos="v")))

Oh                  Oh                  
Marilla             Marilla             
looking             look                
forward             forward             
to                  to                  
things              things              
is                  be                  
half                half                
the                 the                 
pleasure            pleasure            
of                  of                  
them                them                
exclaimed           exclaim             
Anne                Anne                
You                 You                 
mayn                mayn                
t                   t                   
get                 get                 
the                 the                 
things              things              
themselves          themselves          
but                 but                 
nothing             nothing             
can                 can                 
prevent         

#### *Parts-of-speech (POS) tagging*

Identifies the grammar groups (such as noun, verb, pronoun etc) in the text

In [114]:
nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(words_no_punkt[:15]) #IN- preposition; NNP- noun, proper, singular; VB - Verb; RB - adverb; TO: "to" as preposition or infinitive marker; 
#VBG: verb, present participle or gerund; VBD: verb, past tense


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Oh', 'UH'),
 ('Marilla', 'NNP'),
 ('looking', 'VBG'),
 ('forward', 'RB'),
 ('to', 'TO'),
 ('things', 'NNS'),
 ('is', 'VBZ'),
 ('half', 'PDT'),
 ('the', 'DT'),
 ('pleasure', 'NN'),
 ('of', 'IN'),
 ('them', 'PRP'),
 ('exclaimed', 'VBD'),
 ('Anne', 'NNP'),
 ('You', 'PRP')]

#### *Creation of n-grams*

## **Classification of Real and Fake News**

To classify whether a news article is fake or real.
 
Data source: [Kaggle](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset?select=Fake.csv)

In [49]:
import warnings 
warnings.simplefilter("ignore")

import os 
from google.colab import files

import pandas as pd 
import numpy as np 

#nltk
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords 
from nltk import word_tokenize
stop_words = set(stopwords.words('english'))

#Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

#Classifiers
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC,  LinearSVC
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import StratifiedKFold

#Performance metrics
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import precision_recall_fscore_support

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [50]:
uploaded = files.upload()

Saving Fake.csv to Fake.csv
Saving True.csv to True.csv


In [52]:
fake_data = pd.read_csv("Fake.csv")
real_data = pd.read_csv("True.csv")

In [53]:
fake_data.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [54]:
real_data.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [57]:
#Print the number of missing values in both dataframes: real_data and fake_data
print("For fake_data")
print(fake_data.isnull().sum())

print("For real_data")
print(real_data.isnull().sum())

For fake_data
title      0
text       0
subject    0
date       0
dtype: int64
For real_data
title      0
text       0
subject    0
date       0
dtype: int64


In [58]:
fake_data["label"] = 1
real_data["label"] = 0
print(fake_data.head())
print(real_data.head())

                                               title  ... label
0   Donald Trump Sends Out Embarrassing New Year’...  ...     1
1   Drunk Bragging Trump Staffer Started Russian ...  ...     1
2   Sheriff David Clarke Becomes An Internet Joke...  ...     1
3   Trump Is So Obsessed He Even Has Obama’s Name...  ...     1
4   Pope Francis Just Called Out Donald Trump Dur...  ...     1

[5 rows x 5 columns]
                                               title  ... label
0  As U.S. budget fight looms, Republicans flip t...  ...     0
1  U.S. military to accept transgender recruits o...  ...     0
2  Senior U.S. Republican senator: 'Let Mr. Muell...  ...     0
3  FBI Russia probe helped by Australian diplomat...  ...     0
4  Trump wants Postal Service to charge 'much mor...  ...     0

[5 rows x 5 columns]


In [67]:
#Combine the both fake_data and real_data; shuffle the data; and split them into training and test dataset

print("Dimension of Fake News:", fake_data.shape)
print("Dimension of Real News:", real_data.shape)
fake_real_data = pd.concat([fake_data,real_data])
print("Dimension of combined dataset:", fake_real_data.shape)
fake_real_data = fake_real_data.sample(frac=1).reset_index(drop=True)

Dimension of Fake News: (23481, 5)
Dimension of Real News: (21417, 5)
Dimension of combined dataset: (44898, 5)


In [69]:
#Combining both healdlines and news content to form another column called "content"

fake_real_data["content"] = fake_real_data["title"] + fake_real_data["text"]
print(fake_real_data.columns)

Index(['title', 'text', 'subject', 'date', 'label', 'content'], dtype='object')


In [71]:
#Splitting dataset into train and test dataset 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    fake_real_data["content"], fake_real_data['label'], test_size=0.3, random_state=1)

print("Dimension of training data:", X_train.shape)
print("Dimension of test data:", X_test.shape)

Dimension of training data: (31428,)
Dimension of test data: (13470,)


### **Text Pre-processing**

### **Exploratory Analysis**


### **Word Embeddings** 

### **Classifiers**

### **Performance Evaluation**


## **Future Work**

I have used NLTK primarily for the analysis. It will be continuously upated to give a comparison of preprocessing the text using NLTK and spaCy. Also, I intend to extend the notebook by implementing deep learning models for text classification. 

For more information: [GitHub](https://github.com/dhanyajothimani)


### **References** 

Examples and concepts compiled from various sources including Python package documentation, Data camp, and StackOverflow.