# Fundamentals of NLP
#### By: Farahana, Date: 13/8/2020

I've never tried NLP in my life. This is my first time trying to get into this. We will try to go together from here. 

### First model: Bag of Words

In data science, it is easy to work with numerical values rather than a non-numerical values such as words. This model is the easiest model to convert sentences into numerical values/vectors. For instance;

* "I love the book"
* "This is a great book"
* "The fit is great"
* "I love the shoes"

we can extract <1> _unique_ words from the sentences above and turn into:

* "I love the book this is a great fit shoes"

Then <2> we will create a vector for each and every sentence that has that unique word. We will implement scikit-learn [CountVectorizer](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) from here on: 

In [1]:
# Import necessities
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# let us try with the example as training set
x_train = ["I love the book", "This is a great book", "The fit is great", "I love the shoes"]

In [3]:
# Then let us fit it into CountVectorizer (as a dictionary) into our training set
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(x_train)

In [4]:
# let us check our vectors and test something with it.
vectors.shape

(4, 8)

In [5]:
print(vectorizer.get_feature_names()) # get the unique words as features for the training set
print(vectors.toarray()) # convert the training set into binary array

['book', 'fit', 'great', 'is', 'love', 'shoes', 'the', 'this']
[[1 0 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]


The phrase return `book`,`1`; `fit`,`0`; `great`,`0`; `is`,`0`; `love`,`1`; `shoes`,`0`; and so on as in "I love the book" for the first row of `vector`. We can view this in pandas dataframe below.

In [6]:
import pandas as pd
pd.DataFrame(vectors.toarray(), columns = vectorizer.get_feature_names())

Unnamed: 0,book,fit,great,is,love,shoes,the,this
0,1,0,0,0,1,0,1,0
1,1,0,1,1,0,0,0,1
2,0,1,1,1,0,0,1,0
3,0,0,0,0,1,1,1,0


Now, let us try to make a proper dataset similar to the above training set by setting books and clothing category. 

In [7]:
class Category:
    BOOKS = "BOOKS"
    CLOTHES = "CLOTHES"
    
X_train = ["I love the book", "This is a great book", "The fit is great", "I love the shoes"]
y_train = [Category.BOOKS, Category.BOOKS, Category.CLOTHES, Category.CLOTHES]

In [8]:
X_vector_train = vectorizer.fit_transform(X_train) # define the x_train as vector for the classification purpose.

Next, we will try to use simple machine learning technique for the above dataset. 

In [9]:
from sklearn.svm import SVC

clf = SVC(kernel='linear')
clf.fit(X_vector_train, y_train)

SVC(kernel='linear')

Now, we can try to predict simple sentence to have the SVM classifier classes it according to its `Category`

In [10]:
X_test = ['I like the book']
X_vector_test = vectorizer.transform(X_test) # as usual, sentence is supposed to be in vector (0,1) for classification

In [11]:
clf.predict(X_vector_test)

array(['BOOKS'], dtype='<U7')

Above example is a unigram approach where each word is taken as feature. However, when we have tenses such as _"was doing"_ and _"is doing"_, and sentiment property such as _"very good"_ and _"very bad"_, we have to consider pairing the words to be more than one. Let us check a vectorizer in effect using `ngram_range` parameter.

>a study by Ioannis Kanaris and others revealed that n-grams of size 3 and 4 yield good performances in the anti-spam filtering of email messages (excerpt from [Python Machine Learning Book](https://github.com/rasbt/python-machine-learning-book-3rd-edition), page 264)

In [12]:
vectorizer = CountVectorizer(ngram_range=(1,2)) # With 1 and 2 words.
vectors = vectorizer.fit_transform(X_train)
print (vectorizer.get_feature_names())

['book', 'fit', 'fit is', 'great', 'great book', 'is', 'is great', 'love', 'love the', 'shoes', 'the', 'the book', 'the fit', 'the shoes', 'this', 'this is']


In [13]:
pd.DataFrame(vectors.toarray(), columns = vectorizer.get_feature_names())
# Now, we have 16 features in the dictionary and slightly different vectors.

Unnamed: 0,book,fit,fit is,great,great book,is,is great,love,love the,shoes,the,the book,the fit,the shoes,this,this is
0,1,0,0,0,0,0,0,1,1,0,1,1,0,0,0,0
1,1,0,0,1,1,1,1,0,0,0,0,0,0,0,1,1
2,0,1,1,1,0,1,1,0,0,0,1,0,1,0,0,0
3,0,0,0,0,0,0,0,1,1,1,1,0,0,1,0,0


In [None]:
print(vectorizer.vocabulary_)

However, with bag of words model, anything outside the vectors data is hardly classified as it does not know what it is. Let us check with `books` instead of `book`.

In [14]:
X_test = ['I like the books']
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(X_train)
X_vector_test = vectorizer.transform(X_test)

clf.predict(X_vector_test)

array(['CLOTHES'], dtype='<U7')

The classifier had classified the test set as `CLOTHES` category despite the only edition of the test set is a letter. 

### Second Model: Word Vectors

Another approach to convert to a numerical vector that has semantical meaning. Such as red and white should be mapped to colour feature. Let us check these sentences:

* "Best book I've read in years"
* "great story and characters"
* "no development of characters during the book"

The model is simplified into 4 steps:

1. The model will look at a window of text such as 5 token long which includes "Best book I've read in".
2. Then the model will look at each token in the context window and develop ideas based on its neighboring token.
3. With training, the model will develop an association between words such as 'book' and 'read'; 'story' and 'characters'. 
4. Eventually, the model is expected to associate every related word with each other such as 'book', 'read', 'story' and 'character'. 

For this, we have a trained library, spacy. Follow [this](https://spacy.io/usage) to have it installed. Usually, the notebook need to restart if you haven't had it before.

In [15]:
import spacy

nlp = spacy.load('en_core_web_md')

In [16]:
docs = [nlp(text) for text in X_train] # let us use above example.

Let us check what is docs according to spacy nlp representation

In [17]:
docs?

[1;31mType:[0m        list
[1;31mString form:[0m [I love the book, This is a great book, The fit is great, I love the shoes]
[1;31mLength:[0m      4
[1;31mDocstring:[0m  
Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list.
The argument must be an iterable if specified.


In [18]:
print( "there are {} vectors embedding for \"{}\" " .format(docs[0].vector.shape[0], docs[0]))

there are 300 vectors embedding for "I love the book" 


In [19]:
docs[0].vector[:10]

array([ 0.08563001,  0.313255  , -0.2392405 , -0.17215225,  0.1418515 ,
        0.1970548 ,  0.04868999, -0.12744625,  0.05947001,  2.1347    ],
      dtype=float32)

Let us use SVM the classifier to classify this word vector model for the training set 

In [20]:
X_vector_train_wv = [x.vector for x in docs] # make new vector training dataset using comprehension

In [21]:
clf = SVC(kernel='linear')
clf.fit(X_vector_train_wv, y_train)

SVC(kernel='linear')

Now, we can try with the lastest test set with `books`.

In [22]:
X_test_wv = [nlp(text) for text in X_test] # make a nlp list from the test set
X_vector_test_wv = [x.vector for x in X_test_wv] # make a list of vectors

clf.predict(X_vector_test_wv)

array(['BOOKS'], dtype='<U7')

It has recognized the plural of the book because Spacy has been trained so that its vector embedding has a semantic view of real world problem. Now, let us try with a slight different test set for classifying BOOKS and CLOTHES categories.

In [23]:
X_test = ['This is a beautiful handbag']

X_test_wv = [nlp(text) for text in X_test] # make an nlp list from the test set
X_vector_test_wv = [x.vector for x in X_test_wv] # make a list of vectors from it

clf.predict(X_vector_test_wv)

array(['CLOTHES'], dtype='<U7')

The SVM classifier has classified the test set correctly with addition of semantic and different word. 

However, there are some drawbacks for the word vector models;

1. When used with larger dictionary with more classes and longer test set, spacy embedding cannot perform to the best.
2. With two meaning words, some improvements needed to work around it.

### NLP techniques

#### 1.0 Regexes
* Not a specific python concept
* type of regular expression
* pattern matching of strings 
* Examples: 
    1. checking a format of phone number, whether it is valid or not;
        * 123-123-1234
        * +1-(123)-123-1234
    2. checking password, has special character and uppercase.
    
** Regular expression [cheatsheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/) should come in handy for the coding purpose 

In [24]:
import re # import regexes library

phrase_to_match = [ 'abcd', 'xxx', 'aaa abkl ccc', 'ab cd']

rexexp = re.compile (r"^ab[^\s]*cd$") # check the cheatsheet
matches = []
for phrase in phrase_to_match:
    if re.search(rexexp, phrase):
        matches.append(phrase)
        
print(matches)

['abcd']


Let us try to implement with books and clothes example

In [25]:
rexexp = re.compile (r"read|story|book") # hard-coded rule to find these words in test set

test_phrases = ["I liked that story.", "I like that book.", "This hat is nice."]
matches = []
for phrase in test_phrases:
    if re.search(rexexp, phrase):
        matches.append(phrase)
        
print(matches)

['I liked that story.', 'I like that book.']


However, the phrases 'history' and 'threaded' can be easily recognized to be in the same category for `read` and `story` as it is part of the phrases. To avoid the mistake, we can put word boundary in the hard-coded rule.

In [27]:
# let us add word boundary \b...\b to 

rexexp = re.compile (r"\bread\b|\bstory\b|\bbook\b")
test_phrases = ["I liked that history.", "the man threaded up the hill", "This hat is nice."] 

matches = []
for phrase in test_phrases:
    if re.search(rexexp, phrase):
        matches.append(phrase)
print(matches)

[]


#### 2.0 Stemming/Lemmatization

* Techniques to normalize texts
* Stemming is to reduce it to a canonical word of it such as `books` is similar to `book` and `reading` is similar to `read`
* However, with stemming, words such as `stories` is reduced to `stori`, then, lemmatization come into workspace.
* The `stori` now can be lemmatized to `story`.
* There are many libraries for this, but the easiest to use is probably [nltk](https://www.nltk.org/) library (natural language tool kit).

** Simple visualization and explaination can be found here: [DataCamp](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python)

<img src="images/stemminglemmatization.jpg" style="width:600px"/>

In [28]:
import nltk 

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [29]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer() # algorithm expect a single word 

In [30]:
phrase = "reading the books"
words = word_tokenize(phrase) # separate the phrase by words

stemmed_words = []  # build an empty list for stemmed words
for word in words:
    stemmed_words.append(stemmer.stem(word)) # append the stemmed words in the matrix
    
# " ".join(stemmed_words) # append the stemmed words

print ("The original phrase -{}- is stemmed to -{}- using NLTK library" .format(phrase, " ".join(stemmed_words)))

The original phrase -reading the books- is stemmed to -read the book- using NLTK library


However, there are some drawback in this simple approach such as punctuation will be recognized as a phrase rather and words like stories become stori will single use of stemming.

In [31]:
phrase = "reading the stories book."
words = word_tokenize(phrase) 

stemmed_words = [] 
for word in words:
    stemmed_words.append(stemmer.stem(word)) 
    
" ".join(stemmed_words)

'read the stori book .'

Now, let us use try _lemmatization_ with nltk library

In [32]:
from nltk.stem import WordNetLemmatizer

lemmatization = WordNetLemmatizer() # similar to stemmer, the algorithm expects a single word for lemmatization

In [33]:
phrase = "reading the books"
words = word_tokenize(phrase) # separate the phrase for each word

lemmatized_words = []  # build an empty list for lemmatized words
for word in words:
    lemmatized_words.append(lemmatization.lemmatize(word)) # append the lemmatized words in the matrix
    
# " ".join(lemmatized_words) # append the lemmatized words

print ("The original phrase -{}- is lemmatized to -{}- using NLTK library" .format(phrase, " ".join(lemmatized_words)))

The original phrase -reading the books- is lemmatized to -reading the book- using NLTK library


There are many ways to lemmatized the phrase but the default lemmatization for `WordNetLemmatizer` function is `NOUN`. The lemmatization can be changed with `POS` as in parts-of-speech. 

In [37]:
phrase = "reading the interesting books"
words = word_tokenize(phrase) 

lemmatized_words = []  
for word in words:
    lemmatized_words.append(lemmatization.lemmatize(word, pos='v')) # verb is used to lemmatized the phrase
    
" ".join(lemmatized_words)

'read the interest book'

The `reading`, `interesting` and `book` words had been tagged as verb and lemmatized accordingly. Here, we can sense that it cannot identify the `interesting` as adjective and `books` as noun. 

#### 3.0 Stopwords 

Stopwords are the most common words in the language, such as in English, this, that, he, and a. It usually get stripped out when tokenization is done. We could see this happens with the first model where `CountVectorizer` does not recognize `I`, and `a` as it has been considered as stopwords in the algorithm. 

In [38]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english') # the language has to be specified.

In [39]:
print (stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [41]:
phrase = 'Here is an example sentence demonstrating the removal of stopwords'

words = word_tokenize(phrase) 

stripped_words = [] 
for word in words:
    if word not in stop_words:
        stripped_words.append(word) 
    
" ".join(stripped_words)

'Here example sentence demonstrating removal stopwords'