# Fundamentals of NLP
#### By: Farahana, Date: 13/8/2020

I've never tried NLP in my life. This is my first time trying to get into this. We will try to go together from here. 

### First model: Bag of Words

In data science, it is easy to work with numerical values rather than a non-numerical values such as words. This model is the easiest model to convert sentences into numerical values/vectors. For instance;

* "I love the book"
* "This is a great book"
* "The fit is great"
* "I love the shoes"

we can extract <1> _unique_ words from the sentences above and turn into:

* "I love the book this is a great fit shoes"

Then <2> we will create a vector for each and every sentence that has that unique word. We will implement scikit-learn [CountVectorizer](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) from here on: 

In [1]:
# Import necessities
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# let us try with the example as training set
x_train = ["I love the book", "This is a great book", "The fit is great", "I love the shoes"]

In [12]:
# Then let us fit it into CountVectorizer (as a dictionary) into our training set
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(x_train)

In [4]:
# let us check our vectors and test something with it.
vectors.shape

(4, 8)

In [5]:
print(vectorizer.get_feature_names()) # get the unique words as features for the training set
print(vectors.toarray()) # convert the training set into binary array

['book', 'fit', 'great', 'is', 'love', 'shoes', 'the', 'this']
[[1 0 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 1]
 [0 1 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 0]]


The phrase return `book`,`1`; `fit`,`0`; `great`,`0`; `is`,`0`; `love`,`1`; `shoes`,`0`; and so on as in "I love the book" for the first row of `vector`.

Now, let us try to make a proper dataset similar to the above training set. 

In [6]:
class Category:
    BOOKS = "BOOKS"
    CLOTHING = "CLOTHING"
    
X_train = ["I love the book", "This is a great book", "The fit is great", "I love the shoes"]
y_train = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]

In [7]:
X_vector_train = vectorizer.fit_transform(X_train) # define the x_train as vector for the classification purpose.

Next, we will try to use simple machine learning technique for the above dataset. 

In [8]:
from sklearn.svm import SVC

clf = SVC(kernel='linear')
clf.fit(X_vector_train, y_train)

SVC(kernel='linear')

Now, we can try to predict simple sentence to have the SVM classifier classes it according to its `Category`

In [9]:
X_test = ['I like the book']
X_vector_test = vectorizer.transform(X_test) # as usual, sentence is supposed to be in vector (0,1) for classification

In [10]:
clf.predict(X_vector_test)

array(['BOOKS'], dtype='<U8')

Above example is a unigram approach where each word is taken as feature. However, when we have tenses such as _"was doing"_ and _"is doing"_, and sentiment such _"very good"_ and _"very bad"_, we have to consider pairing the words to be more than one. Let us check a vectorizer in effect.

In [33]:
vectorizer = CountVectorizer(ngram_range=(1,2)) # With 1 and 2 words.
vectors = vectorizer.fit_transform(X_train)
print (vectorizer.get_feature_names())

['book', 'fit', 'fit is', 'great', 'great book', 'is', 'is great', 'love', 'love the', 'shoes', 'the', 'the book', 'the fit', 'the shoes', 'this', 'this is']


In [36]:
vectors.shape

(4, 16)

Now, we have 16 features in the dictionary.