### NLP Fundamentals

#### Vectorization

##### Count Vectorizer
*In this portion, we will discuss about count vectorizer. What count vectorizer do- it represents the text as a vector. How it works- suppose you have 5 sentences. it seperates all the unique words from the text and make a 1D matrix of the words. Then for every sentence, it places 1s for the words situated in the sentence and 0s for te remaining words. Thus it creates a 5D matrix ie. a vector. It can be binary or ngrams. Binary count one word only once although it may occur multiple times in a sentence.*

In [1]:
class Category:
    TRAVEL = "TRAVEL"
    CLOTHING = "CLOTHING"

trainX = ["I love train journey","I love reading books on train","Train journey is cheap","I like to wear shirt","T-shirt fits best in summer"]
trainY = [Category.TRAVEL, Category.TRAVEL, Category.TRAVEL, Category.CLOTHING, Category.CLOTHING]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True)       # This is a unigram process. ie- vectorize per word
#vectorizer = CountVectorizer(binary=True, ngram_range=(1,2))    # Bigram process
trainX_vector = vectorizer.fit_transform(trainX)

print(vectorizer.get_feature_names())
print(trainX_vector.toarray())

['best', 'books', 'cheap', 'fits', 'in', 'is', 'journey', 'like', 'love', 'on', 'reading', 'shirt', 'summer', 'to', 'train', 'wear']
[[0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0]
 [0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0]
 [0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1]
 [1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0]]


In [3]:
# Define a Classifier
from sklearn import svm

linear_svm = svm.SVC(kernel='linear')
linear_svm.fit(trainX_vector,trainY)

SVC(kernel='linear')

In [4]:
# Prediction
testX = vectorizer.transform(["visiting new place is great"])

linear_svm.predict(testX)

array(['TRAVEL'], dtype='<U8')

##### Word Vectors
*Word vector is also a simillar type concept that is creating a vector using all the sentences in a text. But here it contains semantic meaning of a word in a vector. In *****Count Vectorizer***** because it place 1 for evry situated word, it cannot represents any significance of any word in the sentence. Word Vectors can seperate the valuable words and their importance in the sentence. It can also measure the relationship among words in sentence. For Word Vectorizer we will use *****word2vec*****.*

*Here, we will use spacy library. Spacy has some pretrained model, we can make advantage of them. Brfore using you have to install spacy incase you didn't and download any of the language model.*

*> Install spacy: *****pip install spacy***** *

*> Download Model: *****!python -m spacy download "en_core_web_md"*****, it is a medium model. Is also has small(sm) amd large(lg) models.*

In [6]:
# Download the medium model
#!python -m spacy download "en_core_web_md" 

*One download the model you can comment out the code because we need not download it again. We can load the model and use it.*

In [11]:
# Import the spacy library and Load the medium model
import spacy
nlp = spacy.load("en_core_web_md")

*Now, we need some text on which we can apply the model. We can use the text using before in count vectorizer.*

In [12]:
# Defining the data
class Category:
    TRAVEL = "TRAVEL"
    CLOTHING = "CLOTHING"

trainX = ["I love train journey","I love reading books on train","Train journey is cheap","I like to wear shirt","T-shirt fits best in summer"]
trainY = [Category.TRAVEL, Category.TRAVEL, Category.TRAVEL, Category.CLOTHING, Category.CLOTHING]

*Now we will represent our trainX in vector format. Before that we have to make the trainx into doc. Doc removes the individual sentence string and make them one.*

In [13]:
# Making docs of data and vectorize
docs = [nlp(text) for text in trainX]
#print(docs[0].vector)
wv_trainX_vector = [x.vector for x in docs]

In [14]:
# Define the classifier
from sklearn import svm
linear_svm_wv = svm.SVC(kernel='linear')
linear_svm_wv.fit(wv_trainX_vector, trainY)

SVC(kernel='linear')

In [18]:
# Make Prediction
wv_testX = ["I love tshirt but wear less"]
wv_testX_docs = [nlp(text) for text in wv_testX]
wv_testX_vector = [x.vector for x in wv_testX_docs]

linear_svm_wv.predict(wv_testX_vector)

array(['CLOTHING'], dtype='<U8')

### NLP Techniques

#### Regex
*Regular expression has various works in natural language processing. One of them is Pattern Matching. Here, we will talk about pattern matching like- phone number, email, password checker etc.*

In [20]:
# Import library
import re

In [34]:
# Define expression for various scinario.
regexp0 = re.compile(r"book|accident|concept")  # Hard words
regexp1 = re.compile(r"\bbook\b|\baccedint\b|\bconcept\b")
regexp2 = re.compile(r"^ab[^\s]*cd$")    # start with 'ab' and end with 'cd', anything in the middle except white space.

phrases0 = ["I have this book", "Reebook is a brand name", "The concept isn't perfect", "Roads are barely good"]
phrases1 = ["abcd", "asdf", "xvcd", "ab cd", "abxxxcd"]

In [35]:
# Searching
searched = []
for phrase in phrases0:
    if(re.search(regexp1, phrase)):
        searched.append(phrase)
print(searched)

['I have this book', "The concept isn't perfect"]


*We can search specific word or portion of a word by regular expression. Here, we see that in first sentence it found the word book seperately and in 2nd sentence as part of the word reebook.*

*But if you want the specific word, not as a part of words, you can use no charecters in the word boundraries (\b) shown in regexp1*

In [24]:
# Find Match
matches = []
for phrase in phrases1:
    if(re.match(regexp2, phrase)):
        matches.append(phrase)
print(matches)

['abcd', 'abxxxcd']


#### Stemming and Lemmatization
*Stemming and Lemmatization are two techniques of normalize texts. In vectorization we saw that while training our model it can recognize word but words. While to us it is straight forward to recognize that they are similar word. So, stemming and lemmatization finds the root word.*

*Abs Sayem*