# Natural Language Processing

Natural Language Processing (or NLP) is applying Machine Learning models to text and language. Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing. Whenever you dictate something into your iPhone / Android device that is then converted to text, that’s an NLP algorithm in action.

You can also use NLP on a text review to predict if the review is a good one or a bad one. You can use NLP on an article to predict some 
categories of the articles you are trying to segment. You can use NLP on a book to predict the genre of the book. And it can go further, 
you can use NLP to build a machine translator or a speech recognition system, and in that last example you use classification algorithms to 
classify language. Speaking of classification algorithms, most of NLP algorithms are classification models, and they include Logistic Regression, 
Naive Bayes, CART which is a model based on decision trees, Maximum Entropy again related to Decision Trees, Hidden Markov Models 
which are models based on Markov processes.

Examples : 
1. if/else Rules (Chatbot)
2. Audio frequency componenet analysis
3. bag-of-Words Model
4. CNN for text recognition

In [89]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np

In [90]:
dataset = pd.read_csv("Restaurant_Reviews.tsv",delimiter="\t",quoting = 3)

In [91]:
import re
import nltk
nltk.download("stopwords") # download stop words which are words which are not part of the vocab of the model

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer #In brief, stemming is the process of reducing a word to its word stem. Word stem is a base or root form of the word and doesn't need to be an existing word
corpus = []
for i in range(0,1000):
    review =re.sub('[^a-zA-Z]',' ',dataset["Review"][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    allstopwords = stopwords.words('english')
    allstopwords.remove("not")
    review = [ps.stem(word) for word in review if not word in set(allstopwords)]
    review = ' '.join(review)
    corpus.append(review)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [92]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
Y = dataset.iloc[:,-1].values

In [93]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2)

In [94]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

svc = SVC(kernel="rbf",random_state=0)
svc.fit(X_train,Y_train)
gb = GaussianNB()
gb.fit(X_train,Y_train)

In [95]:
from sklearn.metrics import confusion_matrix,accuracy_score
Y_pred_gb = gb.predict(X_test)
Y_pred_svc = svc.predict(X_test)
print(confusion_matrix(Y_pred_gb,Y_test))
print(accuracy_score(Y_pred_gb,Y_test))
print(confusion_matrix(Y_pred_svc,Y_test))
print(accuracy_score(Y_pred_svc,Y_test))

[[51 15]
 [49 85]]
0.68
[[87 21]
 [13 79]]
0.83
