# Natural Language Processing

Prefer TSV files as we might have comma in the text itself, so CSV format will make it difficult only.

In this we will be working on a data based on reviews given to a restuarant.

In [1]:
#Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#Importing the dataset
dataset = pd.read_csv('data/Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3) #quoting = 3 means no quotes at all

In [3]:
dataset.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [5]:
dataset.shape

(1000, 2)

## Cleaning the texts

Apply stemming, do lemmatization, remove stopwords and punctuation.

At last do Bag of Word model and tokenization..

Cleaning the first review at 0th index

In [6]:
import re

In [7]:
dataset['Review'][0]

'Wow... Loved this place.'

In [8]:
#keep only letters in our review
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][0])

In [10]:
#Now check the first review
review

'Wow    Loved this place '

In [11]:
#Change the review in lower case
review = review.lower()

In [12]:
review

'wow    loved this place '

In [13]:
#remove the non-significant words.. (Stopwords)
#import the nltk library
import nltk
nltk.download('stopwords') #download stopwords list

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ankitsharma/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [14]:
#traverse through the review and remove the irrelevant words
review = review.split() #to change the from the string to the list format so that we can traverse..

In [15]:
review

['wow', 'loved', 'this', 'place']

In [16]:
from nltk.corpus import stopwords

review = [word for word in review if not word in stopwords.words('english')] #use set for using more words..

In [17]:
review?

[0;31mType:[0m        list
[0;31mString form:[0m ['wow', 'loved', 'place']
[0;31mLength:[0m      3
[0;31mDocstring:[0m  
list() -> new empty list
list(iterable) -> new list initialized from iterable's items


## Stemming

Going back to the root(stem) of the word. 

Loved --> Love

Amusing --> Amus

Wolves --> Wolv

In [21]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]

In [22]:
review?

[0;31mType:[0m        list
[0;31mString form:[0m ['wow', 'love', 'place']
[0;31mLength:[0m      3
[0;31mDocstring:[0m  
list() -> new empty list
list(iterable) -> new list initialized from iterable's items


In [23]:
#Joining the words together separated by space

review = ' '.join(review)

In [26]:
review?

[0;31mType:[0m        str
[0;31mString form:[0m wow love place
[0;31mLength:[0m      14
[0;31mDocstring:[0m  
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.


* Our dataset is of 1000 rows i.e. we have 1000 different reviews. 
* We need to change the scope of our loop to cover all the dataset and perform the cleaning process

In [27]:
#We will create a new list (a collection of text)
corpus = []


#Cleaning process for all the dataset
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    
    corpus.append(review)

In [30]:
corpus?

[0;31mType:[0m        list
[0;31mString form:[0m ['wow love place', 'crust good', 'tasti textur nasti', 'stop late may bank holiday rick steve rec <...> m think go ninja sushi next time', 'wast enough life pour salt wound draw time took bring check']
[0;31mLength:[0m      1000
[0;31mDocstring:[0m  
list() -> new empty list
list(iterable) -> new list initialized from iterable's items


## Creating Bag of Words..

All the relevant words which are actually helpful in making this analyzer for a single time. Removing the unnecessary and repeated words.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

In [41]:
cv = CountVectorizer(max_features= 1500)

In [42]:
X = cv.fit_transform(corpus).toarray() #created a sparse matrix for all the words in the review

In [43]:
X?


[0;31mType:[0m            ndarray
[0;31mString form:[0m    
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
[0;31mLength:[0m          1000
[0;31mFile:[0m            /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/__init__.py
[0;31mDocstring:[0m       <no docstring>
[0;31mClass docstring:[0m
ndarray(shape, dtype=float, buffer=None, offset=0,
        strides=None, order=None)

An array object represents a multidimensional, homogeneous array
of fixed-size items.  An associated data-type object describes the
format of each element in the array (its byte-order, how many bytes it
occupies in memory, whether it is an integer, a floating point number,
or something else, etc.)

Arrays should be constructed using `array`, `zeros` or `empty` (refer
to the See Also section below).  The parameters given here refer to
a low-level method (`ndarray(...)`) for instantiating an array.


In [44]:
#dependent variable

y = dataset.iloc[:, 1].values

## Training and testing our model

In [49]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

### Using Naive Bayes

In [50]:
#Fitting the model to the Naive Bayes algorithm
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [52]:
y_pred = classifier.predict(X_test)

In [53]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [54]:
cm

array([[ 67,  50],
       [ 20, 113]])

In [55]:
acc = (67+113)/250

In [56]:
acc

0.72

### Using SVM

In [57]:
#Using Support Vector Machine
from sklearn.svm import SVC

In [58]:
classifier2 = SVC(kernel = 'linear', random_state = 0)
classifier2.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False)

In [59]:
y_pred2 = classifier2.predict(X_test)

In [60]:
cm2 = confusion_matrix(y_test, y_pred2)

In [61]:
cm2

array([[94, 23],
       [49, 84]])

In [62]:
acc2 = (94+84)/250
acc2

0.712