# Natural language Processing

`Natural Language Processing (or NLP) is applying Machine Learning models to text and language. Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing. Whenever you dictate something into your iPhone Android device that is then converted to text, that’s an NLP algorithm in action.`

`You can also use NLP on a text review to predict if the review is a good one or a bad one. You can use NLP on an article to predict some categories of the articles you are trying to segment. You can use NLP on a book to predict the genre of the book. And it can go further, you can use NLP to build a machine translator or a speech recognition system, and in that last example you use classification algorithms to classify language. Speaking of classification algorithms, most of NLP algorithms are classification models, and they include Logistic Regression, Naive Bayes, CART which is a model based on decision trees, Maximum Entropy again related to Decision Trees, Hidden Markov Models which are models based on Markov processes.`

#### CSV vs TSV:
comma separated values vs Tab separated values.<br>

### We use the Data collected for the reviews for a restaurant to use NLP on it and find if a new review is +ve or negative. 
### Importing the data
<b>we can't use the CSV as commas can't be used as delimeters to separate the data, as commas can be part of review too. So we use the TSV format for this</b><br>
now for that we use same `pd.read_csv('',delimiter='\t',quoting=3)` with the delimiter tab.<br>
and to ignore the double quotes in the reviews, we use `quoting=3` parameter.<br><p>

<b>We can use NLP for not just reviews but also for articles, written speech or some discourse. Discourse analysis is a part of NLP. </b>

### Cleaning the data:

<b>Need for this is that we need only the relevant words to make the bag of words</b> <br>
we will get rid of articles, numbers, remove ..., use stemming ie. to use roots for the words that mean same thing. We will get rid of the capitals.<br>
Finally we will make <b>a bag of words model</b> out of it which is <b>Tokenizing process</b>. i.e it splits texts into words.. `relevant words` and then we will attribute one coulumn for each words and for each review, each column will contain the number of times the associated word appears in the review. Hence lots of 0's and some 1's and less 2's and 3's. <b>Sparse matrix(to make the sparse matrix have lower dimentions(excluding the signs, numbers and stopwords we decrease the dimensions of the sparse matrix that we have to deal with)  )</b> will be created by this. <br><p> 


In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

#importing the dataset (The TSV file, Tab separated values)

dataset = pd.read_csv('Restaurant_Reviews.tsv',delimiter='\t',quoting=3)

In [2]:
dataset['Review'][0]

'Wow... Loved this place.'

<b>We will be removing anything except the letters first and replacing it with space for the 1st review for example:</b>

In [3]:
import re 
review = re.sub('[^a-zA-Z]',' ',dataset['Review'][0])

In [4]:
review

'Wow    Loved this place '

<b>Lets remove the capital letters:</b>

In [5]:
review = review.lower()
review

'wow    loved this place '

<b>Let's remove the words which are not very relevant to the ML algo: (words like the, a, ...)<i>We use the NLTK lib and tools to achive our goal. list of words contains the list of irrelavant words. For this we use `nltk.download('stopwords')`</i> Now we check for every review and if any stopwords are found, we delete it from the review</b>

In [6]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/prashant/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
review = review.split()
review

['wow', 'loved', 'this', 'place']

<b>Here 'this' is the stopword, we remove it we convert the list of stopwords to set of stop words to make the process faster<b>


In [9]:
#review = [word for word in review if not word in set(stopwords.words('english'))]
#review
#['wow', 'loved', 'place']

### Steming: (finding the roots for the words to make them represented by their roots):
`from nltk.stem.porter import PorterStemmer`<br>
is used to stem the words<br>
`ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
`

In [8]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]

In [9]:
review

['wow', 'love', 'place']

<b>Lets join the review</b>


In [10]:
review = ' '.join(review)

In [12]:
review

'wow love place'

<b>we perform cleaning for all the reviews, and then put em in the list corpus<b>

## Creating the bag of words:
We take the words in all the reviews without repeating them, and then we make a sparse matrix out of it with columns being the unique words itself.<b>Tokenization</b><br>
filling the the rows with every time in a review the word appears we fill 1 else 0.<br>
<p>
    Then we train the model on basis of sparse matrix and outcome then finally it will get us to classification problem for binary outcome. <p>
        
sklearn.feature_extraction.text---> CounterVectorizer
<br>
`cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()`
toarray()---> converts it into a matrix.<br>
X is now our sparse matrix.<br>

for Countvectorizer(max_features=1500) this parameter will help us get the 1500 most frequent words.This reduces the sparsity and also give us the most relevant words for us to train our model on.<br>

Also the dimentionality reduction can be used to make it simpler<br> 
<p>
Now we use <b>Naive Bayes classification model</b> with X, and y to tain on.
The model predicts the review with 73% accuracy.
<br> 
<p>
Again using <b>KNN classification model</b> with X, and y to tain on.
The model predicts the review with 78% accuracy.
<br> 
<p>
Again using <b>SVM classification model</b> with X, and y to tain on.
The model predicts the review with 89% accuracy.
<br> 
<p>

Again using <b>DT classification model</b> with X, and y to tain on.
The model predicts the review with 90.5% accuracy.

<br> 
<p>
Again using <b>Random forest classification model</b> with X, and y to tain on.
The model predicts the review with 85% accuracy.


In [None]:
#Natural Language processing

#Importing the libraries

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

#importing the dataset (The TSV file, Tab separated values)

dataset = pd.read_csv('Restaurant_Reviews.tsv',delimiter='\t',quoting=3) 

#Cleaning the texts

import re
import nltk
nltk.download('stopwords') 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []
for i in range(0,1000):
    review = re.sub('[^a-zA-Z]',' ',dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

#Creating the Bag of Words model though the process of Tokenization.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,1].values

# Training a Classification model with the independent variable vector and dependent variable vector
# We use Naive Bayes or DT or Random forest for NLP

# Here we use Naive Bayes

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)


# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)