# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset. Also Import any relevant libraries here for the rest of the preprocessing and KNN classification.

In [None]:



import numpy as np
import pandas as pd
import nltk, re
nltk.download('punkt')
from nltk.corpus import stopwords
stopwords=stopwords.words('english')
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews.

In [None]:


train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head()

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0


In [None]:

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head()

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


In [None]:
#check for nulls in train&test
print('Train Data Null Values',train_df.isnull().sum())
print('Test Data Null Values',test_df.isnull().sum())

Train Data Null Values text     0
label    0
dtype: int64
Test Data Null Values text     0
label    0
dtype: int64


In [None]:
train_df['label'].value_counts()
test_df['label'].value_counts()

0    12500
1    12500
Name: label, dtype: int64

## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [None]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

processed_text=[]
for z in train_df['text']:
        
        # Check if the sentence is a missing value
        if isinstance(z, str) == False:
            z = ""
            
        filtered_sentence=[]
        
        z = z.lower() # Lowercase 
        z = z.strip() # Remove leading/trailing whitespace
        z = re.sub('\s+', ' ', z) # Remove extra space and tabs
        z = re.compile('<.*?>').sub('', z) # Remove HT
        processed_text.append(" ".join([i.lower() for i in z.split() if i not in stop_words and len(i)>=3]))
test_processed_text=[]

for z in test_df['text']:
        z = z.lower() # Lowercase 
        z = z.strip() # Remove leading/trailing whitespace
        z = re.sub('\s+', ' ', z) # Remove extra space and tabs
        z = re.compile('<.*?>').sub('', z) # Remove HT
        processed_text.append(" ".join([i.lower() for i in z.split() if i not in stop_words and len(i)>=3]))

In [None]:
processed_text_df = pd.DataFrame({'text':processed_text,'label':list(train_df['label'])})
test_processed_text_df = pd.DataFrame({'text':test_processed_text,'label':list(test_df['label'])})

ValueError: ignored

In [None]:
processed_text_df

In [None]:
X = processed_text_df['text']
Y = processed_text_df['label']

X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)

In [None]:
#vectorizer = CountVectorizer()
#x_train = vectorizer.fit_transform(X_train)
#x_test = vectorizer.transform(X_test)
#test_data = vectorizer.transform(test_processed_text_df['text'])

import gensim
from gensim.models import Word2Vec
### PIPELINE ###
##########################
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
    #( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10)),
    ('knn', KNeighborsClassifier())  
                                ])


# Visualize the pipeline
# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps
from sklearn import set_config
set_config(display='diagram')
pipeline

In [None]:
knn = KNeighborsClassifier(n_neighbors=60)
knn.fit(x_train,y_train)
y_pred = knn.predict(test_data)

In [None]:
confusion_matrix = metrics.confusion_matrix(test_df['label'], y_pred)
print('Accuracy: ',accuracy_score(test_df['label'], y_pred))
confusion_matrix