### Natural Language Processing
We will train a supervised model to predict if a movie has a positive or a negative review.

####  **Dataset loading & dev/test splits**

**1.0) Load the movie reviews dataset from NLTK library**

In [1]:
import nltk
nltk.download("movie_reviews")
from nltk.corpus import movie_reviews
import pandas as pd
from nltk.corpus import twitter_samples 
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
stop = stopwords.words('english')
import string
import warnings
warnings.filterwarnings('ignore')
import re
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\avitr\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\avitr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\avitr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
negative_fileids = movie_reviews.fileids('neg')
positive_fileids = movie_reviews.fileids('pos')

pos_document = [(' '.join(movie_reviews.words(file_id)),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id) if category == 'pos']
neg_document = [(' '.join(movie_reviews.words(file_id)),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id) if category == 'neg']

# List of postive and negative reviews
pos_list = [pos[0] for pos in pos_document]
neg_list = [neg[0] for neg in neg_document]

**1.1) Make a data frame that has reviews and its label**

In [3]:
# code here
df = pd.DataFrame({'Reviews':pos_list+neg_list,'Label':[1]*len(pos_list)+[0]*len(neg_list)})
df

Unnamed: 0,Reviews,Label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you ' ve got mail works alot better than it de...,1
3,""" jaws "" is a rare film that grabs your attent...",1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything , "" stigmata "" should be taken as ...",0
1996,"john boorman ' s "" zardoz "" is a goofy cinemat...",0
1997,the kids in the hall are an acquired taste . i...,0
1998,there was a time when john carpenter was a gre...,0


In [4]:
df.to_csv('movie_review.csv',index=False)

**1.2 look at the class distribution of the movie reviews**

In [4]:
# code here
df['Label'].value_counts()

1    1000
0    1000
Name: Label, dtype: int64

**1.3) Create a development & test split (80/20 ratio):**

In [5]:
# code here
X = df['Reviews']
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

#### **Data preprocessing**
We will do some data preprocessing before we tokenize the data. We will remove `#` symbol, hyperlinks, stop words & punctuations from the data. You may use `re` package for this. 

**1.4) Replace the `#` symbol with '' in every review**

In [6]:
# code here
f = lambda x: re.sub(r"#+", "", x)
X_train = X_train.map(lambda x: f(x))
X_test = X_test.map(lambda x: f(x))

**1.5) Replace hyperlinks with '' in every review**

In [7]:
# code here
f = lambda x: re.sub(r"http\S+", "", x)
X_train = X_train.map(lambda x: f(x))
X_test = X_test.map(lambda x: f(x))

**1.6) Remove all stop words**

In [8]:
# code here
stop_words = set(stopwords.words("english"))
X_train_tokens = X_train.map(word_tokenize)
X_test_tokens = X_test.map(word_tokenize)

f = lambda tokens: ' '.join([token for token in tokens if token not in stop_words])

X_train = X_train_tokens.map(f)
X_test = X_test_tokens.map(f)

**1.7) Remove all punctuations**

In [9]:
# code here
def remove_punctuation(text):
    return re.sub('[^a-zA-Z]', ' ', str(text))
    
X_train = X_train.map(remove_punctuation)
X_test = X_test.map(remove_punctuation)

**1.8) Apply stemming on the development & test datasets using Porter algorithm**

In [10]:
#code here
import numpy as np
from nltk.stem import PorterStemmer
ps=PorterStemmer()

X_train_tokens = X_train.map(word_tokenize)
X_test_tokens = X_test.map(word_tokenize)

f = lambda tokens: [ps.stem(token) for token in tokens]
X_train = X_train_tokens.map(lambda x: ' '.join(f(x)))
X_test = X_test_tokens.map(lambda x: ' '.join(f(x)))

#### **Model training**

**1.9) Create bag of words features for each review in the development dataset**

In [11]:
#code here
bow_vectorizer = CountVectorizer()
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_train_bow = X_train_bow.toarray()
X_train_bow

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

**1.10) Train a Logistic Regression model on the development dataset**

In [12]:
#code here
bow_clf = LogisticRegression(random_state=42)
bow_clf.fit(X_train_bow,y_train)

**1.11) Create TF-IDF features for each review in the development dataset**

In [13]:
#code here
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_train_tfidf = X_train_tfidf.toarray()
X_train_tfidf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

**1.12) Train the Logistic Regression model on the development dataset with TF-IDF features**

In [14]:
#code here
tfidf_clf = LogisticRegression(random_state=42)
tfidf_clf.fit(X_train_tfidf,y_train)

**1.13) Compare the performance of the two models on the test dataset. Explain the difference in results obtained?**

In [15]:
#code here

X_test_bow = bow_vectorizer.transform(X_test)
y_pred_bow = bow_clf.predict(X_test_bow)
acc_bow = accuracy_score(y_test,y_pred_bow)

X_test_tfidf = tfidf_vectorizer.transform(X_test)
y_pred_tfidf = tfidf_clf.predict(X_test_tfidf)
acc_tfidf = accuracy_score(y_test,y_pred_tfidf)

In [16]:
results = pd.DataFrame({
    'Model': ['LogisticRegression + BOW', 'LogisticRegression + Tf-Idf'],
    'Accuracy (%)': [acc_bow*100,acc_tfidf*100]})

result_df = results.sort_values(by='Accuracy (%)', ascending=False)
result_df = result_df.set_index('Model')
result_df

Unnamed: 0_level_0,Accuracy (%)
Model,Unnamed: 1_level_1
LogisticRegression + BOW,82.25
LogisticRegression + Tf-Idf,81.0


- BOW model performs better than Tf-Idf on test data. There is not much difference in performance, it differs by mere 1.25% i.e BOW is having accuracy of 82.25% although Tf-Idf is having accuracy of 81%