# Classification of Postive & Negative reviews using NLP techniques

### By: Soorya Parthiban

## Problem Statement: We are provided the review datasets taken from different websites like IMDB, Yelp and Amazon to a machine learning model to predict the label of the review either positive or negative.

## Importing Library & Dataset

In [2]:
import numpy as np
import pandas as pd 

import warnings
warnings.filterwarnings("ignore")

import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\soory\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\soory\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\soory\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
review_df = pd.read_csv(r"C:\Users\soory\Downloads\review_train_set.csv")

In [4]:
review_df.head()

Unnamed: 0,review,sentiment
0,think robert ryans best film portrayed someone...,1
1,juano hernandez exceptional actor played suppo...,1
2,shocked sign indicate cash,0
3,sat another ten minute finally gave left,0
4,igo charger tip really great,1


In [5]:
review_df.tail()

Unnamed: 0,review,sentiment
2055,food good,1
2056,nicest chinese restaurant ive,1
2057,could believe dirty oyster,0
2058,delicious absolutely back,1
2059,earbud piece break easily,0


In [6]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2060 entries, 0 to 2059
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     2060 non-null   object
 1   sentiment  2060 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 32.3+ KB


In [7]:
review_df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [8]:
print("Length of the dataset: ", len(review_df))

Length of the dataset:  2060


## Data Pre-Processing

### Mergining True & Fake News

In [9]:
review_df['review'] = review_df['review'].str.lower()


def remove_special_characters(text, remove_digits=True):
  pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
  text = re.sub(pattern, '', str(text))
  return text

review_df['review'] = review_df['review'].apply(remove_special_characters, remove_digits=False)

def f(r):
  wnl = WordNetLemmatizer()
  words = nltk.word_tokenize(r)
  lemmatized_words = [wnl.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  return " ".join(lemmatized_words)

review_df['review'] = review_df['review'].apply(f)

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
data_X = cv.fit_transform(review_df['review']).toarray()
data_X = pd.DataFrame(data_X, columns=cv.get_feature_names())

In [12]:
X = data_X
y = review_df.sentiment

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3)

In [14]:
print(X_train.shape, X_val.shape)

(1442, 3977) (618, 3977)


## Building the ML Models

In [15]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score

### 1. Decision Tree  

In [16]:
from sklearn.tree import DecisionTreeClassifier

In [17]:
model1 = DecisionTreeClassifier()

In [18]:
model1.fit(X_train, y_train)

DecisionTreeClassifier()

In [19]:
y_preds_model1 = model1.predict(X_val)

In [20]:
print("Decision Tree model's accuracy: ", accuracy_score(y_val, y_preds_model1))

Decision Tree model's accuracy:  0.7055016181229773


In [23]:
print("Decision Tree model's F1 Score: ", f1_score(y_val, y_preds_model1)) 

Decision Tree model's F1 Score:  0.6936026936026936


In [22]:
print("Decision Tree model's Classification Report: \n", classification_report(y_val, y_preds_model1))

Decision Tree model's Classification Report: 
               precision    recall  f1-score   support

           0       0.69      0.75      0.72       308
           1       0.73      0.66      0.69       310

    accuracy                           0.71       618
   macro avg       0.71      0.71      0.71       618
weighted avg       0.71      0.71      0.71       618



### 2. Random Forest  

In [24]:
from sklearn.ensemble import RandomForestClassifier

In [25]:
model2 = RandomForestClassifier(n_estimators=3000)

In [26]:
model2.fit(X_train, y_train)

RandomForestClassifier(n_estimators=3000)

In [27]:
y_preds_model2 = model2.predict(X_val)

In [28]:
print("Random Forest model's accuracy: ", accuracy_score(y_val, y_preds_model2))

Random Forest model's accuracy:  0.7475728155339806


In [29]:
print("Random Forest model's F1 Score: ", f1_score(y_val, y_preds_model2)) 

Random Forest model's F1 Score:  0.7214285714285714


In [30]:
print("Random Forest model's Classification Report: \n", classification_report(y_val, y_preds_model2))

Random Forest model's Classification Report: 
               precision    recall  f1-score   support

           0       0.71      0.84      0.77       308
           1       0.81      0.65      0.72       310

    accuracy                           0.75       618
   macro avg       0.76      0.75      0.75       618
weighted avg       0.76      0.75      0.75       618



### 3. Extra Tree  

In [31]:
from sklearn.ensemble import ExtraTreesClassifier

In [32]:
model3 = ExtraTreesClassifier(n_estimators=3000)

In [33]:
model3.fit(X_train, y_train)

ExtraTreesClassifier(n_estimators=3000)

In [34]:
y_preds_model3 = model3.predict(X_val)

In [35]:
print("Extra Tree model's accuracy: ", accuracy_score(y_val, y_preds_model3))

Extra Tree model's accuracy:  0.7572815533980582


In [36]:
print("Extra Tree model's F1 Score: ", f1_score(y_val, y_preds_model3)) 

Extra Tree model's F1 Score:  0.7448979591836735


In [44]:
print("Extra Tree model's Classification Report: \n", classification_report(y_val, y_preds_model3))

Extra Tree model's Classification Report: 
               precision    recall  f1-score   support

           0       0.73      0.81      0.77       308
           1       0.79      0.71      0.74       310

    accuracy                           0.76       618
   macro avg       0.76      0.76      0.76       618
weighted avg       0.76      0.76      0.76       618



## Predicting the Output For Testing Dataset

In [48]:
test_data = pd.read_csv(r"C:\Users\soory\Downloads\review_test_set.csv")

In [49]:
test_data.head()

Unnamed: 0,review
0,great service food
1,pairing iphone could happier far
2,plot hole pair fishnet stocking direction edit...
3,surely doesnt know make coherent action movie ...
4,still quite interesting entertaining follow


In [50]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686 entries, 0 to 685
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  686 non-null    object
dtypes: object(1)
memory usage: 5.5+ KB


In [51]:
test_data['review'] = test_data['review'].str.lower()


def remove_special_characters(text, remove_digits=True):
  pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
  text = re.sub(pattern, '', str(text))
  return text

test_data['review'] = test_data['review'].apply(remove_special_characters, remove_digits=False)

def f(r):
  wnl = WordNetLemmatizer()
  words = nltk.word_tokenize(r)
  lemmatized_words = [wnl.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  return " ".join(lemmatized_words)

test_data['review'] = test_data['review'].apply(f)

In [52]:
test = cv.transform(test_data['review']).toarray()
test_data = pd.DataFrame(test, columns=cv.get_feature_names())

In [53]:
target = model3.predict(test_data)

In [55]:
res = pd.DataFrame(target) #target is nothing but the final predictions of your model on input features of your new unseen test data
res.columns = ["prediction"]
res.to_csv("submission.csv", index = False) # the csv file will be saved locally on the same location where this notebook is located.