### Project Description

We have the data of IMDB movie reviews labelled as positive or negative. For each sentiment positive and negative we have 12500 data points for each. The goal of this project is to classify the reviews as positive or negative.

#### Read the data

In [1]:
import pandas as pd

data = pd.read_csv('IMDB_review.csv')
data.head()

Unnamed: 0,review,sentiment
0,I thought this was a wonderful way to spend ti...,positive
1,"Probably my all-time favorite movie, a story o...",positive
2,I sure would like to see a resurrection of a u...,positive
3,"This show was an amazing, fresh & innovative i...",negative
4,Encouraged by the positive comments about this...,negative


In [2]:
data.shape

(25000, 2)

In [3]:
data['sentiment'].value_counts()

negative    12500
positive    12500
Name: sentiment, dtype: int64

#### Remove punctuation

In [4]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [5]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

data['review_nopunct'] = data['review'].apply(lambda x : remove_punct(x))

data.head()

Unnamed: 0,review,sentiment,review_nopunct
0,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...
1,"Probably my all-time favorite movie, a story o...",positive,Probably my alltime favorite movie a story of ...
2,I sure would like to see a resurrection of a u...,positive,I sure would like to see a resurrection of a u...
3,"This show was an amazing, fresh & innovative i...",negative,This show was an amazing fresh innovative ide...
4,Encouraged by the positive comments about this...,negative,Encouraged by the positive comments about this...


Added a new column without the punctuation marks

#### Tokenization

In [6]:
import re

def tokenize(text):
    tokens = re.split('\W+',text)
    return tokens

data['review_tokens'] = data['review_nopunct'].apply(lambda x: tokenize(x.lower()))

data.head()

Unnamed: 0,review,sentiment,review_nopunct,review_tokens
0,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...,"[i, thought, this, was, a, wonderful, way, to,..."
1,"Probably my all-time favorite movie, a story o...",positive,Probably my alltime favorite movie a story of ...,"[probably, my, alltime, favorite, movie, a, st..."
2,I sure would like to see a resurrection of a u...,positive,I sure would like to see a resurrection of a u...,"[i, sure, would, like, to, see, a, resurrectio..."
3,"This show was an amazing, fresh & innovative i...",negative,This show was an amazing fresh innovative ide...,"[this, show, was, an, amazing, fresh, innovati..."
4,Encouraged by the positive comments about this...,negative,Encouraged by the positive comments about this...,"[encouraged, by, the, positive, comments, abou..."


Created the word tokens from the column which was added in previous step without punctuation marks

#### Remove stopwords

In [7]:
import nltk

stopword = nltk.corpus.stopwords.words('english')

In [8]:
def remove_stopword(token_list):
    text = [word for word in token_list if word not in stopword]
    return text

data['review_nostop'] = data['review_tokens'].apply(lambda x: remove_stopword(x))

data.head()

Unnamed: 0,review,sentiment,review_nopunct,review_tokens,review_nostop
0,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...,"[i, thought, this, was, a, wonderful, way, to,...","[thought, wonderful, way, spend, time, hot, su..."
1,"Probably my all-time favorite movie, a story o...",positive,Probably my alltime favorite movie a story of ...,"[probably, my, alltime, favorite, movie, a, st...","[probably, alltime, favorite, movie, story, se..."
2,I sure would like to see a resurrection of a u...,positive,I sure would like to see a resurrection of a u...,"[i, sure, would, like, to, see, a, resurrectio...","[sure, would, like, see, resurrection, dated, ..."
3,"This show was an amazing, fresh & innovative i...",negative,This show was an amazing fresh innovative ide...,"[this, show, was, an, amazing, fresh, innovati...","[show, amazing, fresh, innovative, idea, 70s, ..."
4,Encouraged by the positive comments about this...,negative,Encouraged by the positive comments about this...,"[encouraged, by, the, positive, comments, abou...","[encouraged, positive, comments, film, looking..."


From the tokens created removed the stopwords and added the column in the data

#### Stemming

In [9]:
ps = nltk.PorterStemmer()

In [10]:
def stemming(tokenized_text):
    word_stem = [ps.stem(word) for word in tokenized_text]
    return word_stem

data['review_stem'] = data['review_nostop'].apply(lambda x: stemming(x))

data.head()

Unnamed: 0,review,sentiment,review_nopunct,review_tokens,review_nostop,review_stem
0,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...,"[i, thought, this, was, a, wonderful, way, to,...","[thought, wonderful, way, spend, time, hot, su...","[thought, wonder, way, spend, time, hot, summe..."
1,"Probably my all-time favorite movie, a story o...",positive,Probably my alltime favorite movie a story of ...,"[probably, my, alltime, favorite, movie, a, st...","[probably, alltime, favorite, movie, story, se...","[probabl, alltim, favorit, movi, stori, selfle..."
2,I sure would like to see a resurrection of a u...,positive,I sure would like to see a resurrection of a u...,"[i, sure, would, like, to, see, a, resurrectio...","[sure, would, like, see, resurrection, dated, ...","[sure, would, like, see, resurrect, date, seah..."
3,"This show was an amazing, fresh & innovative i...",negative,This show was an amazing fresh innovative ide...,"[this, show, was, an, amazing, fresh, innovati...","[show, amazing, fresh, innovative, idea, 70s, ...","[show, amaz, fresh, innov, idea, 70, first, ai..."
4,Encouraged by the positive comments about this...,negative,Encouraged by the positive comments about this...,"[encouraged, by, the, positive, comments, abou...","[encouraged, positive, comments, film, looking...","[encourag, posit, comment, film, look, forward..."


Word stems are created and added in the data as new column

#### Lemmatize Text

In [11]:
wn = nltk.WordNetLemmatizer()

In [12]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dhamn\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [13]:
def lemmatize(tokenized_text):
    word_lemma = [ wn.lemmatize(word) for word in tokenized_text]
    return word_lemma

data['review_lemma'] = data['review_nostop'].apply(lambda x: lemmatize(x))
data.head()

Unnamed: 0,review,sentiment,review_nopunct,review_tokens,review_nostop,review_stem,review_lemma
0,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...,"[i, thought, this, was, a, wonderful, way, to,...","[thought, wonderful, way, spend, time, hot, su...","[thought, wonder, way, spend, time, hot, summe...","[thought, wonderful, way, spend, time, hot, su..."
1,"Probably my all-time favorite movie, a story o...",positive,Probably my alltime favorite movie a story of ...,"[probably, my, alltime, favorite, movie, a, st...","[probably, alltime, favorite, movie, story, se...","[probabl, alltim, favorit, movi, stori, selfle...","[probably, alltime, favorite, movie, story, se..."
2,I sure would like to see a resurrection of a u...,positive,I sure would like to see a resurrection of a u...,"[i, sure, would, like, to, see, a, resurrectio...","[sure, would, like, see, resurrection, dated, ...","[sure, would, like, see, resurrect, date, seah...","[sure, would, like, see, resurrection, dated, ..."
3,"This show was an amazing, fresh & innovative i...",negative,This show was an amazing fresh innovative ide...,"[this, show, was, an, amazing, fresh, innovati...","[show, amazing, fresh, innovative, idea, 70s, ...","[show, amaz, fresh, innov, idea, 70, first, ai...","[show, amazing, fresh, innovative, idea, 70, f..."
4,Encouraged by the positive comments about this...,negative,Encouraged by the positive comments about this...,"[encouraged, by, the, positive, comments, abou...","[encouraged, positive, comments, film, looking...","[encourag, posit, comment, film, look, forward...","[encouraged, positive, comment, film, looking,..."


Lemmatization modifies the word to its root word, the words are lemmatized and added as new column

### TF-IDF Vectorization

In [14]:
### Create function to remove punctuation, tokenize, remove stopwords, and stem

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopword]
    return text

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(analyzer = clean_text)
X_tfidf = tfidf_vect.fit_transform(data['review'])
print(X_tfidf.shape)

(25000, 93696)


In [16]:
X_tfidf_df = pd.DataFrame(X_tfidf.toarray())
X_tfidf_df.columns = tfidf_vect.get_feature_names()


Here we have created the TF-IDF word vectors for different words in the text and converted them as feature matrix

### Build RF with GridSearchCV

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score

##### Tried using GridSearchCV function, but the issue was it was not allocating enough processing memory even for 6K records, tried reducing the records to 5K and 4K still same error.

##### When tried using 3K records, still the model training was not completed after 15 mins, also 3K records are practically very less for model training as compared to the features.

##### So here, I have opted for building own GridSearch method, as showcased in one of the lecture python notebooks which takes couple of seconds to train and predict

In [18]:
### Taking the subset of data to train the model faster

features = X_tfidf_df[0:6000]
labels = data['sentiment'][0:6000]


In [19]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2)

In [20]:
def train_RF(n_est, depth):
    rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)
    rf_model = rf.fit(X_train, y_train)
    y_pred = rf_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='positive', average='binary')
    print('Est: {} / Depth: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
        n_est, depth, round(precision, 3), round(recall, 3),
        round((y_pred==y_test).sum() / len(y_pred), 3)))

In [21]:
for n_est in [10, 50, 100]:
    for depth in [10, 20, 30, None]:
        train_RF(n_est, depth)

Est: 10 / Depth: 10 ---- Precision: 0.728 / Recall: 0.715 / Accuracy: 0.718
Est: 10 / Depth: 20 ---- Precision: 0.71 / Recall: 0.712 / Accuracy: 0.703
Est: 10 / Depth: 30 ---- Precision: 0.765 / Recall: 0.724 / Accuracy: 0.744
Est: 10 / Depth: None ---- Precision: 0.734 / Recall: 0.634 / Accuracy: 0.695
Est: 50 / Depth: 10 ---- Precision: 0.834 / Recall: 0.725 / Accuracy: 0.785
Est: 50 / Depth: 20 ---- Precision: 0.807 / Recall: 0.81 / Accuracy: 0.803
Est: 50 / Depth: 30 ---- Precision: 0.835 / Recall: 0.767 / Accuracy: 0.803
Est: 50 / Depth: None ---- Precision: 0.852 / Recall: 0.798 / Accuracy: 0.826
Est: 100 / Depth: 10 ---- Precision: 0.848 / Recall: 0.774 / Accuracy: 0.813
Est: 100 / Depth: 20 ---- Precision: 0.853 / Recall: 0.833 / Accuracy: 0.841
Est: 100 / Depth: 30 ---- Precision: 0.834 / Recall: 0.785 / Accuracy: 0.81
Est: 100 / Depth: None ---- Precision: 0.842 / Recall: 0.797 / Accuracy: 0.819


We can see that the RF model performed well at predictions when trained for **`n_estimators:100`** and **`max_depth:20`**. The accuracy of the the model is around 84% and the precision and recall scores are also quite well. So we choose these parameters for our final model evaluation

#### Evaluate model on best hyperparameters

In [23]:
import time

In [24]:
rf = RandomForestClassifier(n_estimators=100, max_depth=20, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rf_model.predict(X_test)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='positive', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 31.017 / Predict time: 2.669 ---- Precision: 0.844 / Recall: 0.793 / Accuracy: 0.819


The model fit time was around 20.48 sec and the time taken to predict was 0.985 sec, here also the accuracy, precision and recall scores seem to be pretty well, we can say that overall the model performs good.

### Build Gradient Boosting with GridSearchCV

In [25]:
##from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

In [29]:
labels_new = labels.map({'positive': 1, 'negative': 0})

In [30]:
X_train, X_test, y_train, y_test = train_test_split(features, labels_new, test_size=0.2)

In [35]:
def train_GB(est, max_depth):
    gb = XGBClassifier(n_estimators=est, max_depth=max_depth)
    gb_model = gb.fit(X_train, y_train)
    y_pred = gb_model.predict(X_test)
    precision, recall, fscore, train_support = score(y_test, y_pred, pos_label= 1, average='binary')
    print('Est: {} / Depth: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
        est, max_depth, round(precision, 3), round(recall, 3), 
        round((y_pred==y_test).sum()/len(y_pred), 3)))

In [36]:
for n_est in [10, 50, 100]:
    for depth in [10, 20, 30, None]:
        train_GB(n_est, depth)

Est: 10 / Depth: 10 ---- Precision: 0.77 / Recall: 0.816 / Accuracy: 0.785
Est: 10 / Depth: 20 ---- Precision: 0.784 / Recall: 0.8 / Accuracy: 0.788
Est: 10 / Depth: 30 ---- Precision: 0.783 / Recall: 0.793 / Accuracy: 0.785
Est: 10 / Depth: None ---- Precision: 0.76 / Recall: 0.833 / Accuracy: 0.783
Est: 50 / Depth: 10 ---- Precision: 0.826 / Recall: 0.858 / Accuracy: 0.838
Est: 50 / Depth: 20 ---- Precision: 0.828 / Recall: 0.839 / Accuracy: 0.832
Est: 50 / Depth: 30 ---- Precision: 0.815 / Recall: 0.829 / Accuracy: 0.819
Est: 50 / Depth: None ---- Precision: 0.818 / Recall: 0.856 / Accuracy: 0.832
Est: 100 / Depth: 10 ---- Precision: 0.843 / Recall: 0.863 / Accuracy: 0.85
Est: 100 / Depth: 20 ---- Precision: 0.843 / Recall: 0.854 / Accuracy: 0.847
Est: 100 / Depth: 30 ---- Precision: 0.829 / Recall: 0.836 / Accuracy: 0.831
Est: 100 / Depth: None ---- Precision: 0.851 / Recall: 0.871 / Accuracy: 0.858


In XGBoost algorithm we got an accuracy of around 85.8% with **`n_estimators:100`** and **`Depth:None`**. We will evaluate our model using these parameters

In [37]:
gb = XGBClassifier(n_estimators=100, max_depth=None)

start = time.time()
gb_model = gb.fit(X_train, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = gb_model.predict(X_test)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label=1, average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 1026.563 / Predict time: 9.086 ---- Precision: 0.851 / Recall: 0.871 / Accuracy: 0.858


We can see that the training time is huge which is approximately **20 mins** and predict time is 9 secs. The accuracy is around 85% and precision recall values are pretty well.

- Overall if we see RF and XGBoost, the accuracy difference is less as compared to the huge training time and memory occupied by XGBoost. We would choose a model which gives decent accuracy as well as trains faster.

- So we can say that RF out performs XGBoost in this matter, as accuaracy is decent and precision, recall values are also good.

