# Bag of Words Meets Bags Of Popcorn

- What is Bag of Words Model? 
https://www.geeksforgeeks.org/bag-of-words-bow-model-in-nlp/

- What is Word2Vec?
https://code.google.com/archive/p/word2vec/


- Contest Link: https://www.kaggle.com/c/word2vec-nlp-tutorial/overview

## 1) Importing the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
import nltk
import scipy
import warnings
import re
warnings.filterwarnings('ignore')

  import pandas.util.testing as tm


## 2) Reading the text files and observing its features

In [0]:
# For train
df_train = pd.read_csv('drive/My Drive/Pytorch_DataSet/Bag Of Words Meets Bags of popcorn/labeledTrainData.tsv',sep='\t')

# For test
df_test = pd.read_csv('drive/My Drive/Pytorch_DataSet/Bag Of Words Meets Bags of popcorn/testData.tsv',sep='\t')

In [3]:
df_train.head()   

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [4]:
df_test.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [6]:
df_train.describe()

Unnamed: 0,sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


By observing mean,it looks like it has 50% positive and 50% negative reviews.

## 3) Data Cleaning

In [7]:
# Lets check review section 
pd.options.display.max_colwidth = 500 # to view each rows text to 500 characters
print(df_train['review'])

0        With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle m...
1        \The Classic War of the Worlds\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for 

By looking at the text, it should be cleaned : 
- Lowercase all the letter
- Convert short forms to long forms
- Remove the html tags
- Remove extra white spaces
- Remove leading and trailing spaces from a word
- Removing numbers like 19 , 20
- Removing punctuations such as ... 


In [8]:
# Removing html tags from text.
# https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string

txt = " the source.<br /><br />Here's a pretensions:<br /><br />Just when"
cleanr = re.sub('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});','',txt)
cleanr

" the source.Here's a pretensions:Just when"

In [9]:
# Removing digits from text
# https://stackoverflow.com/questions/817122/delete-digits-in-python-regex

text = " the source.<br /><br />Here's a 20 preten 103s and 120th sions:<br /><br />Just when"
text = re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", text) # Removing digits
text

" the source.<br /><br />Here's a preten 103s and 120th sions:<br /><br />Just when"

In [0]:
# Let's clean text
def clean_text(text):
  text = text.lower()
  text = re.sub('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});','',text)  # for removal of html tags
  """
  text = re.sub(r"can't","cannot",text)
  text = re.sub(r"shan't","shall not",text)
  text = re.sub(r"won't","will not",text)
  text = re.sub(r"n't"," not",text) # see the space before not. 
  text = re.sub(r"i'm","i am",text)
  text = re.sub(r"what's","what is",text)
  text = re.sub(r"let's","let us",text)
  text = re.sub(r"'re"," are",text)
  text = re.sub(r"'s"," ",text)  # space because we dont know the tense , it can be is/has anything.
  text = re.sub(r"'ve"," have",text)
  text = re.sub(r"\'ll", " will ", text)
  text = re.sub(r"\'scuse", " excuse ", text)
  """
  text = re.sub(r"[^a-zA-Z]"," ",text)
  text = re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", text) # Removing digits
  text = re.sub('\W', ' ', text)  # If the comment/word does not contain any alphabets
  text = re.sub('\s+', ' ', text) # If there are more than one whitespace simultenously, then replace them by only 1 whitespace and also replace the punctuation marks
  text = text.strip(' ') # Removing leading and trailing white spaces
  return text


Cleaning text for train file


In [0]:
df_train['review'] = df_train['review'].apply(lambda text: clean_text(text))

In [12]:
print(df_train['review'])

0        with all this stuff going down at the moment with mj i ve started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released some of it has subtle message...
1        the classic war of the worlds by timothy hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate h g wells classic book mr hines succeeds in doing so i and those who watched his film with me appreciated the fact that it was not the standard predictable hollywood fare that comes out every year e g the spielberg version with tom cruise that had only the slightest resemblance to the book obviously everyone looks for different things

Cleaning text for test file


In [0]:
df_test['review'] = df_test['review'].apply(lambda text: clean_text(text))

In [14]:
print(df_test['review'])

0        naturally in a film who s main themes are of mortality nostalgia and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones however there is a craftsmanship and completeness to the film which anyone can enjoy the pace is steady and constant the characters full and engaging the relationships and interactions natural showing that you do not need floods of tears to show emotion screams to show fear shouting to show dispute or violence to s...
1        this movie is a disaster within a disaster film it is full of great action scenes which are only meaningful if you throw away all sense of reality let s see word to the wise lava burns you steam burns you you can t stand next to lava diverting a minor lava flow is difficult let alone a significant one scares me to think that some might actually believe what they saw in this movie even worse is the significant amount of talent that went into making this film i mean the acting is

Splitting into X and y features

In [0]:
# For train file

X = df_train['review']
y = df_train['sentiment']

# For test file

X_test = df_test['review']

## 4) Tokenizing text words

we will use tfidftokenizer and will remove all the stop words from text and take only top 5000 words.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix,roc_auc_score,roc_curve

In [17]:
tf = TfidfVectorizer(stop_words='english',max_features=20000)
tf

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=20000,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [0]:
# For training

X = tf.fit_transform(X)

# For test 

X_test = tf.transform(X_test)

In [19]:
len(tf.get_feature_names()) , 
#tf.get_feature_names()

(20000,)

In [20]:
print(X.toarray().shape)
X.toarray()

(25000, 20000)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Splitting into train, val and test set

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.1,random_state=50)

## 5) Building the model

### 5.1) Using random forest classifier.

In [0]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()

In [23]:
rf_clf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [0]:
y_val_pred = rf_clf.predict(X_val)

In [25]:
print(accuracy_score(y_val_pred,y_val))

0.8588


In [26]:
print(roc_curve(y_val_pred,y_val))

(array([0.        , 0.14475743, 1.        ]), array([0.        , 0.86252046, 1.        ]), array([2, 1, 0]))


In [27]:
print(roc_auc_score(y_val_pred,y_val))

0.8588815123876556


In [0]:
rfc1 = rf_clf.predict(X_test)
rfc2 = rf_clf.predict_proba(X_test)[:,1]

In [29]:
rfc1

array([1, 0, 1, ..., 1, 1, 1])

In [30]:
rfc2

array([0.87, 0.21, 0.58, ..., 0.55, 0.78, 0.56])

In [31]:
# putting result in submission file

submission_file = pd.read_csv('drive/My Drive/Pytorch_DataSet/Bag Of Words Meets Bags of popcorn/sampleSubmission.csv')
submission_file.head()

Unnamed: 0,id,sentiment
0,12311_10,0
1,8348_2,0
2,5828_4,0
3,7186_2,0
4,12128_7,0


In [0]:
submission_file['sentiment'] = rfc1

In [33]:
submission_file

Unnamed: 0,id,sentiment
0,12311_10,1
1,8348_2,0
2,5828_4,1
3,7186_2,1
4,12128_7,1
...,...,...
24995,2155_10,1
24996,59_10,1
24997,2531_1,1
24998,7772_8,1


In [0]:
submission_file.to_csv('Using ensemble model.csv',index=False)

Model gave 84.7% accuracy.

### 5.2) Using xgboost

In [0]:
import xgboost as xgb

In [0]:
xg_clf = xgb.XGBClassifier(objective ='binary:logistic',learning_rate=0.2,n_estimators=1000,max_depth=20)

In [37]:
xg_clf.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.2, max_delta_step=0, max_depth=20,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [0]:
y_val_pred = xg_clf.predict(X_val)

In [39]:
print(roc_auc_score(y_val_pred,y_val))

0.8676432186662366


In [0]:
xgb1 = xg_clf.predict(X_test)
xgb2 = xg_clf.predict_proba(X_test)[:,1]

In [41]:
xgb1

array([1, 0, 1, ..., 0, 1, 1])

In [42]:
xgb2

array([9.9999225e-01, 7.8459325e-06, 9.8741156e-01, ..., 2.4775542e-01,
       9.9952829e-01, 6.7515337e-01], dtype=float32)

In [43]:
# putting result in submission file

submission_file = pd.read_csv('drive/My Drive/Pytorch_DataSet/Bag Of Words Meets Bags of popcorn/sampleSubmission.csv')
submission_file.head()

Unnamed: 0,id,sentiment
0,12311_10,0
1,8348_2,0
2,5828_4,0
3,7186_2,0
4,12128_7,0


In [0]:
submission_file['sentiment'] = xgb1
submission_file.to_csv('Using xgboost.csv',index=False)

Model gave 86% accuracy.

### 5.3) Using Logistic Regression

In [0]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [46]:
lr.fit(X_train,y_train)
y_val_pred = lr.predict(X_val)

print(roc_auc_score(y_val_pred,y_val))

lr1 = lr.predict(X_test)
lr2 = lr.predict_proba(X_test)[:,1]

0.8929576689398318


In [47]:
lr1

array([1, 0, 1, ..., 0, 1, 1])

In [48]:
lr2

array([0.93838547, 0.06921883, 0.68454463, ..., 0.41196431, 0.93211129,
       0.58675025])

In [0]:
# putting result in submission file

submission_file = pd.read_csv('drive/My Drive/Pytorch_DataSet/Bag Of Words Meets Bags of popcorn/sampleSubmission.csv')
submission_file['sentiment'] = lr1
submission_file.to_csv('Using Logistic Regression.csv',index=False)

Model gave 88% accuracy.

## Model Ensembling