Prediction of Customer's Hotel Review.

Methodology followed:
1. Load the Training CSV file.
2. Converting the words Happy/Not_Happy into binary indicators 1 or 0.
3. Convert the text into sparse matrix of TFIDF vectors.
4. Apply Naive Bayes Classifier.
5. Measure the ROC scores
    

In [326]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer

In [327]:
df_train = pd.read_csv("/Users/Arati/Documents/UIS/Antworks/train.csv", sep=',')


In [328]:
df_test = pd.read_csv("/Users/Arati/Documents/UIS/Antworks/test.csv", sep=',')

In [331]:
df_train.head(5)


Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
0,id10326,The room was kind of clean but had a VERY stro...,Edge,Mobile,not happy
1,id10327,I stayed at the Crown Plaza April -- - April -...,Internet Explorer,Mobile,not happy
2,id10328,I booked this hotel through Hotwire at the low...,Mozilla,Tablet,not happy
3,id10329,Stayed here with husband and sons on the way t...,InternetExplorer,Desktop,happy
4,id10330,My girlfriends and I stayed here to celebrate ...,Edge,Tablet,not happy


In [332]:
df_test.head(5)

Unnamed: 0,User_ID,Description,Browser_Used,Device_Used
0,id80132,Looking for a motel in close proximity to TV t...,Firefox,Mobile
1,id80133,Walking distance to Madison Square Garden and ...,InternetExplorer,Desktop
2,id80134,Visited Seattle on business. Spent - nights in...,IE,Tablet
3,id80135,This hotel location is excellent and the rooms...,Edge,Mobile
4,id80136,This hotel is awesome I love the service Antho...,Mozilla,Mobile


From the Data Frame we just need the fields User_ID, Description & Is_Response. SO we need to make a new data frame with just these column names. 

In [333]:
# DROPPING COLUMNS 'Browser_Used' & 'Device Used'
df_new = df_train.drop('Browser_Used', axis = 1)
df_new = df_new.drop('Device_Used', axis = 1)

In [334]:
df_new.head(5)

Unnamed: 0,User_ID,Description,Is_Response
0,id10326,The room was kind of clean but had a VERY stro...,not happy
1,id10327,I stayed at the Crown Plaza April -- - April -...,not happy
2,id10328,I booked this hotel through Hotwire at the low...,not happy
3,id10329,Stayed here with husband and sons on the way t...,happy
4,id10330,My girlfriends and I stayed here to celebrate ...,not happy


In [335]:
#TFIDF
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

In [336]:
df_new.replace(to_replace= 'not happy', value= 0, inplace= True, method= None)
df_new.replace(to_replace= 'happy', value=1, inplace= True, method= None)

In [337]:
df_new.shape

(38932, 3)

In [338]:
#Setting the value of y - variable to be predicted. - with 
#Is_Response value as 'Happy - 1" or "Not_Happy - 0"

Y = df_new.Is_Response

In [339]:
#CONVERTING DESCRIPTION IN THE DATA FRAME TO FEATURES
X = vectorizer.fit_transform(df_new.Description)

In [340]:
print(X.shape)
print(Y.shape)

(38932, 45940)
(38932,)


In [341]:
#WE HAVE 38932 OBSERVATIONS WITH 45940 UNIQUE WORDS

In [342]:
#USE TRAIN TEST SPLIT FUNCTION TO DIVIDE THE DATASET
X_train, X_test, y_train, y_test= train_test_split(X, Y , random_state=42)

In [343]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import BernoulliNB
#TRAIN THE NAIVE BAYES CLASSIFIER
nb = MultinomialNB()
clf= naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)



MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [344]:
preds = clf.predict(X_test)

In [220]:
#TESTING MODEL'S ACCURACY

roc_auc_score(y_test,clf.predict_proba(X_test)[:,1])

0.91249169268360486

In [345]:
#Predicting responses for Test File

df_test_new = df_test.drop('Device_Used', axis = 1)
df_test_new = df_test_new.drop('Browser_Used', axis = 1)
df_test_new.head(5)
df_test_new.shape


(29404, 2)

In [346]:

X_Test = df_test_new
X_Test.head(5)
X_Test['Is_Response'] = 0
X_Test.shape

(29404, 3)

In [347]:
#X_new_tfidf = tfidf_transformer.transform(X_new_counts)
X1 = vectorizer.transform(X_Test.Description)

In [348]:
X1

<29404x45940 sparse matrix of type '<class 'numpy.float64'>'
	with 1916481 stored elements in Compressed Sparse Row format>

In [349]:
X_Test['Is_Response'] = clf.predict(X1)

In [351]:
X_Test.head(5)

Unnamed: 0,User_ID,Description,Is_Response
0,id80132,Looking for a motel in close proximity to TV t...,1
1,id80133,Walking distance to Madison Square Garden and ...,1
2,id80134,Visited Seattle on business. Spent - nights in...,1
3,id80135,This hotel location is excellent and the rooms...,0
4,id80136,This hotel is awesome I love the service Antho...,1


In [355]:

X_Test=X_Test.drop(['Description'],axis=1)

In [356]:
X_Test.head(5)

Unnamed: 0,User_ID,Is_Response
0,id80132,1
1,id80133,1
2,id80134,1
3,id80135,0
4,id80136,1


In [357]:
X_Test.replace(to_replace= 1, value= 'happy', inplace= True, method= None)
X_Test.replace(to_replace= 0, value= 'not_happy', inplace= True, method= None)

In [358]:
X_Test.head(5)

Unnamed: 0,User_ID,Is_Response
0,id80132,happy
1,id80133,happy
2,id80134,happy
3,id80135,not_happy
4,id80136,happy


In [359]:
X_Test.to_csv('/Users/Arati/Documents/UIS/Antworks/Submission.csv',sep=',', index= False)