# Capstone Project: Comment Subtopics Analysis for Airbnb Hosts
---

How can a host on Airbnb understand that are their strengths and weaknesses? How can hosts point out the demand trend of their customers from a large scale of comments? This project focuses on using machine learning tools to help hosts understand the underlying trends of the comments on their property.  

---


# Part 5: Use Comments to Predict Star Ratings For Each Host

- After performing exploratory data analysis on the documents, we can see that here is some kind of realtionship between comments and review score.Now, with this finding, in this documentation, we are going to use comments in each listings to predict the rating for this listing. 
---

### Steps 
1. creating individual corpus for each listing 
2. build a baseline model for prediction 
3. improve on model

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics

from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, BaggingRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn import svm

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

np.random.seed(42)

In [2]:
reviews = pd.read_csv('../data/reviews_sentiment_score.csv', index_col = 0)

In [3]:
reviews.head()

Unnamed: 0,key_0,listing_id,id,date,reviewer_id,reviewer_name,comments,language,overall_rating,compound,neg,neu,pos
0,0,958,5977,2009-07-23,15695,Edmund C,"Our experience was, without a doubt, a five st...",en,97.0,0.959,0.0,0.788,0.212
1,1,958,6660,2009-08-03,26145,Simon,Returning to San Francisco is a rejuvenating t...,en,97.0,0.9819,0.0,0.697,0.303
2,2,958,11519,2009-09-27,25839,Denis,We were very pleased with the accommodations a...,en,97.0,0.76,0.134,0.71,0.156
3,3,958,16282,2009-11-05,33750,Anna,We highly recommend this accomodation and agre...,en,97.0,0.984,0.035,0.646,0.319
4,4,958,26008,2010-02-13,15416,Venetia,Holly's place was great. It was exactly what I...,en,97.0,0.9617,0.0,0.613,0.387


In [4]:
listing = pd.read_csv('../data/listings/2019-03-06_data_listings.csv')

In [5]:
reviews.dtypes

key_0               int64
listing_id          int64
id                  int64
date               object
reviewer_id         int64
reviewer_name      object
comments           object
language           object
overall_rating    float64
compound          float64
neg               float64
neu               float64
pos               float64
dtype: object

## Creating individual Corpus For Each Listings 

---

In [6]:
review_dict = {}
for i in reviews['listing_id'].unique(): 
    comment_list = list(reviews[reviews['listing_id'] == i]['comments'])
    one_list = "".join(comment_list)
    review_dict[i] = one_list

In [7]:
len(review_dict)

4199

In [8]:
len(reviews['listing_id'].unique())

4199

In [9]:
review_prediction_df = pd.DataFrame(data = [review_dict.keys()])

In [10]:
review_prediction_df = review_prediction_df.T

In [11]:
review_prediction_df.columns = ['listing_id']

In [12]:
review_prediction_df['comment'] = review_prediction_df['listing_id'].map(review_dict)

In [13]:
review_prediction_df['overall_score'] = listing['review_scores_rating']

In [14]:
#complete Review Prediction Data Frame 
review_prediction_df.head()

Unnamed: 0,listing_id,comment,overall_score
0,958,"Our experience was, without a doubt, a five st...",97.0
1,5858,We had a fabulous time staying with Philip and...,98.0
2,7918,My stay was fantastic! The neighborhood is gr...,85.0
3,8142,"Excellent! The space is clean and quiet, and t...",93.0
4,8339,My stay was wonderful in many ways; the apartm...,97.0


In [15]:
review_prediction_df.describe()

Unnamed: 0,listing_id,overall_score
count,4199.0,4008.0
mean,9700544.0,95.506487
std,7126105.0,6.109349
min,958.0,20.0
25%,2700651.0,94.0
50%,8817059.0,97.0
75%,16162970.0,99.0
max,21812820.0,100.0


## Cleaning Comments for CountVectorization 

---

In [16]:
def clean_text(text): 
    '''
    This clean_text function will focus on: 
    1. cleaning the content including removing prentecisis, \r\n, : and so on. 
    2. lower case all words.
    3. remove stop words from articles. 
    '''
    text = re.sub(r'\r\n', r' ', text)
    text = re.sub(r'[\\\.\:\*/\,\!]', r' ', text)
    text = re.sub(r'[\(\)]', r' ', text)
    text = re.sub(r'[\"\“\”\—\[\]]', r' ', text)
    text = re.sub(r"'s", r' ', text)
    text = re.sub(r'[\\"]', r' ', text)
    text = text.lower()
    
    return text

In [17]:
clean_comment = []
for comment in review_prediction_df['comment']: 
    new_comment = clean_text(comment)
    clean_comment.append(new_comment)

In [18]:
review_prediction_df['comment'] = clean_comment

In [19]:
review_prediction_df.head()

Unnamed: 0,listing_id,comment,overall_score
0,958,our experience was without a doubt a five st...,97.0
1,5858,we had a fabulous time staying with philip and...,98.0
2,7918,my stay was fantastic the neighborhood is gr...,85.0
3,8142,excellent the space is clean and quiet and t...,93.0
4,8339,my stay was wonderful in many ways; the apartm...,97.0


## Train Test Split 
___

In [41]:
X = review_prediction_df['comment']
y = review_prediction_df['overall_score'].fillna(0)

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 24)

### with CountVectorizer 

In [43]:
#building baseine model 
cv = CountVectorizer(ngram_range= (1,2), 
                     stop_words= 'english', 
                     min_df = 2,
                     max_features = 100000)
X_train_cv = cv.fit_transform(X_train)

In [44]:
X_test_cv = cv.transform(X_test)

In [45]:
X_train_cv.shape, X_test_cv.shape

((3149, 100000), (1050, 100000))

### with TFIDF 

In [46]:
tfidf = TfidfVectorizer(max_features = 200000, 
                        stop_words = 'english', 
                        ngram_range = (1,2))
X_train_tfidf = tfidf.fit_transform(X_train)

In [47]:
X_test_tfidf = tfidf.transform(X_test)

In [48]:
X_train_tfidf.shape, X_test_tfidf.shape

((3149, 200000), (1050, 200000))

## Dimention Reduction 
---

In [49]:
# svd = TruncatedSVD(n_components= 10000, random_state= 42)
# svd.fit(X_train_cv)

# X_train_svd = svd.transform(X_train_cv)
# X_test_svd = svd.transform(X_test_cv)

# X_train_svd.shape, X_test_svd.shape

## Building Baseline Model 
---

In [50]:
#with countvectorizer 
lr = LinearRegression()
lr.fit(X_train_cv, y_train)

print(lr.score(X_train_cv, y_train))
lr.score(X_test_cv, y_test)

0.9951423012675625


-24.440103481948977

In [51]:
#with tfidf
lr = LinearRegression()
lr.fit(X_train_tfidf, y_train)

print(lr.score(X_train_tfidf, y_train))
lr.score(X_test_tfidf, y_test)

0.9951423015332594


-0.20866166531442754

In [26]:
#Random Forest 
#without dimention Reduction
# rf = RandomForestRegressor(n_estimators= 50, max_depth= 100)

# rf.fit(X_train_cv, y_train)

# print(rf.score(X_train_cv, y_train))
# print(rf.score(X_test_cv, y_test))

## Use LSTM to Predict Stars 
---


In [37]:
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

from nltk.corpus import stopwords

In [38]:
X_train.shape

(3149,)

In [33]:
model = Sequential()
model.add(LSTM(1200, input_shape=(1, 356358), return_sequences=True, activation = 'relu'))

model.add(Dropout(0.1))

model.add(LSTM(600, activation ='relu'))

model.add(Dropout(0.5))

model.add(Dense(100, activation = 'relu'))

model.add(Dense(1, activation='linear'))

early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=1, mode='auto')

In [35]:
model.compile(loss='mse', optimizer='adam', metrics=['mae'])

history = model.fit_generator(X_train_cv, validation_data = X_test_cv, epochs=20, verbose=1, callbacks = [early_stop])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().