# Capstone Project: Comment Subtopics Analysis for Airbnb Hosts
---

How can a host on Airbnb understand that are their strengths and weaknesses? How can hosts point out the demand trend of their customers from a large scale of comments? This project focuses on using machine learning tools to help hosts understand the underlying trends of the comments on their property.  

---

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Capstone-Project:-Comment-Subtopics-Analysis-for-Airbnb-Hosts" data-toc-modified-id="Capstone-Project:-Comment-Subtopics-Analysis-for-Airbnb-Hosts-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Capstone Project: Comment Subtopics Analysis for Airbnb Hosts</a></span></li><li><span><a href="#Part-3.2:-LDA-Analysis-On-Comments-Based-On-Overall-Review" data-toc-modified-id="Part-3.2:-LDA-Analysis-On-Comments-Based-On-Overall-Review-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Part 3.2: LDA Analysis On Comments Based On Overall Review</a></span><ul class="toc-item"><li><span><a href="#Import-Review-Data" data-toc-modified-id="Import-Review-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import Review Data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Import-Review-Data-Steps" data-toc-modified-id="Import-Review-Data-Steps-2.1.0.1"><span class="toc-item-num">2.1.0.1&nbsp;&nbsp;</span>Import Review Data Steps</a></span></li></ul></li></ul></li><li><span><a href="#LDA-Prep" data-toc-modified-id="LDA-Prep-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LDA Prep</a></span><ul class="toc-item"><li><span><a href="#Step1:-CountVectorize" data-toc-modified-id="Step1:-CountVectorize-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Step1: CountVectorize</a></span><ul class="toc-item"><li><span><a href="#Good-Rating-Dictionary" data-toc-modified-id="Good-Rating-Dictionary-2.2.1.1"><span class="toc-item-num">2.2.1.1&nbsp;&nbsp;</span>Good Rating Dictionary</a></span></li></ul></li><li><span><a href="#LDA-On-Good-Rating" data-toc-modified-id="LDA-On-Good-Rating-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>LDA On Good Rating</a></span></li><li><span><a href="#Bad-Rating-Dictionary" data-toc-modified-id="Bad-Rating-Dictionary-2.2.3"><span class="toc-item-num">2.2.3&nbsp;&nbsp;</span>Bad Rating Dictionary</a></span></li></ul></li><li><span><a href="#LDA-Run" data-toc-modified-id="LDA-Run-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>LDA Run</a></span><ul class="toc-item"><li><span><a href="#LDA-Bad-Ratings" data-toc-modified-id="LDA-Bad-Ratings-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>LDA Bad Ratings</a></span></li></ul></li></ul></li></ul></div>


# Part 3.2: LDA Analysis On Comments Based On Overall Review
---

In [56]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import LatentDirichletAllocation

from gensim import corpora, models
import pyLDAvis
from langdetect import detect

pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

np.random.seed(42)

## Import Review Data 
---

#### Import Review Data Steps 
1. import data
2. select 2 columns: reivews, and sentiment 
3. break dataframe into 2, one with positive sentiment, the other with negative 
4. run LDA on both 

In [57]:
review_sentiments = pd.read_csv("../data/reviews_sentiment_score.csv", index_col = 0)

In [58]:
review_sentiments.head()

Unnamed: 0,key_0,listing_id,id,date,reviewer_id,reviewer_name,comments,language,overall_rating,compound,neg,neu,pos
0,0,958,5977,2009-07-23,15695,Edmund C,"Our experience was, without a doubt, a five st...",en,97.0,0.959,0.0,0.788,0.212
1,1,958,6660,2009-08-03,26145,Simon,Returning to San Francisco is a rejuvenating t...,en,97.0,0.9819,0.0,0.697,0.303
2,2,958,11519,2009-09-27,25839,Denis,We were very pleased with the accommodations a...,en,97.0,0.76,0.134,0.71,0.156
3,3,958,16282,2009-11-05,33750,Anna,We highly recommend this accomodation and agre...,en,97.0,0.984,0.035,0.646,0.319
4,4,958,26008,2010-02-13,15416,Venetia,Holly's place was great. It was exactly what I...,en,97.0,0.9617,0.0,0.613,0.387


In [59]:
review_sentiments.dtypes

key_0               int64
listing_id          int64
id                  int64
date               object
reviewer_id         int64
reviewer_name      object
comments           object
language           object
overall_rating    float64
compound          float64
neg               float64
neu               float64
pos               float64
dtype: object

In [60]:
good_rating = review_sentiments[review_sentiments['overall_rating'] == 100 ]

In [61]:
good_rating.head()

Unnamed: 0,key_0,listing_id,id,date,reviewer_id,reviewer_name,comments,language,overall_rating,compound,neg,neu,pos
4598,4791,25094,81140853,2016-06-21,76212897,Mattie,Bruce and Alfredo were wonderful hosts. I felt...,en,100.0,0.9853,0.033,0.695,0.272
4599,4792,25094,86021875,2016-07-14,6770350,Patty,I have stayed in Airbnb's all over the US and ...,en,100.0,0.9792,0.027,0.621,0.352
4600,4793,25094,88188179,2016-07-23,83842908,Brian,Bruce and Alfredo were the best hosts I've had...,en,100.0,0.9826,0.019,0.742,0.24
4601,4794,25094,101684726,2016-09-14,6937471,Keely,Bruce and Alfredo are the most wonderful hosts...,en,100.0,0.9565,0.0,0.715,0.285
4602,4795,25094,103330395,2016-09-21,36866721,Hyejin,Bruce and Alfredo are amazing hosts. They welc...,en,100.0,0.9878,0.0,0.463,0.537


In [62]:
good_rating.shape

(13949, 13)

In [63]:
bad_rating = review_sentiments[review_sentiments['overall_rating'] < 80]

In [64]:
bad_rating.shape

(606, 13)

## LDA Prep
---

### Step1: CountVectorize 

#### Good Rating Dictionary 
---

In [65]:
#for positive sentiment 
cv = CountVectorizer(ngram_range= (2,2), 
                     stop_words= 'english', 
                     min_df = 2)
cv_good_rating = cv.fit_transform(good_rating['comments'])

In [66]:
cv_goodrating_df =  pd.DataFrame(cv_good_rating.toarray(), columns = cv.get_feature_names())

In [67]:
good_texts = [cv_goodrating_df.columns[cv_goodrating_df.loc[index,:].to_numpy().nonzero()] for index in cv_goodrating_df.index] 

In [68]:
dictionary_good = corpora.Dictionary(good_texts)

In [69]:
dictionary_good

<gensim.corpora.dictionary.Dictionary at 0x13de2d908>

In [70]:
corpus_good = [dictionary_good.doc2bow(text) for text in good_texts]

### LDA On Good Rating 

In [71]:
ldamodel_good = models.ldamodel.LdaModel(corpus_good,                     # pass in our corpus
                                    id2word = dictionary_good,       # matches each word to its "number" or "spot" in the dictionary
                                    num_topics = 15,             # number of topics T to find
                                    passes = 5,                 # number of passes through corpus; similar to number of epochs
                                    minimum_probability = 0.01)

In [72]:
pyLDAvis.gensim.prepare(ldamodel_good, corpus_good, dictionary_good)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Bad Rating Dictionary 
---

In [73]:
#for positive sentiment 
cv = CountVectorizer(ngram_range= (2,2), 
                     stop_words= 'english', 
                     min_df = 2)
cv_bad_rating = cv.fit_transform(bad_rating['comments'])

In [75]:
cv_bad_rating_df =  pd.DataFrame(cv_bad_rating.toarray(), columns = cv.get_feature_names())

In [76]:
bad_rating_texts = [cv_bad_rating_df.columns[cv_bad_rating_df.loc[index,:].to_numpy().nonzero()] for index in cv_bad_rating_df.index]

In [77]:
dictionary_bad = corpora.Dictionary(bad_rating_texts)

In [78]:
corpus_bad = [dictionary_bad.doc2bow(text) for text in bad_rating_texts]

## LDA Run
---

### LDA Bad Ratings

In [79]:
ldamodel_bad = models.ldamodel.LdaModel(corpus_bad, 
                                        id2word = dictionary_bad, 
                                        num_topics = 15, 
                                        passes = 5, 
                                        minimum_probability = 0.01)

In [80]:
pyLDAvis.gensim.prepare(ldamodel_bad, corpus_bad, dictionary_bad)