# Capstone Project: Comment Subtopics Analysis for Airbnb Hosts
---

How can a host on Airbnb understand that are their strengths and weaknesses? How can hosts point out the demand trend of their customers from a large scale of comments? This project focuses on using machine learning tools to help hosts understand the underlying trends of the comments on their property.  

---

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Capstone-Project:-Comment-Subtopics-Analysis-for-Airbnb-Hosts" data-toc-modified-id="Capstone-Project:-Comment-Subtopics-Analysis-for-Airbnb-Hosts-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Capstone Project: Comment Subtopics Analysis for Airbnb Hosts</a></span></li><li><span><a href="#Part-3.1:-LDA-Analysis-On-Comments-Based-On-Sentiment" data-toc-modified-id="Part-3.1:-LDA-Analysis-On-Comments-Based-On-Sentiment-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Part 3.1: LDA Analysis On Comments Based On Sentiment</a></span><ul class="toc-item"><li><span><a href="#Import-Review-Data" data-toc-modified-id="Import-Review-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import Review Data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Import-Review-Data-Steps" data-toc-modified-id="Import-Review-Data-Steps-2.1.0.1"><span class="toc-item-num">2.1.0.1&nbsp;&nbsp;</span>Import Review Data Steps</a></span></li></ul></li></ul></li><li><span><a href="#LDA-Prep" data-toc-modified-id="LDA-Prep-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LDA Prep</a></span><ul class="toc-item"><li><span><a href="#Step1:-CountVectorize" data-toc-modified-id="Step1:-CountVectorize-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Step1: CountVectorize</a></span></li><li><span><a href="#Positive-Sentiment-Dictionary" data-toc-modified-id="Positive-Sentiment-Dictionary-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Positive Sentiment Dictionary</a></span></li><li><span><a href="#LDA-Positive-Sentiment" data-toc-modified-id="LDA-Positive-Sentiment-2.2.3"><span class="toc-item-num">2.2.3&nbsp;&nbsp;</span>LDA Positive Sentiment</a></span></li><li><span><a href="#Negative-Sentiment-Dictionary" data-toc-modified-id="Negative-Sentiment-Dictionary-2.2.4"><span class="toc-item-num">2.2.4&nbsp;&nbsp;</span>Negative Sentiment Dictionary</a></span></li></ul></li><li><span><a href="#LDA-Run" data-toc-modified-id="LDA-Run-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>LDA Run</a></span><ul class="toc-item"><li><span><a href="#LDA-Negative-Sentiment" data-toc-modified-id="LDA-Negative-Sentiment-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>LDA Negative Sentiment</a></span></li></ul></li></ul></li></ul></div>


# Part 3.1: LDA Analysis On Comments Based On Sentiment
---

In [3]:
# !pip install --upgrade pyldavis
# !pip install --upgrade gensim

In [2]:
# !pip install langdetect

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import LatentDirichletAllocation

from gensim import corpora, models
import pyLDAvis.gensim
from langdetect import detect

pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

np.random.seed(42)

## Import Review Data 
---

#### Import Review Data Steps 
1. import data
2. select 2 columns: reivews, and sentiment 
3. break dataframe into 2, one with positive sentiment, the other with negative 
4. run LDA on both 

In [2]:
review_sentiments = pd.read_csv("../data/reviews_sentiment_score.csv", index_col = 0)

In [3]:
review_sentiments.head()

Unnamed: 0,key_0,listing_id,id,date,reviewer_id,reviewer_name,comments,language,overall_rating,compound,neg,neu,pos
0,0,958,5977,2009-07-23,15695,Edmund C,"Our experience was, without a doubt, a five st...",en,97.0,0.959,0.0,0.788,0.212
1,1,958,6660,2009-08-03,26145,Simon,Returning to San Francisco is a rejuvenating t...,en,97.0,0.9819,0.0,0.697,0.303
2,2,958,11519,2009-09-27,25839,Denis,We were very pleased with the accommodations a...,en,97.0,0.76,0.134,0.71,0.156
3,3,958,16282,2009-11-05,33750,Anna,We highly recommend this accomodation and agre...,en,97.0,0.984,0.035,0.646,0.319
4,4,958,26008,2010-02-13,15416,Venetia,Holly's place was great. It was exactly what I...,en,97.0,0.9617,0.0,0.613,0.387


In [4]:
review_sentiments.dtypes

key_0               int64
listing_id          int64
id                  int64
date               object
reviewer_id         int64
reviewer_name      object
comments           object
language           object
overall_rating    float64
compound          float64
neg               float64
neu               float64
pos               float64
dtype: object

In [5]:
positive_df = review_sentiments[review_sentiments['pos'] > 0.5]

In [6]:
positive_df.shape

(38588, 13)

In [7]:
negative_df = review_sentiments[review_sentiments['compound'] < -0.3]

In [8]:
negative_df.shape

(1610, 13)

## LDA Prep
---

### Step1: CountVectorize 

### Positive Sentiment Dictionary 
---

In [9]:
#for positive sentiment 
cv = CountVectorizer(ngram_range= (2,2), 
                     stop_words= 'english', 
                     min_df = 2)
cv_positive = cv.fit_transform(positive_df['comments'])

In [10]:
cv_positive_df =  pd.DataFrame(cv_positive.toarray(), columns = cv.get_feature_names())

In [11]:
cv_positive_df.head()

Unnamed: 0,00 night,00 pm,10 00,10 00pm,10 10,10 11pm,10 12,10 15,10 15min,10 20,...,zoo close,zoo enjoyed,zoo golden,zoo great,zoo ocean,zoo right,zoo went,über rides,정말 좋았습니다,좋았습니다 호스트
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
positive_texts = [cv_positive_df.columns[cv_positive_df.loc[index,:].nonzero()] for index in cv_positive_df.index]

In [14]:
dictionary_pos = corpora.Dictionary(positive_texts)

In [15]:
dictionary_pos

<gensim.corpora.dictionary.Dictionary at 0x1a23088550>

In [16]:
corpus_pos = [dictionary_pos.doc2bow(text) for text in positive_texts]

### LDA Positive Sentiment

In [17]:
ldamodel_pos = models.ldamodel.LdaModel(corpus_pos,                     # pass in our corpus
                                    id2word = dictionary_pos,       # matches each word to its "number" or "spot" in the dictionary
                                    num_topics = 15,             # number of topics T to find
                                    passes = 5,                 # number of passes through corpus; similar to number of epochs
                                    minimum_probability = 0.01)

In [18]:
pyLDAvis.gensim.prepare(ldamodel_pos, corpus_pos, dictionary_pos)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Negative Sentiment Dictionary 
---

In [19]:
#for positive sentiment 
cv = CountVectorizer(ngram_range= (2,2), 
                     stop_words= 'english', 
                     min_df = 2)
cv_negative = cv.fit_transform(negative_df['comments'])

In [20]:
cv_negative_df =  pd.DataFrame(cv_negative.toarray(), columns = cv.get_feature_names())

In [21]:
negative_texts = [cv_negative_df.columns[cv_negative_df.loc[index,:].nonzero()] for index in cv_negative_df.index]

In [22]:
dictionary_neg = corpora.Dictionary(negative_texts)

In [23]:
corpus_neg = [dictionary_neg.doc2bow(text) for text in negative_texts]

## LDA Run
---

### LDA Negative Sentiment 

In [24]:
ldamodel_neg = models.ldamodel.LdaModel(corpus_neg, 
                                        id2word = dictionary_neg, 
                                        num_topics = 12, 
                                        passes = 5, 
                                        minimum_probability = 0.01)

In [25]:
pyLDAvis.gensim.prepare(ldamodel_neg, corpus_neg, dictionary_neg)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  return pd.concat([default_term_info] + list(topic_dfs))
