# Capstone Project: Comment Subtopics Analysis for Airbnb Hosts
---

How can a host on Airbnb understand that are their strengths and weaknesses? How can hosts point out the demand trend of their customers from a large scale of comments? This project focuses on using machine learning tools to help hosts understand the underlying trends of the comments on their property.  

---


# Part 2: LDA Analysis On Comments
---

In [2]:
# !pip install pyldavis
# !pip install --upgrade gensim

In [8]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import LatentDirichletAllocation

from gensim import corpora, models
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

np.random.seed(42)

## Import Review Data 
---

#### Import Review Data Steps 
1. import data
2. select 2 columns: reivews, and sentiment 
3. break dataframe into 2, one with positive sentiment, the other with negative 
4. run LDA on both 

## LDA Prep
---

### Step1: CountVectorize 

#### Positive Sentiment Dictionary 
---

In [None]:
#for positive sentiment 
cv = CountVectorizer(ngram_range= (1,1), 
                     stop_words= 'english', 
                     min_df = 2)
cv_positive = cv.fit_transform(##content)

In [None]:
cv_positive_df =  pd.DataFrame(cv_positive.toarray(), columns = cv.get_feature_names())

In [None]:
positive_texts = [cv_positive_df.loc[index,:].nonzero()] for index in cv_positive_df.index]

In [None]:
dictionary_pos = corpora.Dictionary(positive_texts)

In [None]:
corpus_pos = [dictionary.doc2bow(text) for text in positive_texts]

#### Negative Sentiment Dictionary 
---

In [None]:
#for positive sentiment 
cv = CountVectorizer(ngram_range= (1,1), 
                     stop_words= 'english', 
                     min_df = 2)
cv_negative = cv.fit_transform(##content)

In [None]:
cv_negative_df =  pd.DataFrame(cv_negative.toarray(), columns = cv.get_feature_names())

In [None]:
negative_texts = [cv_negative_df.loc[index,:].nonzero()] for index in cv_negative_df.index]

In [None]:
dictionary_neg = corpora.Dictionary(negative_texts)

In [None]:
corpus_neg = [dictionary.doc2bow(text) for text in negative_texts]

## LDA Run
---

### LDA Positive Sentiment

In [None]:
ldamodel_pos = models.ldamodel.LdaModel(corpus_pos,                     # pass in our corpus
                                    id2word = dictionary_pos,       # matches each word to its "number" or "spot" in the dictionary
                                    num_topics = 20,             # number of topics T to find
                                    passes = 5,                 # number of passes through corpus; similar to number of epochs
                                    minimum_probability = 0.01)

In [None]:
pyLDAvis.gensim.prepare(ldamodel_pos, corpus_pos, dictionary_pos)

### LDA Negative Sentiment 

In [None]:
ldamodel_neg = models.ldamodel.LdaModel(corpus_neg, 
                                        id2word = dictionary_neg, 
                                        num_topics = 20, 
                                        passes = 5, 
                                        minimum_probability = 0.01)

In [None]:
pyLDAvis.gensim.prepare(ldamodel_neg, corpus_neg, dictionary_neg)