# Capstone Project: Comment Subtopics Analysis for Airbnb Hosts
---

How can a host on Airbnb understand that are their strengths and weaknesses? How can hosts point out the demand trend of their customers from a large scale of comments? This project focuses on using machine learning tools to help hosts understand the underlying trends of the comments on their property.  

---

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Capstone-Project:-Comment-Subtopics-Analysis-for-Airbnb-Hosts" data-toc-modified-id="Capstone-Project:-Comment-Subtopics-Analysis-for-Airbnb-Hosts-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Capstone Project: Comment Subtopics Analysis for Airbnb Hosts</a></span></li><li><span><a href="#Data-Selection-&amp;-Cleaning" data-toc-modified-id="Data-Selection-&amp;-Cleaning-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Selection &amp; Cleaning</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Data:" data-toc-modified-id="Data:-2.0.1"><span class="toc-item-num">2.0.1&nbsp;&nbsp;</span>Data:</a></span></li></ul></li><li><span><a href="#Import-The-Most-Recent-Listing-Data" data-toc-modified-id="Import-The-Most-Recent-Listing-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Import The Most Recent Listing Data</a></span></li><li><span><a href="#Import-All-Reviews" data-toc-modified-id="Import-All-Reviews-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Import All Reviews</a></span></li><li><span><a href="#Use-NLTK-and-Google-Sentiment-Analysis-to-Perform-Sentiment-Analysis" data-toc-modified-id="Use-NLTK-and-Google-Sentiment-Analysis-to-Perform-Sentiment-Analysis-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Use NLTK and Google Sentiment Analysis to Perform Sentiment Analysis</a></span><ul class="toc-item"><li><span><a href="#NLTK-Sentiment-Analysis" data-toc-modified-id="NLTK-Sentiment-Analysis-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>NLTK Sentiment Analysis</a></span></li><li><span><a href="#Google-Language-API" data-toc-modified-id="Google-Language-API-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Google Language API</a></span></li></ul></li></ul></li></ul></div>

---

# Data Selection & Cleaning

---
### Data: 
The data for this project is collected from Inside Airbnb. [Inside Airbnb](http://insideairbnb.com/about.html) is an independent, non-commercial set of tools and data that allows you to explore how Airbnb is really being used in cities around the world. For the purpose of this project, I will be using the dataset Inside Airbnb put out for San Francisco regarding on the listing information and comments. 

All Libraries Used

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import time
from langdetect import detect #language detection 

now = time.time()

## Import The Most Recent Listing Data 

---

In [2]:
listing = pd.read_csv('data/listings/2019-03-06_data_listings.csv')

In [3]:
listing.shape

(7151, 106)

In [4]:
listing.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,958,https://www.airbnb.com/rooms/958,20190306152813,2019-03-06,"Bright, Modern Garden Unit - 1BR/1B",New update: the house next door is under const...,"Newly remodeled, modern, and bright garden uni...",New update: the house next door is under const...,none,*Quiet cul de sac in friendly neighborhood *St...,...,t,f,moderate,f,f,1,1,0,0,1.54
1,5858,https://www.airbnb.com/rooms/5858,20190306152813,2019-03-06,Creative Sanctuary,,We live in a large Victorian house on a quiet ...,We live in a large Victorian house on a quiet ...,none,I love how our neighborhood feels quiet but is...,...,f,f,strict_14_with_grace_period,f,f,1,1,0,0,0.93
2,7918,https://www.airbnb.com/rooms/7918,20190306152813,2019-03-06,A Friendly Room - UCSF/USF - San Francisco,Nice and good public transportation. 7 minute...,Room rental-sunny view room/sink/Wi Fi (inner ...,Nice and good public transportation. 7 minute...,none,"Shopping old town, restaurants, McDonald, Whol...",...,f,f,strict_14_with_grace_period,f,f,9,0,9,0,0.15
3,8142,https://www.airbnb.com/rooms/8142,20190306152813,2019-03-06,Friendly Room Apt. Style -UCSF/USF - San Franc...,Nice and good public transportation. 7 minute...,Room rental Sunny view Rm/Wi-Fi/TV/sink/large ...,Nice and good public transportation. 7 minute...,none,,...,f,f,strict_14_with_grace_period,f,f,9,0,9,0,0.15
4,8339,https://www.airbnb.com/rooms/8339,20190306152813,2019-03-06,Historic Alamo Square Victorian,Pls email before booking. Interior featured i...,Please send us a quick message before booking ...,Pls email before booking. Interior featured i...,none,,...,f,f,strict_14_with_grace_period,t,t,2,2,0,0,0.23


In [8]:
#Extract all review information for analysis purpose 
review_df = listing[['host_id', 'last_scraped', 'review_scores_accuracy', 'review_scores_checkin', 'review_scores_cleanliness', 
             'review_scores_communication', 'review_scores_location', 'review_scores_rating', 'review_scores_value']]

In [9]:
review_df.head()

Unnamed: 0,host_id,last_scraped,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value
0,1169,2019-03-06,10.0,10.0,10.0,10.0,10.0,97.0,10.0
1,8904,2019-03-06,10.0,10.0,10.0,10.0,10.0,98.0,9.0
2,21994,2019-03-06,8.0,9.0,8.0,9.0,9.0,85.0,8.0
3,21994,2019-03-06,9.0,10.0,9.0,10.0,9.0,93.0,9.0
4,24215,2019-03-06,10.0,10.0,10.0,10.0,10.0,97.0,9.0


In [13]:
review_df = review_df.sort_values(by = ['host_id'])

In [14]:
review_df.reset_index(drop = True, inplace= True)

In [15]:
review_df.head()

Unnamed: 0,host_id,last_scraped,review_scores_accuracy,review_scores_checkin,review_scores_cleanliness,review_scores_communication,review_scores_location,review_scores_rating,review_scores_value
0,46,2019-03-06,10.0,10.0,10.0,10.0,10.0,98.0,9.0
1,470,2019-03-06,9.0,10.0,10.0,10.0,10.0,90.0,9.0
2,1169,2019-03-06,10.0,10.0,10.0,10.0,10.0,97.0,10.0
3,4921,2019-03-06,10.0,10.0,10.0,10.0,10.0,97.0,10.0
4,4921,2019-03-06,10.0,10.0,9.0,10.0,9.0,97.0,10.0


In [16]:
review_df.dtypes

host_id                          int64
last_scraped                    object
review_scores_accuracy         float64
review_scores_checkin          float64
review_scores_cleanliness      float64
review_scores_communication    float64
review_scores_location         float64
review_scores_rating           float64
review_scores_value            float64
dtype: object

In [17]:
review_df.isnull().sum()

host_id                           0
last_scraped                      0
review_scores_accuracy         1425
review_scores_checkin          1427
review_scores_cleanliness      1424
review_scores_communication    1423
review_scores_location         1427
review_scores_rating           1421
review_scores_value            1428
dtype: int64

In [18]:
#Since all the NA means there is no reivews, I will fill all the NA's with 0 
review_df.fillna(0, inplace = True)

In [19]:
review_df.isnull().sum()

host_id                        0
last_scraped                   0
review_scores_accuracy         0
review_scores_checkin          0
review_scores_cleanliness      0
review_scores_communication    0
review_scores_location         0
review_scores_rating           0
review_scores_value            0
dtype: int64

In [116]:
#save the dataframe to csv 
review_df.to_csv('data/reviewsscore2019.csv')

## Import All Reviews 
---

In [82]:
reviews = pd.read_csv('data/reviews/2019-03-06_data_reviews.csv')

In [85]:
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,958,5977,2009-07-23,15695,Edmund C,"Our experience was, without a doubt, a five st..."
1,958,6660,2009-08-03,26145,Simon,Returning to San Francisco is a rejuvenating t...
2,958,11519,2009-09-27,25839,Denis,We were very pleased with the accommodations a...
3,958,16282,2009-11-05,33750,Anna,We highly recommend this accomodation and agre...
4,958,26008,2010-02-13,15416,Venetia,Holly's place was great. It was exactly what I...


In [83]:
reviews.shape

(311277, 6)

In [84]:
reviews.isnull().sum()

listing_id        0
id                0
date              0
reviewer_id       0
reviewer_name     1
comments         79
dtype: int64

In [61]:
#look at the rows with not comments 
reviews[reviews['comments'].isnull()].head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
9989,63405,369208104,2019-01-09,64031749,Conor,
29324,377569,404349422,2019-01-23,233758492,Branden,
30591,423063,338134411,2018-10-18,25368593,Veronica,
45406,719431,233055588,2018-02-07,166605374,Otto,
65390,1206184,154689490,2017-05-25,131728793,Werner,


In [87]:
#Review clean up function 
def review_cleanup(df, reviews):
    '''
    This function will clean up reviews with no content, emoji, and languages that are not english will not be included
    the output of the function will be a clean dataframe with english only comments. 
    Note: This function will take some time to run if there is a lot of comments 
    '''
    # Drop NA 
    df = df.dropna()
    
    #Exclude comments with lenth less than 10 
    textlang = [detect(comment) if len(comment) > 10 else comment == 'bonju' for comment in df[reviews]]
    
    #add new variable to dataframe 
    df['language'] = textlang
    
    #dataframe filter 
    df = df[df['language'] == 'en']
    
    return df

In [90]:
%%time
#run the function: 
english_reveiw = review_cleanup(df = reviews, reviews= 'comments')

CPU times: user 20min 47s, sys: 1min 11s, total: 21min 58s
Wall time: 22min 19s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [117]:
english_reveiw.to_csv('data/english_comments_only.csv')

## Use NLTK and Google Sentiment Analysis to Perform Sentiment Analysis 
---

### NLTK Sentiment Analysis

---

In [94]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [96]:
sid = SentimentIntensityAnalyzer()
reviews_dic = []
for review in english_reveiw['comments']:
    ss = sid.polarity_scores(review)
    reviews_dic.append(ss)

In [113]:
pd.DataFrame(reviews_dic).head()

Unnamed: 0,compound,neg,neu,pos
0,0.959,0.0,0.788,0.212
1,0.9819,0.0,0.697,0.303
2,0.76,0.134,0.71,0.156
3,0.984,0.035,0.646,0.319
4,0.9617,0.0,0.613,0.387


### Google Language API
---

In [183]:
from google.oauth2 import service_account
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types

In [193]:
# Code from google cloud platform 
# Imports the Google Cloud client library
def language_sentiment_analysis(text):
    '''
    This function will take in words and give them a sentiment score. 
    '''
    #defind credentials for google cloud 
    credentials = service_account.Credentials.from_service_account_file('/Users/evelyn/Documents/DSI/capstone_project/googleapikey.json')
    
    #intiantiate google clinet project 
    client = language.LanguageServiceClient(credentials= credentials)
    
    #take in documents and perform sentiment analysis 
    
    document = types.Document(content=text, type=enums.Document.Type.PLAIN_TEXT)
    sent_analysis = client.analyze_sentiment(document = document)
    sentiment = sent_analysis.document_sentiment
    
    return {'sent_score' : sentiment.score, 'sent_magnitude': sentiment.magnitude}

In [194]:
english_reveiw['comments'][1]

"Returning to San Francisco is a rejuvenating thrill but this time it was enhanced by our stay at Holly and David's beautifully renovated and perfectly located apartment. You do not need a car to enjoy the City as everything is within walking distance - great restaurants, bars and local stores. With such amenable hosts and a place to stay that enhances one's holiday, we will be returning again and again."

In [195]:
sent = language_sentiment_analysis(english_reveiw['comments'][1])

In [196]:
sent

{'sent_score': 0.5, 'sent_magnitude': 1.5}

In [None]:
%%time
#apply function to comments 
sent_analysis_ggl = [english_reveiw['comments'].apply(language_sentiment_analysis)]

In [181]:
#Entity sentiment analysis 
from google.oauth2 import service_account
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
import sys

def entity_sentiment_text(text):
    """Detects entity sentiment in the provided text."""
    credentials = service_account.Credentials.from_service_account_file('/Users/evelyn/Documents/DSI/capstone_project/googleapikey.json')
    client = language.LanguageServiceClient(credentials= credentials)

    if isinstance(text, six.binary_type):
        text = text.decode('utf-8')

    document = types.Document(
        content=text.encode('utf-8'),
        type=enums.Document.Type.PLAIN_TEXT)

    # Detect and send native Python encoding to receive correct word offsets.
    encoding = enums.EncodingType.UTF32
    if sys.maxunicode == 65535:
        encoding = enums.EncodingType.UTF16

    result = client.analyze_entity_sentiment(document, encoding)

    for entity in result.entities:
        print('Mentions: ')
        print(u'Name: "{}"'.format(entity.name))
        for mention in entity.mentions:
            print(u'  Begin Offset : {}'.format(mention.text.begin_offset))
            print(u'  Content : {}'.format(mention.text.content))
            print(u'  Magnitude : {}'.format(mention.sentiment.magnitude))
            print(u'  Sentiment : {}'.format(mention.sentiment.score))
            print(u'  Type : {}'.format(mention.type))
            print(u'Salience: {}'.format(entity.salience))
            print(u'Sentiment: {}\n'.format(entity.sentiment))