# Airbnb Reviews: Categorize Reviews and Sample Results

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()

from gensim.models.ldamulticore import LdaMulticore

### Notebook Overview

In this notebook I will be using the topic groupings selected in the "Model and Topic Selection" Notebook to tag AirBnb Reviews.  There were two models that seemed to have fairly good groupings which were the "Full Review" model and the "Non-Name Entities Plus Adjectvies" model.  The "Full Review" topics seemed to have a broader variety of topics with less groupings per topic while the "Non-Name Entities Plus Adjectives" groupings were more in-depth on fewer topics.  To select the best model I hope to try sampling reviews and seeing which topics seem to correspond best with actual reviews.

### Outline
<strong>Step 1:</strong> Tag all reviews with ALL groupings (Topic Groups) made by each model.<br>
<strong>Step 2:</strong> Group the Tags selected in the previous notebook to get Topics for each Review.<br>
<strong>Step 3:</strong> Sample a few reviews with the topic categorizations to compare the models.

### Results

After Taggings, Grouping, and Sampling the topic categorizations for the Airbnb Reviews, I found that the "Full Review" model had the categorizations with the best results and seemed to represent the reviews fairly well.  In the next notebook I will use the topic categorizations to conduct analysis on Airbnb reviews hopefully finding some interesting insights on how guests review different listings.

## Load Data

In [13]:
# Select City
country = 'united-states'
city = 'san-francisco'

# Directory
directory = '../data/' + country + '/' + city + '/'

# Load Data
reviews_df = pd.read_csv(directory + 'interim/review_wrangled.csv', sep=';', lineterminator='\n').drop(columns=['Unnamed: 0'])

## Clean Data

In [14]:
"""Pull Important Columns"""
# Columns
columns = ['listing_id','id','date','comments','tokens','tokens_count']

# Filter
reviews_df = reviews_df[columns]

# No Null Tokens
reviews_df = reviews_df[reviews_df.tokens_count > 0].reset_index(drop=True)

# Remove Automated Posts
reviews_df = reviews_df[~reviews_df['comments'].str.contains('This is an automated posting.')]

In [15]:
reviews_df.head()

Unnamed: 0,listing_id,id,date,comments,tokens,tokens_count
0,958,5977,2009-07-23,"Our experience was, without a doubt, a five st...","['experience', 'without', 'doubt', 'five', 'st...",47
1,958,6660,2009-08-03,Returning to San Francisco is a rejuvenating t...,"['returning', 'san', 'francisco', 'rejuvenatin...",36
2,958,11519,2009-09-27,We were very pleased with the accommodations a...,"['pleased', u'accommodation', 'friendly', 'nei...",67
3,958,16282,2009-11-05,We highly recommend this accomodation and agre...,"['highly', 'recommend', 'accomodation', 'agree...",43
4,958,26008,2010-02-13,Holly's place was great. It was exactly what I...,"['holly', 'place', 'great', 'exactly', 'needed...",23


# 1. Tag Reviews

In [18]:
def get_review_tags(data, model):
    """Get Review Tags"""
    # Blank Topic List
    full_tokens = pd.DataFrame()
    
    # Iterate Through Reviews
    for index in data.index:
        
        # Get Review Info (Tokens & ID)
        tokens = data['tokens'][index]
        index_num = data['id'][index]
        
        try:
            # Iterate Tokens
            topic_list = []
            for word in tokens:
                # Get Topic of Token
                topic_list = topic_list + model.get_term_topics(str(word), minimum_probability=min_prob)
                
            # Count Topics
            topic_counts = pd.DataFrame(topic_list).groupby(0).count().reset_index()
            topic_counts['id'] = index_num
        except:
            topic_counts = pd.DataFrame()
            topic_counts['id'] = index_num
        
        # Concat Topic Counts to Full Topic List
        full_tokens = pd.concat([full_tokens, topic_counts], sort=False)
        
    return full_tokens
    
    # Pivot To Show Topics by Review
    #full_tokens_df = pd.pivot(index='id', columns=0, values=1, data=full_tokens).fillna(0)
    
    # Return Review Topics
    #return full_tokens_df

# a) Full Reviews Topics

In [19]:
# Load LDA
ldamodel = LdaMulticore.load('../models/ldam_reviews_50topics_10words_50passes_full.model')

# Get Review Tags
review_topic_tags = get_review_tags(reviews_df, ldamodel)

In [26]:
review_topic_tags.shape

(305589, 51)

In [27]:
review_topic_tags.head()

Unnamed: 0,id,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,...,40.0,41.0,42.0,43.0,44.0,45.0,46.0,47.0,48.0,49.0
0,1981,0.0,0.0,1.0,1.0,2.0,0.0,0.0,5.0,5.0,...,0.0,6.0,0.0,1.0,2.0,0.0,0.0,5.0,0.0,4.0
1,2993,0.0,0.0,3.0,1.0,4.0,0.0,0.0,7.0,2.0,...,2.0,5.0,1.0,0.0,2.0,0.0,0.0,5.0,0.0,4.0
2,3905,0.0,0.0,1.0,1.0,1.0,0.0,0.0,4.0,2.0,...,0.0,2.0,1.0,1.0,1.0,0.0,0.0,3.0,0.0,2.0
3,5566,0.0,0.0,1.0,2.0,0.0,0.0,1.0,2.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,5977,0.0,0.0,1.0,1.0,4.0,0.0,0.0,2.0,0.0,...,1.0,4.0,5.0,1.0,3.0,0.0,0.0,2.0,0.0,1.0


In [327]:
review_topic_tags.reset_index().to_csv(directory + 'interim/review_tags_count_full.csv')

# b) No-Name Entities Plus Adjective Topics

In [259]:
# Load LDA
ldamodel_no_ner = LdaMulticore.load('../models/ldam_reviews_50topics_10words_50passes_no_ner_plus_adj.model')

# Get All Review Tags for each Review
review_topic_tags_no_ner = get_review_tags(reviews_df, ldamodel_no_ner)

In [297]:
review_topic_tags_no_ner.shape

(292053, 50)

In [288]:
review_topic_tags_no_ner.head()

Unnamed: 0_level_0,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,...,40.0,41.0,42.0,43.0,44.0,45.0,46.0,47.0,48.0,49.0
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1981,1.0,0.0,0.0,2.0,3.0,3.0,0.0,1.0,0.0,0.0,...,1.0,1.0,1.0,4.0,1.0,0.0,2.0,0.0,4.0,0.0
2993,2.0,2.0,0.0,2.0,3.0,4.0,2.0,0.0,0.0,0.0,...,0.0,1.0,0.0,4.0,5.0,0.0,0.0,0.0,2.0,0.0
3905,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,2.0,1.0,0.0,0.0,1.0,2.0,0.0
5566,1.0,1.0,0.0,2.0,0.0,3.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4.0,0.0
6502,1.0,0.0,0.0,2.0,3.0,7.0,1.0,0.0,0.0,0.0,...,1.0,1.0,0.0,7.0,2.0,0.0,0.0,0.0,2.0,0.0


In [316]:
review_topic_tags_no_ner.columns

Float64Index([ 0.0,  1.0,  2.0,  3.0,  4.0,  5.0,  6.0,  7.0,  8.0,  9.0, 10.0,
              11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0,
              22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0,
              33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0,
              44.0, 45.0, 46.0, 47.0, 48.0, 49.0],
             dtype='float64', name=0)

In [262]:
review_topic_tags_no_ner.reset_index().to_csv(directory + 'interim/review_tags_no_ne_plus_adj.csv')

# 2. Group Topics and Sample Results

In [349]:
def group_topics(tags, topics, data):
    data = data.copy()
    """Take all topic values and group them into selected topic groups"""
    for topic, id_list in topics.items():
        # Create Topic Column
        data.loc[:, topic] = 0

        for id_num in id_list:
            # Add Values to Topic
            data.loc[:, topic] += tags[str(float(id_num))]
    return data

In [311]:
def print_review(data, topic_list):
    """Print Reviews with Corresponding Topics"""
    for index in data.index:
        review = data.loc[index]
        
        print('Review ID: ' + str(review['id']))
        print('Review:')
        print(str(review['comments']))
        print('\nTopics')
        for topic in review[topic_list].sort_values(ascending=False).index:
            if review[topic] > 0:
                print(topic + ': ' + str(review[topic]))        
        print('\n\n')

# a) Full Review Tags

In [304]:
full_review_topics = {'Host': [7,8,38],
                      'Noise': [2],
                      'Location': [22,34,37,44],
                      'House': [10,14,39],
                      'Checkin & Communication': [17,23,27],
                      'Neighborhood': [4],
                      'Parking': [35],
                      'Value':[40],
                      'Come Again': [41,49],
                      'Cleanliness':[47],
                      'Accuracy': [33]}

In [353]:
# Load CSV
review_topic_tags = pd.read_csv(directory + 'interim/review_tags_full.csv').drop(columns=['Unnamed: 0'])

In [367]:
# Group Topics
review_topic_tags_full = group_topics(review_topic_tags, full_review_topics, reviews_df)

In [368]:
review_topic_tags_full.head(3)

Unnamed: 0,listing_id,id,date,comments,tokens_count,Checkin & Communication,Cleanliness,Value,Location,Accuracy,Noise,Neighborhood,House,Host,Come Again,Parking
0,958,5977,2009-07-23,"Our experience was, without a doubt, a five st...",47,3.0,5.0,0.0,5.0,2.0,1.0,2.0,3.0,14.0,10.0,1.0
1,958,6660,2009-08-03,Returning to San Francisco is a rejuvenating t...,36,0.0,5.0,2.0,10.0,3.0,3.0,4.0,2.0,14.0,9.0,1.0
2,958,11519,2009-09-27,We were very pleased with the accommodations a...,67,3.0,3.0,0.0,5.0,1.0,1.0,1.0,0.0,10.0,4.0,0.0


## Export Review Topics

In [372]:
# Export Review Topics
review_topic_tags_full.to_csv(directory + 'processed/review_topics_final.csv')

# Sample (Best Model)

In [336]:
TOPIC_COLS = ['Checkin & Communication', 'Cleanliness', 'Value', 'Location',\
              'Accuracy', 'Noise', 'Neighborhood', 'House', 'Host', 'Come Again',\
              'Parking']

In [337]:
print_review(review_topic_tags_full.head(3), TOPIC_COLS)

Review ID: 5977
Review:
Our experience was, without a doubt, a five star experience. Holly and her husband, David, were the consummate hosts; friendly and accomodating while still honoring our privacy. The apartment was a charming layout with a full view and access to the home's garden The location is perfect for full engagement with the city; close to mass transit with walking proximity to the Haight, the Mission, the Castro and Golden Gate Park. I can't wait for our next visit.  Ted and Karen Wingerd

Topics
Host: 14.0
Come Again: 10.0
Location: 5.0
Cleanliness: 5.0
House: 3.0
Checkin & Communication: 3.0
Neighborhood: 2.0
Accuracy: 2.0
Parking: 1.0
Noise: 1.0



Review ID: 6660
Review:
Returning to San Francisco is a rejuvenating thrill but this time it was enhanced by our stay at Holly and David's beautifully renovated and perfectly located apartment. You do not need a car to enjoy the City as everything is within walking distance - great restaurants, bars and local stores. With 

# b) Non-Name Entities Plus Adj Tags

In [344]:
# Selected Topics
no_ne_plus_adj_topics = {'Accuracy': [26,49],
                         'Cleanliness': [18,43],
                         'Checkin': [11,33,34],
                         'Communication': [8,19,32,37],
                         'Location': [20,21,30,35,36],
                         'Transport': [6,41,44],
                         'Value': [22,27,28,46]}

In [340]:
# Load Tags
review_tags_no_ne_adj = pd.read_csv(directory + 'interim/review_tags_no_ne_plus_adj.csv').drop(columns=['Unnamed: 0'])

In [370]:
# Group Topics
review_topic_tags_no_ner_adj = group_topics(review_tags_no_ne_adj, no_ne_plus_adj_topics, reviews_df)

In [371]:
review_topic_tags_no_ner_adj.head(3)

Unnamed: 0,listing_id,id,date,comments,tokens_count,Cleanliness,Communication,Checkin,Value,Location,Transport,Accuracy
0,958,5977,2009-07-23,"Our experience was, without a doubt, a five st...",47,5.0,4.0,4.0,16.0,11.0,2.0,1.0
1,958,6660,2009-08-03,Returning to San Francisco is a rejuvenating t...,36,7.0,4.0,3.0,12.0,13.0,8.0,1.0
2,958,11519,2009-09-27,We were very pleased with the accommodations a...,67,3.0,4.0,2.0,7.0,8.0,1.0,1.0


## Sample

In [360]:
NO_NE_TOPICS = ['Cleanliness', 'Communication', 'Checkin', 'Value', 'Location','Transport']

In [361]:
print_review(review_topic_tags_no_ner_adj.head(), NO_NE_TOPICS)

Review ID: 5977
Review:
Our experience was, without a doubt, a five star experience. Holly and her husband, David, were the consummate hosts; friendly and accomodating while still honoring our privacy. The apartment was a charming layout with a full view and access to the home's garden The location is perfect for full engagement with the city; close to mass transit with walking proximity to the Haight, the Mission, the Castro and Golden Gate Park. I can't wait for our next visit.  Ted and Karen Wingerd

Topics
Value: 16.0
Location: 11.0
Cleanliness: 5.0
Checkin: 4.0
Communication: 4.0
Transport: 2.0



Review ID: 6660
Review:
Returning to San Francisco is a rejuvenating thrill but this time it was enhanced by our stay at Holly and David's beautifully renovated and perfectly located apartment. You do not need a car to enjoy the City as everything is within walking distance - great restaurants, bars and local stores. With such amenable hosts and a place to stay that enhances one's holi