## Review Stuffing Services - Taking a Look

In this project, I will use some data from the AirBnB Seattle dataset for 2016, which contains various metrics for host listings and reviews. The dataset is part of the repository, but full rights remain with the original owners. 

### Section 1 : Business Understanding

The target is to check whether there is any monetary benefit to be gained for an AirBnB host from having:

1.) More Positive Reviews
2.) A Superhost Badge

The main metric we will use to judge this "benefit" would be the ability to command higher prices. 

I have worked in Python in the past, but that was quite a while ago. Hence, there are sections where I have tried alternative approaches, just to get refamiliarized with the language. Pandas was relatively new to me, so it was fun getting to learn it.

Questions I will ask :

1.) Does achieving the SuperHost status enable you to charge more for your properties?
2.) What about Neighbourhoods? Do they affect pricing ?
3.) What about review boosting services? Are they worth your time ? Do having higher ratings help you charge more ?

### Section 2 : Data Understanding

The primary data used was from a freely available Kaggle AirBnB dataset [here](https://www.kaggle.com/airbnb/seattle/data).

There were primarily three datasets available for the AirBnB data I was looking at. 

1.) A calendar of listings with their dates and prices in a file called Calendar.csv
2.) The listings themselves in a file called listings.csv, with all the pertinent listing details, including price. This included review scores.
3.) And finally, the actual reviews for the listings, in reviews.csv

The primary key used for all three was a Listing Id, and the datasets could be joined into one on that.

Since my primary goal did not extend to parsing the actual reviews or take into account date/month variability, I used only the listings.csv for my analysis. 

### Section 3 : Preparing Data

In [13]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px

#Load the entire csv file into a dataframe
df_loadcsv = pd.read_csv("Seattle Listings Airbnb.csv")

#for easier previews
pd.options.display.max_rows=100

#listing all columns we have for analysis
df_loadcsv.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', '

In [14]:
#Checking for classifiers
df_loadcsv['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Shared room'], dtype=object)

In [15]:
#Checking for classifiers for Q1
df_loadcsv['host_is_superhost'].unique()

array(['f', 't', nan], dtype=object)

In [16]:
#Checking for classifiers for Q2
df_loadcsv['neighbourhood_cleansed'].unique()

array(['West Queen Anne', 'Adams', 'West Woodland', 'East Queen Anne',
       'Wallingford', 'North Queen Anne', 'Green Lake', 'Westlake',
       'Mann', 'Madrona', 'University District', 'Harrison/Denny-Blaine',
       'Minor', 'Leschi', 'Atlantic', 'Pike-Market', 'Eastlake',
       'South Lake Union', 'Lawton Park', 'Briarcliff', 'Belltown',
       'International District', 'Central Business District',
       'First Hill', 'Yesler Terrace', 'Pioneer Square', 'Gatewood',
       'Arbor Heights', 'Alki', 'North Admiral', 'Crown Hill',
       'Fairmount Park', 'Genesee', 'Interbay', 'Industrial District',
       'Mid-Beacon Hill', 'South Beacon Hill', 'Greenwood', 'Holly Park',
       'Fauntleroy', 'North Beacon Hill', 'Mount Baker', 'Brighton',
       'South Delridge', 'View Ridge', 'Dunlap', 'Rainier Beach',
       'Columbia City', 'Seward Park', 'North Delridge', 'Maple Leaf',
       'Ravenna', 'Riverview', 'Portage Bay', 'Bryant', 'Montlake',
       'Broadway', 'Loyal Heights', 'Vict

In [17]:
#checking for null values
df_loadcsv.isnull().sum()

id                                     0
listing_url                            0
scrape_id                              0
last_scraped                           0
name                                   0
summary                              177
space                                569
description                            0
experiences_offered                    0
neighborhood_overview               1032
notes                               1606
transit                              934
thumbnail_url                        320
medium_url                           320
picture_url                            0
xl_picture_url                       320
host_id                                0
host_url                               0
host_name                              2
host_since                             2
host_location                          8
host_about                           859
host_response_time                   523
host_response_rate                   523
host_acceptance_

In [18]:
#pulling aside interestiong columns, removing nulls for SuperHost as there are only two in the dataset
df_target = df_loadcsv.filter(['host_is_superhost', 'room_type','neighbourhood_cleansed','price']).dropna()
df_target.dtypes


host_is_superhost         object
room_type                 object
neighbourhood_cleansed    object
price                     object
dtype: object

In [19]:
# Cleaning data, as the price object is un USD, string 
df_target['price'] = df_target['price'].str.replace('$','')
df_target['price'] = df_target['price'].str.replace(',','')
df_target['price'] = df_target['price'].apply(pd.to_numeric)
df_target.dtypes

host_is_superhost          object
room_type                  object
neighbourhood_cleansed     object
price                     float64
dtype: object

In [20]:
# plotting function
def plotsuper(xset, yset):

    '''
    To plot a graph showing two lines - one with Superhost Status, one without.

    INPUT
    xset - A dataframe with room and price variables, for Superhosts
    yset - A dataframe with room and price variables,for Non - Superhosts

    '''
    
    
    x1 = xset['room_type']
    y1 = xset['price']
    y2 = yset['price']


    fig = go.Figure()

    fig.add_trace(go.Scatter(
        x= x1,
        y=y1,
        name="Superhost"
    ))

    fig.add_trace(go.Scatter(
        x= x1,
        y=y2,
        name="Not Superhost"
    ))


    fig.update_layout(
        title="",
        xaxis_title="Apartment Types",
        yaxis_title="Average Price",
        legend_title="Status",
    )


    fig.show()

### Section 3 : Preparing Data   

#### Question 1 : Is being a superhost Quantifyably Valuable ?

In [21]:
#Q1 selecting dataframes with Superhost and not superhost status
df_question1_s = df_target[df_target['host_is_superhost']=='t'].groupby(['room_type']).mean().reset_index()
df_question1_ns = df_target[df_target['host_is_superhost']=='f'].groupby(['room_type']).mean().reset_index()
df_question1_s  # checking data 

Unnamed: 0,room_type,price
0,Entire home/apt,157.001927
1,Private room,77.108871
2,Shared room,58.363636


#### Question 2 : What about Neighbourhoods? Lets take a look at Neighborhood clusters and prices

In [22]:
df_neighbour = df_target.groupby(['room_type', 'neighbourhood_cleansed']).mean().reset_index()
df_neighbour.head()

#renaming columns for easier graphing
df_neighbour.columns = ['Room Type', 'Neighbourhood', 'Avg Prices']
df_neighbour.head()

Unnamed: 0,Room Type,Neighbourhood,Avg Prices
0,Entire home/apt,Adams,146.755102
1,Entire home/apt,Alki,182.212121
2,Entire home/apt,Arbor Heights,175.0
3,Entire home/apt,Atlantic,133.4
4,Entire home/apt,Belltown,168.893519


#### Q 1 + 2 : Lets look at Superhost "value" for a specific neighbourhood? Lets check with One : West Queen Anne Neighbourhood

In [23]:
df_wqa_s = df_target[(df_target['host_is_superhost']=='t')&(df_target['neighbourhood_cleansed'] =='West Queen Anne')].groupby(['room_type']).mean().reset_index()
df_wqa_ns = df_target[(df_target['host_is_superhost']=='f')&(df_target['neighbourhood_cleansed'] =='West Queen Anne')].groupby(['room_type']).mean().reset_index()
df_wqa_s  # checking data 
#almost immediately we see that there are less options

Unnamed: 0,room_type,price
0,Entire home/apt,136.666667
1,Private room,68.8


#### Question 3 : What about reviews ? Do ratings enable someone to charge a higher price ? 

In [24]:
df_target2 = df_loadcsv.filter(['room_type','neighbourhood_cleansed','price', 'review_scores_rating'])
df_target2.head()

Unnamed: 0,room_type,neighbourhood_cleansed,price,review_scores_rating
0,Entire home/apt,West Queen Anne,$85.00,95.0
1,Entire home/apt,West Queen Anne,$150.00,96.0
2,Entire home/apt,West Queen Anne,$975.00,97.0
3,Entire home/apt,West Queen Anne,$100.00,
4,Entire home/apt,West Queen Anne,$450.00,92.0


In [25]:
#count the nulls
df_target2.isnull().sum()

room_type                   0
neighbourhood_cleansed      0
price                       0
review_scores_rating      647
dtype: int64

In [26]:
# drop the nulls as imputing them would not be logical, lacking any real basis to do so. We also have enough entries of actual data, 
df_target2.dropna()

Unnamed: 0,room_type,neighbourhood_cleansed,price,review_scores_rating
0,Entire home/apt,West Queen Anne,$85.00,95.0
1,Entire home/apt,West Queen Anne,$150.00,96.0
2,Entire home/apt,West Queen Anne,$975.00,97.0
4,Entire home/apt,West Queen Anne,$450.00,92.0
5,Private room,West Queen Anne,$120.00,95.0
...,...,...,...,...
3810,Entire home/apt,Fremont,$154.00,92.0
3811,Entire home/apt,Fremont,$65.00,100.0
3812,Entire home/apt,Fremont,$95.00,96.0
3813,Entire home/apt,Fremont,$359.00,80.0


In [27]:
#we clean price again. ToDo: do for main data while refactoring

df_target2['price'] = df_target2['price'].str.replace('$','')
df_target2['price'] = df_target2['price'].str.replace(',','')
df_target2['price'] = df_target2['price'].apply(pd.to_numeric)
df_target2.dtypes

room_type                  object
neighbourhood_cleansed     object
price                     float64
review_scores_rating      float64
dtype: object

In [28]:
#rename columns
df_target2.columns = ['Room Type', 'Neighbourhood', 'Price', 'Review Score']

### Section 4 : Evaluating the Results

#### Question 1 : Is being a superhost Quantifyably Valuable ?

In [29]:
# plotting the graph

plotsuper(df_question1_s, df_question1_ns)

It seems that we do have a small, positive correlation between being a superhost and commanding higher prices. That assumes, though, that all other things remain equal

#### Result : Yes, there is some price benefit to being a Superhost, assuming all other things equal

***

#### Question 2 : What about Neighbourhoods? Lets take a look at Neighborhood clusters and prices

In [30]:
# Q2 plot it - trying out plotly express instead of using an iterative for loop to add traces

px.scatter(df_neighbour, x = 'Room Type', y = 'Avg Prices', color='Neighbourhood')

What a range of prices ! The first question assumed that neighbourhoods were not a pricing factor, but this data claims otherwise. This can easily overwhelm any signals of the first question !

#### Result : Yes, prices vary greatly as per neighbourhood. 

***

#### Q 1 + 2 : Lets look at Superhost "value" for a specific neighbourhood? Lets check with One : West Queen Anne Neighbourhood

In [31]:
# Lets do a plot

plotsuper(df_wqa_s, df_wqa_ns)


It seems our results from the first question are now "questionable". Its clear that we cannot generalize that superhosts can command better pricing, as it is clearly not the case in the very first neighbourhood we tested. 

#### Result : We cannot say for certainity that Superhost status will allow for higher earning, as the slight positove signal in Q1 is likely due to other factors. Thus, Superhosts are not inherently valuable as per our metrics. 

***

#### Question 3: What about reviews ? Do ratings enable someone to charge a higher price? 

In [32]:
#plot it via px

px.scatter(df_target2, x='Review Score', y='Price', color='Room Type')

There seems to be a small positive trend, but let us see if we can clean it up a bit with some outlier analysis

In [37]:
# Outliers may be affecting the result. Lets evaluate

# using the tukey method

def outlier_trim(df_large):
    
    '''
    Function for filtering out outliers using the Tukey Metric
    
    INPUT
    df_large : Dataframe with outliers
    
    OUTPUT
    df_large : Dataframe with outliers trimmed
    
    '''
    rat_lw = df_large['Review Score'].quantile(0.25)
    rat_hi = df_large['Review Score'].quantile(0.75)
    
    IQR = rat_hi - rat_lw
    rat_max_value = rat_hi + 1.5 * IQR
    rat_min_value = rat_lw - 1.5 * IQR

    #filter dataset 

    df_large = df_large[(df_large['Review Score'] < rat_max_value) | (df_large['Review Score']> rat_min_value )]
    return df_large



In [38]:
#trim the ratings outliers

df_target2r = outlier_trim(df_target2)
df_target2r.shape


(3171, 4)

In [39]:
# repeat plot
px.scatter(df_target2r, x='Review Score', y='Price', color='Room Type')

The Tukey method did not catch many outliers. But the trend is clear -  More positive reviews meant better pricing. Be aware that this still assumes parity between amenities for the room types, a generalization we had to make.

#### Result : Yes, the reviews do make a difference, with better reviews allowing for better prices to be realized.

___

To be continued with more analysis and more metrics, hopefully ... 