#Project: Write a Data Science Blog Post

#Section 1: Business Understanding

This contains the code for homework 1 in the Udacity Data Scientist Nano Degree. The motivation behind this is to pass the course. As such I decided to look at the provided data sets of airbnb listing data in Bosten and Seattle to answer the following questions:

**Question 1**: What is the correlation between the following values: 'price', 'cleaning_fee', 'review_scores_location', 'review_scores_value', 'accommodates', 'bathrooms', 'bedrooms'?

**Question 2**: Which of the following variables has the strongest correlation with the listing price: 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'host_is_superhost', 'beds', 'bed_type', 'cleaning_fee', 'cancellation_policy', 'review_scores_value', 'review_scores_location', 'host_identity_verified?'

**Question 3**: Compare the results of questions 1 and 2 between Bosten and Seattle to look for interesting similarities/differences?

#Section 2: Data Understanding

Import the needed packages for this code

In [None]:
# read in needed packages
# import warnings
# warnings.simplefilter(action='ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import AllTogether as t
import seaborn as sns
%matplotlib inline

Load the datasets and take a look at their shape to verify.

In [None]:
df_boston = pd.read_csv('./boston/listings.csv')
df_seattle = pd.read_csv('./seattle/listings.csv')

df_boston.shape
df_seattle.shape

#Section 3: Data Preparation

The following function extracts the columns to be used for this project. Furthermore, it manipulates certain columns to allow for the conversion of that columns data type.

In [None]:
# function to extract columns to be used for this project and drop rows with NAN values
def reduce_and_convert(df):
  
    # extract the columns by column name
    df = df[['room_type', 'accommodates', 'bathrooms', 'bedrooms', 'host_is_superhost',
             'beds', 'bed_type', 'price', 'cleaning_fee', 'cancellation_policy',
             'review_scores_value', 'review_scores_location', 'host_identity_verified']]

    # replace certain characters to allow for data type conversion
    df['price'] = df['price'].str.replace('$','')
    df['price'] = df['price'].str.replace(',','')
    df['cleaning_fee'] = df['cleaning_fee'].str.replace('$','')
    df['cleaning_fee'] = df['cleaning_fee'].str.replace(',','')
    df['host_is_superhost'].map({'t': True, 'f': False})
    df['host_identity_verified'].map({'t': True, 'f': False})

    # convert data types for certain columns so they are not categorical anymore
    df = df.astype({'price':'float'})
    df = df.astype({'cleaning_fee':'float'})
    df = df.astype({'host_is_superhost':'bool'})
    df = df.astype({'host_identity_verified':'bool'})

    return df

df_boston = reduce_and_convert(df_boston)
df_seattle = reduce_and_convert(df_seattle)

Further data preparation is needed so this function is used to drop rows with NAN values and convert categorical columns into ones that can be used by our model. Alternatively to dropping rows we could fill NAN values with the mean etc. however for this project we decided against this.

In [None]:
# convert categorical columns and drop rows containing nan
def add_mean_and_dumnmy(df):
  
    # empty cleaning fee assumed to mean no cleaning fee
    df['cleaning_fee'] = df['cleaning_fee'].fillna(0.0)

    # drop any rows with NAN entries
    df = df.dropna(axis=0)

    # get column names of categorical columns
    cat_vars = df.select_dtypes(include=['object']).columns

    # convert categorical columns column by column
    for var in cat_vars:
        df = pd.concat([df.drop(var, axis=1), pd.get_dummies(df[var], prefix=var, prefix_sep='_', drop_first=True)],
                       axis=1)

    return df

df_boston = add_mean_and_dumnmy(df_boston)
df_seattle = add_mean_and_dumnmy(df_seattle)

#Section 4: Data Modeling 

Here we fit a linear regression model using Udacity provided code.

In [None]:
def train_model(df):
    # split data into X and y
    X = df.drop(['price'], axis=1)
    y = df['price']
    
    # use udacity data scientist provided code to train the model and search for optimal model
    cutoffs = [5000, 3500, 2500, 1000, 100, 50]
    r2_scores_test, r2_scores_train, lm_model, X_train, X_test, y_train, y_test = t.find_optimal_lm_mod(X, y, cutoffs)

    return r2_scores_test, r2_scores_train, lm_model, X_train, X_test, y_train, y_test

r2_scores_test_boston, r2_scores_train_boston, lm_model_boston, X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_model(df_boston)
r2_scores_test_seattle, r2_scores_train_seattle, lm_model_seattle, X_train_seattle, X_test_seattle, y_train_seattle, y_test_seattle = train_model(df_seattle)

#Section 5: Evaluation

**Question 1**: What is the correlation between the following values: 'price', 'cleaning_fee', 'review_scores_location', 'review_scores_value', 'accommodates', 'bathrooms', 'bedrooms'?

This function alows us to plot a heatmap visualizing the correlation between a list of columns. With a theoretical range of -1 to 1 the numbers in the resulting heatmap indicate how two variables correlate. A value of 1 would indicate perfect positive correlation meaning a doubling of variable x results in a doubling of variable y. A value close to 0 indicates little to no correlation.

In [None]:
# show heatmap of correlation between a subset of columns 
def plot_hist_and_heatmap(df, col_list):
    # select subset of data
    df = df[col_list]
    # create heatmap
    sns.heatmap(df.corr(), annot=True, fmt=".2f")
    
plot_hist_and_heatmap(df_boston, ['price', 'cleaning_fee', 'number_of_reviews', 'review_scores_rating', 'accommodates', 'bathrooms', 'bedrooms'])

![Heatmap showing the correlation between different variables for the Bosten dataset](./images/Figure_1.png)

There seems to be a clear positiv correlation between a listing's price and cleaning fees. Furthermore, unsurprisingly, the number of people a listing accommodates as well as the number of bathrooms and bedrooms show a positive correlation with the price as well as with each other. This is not a create surprise as you would expect listings that accommodate more people to be more expensive and require more bath- and bedrooms. The review score of a listing shows a positive correlation with its location score. This again is not too surprising since the location score is a major factor of the overall listing score. However, there seems to be no clear correlation between location and overall review score and any other of the values, in particular not the price of the listing. This is somewhat surprising and you could have imagined more expensive listings to receive better review scores or at the very least better location scores.

**Question 2**: Which of the following variables has the strongest correlation with the listing price: 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'host_is_superhost', 'beds', 'bed_type', 'cleaning_fee', 'cancellation_policy', 'review_scores_value', 'review_scores_location', 'host_identity_verified'?

This function is used to list the top x coefficients of a given lm model. Since the data for our model was normalized (as part of the Udacity provided find_optimal_lm_mod function used earlier) the resulting values list of coefficients represents the input variables with the greatest correlation to the output/predicted variable (in our case this would be the listing price). The sign of the coefs column indicates positive or negative correlation.  

In [None]:
# extract coefs from lm model an show top results
def print_coeffs(lm_model, X_train, top_x_results = 10):

    # look at model coefficients
    coefs_df = pd.DataFrame()
    coefs_df['est_int'] = X_train.columns
    coefs_df['coefs'] = lm_model.coef_
    coefs_df['abs_coefs'] = np.abs(lm_model.coef_)
    coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)

    coefs_df.head(top_x_results)

print_coeffs(lm_model_boston, X_train_boston)

| BOSTON |                                         |            |           |
|--------|-----------------------------------------|------------|-----------|
|        | est_int                                 | coefs      | abs_coefs |
| 9      | **room_type_Private room**              | -80.361940 | 80.361940 |
| 10     | **room_type_Shared room**               | -73.584597 | 73.584597 |
| 14     | **cancellation_policy_super_strict_30** | 72.254277  | 72.254277 |
| 2      | **bedrooms**                            | 37.644407  | 37.644407 |
| 1      | **bathrooms**                           | 35.593966  | 35.593966 |
| 11     | **bed_type_Real Bed**                   | 18.405890  | 18.405890 |
| 7      | **review_scores_location**              | 18.043700  | 18.043700 |
| 12     | **cancellation_policy_moderate**        | 13.164053  | 13.164053 |
| 6      | **review_scores_value**                 | -6.267840  | 6.267840  |
| 4      | **beds**                                | 5.188895   | 5.188895  |

Strong negative correlation can be seen with the listing type where anything other than the whole house/apt option pulls the price down significantly. Somewhat surprisingly, a super strict cancellation policy in in Boston is the first non room type related variable on the list. Extremely high priced listings in Boston seem to be very strict with their cancellation policy, possibly to avoid hoax bookings or an unreasonably high cancerllation rate for these kind of rentals. Further findings in this list support the earlier findings of the heatmaps that number of accomodated people as well as number of beds, bedrooms and bathrooms all correlate positively with price and are among the top ten.

**Question 3**: Compare the results of questions 1 and 2 between Bosten and Seattle to look for interesting similarities/differences?

Plot the heatmap for both cities to compare:

In [None]:
plot_hist_and_heatmap(df_boston, ['price', 'cleaning_fee', 'number_of_reviews', 'review_scores_rating', 'accommodates', 'bathrooms', 'bedrooms'])
plot_hist_and_heatmap(df_seattle, ['price', 'cleaning_fee', 'number_of_reviews', 'review_scores_rating', 'accommodates', 'bathrooms', 'bedrooms'])


**Heatmap showing the correlation between different variables for the Bosten dataset**

![Heatmap showing the correlation between different variables for the Bosten dataset](./images/Figure_1.png)

**Heatmap showing the correlation between different variables for the Seattle dataset**

![Heatmap showing the correlation between different variables for the Seattle dataset](./images/Figure_2.png)

Comparing the heatmaps for Boston and Seattle one can see an overall similar trend. There does however seem to be significantly stronger correlation between the location rating and the price in Boston compared to Seattle.


Print the top 10 coefficients for both cities to compare:

In [None]:
print('BOSTON')
print_coeffs(lm_model_boston, X_train_boston)

print('SEATTLE')
print_coeffs(lm_model_seattle, X_train_seattle)

| BOSTON  |                                         |            |           |
|---------|-----------------------------------------|------------|-----------|
|         | est_int                                 | coefs      | abs_coefs |
| 9       | **room_type_Private room**              | -80.361940 | 80.361940 |
| 10      | **room_type_Shared room**               | -73.584597 | 73.584597 |
| 14      | **cancellation_policy_super_strict_30** | 72.254277  | 72.254277 |
| 2       | **bedrooms**                            | 37.644407  | 37.644407 |
| 1       | **bathrooms**                           | 35.593966  | 35.593966 |
| 11      | **bed_type_Real Bed**                   | 18.405890  | 18.405890 |
| 7       | **review_scores_location**              | 18.043700  | 18.043700 |
| 12      | **cancellation_policy_moderate**        | 13.164053  | 13.164053 |
| 6       | **review_scores_value**                 | -6.267840  | 6.267840  |
| 4       | **beds**                                | 5.188895   | 5.188895  |

| SEATTLE |                                         |            |           |
|---------|-----------------------------------------|------------|-----------|
|         | est_int                                 | coefs      | abs_coefs |
| 10      | **room_type_Shared room**               | -69.368033 | 69.368033 |
| 9       | **room_type_Private room**              | -38.917639 | 38.917639 |
| 2       | **bedrooms**                            | 28.415771  | 28.415771 |
| 1       | **bathrooms**                           | 26.916930  | 26.916930 |
| 7       | **review_scores_location**              | 14.234777  | 14.234777 |
| 13      | **cancellation_policy_moderate**        | -6.406348  | 6.406348  |
| 0       | **accommodates**                        | 6.334792   | 6.334792  |
| 6       | **review_scores_value**                 | -4.474419  | 4.474419  |
| 11      | **bed_type_Futon**                      | 3.574835   | 3.574835  |
| 14      | **cancellation_policy_strict**          | -2.415319  | 2.415319  |

In the top ten variables correlating with price we again see similarties with any room type other than whole house/apt showing the strongest negative correlation with price in both cities. Somewhat surprisingly, a super strict cancellation policy in in Boston is the first non room type related variable on the list while it does not show up in Seattle at all. Extremely high priced listings in Boston seem to be very strict with their cancellation policy, possibly to avoid hoax bookings or an unreasonably high cancerllation rate for these kind of rentals. However, why this is not the case in Seattle I am not 100% clear. Further findings in this list support the earlier findings of the heatmaps that number of accomodated people as well as number of beds, bedrooms and bathrooms all correlate positively with price and are among the top ten in both cities.