#Project: Write a Data Science Blog Post

#Section 1: Business Understanding

This contains the code for homework 1 in the Udacity Data Scientist Nano Degree. The motivation behind this is to pass the course. As such I decided to look at the provided data sets of airbnb listing data in Bosten and Seattle to answer the following questions:

**Question 1**: What is the correlation between the following values: 'price', 'cleaning_fee', 'review_scores_location', 'review_scores_value', 'accommodates', 'bathrooms', 'bedrooms'

**Question 2**: Which of the following variables has the strongest correlation with the listing price: 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'host_is_superhost', 'beds', 'bed_type', 'cleaning_fee', 'cancellation_policy', 'review_scores_value', 'review_scores_location', 'host_identity_verified'

**Question 3**: Compare the results of questions 1 and 2 between Bosten and Seattle to look for interesting similarities/differences.

#Section 2: Data Understanding

Import the needed packages for this code

In [None]:
# read in needed packages
# import warnings
# warnings.simplefilter(action='ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import AllTogether as t
import seaborn as sns
%matplotlib inline

Load the datasets and take a look at their shape to verify.

In [None]:
df_boston = pd.read_csv('./boston/listings.csv')
df_seattle = pd.read_csv('./seattle/listings.csv')

df_boston.shape
df_seattle.shape

#Section 3: Data Preparation

The following function extracts the columns to be used for this project

In [None]:
# function to extract columns to be used for this project and drop rows with NAN values
def reduce_and_convert(df):
    # extract the columns by column name
    df = df[['room_type', 'accommodates', 'bathrooms', 'bedrooms', 'host_is_superhost',
             'beds', 'bed_type', 'price', 'cleaning_fee', 'cancellation_policy',
             'review_scores_value', 'review_scores_location', 'host_identity_verified']]

    # replace certain characters to allow for data type conversion
    df['price'] = df['price'].str.replace('$','')
    df['price'] = df['price'].str.replace(',','')
    df['cleaning_fee'] = df['cleaning_fee'].str.replace('$','')
    df['cleaning_fee'] = df['cleaning_fee'].str.replace(',','')
    df['cleaning_fee'] = df['cleaning_fee'].fillna(0.0)
    df['host_is_superhost'].map({'t': True, 'f': False})
    df['host_identity_verified'].map({'t': True, 'f': False})

    # convert data types for certain columns so they are not categorical anymore
    df = df.astype({'price':'float'})
    df = df.astype({'cleaning_fee':'float'})
    df = df.astype({'host_is_superhost':'bool'})
    df = df.astype({'host_identity_verified':'bool'})

    return df

df_boston = reduce_and_convert(df_boston)
df_seattle = reduce_and_convert(df_seattle)

Further data preparation is needed so this function is used to drop rows with NAN values and convert categorical columns into ones that can be used by our model. Alternatively to dropping rows we could fill NAN values with the mean etc. however for this project we decided against this.

In [None]:
# convert categorical columns and drop rows containing nan
def add_mean_and_dumnmy(df):
    # drop any rows with NAN entries
    df = df.dropna(axis=0)

    # get column names of categorical columns
    cat_vars = df.select_dtypes(include=['object']).columns

    # convert categorical columns column by column
    for var in cat_vars:
        df = pd.concat([df.drop(var, axis=1), pd.get_dummies(df[var], prefix=var, prefix_sep='_', drop_first=True)],
                       axis=1)

    return df

df_boston = add_mean_and_dumnmy(df_boston)
df_seattle = add_mean_and_dumnmy(df_seattle)

#Section 4: Data Modeling 

Here we fit a linear regression model using Udacity provided code.

In [None]:
def train_model(df):
    # split data into X and y
    X = df.drop(['price'], axis=1)
    y = df['price']
    
    # use udacity data scientist provided code to train the model and search for optimal model
    cutoffs = [5000, 3500, 2500, 1000, 100, 50]
    r2_scores_test, r2_scores_train, lm_model, X_train, X_test, y_train, y_test = t.find_optimal_lm_mod(X, y, cutoffs)

    return r2_scores_test, r2_scores_train, lm_model, X_train, X_test, y_train, y_test

r2_scores_test_boston, r2_scores_train_boston, lm_model_boston, X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_model(df_boston)
r2_scores_test_seattle, r2_scores_train_seattle, lm_model_seattle, X_train_seattle, X_test_seattle, y_train_seattle, y_test_seattle = train_model(df_seattle)

#Section 5: Evaluation

This function alows us to plot a heatmap visualizing the correlation between a list of columns. **This is used to answer question 1 and 3**. With a theoretical range of -1 to 1 the numbers in the resulting heatmap indicate how two variables correlate. A value of 1 would indicate perfect positive correlation meaning a doubling of variable x results in a doubling of variable y. A value close to 0 indicates little to no correlation.

In [None]:
# show heatmap of correlation between a subset of columns 
def plot_hist_and_heatmap(df, col_list):
    # select subset of data
    df = df[col_list]
    # create heatmap
    sns.heatmap(df.corr(), annot=True, fmt=".2f")

This function is used to list the top x coefficients of a given lm model. **This is used to answer question 2 and 3**. Since the data for our model was normalized the resulting values list of coefficients represends the input variables with the greatest correlation to the output/predicted variable (in our case this would be the listing price). The sign of the coefs column indicates positive or negative correlation.  

In [None]:
# extract coefs from lm model an show top results
def print_coeffs(lm_model, X_train, top_x_results = 10):

    # look at model coefficients
    coefs_df = pd.DataFrame()
    coefs_df['est_int'] = X_train.columns
    coefs_df['coefs'] = lm_model.coef_
    coefs_df['abs_coefs'] = np.abs(lm_model.coef_)
    coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)

    coefs_df.head(top_x_results)

**Question 1:** Ho do the variables 'price', 'cleaning_fee', 'number_of_reviews', 'review_scores_rating', 'accommodates', 'bathrooms' and 'bedrooms' correlate in the Boston dataset?

In [None]:
plot_hist_and_heatmap(df_boston, ['price', 'cleaning_fee', 'number_of_reviews', 'review_scores_rating', 'accommodates', 'bathrooms', 'bedrooms'])

**Question 2:** Which variables show the highest correlation with the price of a listing for Boston?

In [None]:
train_model_and_coeffs(lm_model_boston, X_train_boston)

**Question 3:** How do Boston and Seattle differ when it comes to the questions posed before.

Plot the heatmap for both cities to compare:

In [None]:
plot_hist_and_heatmap(df_boston, ['price', 'cleaning_fee', 'number_of_reviews', 'review_scores_rating', 'accommodates', 'bathrooms', 'bedrooms'])
plot_hist_and_heatmap(df_seattle, ['price', 'cleaning_fee', 'number_of_reviews', 'review_scores_rating', 'accommodates', 'bathrooms', 'bedrooms'])

Print the top 10 coefficients for both cities to compare:

In [None]:
print('BOSTON')
train_model_and_coeffs(lm_model_boston, X_train_boston)

print('SEATTLE')
train_model_and_coeffs(lm_model_seattle, X_train_seattle)