## Seattle AirBnB
This notebook explores the Seattle Airbnb Open Data
https://www.kaggle.com/airbnb/seattle/data

First let's import some packages that will likely come in handy:

non-standard installations:
`!conda install basemap`

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
from mpl_toolkits.basemap import Basemap

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

%matplotlib inline

In [None]:
calendar_df = pd.read_csv('../data/calendar.csv') 
listings_df = pd.read_csv('../data/listings.csv')
reviews_df = pd.read_csv('../data/reviews.csv')

In [None]:
calendar_df.head(3)

In [None]:
print(calendar_df.count())
print(calendar_df.count()/calendar_df.shape[0])

Some of the price data (~33%) is missing.

In [None]:
listings_df.shape

In [None]:
print(listings_df.columns)
listings_df.describe()

In [None]:
sns.heatmap(listings_df.corr());

It looks like some of these variables are highly correlated, mostly as we would expect -- for example, if one review score is high, the others are likely high as well (eg. rating, accuracy, cleanliness, checkin, communication, value); the location review is the least correlated to the others, which also makes sense. 

Likewise, availabilities are highly correlated (30, 60, 90, 365) as are the variables related to the house size (accomodates, bathrooms, bedrooms, beds, square feet). When determining predictor variables to use, we won't want to use all of the highly correlated values, but perhaps only one most representative value from each group. 

Let's look at a correlation matrix with a reduced number of variables, first let's find a list of the most highly correlated variables in a systematic way. Here I'm following steps from this 
https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

In [None]:
# Create correlation matrix
corr_matrix = listings_df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
to_drop

Here we can see that we found the two most highly correlated features. Let's turn this into a function and test some different thresholds to see if we can get some of the other variables we saw that look like they have high correlations. Looking at the heatmap, let's try a correlation threshold of 0.8 to see what that gives us:

In [None]:
def find_correlated_features(df,threshold):
    # Create correlation matrix
    corr_matrix = df.corr().abs()

    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

    # Find index of feature columns with correlation greater than 0.95
    to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
    
    return to_drop

In [None]:
corr_cols = find_correlated_features(listings_df,0.8)
print(corr_cols)

Ok, this gave us a few more. Let's loop through a number of different thresholds from 0.5 to 0.9 and list the correlated columns for each threshold. We can then select the set that looks the best, upon comparison:

In [None]:
thresholds = [0.5,0.6,0.7,0.8,0.9]
for t in thresholds:
    cols = find_correlated_features(listings_df,t)
    print(str(t) + ':' + ','.join(cols))

In [None]:
# Drop features with > 50% correlation
cols_to_drop = find_correlated_features(listings_df,0.5)
listings_df2 = listings_df.copy().drop(listings_df[cols_to_drop], axis=1)

In [None]:
#Let's also remove columns with missing values
listings_df2.head(5)

In [None]:
listings_df2 = listings_df2.drop(labels=['id','scrape_id','license'],axis=1)
sns.heatmap(listings_df2.corr());

In [None]:
fig = plt.figure(figsize = (15,15))
listings_df.hist(ax = fig.gca());

In [None]:
most_missing_cols = set(listings_df.columns[listings_df.isnull().mean() > 0.5])
most_missing_cols

In [None]:
print(listings_df.shape)
print(listings_df.dropna(how='all',axis=1).shape)
listings_df['scrape_id'].describe()

In [None]:
reviews_df.columns

In [None]:
listings_df['neighbourhood'].unique()

In [None]:
listings_df['neighbourhood'].value_counts().head()

From this initial look at the dataset, some questions to consider include:

    1) which variables are best predictors for the overall review score `review_score_rating`?
    2) do review scores vary significantly between different neighborhoods?
    3) which neighborhoods provide the best value (ratings/price ratio)?
    4) if we control for the review_score_location, does the neighborhood rating change?

In [None]:
hood_ratings = listings_df.groupby('neighbourhood').mean()['review_scores_rating'].reset_index()

In [None]:
hood_ratings.sort_values(by='review_scores_rating').head().style.format({'review_scores_rating': '{:.2f}'})

In [None]:
hood_ratings.sort_values(by='review_scores_rating').tail().style.format({'review_scores_rating': '{:.2f}'})

Let's look at a map, so we can more easily see if there are any trends:

In [None]:
m = Basemap(projection='gnom', lat_0=57.3, lon_0=-6.2,
            width=90000, height=120000, resolution=res, ax=ax[i])
m.fillcontinents(color="#FFDDCC", lake_color='#DDEEFF')
m.drawmapboundary(fill_color="#DDEEFF")
m.drawcoastlines()
ax[i].set_title("resolution='{0}'".format(res));

In [None]:
def clean_fit_linear_mod(df, response_col, cat_cols, dummy_na, test_size=.3, rand_state=42):
    '''
    INPUT:
    df - a dataframe holding all the variables of interest
    response_col - a string holding the name of the column 
    cat_cols - list of strings that are associated with names of the categorical columns
    dummy_na - Bool holding whether you want to dummy NA vals of categorical columns or not
    test_size - a float between [0,1] about what proportion of data should be in the test dataset
    rand_state - an int that is provided as the random state for splitting the data into training and test 
    
    OUTPUT:
    test_score - float - r2 score on the test data
    train_score - float - r2 score on the test data
    lm_model - model object from sklearn
    X_train, X_test, y_train, y_test - output from sklearn train test split used for optimal model
    
    Your function should:
    1. Drop the rows with missing response values
    2. Drop columns with NaN for all the values
    3. Use create_dummy_df to dummy categorical columns
    4. Fill the mean of the column for any missing values 
    5. Split your data into an X matrix and a response vector y
    6. Create training and test sets of data
    7. Instantiate a LinearRegression model with normalized data
    8. Fit your model to the training data
    9. Predict the response for the training data and the test data
    10. Obtain an rsquared value for both the training and test data
    '''
    #Drop the rows with missing response values
    df  = df.dropna(subset=[response_col], axis=0)

    #Drop columns with all NaN values
    df = df.dropna(how='all', axis=1)

    #Dummy categorical variables
    df = create_dummy_df(df, cat_cols, dummy_na)

    # Mean function
    fill_mean = lambda col: col.fillna(col.mean())
    # Fill the mean
    df = df.apply(fill_mean, axis=0)

    #Split into explanatory and response variables
    X = df.drop(response_col, axis=1)
    y = df[response_col]

    #Split into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=rand_state)

    lm_model = LinearRegression(normalize=True) # Instantiate
    lm_model.fit(X_train, y_train) #Fit

    #Predict using your model
    y_test_preds = lm_model.predict(X_test)
    y_train_preds = lm_model.predict(X_train)

    #Score using your model
    test_score = r2_score(y_test, y_test_preds)
    train_score = r2_score(y_train, y_train_preds)

    return test_score, train_score, lm_model, X_train, X_test, y_train, y_test

In [None]:
#Test your function with the above dataset
test_score, train_score, lm_model, X_train, X_test, y_train, y_test = clean_fit_linear_mod(df_new, 'Salary', cat_cols_lst, dummy_na=False)