**Final Project Submission**

Please fill out:
* Student name: Brian Tracy
* Student pace: self paced
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


# Business Problem

A local home renovation company is looking for advice on 

# EDA

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

raw_data = pd.read_csv('data\kc_house_data.csv')
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

In [None]:
raw_data.waterfront.value_counts()

In [3]:
raw_data.isna().sum()

id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

We need to deal with NaN values.

- For 'yr_renovated', we assume that the house has never been renovated and set the value to 0.0 like other homes not renovated.

- For 'view', we assume that no value means there is no view. There are many 'NONE' values in this column, so we set the nulls to 'NONE'.

- For 'waterfront', we also assume that no value means no waterfront view, and set the value to 'NO'.

Consider leaving everything in.

Relationship between year built, year renovated, year sold

Don't drop unless you have good reason, based on assumptions of linear modeling

In [None]:
df = raw_data.copy()
df['yr_renovated'].fillna(0.0, inplace=True)
df['view'].fillna('NONE', inplace=True)
df['waterfront'].fillna('NO', inplace=True)

# dropping columns we will not need: id, lat, long, sqft_living15, sqft_lot15
# and zipcode
# we also will drop sqft_above (just use sqft_living)
# We will engineer new feature for renovation based on yr_renovated
# df.drop(['id', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'zipcode',
#          'sqft_above', 'yr_renovated'], axis=1, inplace=True)


In [None]:
df

## Baseline model

In [None]:
numerical_data = df.select_dtypes(include='number').copy()

In [None]:
numerical_data.corr()['price'].map(abs).sort_values(ascending=False)

Here decided to drop sqft_lot and yr_built for baseline model as well.

In [None]:
X = numerical_data.drop(['price', 'sqft_lot', 'yr_built'], axis=1)
y = numerical_data['price']

In [None]:
baseline_model = LinearRegression().fit(X,y).score(X,y)
baseline_model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=0)

In [None]:
model_v1 = LinearRegression().fit(X_train, y_train)

In [None]:
model_v1.score(X_test,y_test)

In [None]:
X.columns

In [None]:
model_v1.coef_

In [None]:
model_v1.intercept_

In [None]:
from sklearn.model_selection import cross_validate, ShuffleSplit

splitter = ShuffleSplit(n_splits=3, test_size=0.25, random_state=0)

baseline_scores = cross_validate(
    estimator=model_v1,
    X=X_train,
    y=y_train,
    return_train_score=True,
    cv=splitter
)

print("Train score:", baseline_scores["train_score"].mean())
print("Test score:", baseline_scores["test_score"].mean())

Looks like a baseline model scores about 50, which we hope to improve by converting and adding back in categorical data first

## Convert categorical data

In [None]:
categorical_data = df.select_dtypes(exclude='number').copy()
categorical_data.info()

We have 6 categorical features to evaluate:
(technically 2 are numerical we are adapting)
- 'waterfront': engineer new boolean 'is_waterfront'
- 'grade': change to numeric
- 'sqft_basement': engineer new boolean 'has_basement'
- 'yr_renovated': engineer new boolean 'been_renovated'
    *note this uses the raw_data from original import

In [None]:
# set up new dataframe to concat later with numerical dataframe
converted_features = pd.DataFrame([])

# new 'is_waterfront' feature (boolean)
converted_features['is_waterfront'] = categorical_data.waterfront\
                                                     .map({'NO': 0, 'YES': 1})

# updated 'grade' feature
converted_features['grade'] = categorical_data.grade\
                                                  .map(lambda x: x.split()[0])

# 'sqft_basement' has some values of '?', before engineering new feature these
# must be converted to 0.0
categorical_data.sqft_basement.replace(to_replace='?', value=0.0, 
                                       inplace=True)
# then convert the whole column to float
categorical_data['sqft_basement'] = categorical_data.sqft_basement\
                                                    .astype('float')
# new 'has_basement' feature (boolean)
converted_features['has_basement'] = categorical_data.sqft_basement\
                                           .apply(lambda x: 1 if x > 1 else 0)

# new 'been_renovated' feature (boolean)
converted_features['been_renovated'] = raw_data.yr_renovated\
                                           .apply(lambda x: 1 if x > 0 else 0)

In [None]:
converted_features

- 'view': onehotencode (NONE, AVERAGE, GOOD, FAIR, EXCELLENT)
- 'condition': onehotencode (Average, Good, Very Good, Fair, Poor)

In [None]:
from sklearn.preprocessing import OneHotEncoder

categories = ['view', 'condition']
temp_df = categorical_data[categories].copy()

ohe = OneHotEncoder(sparse=False, drop='first')
ohe.fit(temp_df)

column_names = ohe.get_feature_names()

ohe_encoded = ohe.transform(temp_df)

temp_df_encoded = pd.DataFrame(ohe_encoded, columns=column_names)

category dropped becomes the reference category

In [None]:
temp_df.view.value_counts()

In [None]:
temp_df.condition.value_counts()

In [None]:
temp_df_encoded

In [None]:
merged_df = pd.concat([numerical_data, converted_features, temp_df_encoded],
                     axis=1)

In [None]:
merged_df

## Model Version 2

Model version 2 (bringing in converted categoricals)

In [None]:
X = merged_df.drop('price', axis=1)
y = merged_df['price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=0)

In [None]:
model_v2 = LinearRegression().fit(X_train,y_train)

In [None]:
model_v2.score(X_test,y_test)

Looks like we are up to a score of 65 from 50

In [None]:
merged_df.corr()['price'].abs().sort_values(ascending=False)

After looking at these results, decided to instead of onehotencode the view to engineer another boolean feature as to if the property has a view or not

In [None]:
trimmed_df = merged_df.iloc[: ,:-8]

In [None]:
trimmed_df

In [None]:
# new 'has_view' feature (boolean)
trimmed_df['has_view'] = temp_df.view.apply(lambda x: 1 if x != 'NONE' else 0)

In [None]:
trimmed_df.has_view.value_counts()

We have a grade category already, going to try and not bring in the condition category for now and see what results we get.

In [None]:
X = trimmed_df.drop('price', axis=1)
y = trimmed_df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=0)

model_v3 = LinearRegression().fit(X_train,y_train)

model_v3.score(X_test,y_test)

Slightly worse than last model, but still hovering around a score of 65. Lets see if we do some feature ranking with RFE what we can generate

In [None]:
from sklearn.feature_selection import RFE

model_v4 = LinearRegression()

selector = RFE(model_v4, n_features_to_select=5)
selector = selector.fit(X_train,y_train.values.ravel())
selector.support_

In [None]:
selected_columns = X_train.columns[selector.support_]
model_v4.fit(X_train[selected_columns], y_train)

In [None]:
model_v4.score(X_test[selected_columns],y_test)

Drops about 10 points to a score of 55 when we limit ourselves to 5 features

folium for visualization based on mapping