# Background

The following notebook presents the steps involved in and the thought process we used in predicting house prices based on multiple features using regression analysis. We were presented with a dataset preprocessed for instructional purposes and derived from the dataset provided in the former Kaggle competition to predict housing sale price using regression.

King County is home to the largest and fifth largest city in Washington State, namely Seattle and Bellevue, which in conjunction with the third largest city Tacoma forms the Seattle metropolitan area.

If you would like to explore the original dataset on Kaggle, please follow the link below:
https://www.kaggle.com/harlfoxem/housesalesprediction/discussion/92376

We have provided the names and descriptions of the columns in the provided King County dataset:
* **id** - unique ID for a house
* **date** - Date day house was sold
* **price** - Price is prediction target
* **bedrooms** - Number of bedrooms
* **bathrooms** - Number of bathrooms
* **sqft_living** - square footage of the home
* **sqft_lot** - square footage of the lot
* **floors** - Total floors (levels) in house
* **waterfront** - Whether house has a view to a waterfront
* **view** - Number of times house has been viewed
* **condition** - How good the condition is (overall)
* **grade** - overall grade given to the housing unit, based on King County grading system
* **sqft_above** - square footage of house (apart from basement)
* **sqft_basement** - square footage of the basement
* **yr_built** - Year when house was built
* **yr_renovated** - Year when house was renovated
* **zipcode** - zip code in which house is located
* **lat** - Latitude coordinate
* **long** - Longitude coordinate
* **sqft_living15** - The square footage of interior housing living space for the nearest 15 neighbors
* **sqft_lot15** - The square footage of the land lots of the nearest 15 neighbors


# **Business Questions**

Our client representing a cohort of foreign investors has expressed interest in becoming involved in the Seattle area housing market.  By gaining better insight into the prediction models for housing prices, they hope to become major players in the market.  They have partnered with us to apply machine learning in enhancing the prediction of housing prices in King County.

We set out to answer a few questions for our client:

1. Do renovated properties have a higher selling price than unrenovated properties?
2. Does the number of times a property is viewed have any effect on selling price?
3. Does the grade given to the housing unit have an overall effect on the selling price?

Through the use of statistical tests during our EDA process, we will be able to provide the essential information needed for our clients in their new business venture.

# **Exploratory Data Analysis**

The following notebook presents the steps in predicting house pries based on multiple features using regression analysis. We used a dataset of house sales in King County, which includes the city of Seattle and the metropolitan area, processed for instructional purposes from the original Kaggle dataset. We will apply the techniques of exploratory data analysis (EDA) to familiarize ourselves with the dataset.

By performing an EDA, we are able to explore the relationship(s), or lack thereof, between the features and the target and amongst the feature variables themselve. We are better equipped through this process to identify features for analysis and filter out those without any correlation with our target variable. This process is also integral to identifying outliers, missing values, or anomalous values due to human error due to data visualization.

In [23]:
# import packages for data cleaning and processing  
import pandas as pd
import numpy as np
from datetime import datetime
import itertools
import geohash2
import warnings

# import visualization modules
import seaborn as sns
import matplotlib.pyplot as plt
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.io import output_notebook
from bokeh.palettes import Turbo256, Category10_10
from bokeh.transform import linear_cmap
from bokeh.models import HoverTool
warnings.filterwarnings('ignore')
%matplotlib inline
plt.style.use("fivethirtyeight")
import branca.colormap as cm
import json

# import packages for geolocation
import folium
from folium import plugins
import descartes
import geopandas as gpd
from shapely.geometry import Point, Polygon
from folium.plugins import MarkerCluster
from folium.plugins import HeatMap
from bokeh.palettes import RdYlBu11
from bokeh.models import LogColorMapper
from bokeh.io import show
from bokeh.models import (CDSView, ColorBar, ColumnDataSource, CustomJS, CustomJSFilter, GeoJSONDataSource, HoverTool, LinearColorMapper, Slider)
from bokeh.layouts import column, row, widgetbox
from bokeh.palettes import brewer
from bokeh.plotting import figure

# import packages and modules for statistical analysis
from scipy import stats
import scipy.stats as scs
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import linear_model
from sklearn import metrics

# import modules for preprocessing
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import RFE, SelectKBest, f_regression, RFECV, mutual_info_regression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, LabelBinarizer, LabelEncoder, minmax_scale

# import module for object serialization
import pickle

# set display options for Pandas dataframes to allow view of a maximal number of columns and rows
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)


In [2]:
# Read CSV file into notebook
df = pd.read_csv('data/kc_house_data_train.csv', index_col=0)

In [3]:
# get dimensions of the dataframe
df.shape

(17290, 21)

In [4]:
# Display first 5 rows of dataset
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,2591820310,20141006T000000,365000.0,4,2.25,2070,8893,2.0,0,0,4,8,2070,0,1986,0,98058,47.4388,-122.162,2390,7700
1,7974200820,20140821T000000,865000.0,5,3.0,2900,6730,1.0,0,0,5,8,1830,1070,1977,0,98115,47.6784,-122.285,2370,6283
2,7701450110,20140815T000000,1038000.0,4,2.5,3770,10893,2.0,0,2,3,11,3770,0,1997,0,98006,47.5646,-122.129,3710,9685
3,9522300010,20150331T000000,1490000.0,3,3.5,4560,14608,2.0,0,2,3,12,4560,0,1990,0,98034,47.6995,-122.228,4050,14226
4,9510861140,20140714T000000,711000.0,3,2.5,2550,5376,2.0,0,0,3,9,2550,0,2004,0,98052,47.6647,-122.083,2250,4050


In [5]:
# Display last 5 rows of dataset
df.tail()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
17285,627300195,20150303T000000,750000.0,5,2.5,3240,9960,1.0,0,1,3,8,2020,1220,1958,0,98008,47.5858,-122.112,2730,10400
17286,8819900270,20140520T000000,440000.0,2,1.75,1300,4000,2.0,0,0,3,7,1300,0,1948,0,98105,47.6687,-122.288,1350,4013
17287,3816300095,20140514T000000,310000.0,3,1.0,1050,9876,1.0,0,0,3,7,1050,0,1953,0,98028,47.7635,-122.262,1760,9403
17288,122069107,20141204T000000,427500.0,3,1.5,1900,43186,1.5,0,0,4,7,1300,600,1971,0,98038,47.4199,-121.99,2080,108028
17289,6703100135,20150116T000000,348000.0,3,1.5,1330,6768,1.0,0,0,4,7,1330,0,1952,0,98155,47.7366,-122.319,1320,6910


In [6]:
# Get descriptive analytics of dataset
df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0,17290.0
mean,4565502000.0,540739.5,3.37247,2.111943,2081.464604,15243.4,1.490312,0.007981,0.238519,3.408502,7.654425,1789.306015,292.158589,1970.792019,83.806304,98078.193175,47.560058,-122.214258,1987.986698,12873.475824
std,2874656000.0,373319.0,0.939346,0.770476,920.018539,42304.62,0.538909,0.088985,0.775229,0.651296,1.174718,829.265107,443.151874,29.343516,400.329376,53.607949,0.138412,0.140857,684.802635,27227.437583
min,1000102.0,75000.0,0.0,0.0,290.0,572.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,659.0
25%,2114701000.0,321000.0,3.0,1.5,1430.0,5081.25,1.0,0.0,0.0,3.0,7.0,1200.0,0.0,1951.0,0.0,98033.0,47.4712,-122.329,1490.0,5111.25
50%,3903650000.0,450000.0,3.0,2.25,1920.0,7642.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1974.0,0.0,98065.0,47.5716,-122.23,1840.0,7622.5
75%,7301150000.0,645000.0,4.0,2.5,2550.0,10725.75,2.0,0.0,0.0,4.0,8.0,2214.5,560.0,1996.0,0.0,98118.0,47.6779,-122.126,2360.0,10101.75
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,858132.0


**Initial Observations:**

- *waterfront* is a binary discrete variable (0 = not waterfront, 1 = waterfront)
- *sqft_above* + *sqft_basement* = *sqft_living*
- *sqft_basement*, *view*, and *yr_renovated* have many zero values, potentially express them as binary variables
- the oldest home was built in 1970 and the newest in 2015

We can assign our categorical and continuous variables:

- **categorical variables:**  *floors, view, grade, zipcode, bathrooms, bedrooms, condition*
- **continuous variables:** *price, sqft_living, sqft_lot, sqft_above, sqft_basement, yr_built, yr_renovated, lat, long, sqft_living15, sqft_lot15*


In [7]:
# Look for any column types that need conversion
df.dtypes

id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

## **Initial Data Cleaning**

In [8]:
# Check for any null values in the dataset
df.isna().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [9]:
# Convert 'date' column to datetime format, rename to 'sale_date', and drop original column
df['sale_date'] = [x[:8] for x in df.date]
df.sale_date = df.sale_date.apply(lambda x: datetime.strptime(x, '%Y%m%d'))
df.drop(columns='date', inplace=True)
df.head()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,sale_date
0,2591820310,365000.0,4,2.25,2070,8893,2.0,0,0,4,8,2070,0,1986,0,98058,47.4388,-122.162,2390,7700,2014-10-06
1,7974200820,865000.0,5,3.0,2900,6730,1.0,0,0,5,8,1830,1070,1977,0,98115,47.6784,-122.285,2370,6283,2014-08-21
2,7701450110,1038000.0,4,2.5,3770,10893,2.0,0,2,3,11,3770,0,1997,0,98006,47.5646,-122.129,3710,9685,2014-08-15
3,9522300010,1490000.0,3,3.5,4560,14608,2.0,0,2,3,12,4560,0,1990,0,98034,47.6995,-122.228,4050,14226,2015-03-31
4,9510861140,711000.0,3,2.5,2550,5376,2.0,0,0,3,9,2550,0,2004,0,98052,47.6647,-122.083,2250,4050,2014-07-14


In [16]:
geodata = gpd.read_file("mapping/Zipcodes_for_King_County_and_Surrounding_Area___zipcode_area.shp")
geodata.head()

Unnamed: 0,OBJECTID,ZIP,ZIPCODE,COUNTY,ZIP_TYPE,Shape_Leng,Shape_Area,geometry
0,1,98031,98031,33,Standard,117508.232813,228012900.0,"POLYGON ((-122.21842 47.43750, -122.21935 47.4..."
1,2,98032,98032,33,Standard,166737.665152,482675400.0,"MULTIPOLYGON (((-122.24187 47.44122, -122.2411..."
2,3,98030,98030,33,Standard,94409.538568,200095400.0,"POLYGON ((-122.21006 47.38692, -122.21007 47.3..."
3,4,98029,98029,33,Standard,111093.715481,277424700.0,"POLYGON ((-121.97642 47.58430, -121.97645 47.5..."
4,5,98028,98028,33,Standard,71488.230747,199653100.0,"POLYGON ((-122.22788 47.76909, -122.22790 47.7..."


In [17]:
def getXYCoords(geometry, coord_type):
    """ Returns either x or y coordinates from  geometry coordinate sequence. Used with LineString and Polygon geometries."""
    if coord_type == 'x':
        return geometry.coords.xy[0]
    elif coord_type == 'y':
        return geometry.coords.xy[1]

def getPolyCoords(geometry, coord_type):
    """ Returns Coordinates of Polygon using the Exterior of the Polygon."""
    ext = geometry.exterior
    return getXYCoords(ext, coord_type)

def getLineCoords(geometry, coord_type):
    """ Returns Coordinates of Linestring object."""
    return getXYCoords(geometry, coord_type)

def getPointCoords(geometry, coord_type):
    """ Returns Coordinates of Point object."""
    if coord_type == 'x':
        return geometry.x
    elif coord_type == 'y':
        return geometry.y

def multiGeomHandler(multi_geometry, coord_type, geom_type):
    """
    Function for handling multi-geometries. Can be MultiPoint, MultiLineString or MultiPolygon.
    Returns a list of coordinates where all parts of Multi-geometries are merged into a single list.
    Individual geometries are separated with np.nan which is how Bokeh wants them.
    # Bokeh documentation regarding the Multi-geometry issues can be found here (it is an open issue)
    # https://github.com/bokeh/bokeh/issues/2321
    """
    for i, part in enumerate(multi_geometry):
        # On the first part of the Multi-geometry initialize the coord_array (np.array)
        if i == 0:
            if geom_type == "MultiPoint":
                coord_arrays = np.append(getPointCoords(part, coord_type), np.nan)
            elif geom_type == "MultiLineString":
                coord_arrays = np.append(getLineCoords(part, coord_type), np.nan)
            elif geom_type == "MultiPolygon":
                coord_arrays = np.append(getPolyCoords(part, coord_type), np.nan)
        else:
            if geom_type == "MultiPoint":
                coord_arrays = np.concatenate([coord_arrays, np.append(getPointCoords(part, coord_type), np.nan)])
            elif geom_type == "MultiLineString":
                coord_arrays = np.concatenate([coord_arrays, np.append(getLineCoords(part, coord_type), np.nan)])
            elif geom_type == "MultiPolygon":
                coord_arrays = np.concatenate([coord_arrays, np.append(getPolyCoords(part, coord_type), np.nan)])
    return coord_arrays


def getCoords(row, geom_col, coord_type):
    """
    Returns coordinates ('x' or 'y') of a geometry (Point, LineString or Polygon) as a list (if geometry is LineString or Polygon).
    Can handle also MultiGeometries.
    """
    # Get geometry and check the geometry type
    geom = row[geom_col]
    gtype = geom.geom_type

    # "Normal" geometries
    if gtype == "Point":
        return getPointCoords(geom, coord_type)
    elif gtype == "LineString":
        return list(getLineCoords(geom, coord_type))
    elif gtype == "Polygon":
        return list(getPolyCoords(geom, coord_type))

    # Multi geometries
    else:
        return list(multiGeomHandler(geom, coord_type, gtype))

# Calculate coordinates
geodata['x'] = geodata.apply(getCoords, geom_col='geometry', coord_type='x', axis=1)
geodata['y'] = geodata.apply(getCoords, geom_col='geometry', coord_type='y', axis=1)
geodata.head()

Unnamed: 0,OBJECTID,ZIP,ZIPCODE,COUNTY,ZIP_TYPE,Shape_Leng,Shape_Area,geometry,x,y
0,1,98031,98031,33,Standard,117508.232813,228012900.0,"POLYGON ((-122.21842 47.43750, -122.21935 47.4...","[-122.21842289814009, -122.2193499600688, -122...","[47.43750364721223, 47.433871635524056, 47.430..."
1,2,98032,98032,33,Standard,166737.665152,482675400.0,"MULTIPOLYGON (((-122.24187 47.44122, -122.2411...","[-122.24186949687606, -122.24112642701647, -12...","[47.44121579929307, 47.44121048428079, 47.4412..."
2,3,98030,98030,33,Standard,94409.538568,200095400.0,"POLYGON ((-122.21006 47.38692, -122.21007 47.3...","[-122.21005827504885, -122.21006983973646, -12...","[47.38691615903413, 47.38600499052471, 47.3851..."
3,4,98029,98029,33,Standard,111093.715481,277424700.0,"POLYGON ((-121.97642 47.58430, -121.97645 47.5...","[-121.97642224920017, -121.97644534284184, -12...","[47.58429567921104, 47.58417490690478, 47.5842..."
4,5,98028,98028,33,Standard,71488.230747,199653100.0,"POLYGON ((-122.22788 47.76909, -122.22790 47.7...","[-122.22787972551288, -122.22789689450347, -12...","[47.76909367603944, 47.76861505057568, 47.7670..."


In [18]:
# geo = geodata.drop('geometry', axis=1).copy()
# # geo_cds = GeoJSONDataSource(geojson=geodata.tojson())
# geo = geo.fillna('')
# df_cds = ColumnDataSource(data=df)
# TOOLS = "pan, wheel_zoom, box_zoom, reset, save"
# color_mapper = LogColorMapper(palette=RdYlBu11)
# output_notebook()
# plt = figure(title = "Title",
#              tools = TOOLS, 
#              plot_width = 950, 
#              plot_height = 600, 
#              active_scroll = "wheel_zoom")
# plt.xgrid.grid_line_color = None
# plt.ygrid.grid_line_color = None
# plt.patches('x', 'y', source = geo,
#                         fill_color = None,
#                         fill_alpha = 0.8,
#                         line_color = "gray",
#                         line_width = 0.03)
# show(plt)

In [19]:
# from bokeh.palettes import RdYlBu11 as palette
# from bokeh.models import LogColorMapper
# color_mapper = LogColorMapper(palette=palette)

# p = figure(title="Title")
# p.patches('x', 'y', source=geo,
#           fill_color={'field': 'pt_r_tt_ud', 'transform': color_mapper},
#           fill_alpha=1.0, line_color='black', line_width=0.05)
# p.circle('long', 'lat', source=df_cds, size=5)
# show(p)

In [20]:
zip_df = df.groupby('zipcode').agg(np.mean)
zip_df.reset_index(inplace=True)
zip_df.head()

Unnamed: 0,zipcode,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,lat,long,sqft_living15,sqft_lot15
0,98001,4626220000.0,281998.8,3.387324,1.990317,1902.71831,15564.359155,1.396127,0.0,0.109155,3.323944,7.302817,1705.077465,197.640845,1979.90493,41.869718,47.309579,-122.270891,1820.056338,11337.359155
1,98002,4997224000.0,232286.5,3.305732,1.828025,1618.038217,7536.22293,1.318471,0.0,0.012739,3.738854,6.66879,1518.700637,99.33758,1966.33758,75.910828,47.307609,-122.213299,1460.649682,7758.611465
2,98003,4631569000.0,290762.7,3.375,2.066964,1939.125,10777.0625,1.310268,0.0,0.183036,3.361607,7.526786,1667.732143,271.392857,1977.125,26.700893,47.316625,-122.309804,1864.035714,9717.857143
3,98004,4291230000.0,1396883.0,3.85654,2.527426,2969.409283,13679.042194,1.432489,0.004219,0.312236,3.50211,8.776371,2458.14346,511.265823,1971.810127,202.594937,47.615932,-122.205561,2742.236287,13220.194093
4,98005,5096151000.0,808847.6,3.835714,2.417857,2679.235714,19172.15,1.264286,0.0,0.114286,3.7,8.5,2161.807143,517.428571,1969.492857,57.157143,47.611205,-122.167536,2559.578571,18203.107143


In [21]:
df['count'] = 1
count_zip = df.groupby('zipcode').sum()
count_zip.reset_index(inplace=True)
count_zip = count_zip[['zipcode', 'count']]
df.drop(['count'], axis = 1, inplace=True)
count_zip.head()

Unnamed: 0,zipcode,count
0,98001,284
1,98002,157
2,98003,224
3,98004,237
4,98005,140


In [29]:
zip_df = pd.merge(zip_df, count_zip, how='left', on=['zipcode'])
zip_df.head()

Unnamed: 0,zipcode,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,lat,long,sqft_living15,sqft_lot15,count
0,98001,4626220000.0,281998.8,3.387324,1.990317,1902.71831,15564.359155,1.396127,0.0,0.109155,3.323944,7.302817,1705.077465,197.640845,1979.90493,41.869718,47.309579,-122.270891,1820.056338,11337.359155,284
1,98002,4997224000.0,232286.5,3.305732,1.828025,1618.038217,7536.22293,1.318471,0.0,0.012739,3.738854,6.66879,1518.700637,99.33758,1966.33758,75.910828,47.307609,-122.213299,1460.649682,7758.611465,157
2,98003,4631569000.0,290762.7,3.375,2.066964,1939.125,10777.0625,1.310268,0.0,0.183036,3.361607,7.526786,1667.732143,271.392857,1977.125,26.700893,47.316625,-122.309804,1864.035714,9717.857143,224
3,98004,4291230000.0,1396883.0,3.85654,2.527426,2969.409283,13679.042194,1.432489,0.004219,0.312236,3.50211,8.776371,2458.14346,511.265823,1971.810127,202.594937,47.615932,-122.205561,2742.236287,13220.194093,237
4,98005,5096151000.0,808847.6,3.835714,2.417857,2679.235714,19172.15,1.264286,0.0,0.114286,3.7,8.5,2161.807143,517.428571,1969.492857,57.157143,47.611205,-122.167536,2559.578571,18203.107143,140


In [25]:
# cmap = cm.LinearColormap(colors=['blue', 'yellow', 'red'], vmin=100000, vmax=1500000)
# m = folium.Map(location=[df.lat.mean(), df.long.mean()], zoom_start=10, tiles='stamenterrain')
# for i in range(len(df)):
#     folium.Circle(
#         location=[df.iloc[i]['lat'], df.iloc[i]['long']],
#         radius=10,
#         fill=True,
#         color=cmap(df.iloc[i]['price']),
#         fill_opacity=0.2
#     ).add_to(m)
# m.add_child(cmap)
# m.save('price_cmap.html')
# m

<img src="images/folium_circles.png">

In [26]:
# df['zipcode'] = df['zipcode'].astype('str')
# boundary_file = "mapping/Zipcodes_for_King_County_and_Surrounding_Area___zipcode_area.geojson"
# with open(boundary_file, 'r') as f:
#     zipcode_boundary = json.load(f)
# m = folium.Map(location=[df.lat.mean(), df.long.mean()], zoom_start=10, tiles='openstreetmap')
# zipcode_data = df.groupby('zipcode').aggregate(np.mean)
# zipcode_data.reset_index(inplace = True) 
# bins = list(zipcode_data['price'].quantile([0, 0.2, 0.4, 0.6, 0.8, 1]))
# folium.Choropleth(
#     geo_data=zipcode_boundary,
#     name='choropleth',
#     data=zipcode_data,
#     columns=['zipcode', 'price'],
#     key_on='feature.properties.ZIPCODE',
#     fill_color='Spectral',
#     fill_opacity=0.6,
#     nan_fill_opacity=0,
#     line_opacity=1,
#     bins=bins,
#     legend_name='Mean Price'
# ).add_to(m)
# m.save('zip_choropleth.html')
# m

<img src="images/folium_choropleth.png">

Since the data was preprocessed, this steps are more pro forma.  We did not expect to produce any duplicate or missing values from this dataset, as would be expected from more raw data, i.e. directly from Kaggle.  These would be necessary steps in the preprocessing stage of the data.

In [29]:
# Put features into categorical and continuous subsets
feat_cat = df[['view', 'condition', 'grade', 'waterfront', 'floors', 'bedrooms', 'bathrooms', 'zipcode']]
feat_con = df[['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15', 'yr_built', 'yr_renovated', 'lat', 'long']]

In [30]:
# Get indices of the subsetted columns to prepare for Seaborn visualizations
col_con = feat_con.columns
col_cat = feat_cat.columns

## Target Variable Visualization

In [30]:
# fig, ax = plt.subplots(figsize=(12, 4))
# ax = sns.boxplot(df.price)
# plt.savefig("df_target_2.png")

<img src="images/df_target_2.png">

**Observations:**

- We could use this boxplot to identify outliers, but there are potentially different ways we could approach outliers for our target variable.  But this is just to get a lay of the land.

## **Continuous Variable Visualizations**

In [31]:
## Display distribution plots of continuous variables using FacetGrid and distplot

# con_1 = pd.melt(df, value_vars = col_con)
# g = sns.FacetGrid(con_1, col='variable', col_wrap=3, sharex=False, sharey=False, height=4)
# g = g.map(sns.distplot, 'value', color='r')
# g.set_xticklabels(rotation=45)
# plt.savefig("images/df_distplot.png")


<img src="images/df_distplot.png">

**Observations:**

- `sqft_living`, `sqft_above`, and `sqft_living15` are skewed to the right, potentially use log transformation with skewed data to conform to normality
- `sqft_lot`, `sqft_lot15`, `sqft_basement`, and `yr_renovated` have a lot of zero values, maybe create a discrete binary variable for some of them

In [32]:
# Create scatterplots with regression line with regplot() of continuous variables

# con_2 = pd.melt(df, id_vars='price', value_vars=col_con)
# g = sns.FacetGrid(con_2, col='variable', col_wrap=3, sharex=False, sharey=False, height=4)
# g = g.map(sns.regplot, 'value', 'price', color='darkorange')
# g.set_xticklabels(rotation=45)
# plt.savefig('images/df_scatter.png')


<img src="images/df_regplot.png">

**Observations:**

- in the case of `yr_renovated`, with such disparate values between no renovations as 0 values and the years having values around 2000, best to consider this as a discrete variable rather than continuous
- `sqft_living` and `sqft_above` show the strongest correlation with `price`
- scatterplots allow you to identify outliers
- hard to see relationship of `lat`, `long`, and `yr_built` to `price`


## Categorical Variable Visualizations

In [33]:
# Use bar graphs of the distribution of data for categorical variables

# cat_1 = pd.melt(df, value_vars=col_cat)
# g = sns.FacetGrid(cat_1, col='variable', col_wrap=3, sharex=False, sharey=False, height=4)
# g = g.map(sns.countplot, 'value', color='g')
# g.set_xticklabels(rotation=90)
# plt.savefig("images/df_countplot.png")

<img src="images/df_countplot.png">

**Observations:**

- large number of zero values for `waterfont` and `view`
- `bedrooms` and `bathrooms` have right-skewed data

In [34]:
# Create scatterplots for categorical variables to observe any relationships

# cat_2 = pd.melt(df,id_vars='price', value_vars=col_cat)
# g = sns.FacetGrid(cat_2, col='variable', col_wrap=3, sharex=False, sharey=False, height=4)
# g = g.map(sns.regplot, 'value', 'price', color='dodgerblue')
# g.set_xticklabels(rotation=90)
# plt.savefig("images/df_regplot.png")


<img src="images/df_regplot.png">

**Observations:**

- stronger relationship: `bedrooms` vs. `price`, `grade` vs. `price`
- `waterfront` and `view` have correlation with `price`
- little relationship between `zipcode`, `condition`, and `floor`

In [35]:
# Display boxplots of categorical variables to observe any trends in the mean values of each category

# cat_3 = pd.melt(df, id_vars='price', value_vars=col_cat)
# g = sns.FacetGrid(cat_3, col='variable', col_wrap=3, sharex=False, sharey=False, height=4)
# g = g.map(sns.boxplot, 'value', 'price', color='mediumslateblue')
# g.set_xticklabels(rotation=90)
# plt.savefig("images/df_boxplot.png")

<img src="images/df_boxplot.png">

**Observations:**

- 33 bedrooms is an outlier, potentially replace with 3 bedrooms
- strong exponential relationship in the mean values for number of bathrooms and grade
- little correlation with mean views, perhaps express as binary variable
- potentially replace outliers with the mean for grade and bathrooms

In [38]:
df_zipcode = df.groupby(['zipcode']).price.median().sort_values(ascending=False)
# plt.figure(figsize=(20,10))
# plt.ylabel('median price')
# df_zipcode.plot(kind='bar')
# plt.savefig("images/zipcode.png")

<img src="images/zipcode.png">

In [47]:
output_notebook()
source = ColumnDataSource(df_zipcode)
TOOLTIPS = [
            ("Zip Code", "@zipcode"),
            ("Median Price", "@price{($0.00 a)"),
           ]
p = figure(plot_height=400, sizing_mode='scale_width', tooltips=TOOLTIPS)
p.vbar(x='zipcode', top='price', source=source, bottom=0, fill_color=linear_cmap('price', 'Category10_10', 0, max(df_zipcode.price)))
show(p)

## Looking at Correlations

In [36]:
# Correlation Matrix between all variables

# corr_matrix = df.corr()
# plt.figure(figsize=(16,12))
# sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linecolor='black', linewidths=1.0, xticklabels=True, yticklabels=True)
# plt.show()


<img src="images/df_corr.png">

**Final Observations from EDA:**

1. Price has a strong correlation with `sqft_living` and `grade`
2. Price has medium correlation with `bedrooms`, `sqft_above`, `sqft_living15`
3. Price has low correlation with `bedrooms`, `floors`, `sqft_basement`, `latitude`
4. Price has no significant relationship with `sqft_lot`, `yr_built`, `long`, `sqft_lot15`

This will help in guiding our decision for initial feature selection in developing different models.


## **Data Cleaning**

**As data scientists, we undergo the different facets of the data cleaning process to ensure that our "dirty" data does not lead to any false conclusions. To ensure the validity, completeness, and consistency of the data, we make any necessary type conversions, remove any duplicate values and outliers, impute in any missing or anomalous values, perform any scaling or transformations to reduce skewness.**

In [37]:
# Reset view options
pd.set_option('display.max_rows', 20)

In [38]:
# Drop 'id' column and check dataframe
df.drop(['id'], inplace=True, axis=1)
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,sale_date
0,365000.0,4,2.25,2070,8893,2.0,0,0,4,8,2070,0,1986,0,98058,47.4388,-122.162,2390,7700,2014-10-06
1,865000.0,5,3.0,2900,6730,1.0,0,0,5,8,1830,1070,1977,0,98115,47.6784,-122.285,2370,6283,2014-08-21
2,1038000.0,4,2.5,3770,10893,2.0,0,2,3,11,3770,0,1997,0,98006,47.5646,-122.129,3710,9685,2014-08-15
3,1490000.0,3,3.5,4560,14608,2.0,0,2,3,12,4560,0,1990,0,98034,47.6995,-122.228,4050,14226,2015-03-31
4,711000.0,3,2.5,2550,5376,2.0,0,0,3,9,2550,0,2004,0,98052,47.6647,-122.083,2250,4050,2014-07-14


**The `id` column is used to uniquely identify each property, but can be used in regression analysis.**

In [39]:
# Replace anomalous bedroom values and check values in column
df.replace({'bedrooms': {33: 3}}, inplace=True)
df.bedrooms.value_counts()

3     7865
4     5488
2     2204
5     1283
6      229
1      160
7       30
0       12
8       10
9        5
10       3
11       1
Name: bedrooms, dtype: int64

**Upon investigation, it was highly likely that the value was recorded incorrectly and would seem more in alignment with properties with 3 bedrooms rather than 33 bedrooms, which itself is extremely anomalous.**

In [40]:
df[df.bathrooms==0].sort_values('price', ascending=False)

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,sale_date
9286,1295650.0,0,0.0,4810,28008,2.0,0,0,3,12,4810,0,1990,0,98053,47.6642,-122.069,4740,35061,2014-06-24
1120,1095000.0,0,0.0,3064,4764,3.5,0,2,3,7,3064,0,1990,0,98102,47.6362,-122.322,2360,4000,2014-06-12
12982,484000.0,1,0.0,690,23244,1.0,0,0,4,7,690,0,1948,0,98053,47.6429,-121.955,1690,19290,2014-09-18
5424,380000.0,0,0.0,1470,979,3.0,0,2,3,8,1470,0,2006,0,98133,47.7145,-122.356,1470,1399,2015-02-05
483,355000.0,0,0.0,2460,8049,2.0,0,0,3,8,2460,0,1990,0,98031,47.4095,-122.168,2520,8050,2015-04-29
3032,235000.0,0,0.0,1470,4800,2.0,0,0,3,7,1470,0,1996,0,98065,47.5265,-121.828,1060,7200,2014-12-23
10067,142000.0,0,0.0,290,20875,1.0,0,0,1,1,290,0,1963,0,98024,47.5308,-121.888,1620,22850,2014-09-26
9060,75000.0,1,0.0,670,43377,1.0,0,0,3,3,670,0,1966,0,98022,47.2638,-121.906,1160,42882,2015-02-17


**Since we cannot remove any data values from the dataset, I would replace the zero bathroom values with 0.25 since that is the bare minimum for any property.  While properties can have zero bedrooms like studios, but properties must have a bathroom.  Since we have already demonstrated that it has a strong correlation with price, we need to impute some value for the zero values.  What constitutes a quarter bathroom is a bathroom with either a sink, a shower, toilet, or bathtub.  For example, a hallway shower or fresh-up room with single sink would be a 0.25 bathroom.**

In [41]:
df.replace({'bathrooms': {0: 0.25}}, inplace=True)
df.bathrooms.value_counts()

2.50    4322
1.00    3100
1.75    2431
2.25    1666
2.00    1549
        ... 
6.75       2
6.25       2
7.50       1
6.50       1
7.75       1
Name: bathrooms, Length: 30, dtype: int64

In [42]:
df[df.grade==11].sort_values('price', ascending=False)

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,sale_date
6903,7062500.0,5,4.50,10040,37325,2.0,1,2,3,11,7680,2360,1940,2001,98004,47.6500,-122.214,3930,25449,2014-06-11
3903,3850000.0,4,4.25,5770,21300,2.0,1,4,4,11,5770,0,1980,0,98040,47.5850,-122.222,4620,22748,2014-11-14
260,3650000.0,6,4.75,5480,19401,1.5,1,4,5,11,3910,1570,1936,0,98105,47.6515,-122.277,3510,15810,2015-04-21
1020,3640900.0,4,3.25,4830,22257,2.0,1,4,4,11,4830,0,1990,0,98039,47.6409,-122.241,3820,25582,2014-09-11
10286,3418800.0,5,5.00,5450,20412,2.0,0,0,3,11,5450,0,2014,0,98039,47.6209,-122.237,3160,17825,2014-10-07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8344,635000.0,5,3.50,4150,13232,2.0,0,0,3,11,4150,0,2006,0,98003,47.3417,-122.182,3840,15121,2015-02-06
13984,633000.0,5,2.75,3630,30570,2.0,0,0,3,11,3630,0,2000,0,98058,47.4243,-122.097,3620,41965,2014-12-19
2905,575000.0,4,2.50,4620,20793,2.0,0,0,4,11,4620,0,1991,0,98023,47.2929,-122.342,3640,20793,2014-06-24
5668,556000.0,5,2.50,3840,16905,2.0,0,0,3,11,3840,0,1991,0,98023,47.2996,-122.342,3270,12133,2014-05-23


**If we investigate this top outlier that we identified through our data visualization, it immediately becomes apparent that the increase in `price` can be explained by the combination of any of the following: increase in `sqft_living`, `sqft_lot`, and `sqft_basement`, and the recent renovation.  Based on this information, it would not be considered an outlier based on that level of `grade`.**

In [43]:
df[df.bathrooms==4.5].sort_values('price', ascending=False)

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,sale_date
6903,7062500.0,5,4.5,10040,37325,2.0,1,2,3,11,7680,2360,1940,2001,98004,47.6500,-122.214,3930,25449,2014-06-11
7823,3567000.0,5,4.5,4850,10584,2.0,1,4,3,10,3540,1310,2007,0,98008,47.5943,-122.110,3470,18270,2015-01-07
5293,3200000.0,7,4.5,6210,8856,2.5,0,2,5,11,4760,1450,1910,0,98109,47.6307,-122.354,2940,5400,2014-05-07
16000,2945000.0,5,4.5,4340,5722,3.0,0,4,3,10,4340,0,2010,0,98107,47.6715,-122.406,1770,5250,2015-03-04
3181,2600000.0,4,4.5,5270,12195,2.0,1,4,3,11,3400,1870,1979,0,98027,47.5696,-122.090,3390,9905,2014-12-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1835,482500.0,6,4.5,2940,7500,1.5,0,0,4,8,2940,0,1966,0,98034,47.7208,-122.182,2010,7500,2014-09-05
14681,460000.0,5,4.5,3100,7260,2.0,0,0,3,8,3100,0,1963,2000,98059,47.5004,-122.162,1650,7700,2015-02-25
13472,389000.0,6,4.5,3560,14010,2.0,0,0,3,7,3560,0,1989,0,98002,47.3244,-122.217,1710,11116,2015-05-06
9118,350000.0,6,4.5,3500,8504,2.0,0,0,3,7,3500,0,1980,0,98155,47.7351,-122.295,1550,8460,2014-09-17


**If we pick another outlier to explore like the apparent high housing price value in the category of 4.5 bathrooms, once again, there is a substantive increase in `sqft_living` and `sqft_lot`, plus `renovated`. Based on analyzing these two values, I have decided not to change any apparent outliers.**

# Feature Engineering

The goal of feature engineering is to prepare the data for the machine learning algorithms and to improve the performance of the models. The different techniques involved include creating dummy variables for our categorical variables, perform any transformations or scaling, binning, or any new features based on your EDA analysis.

Log transformations consist of taking the log of each observation, where the back transformation is raise 10 or 3 to the power of the number. In the instance where different indepedent factors relate in the following fashion, (factor)*(factor)*(factor)*(factor), i.e. a multiplicative process, the function is log-normal. In biology, size data would be representative of such a dataset, i.e. height of trees.  It is used to handle skewed data in creating a more normal distribution, as well as decreasing the effect of outliers.

Square root transformations consists of taking the sqaure root of each observation, where the back transformation is to square the number.  Count data, like bacteria in a petri dish or mutations in a population, can be transformed for a more normal distributions.  Reciprocal transformations.... 

Scaling techniques normalize the scale of various indepedent variables, which include mean normalization, standardization, robust scaling (median and IQR), min-max scaling, for certain algorithms that are particularly sensitive to feature magnitude. Variables with a greater magnitude often dominate over those with a smaller magnitude range, and the scale directly influences the regression coefficient.

In [40]:
# Create new feature to incorporate built and renovation year
df['sale_age'] = df.sale_date.dt.year - df[['yr_built', 'yr_renovated']].max(axis=1)
# Reset display options set in the beginning
pd.set_option('display.max_rows', 20)
# Look for anomalous values
df.sale_age.value_counts(ascending=False)

 0      418
 9      385
 11     375
 10     372
 8      370
       ... 
 112     22
 115     17
 81      15
 80      12
-1       10
Name: sale_age, Length: 117, dtype: int64

In [41]:
# Replace anomalous values
df.replace({'sale_age': {-1: 0}}, inplace=True)
df.sale_age.value_counts()

0      428
9      385
11     375
10     372
8      370
      ... 
113     23
112     22
115     17
81      15
80      12
Name: sale_age, Length: 116, dtype: int64

In [42]:
# Create binary variable for whether there has been a renovation, has a bathroom, and has been viewed
df['renovated'] = df.yr_renovated.apply(lambda x: x if x==0 else 1)
df['basement'] = df.sqft_basement.apply(lambda x: x if x==0 else 1)
df['viewed'] = df.view.apply(lambda x: x if x==0 else 1)
# Drop original columms as well as the sale_date columns since it is in datetime format
df.drop(['yr_built', 'yr_renovated', 'sale_date', 'sqft_basement', 'view'], inplace=True, axis=1)

In [43]:
# Check for any anomalous values
print(df.basement.value_counts())
print(df.viewed.value_counts())
print(df.renovated.value_counts())

0    10484
1     6806
Name: basement, dtype: int64
0    15571
1     1719
Name: viewed, dtype: int64
0    16564
1      726
Name: renovated, dtype: int64


# Statistical Tests

We conduct our statistical tests to provide answers for our preliminary and exploratory questions about what features affect housing values.  The goal for conducting the regression models is to find what features best predict housing values.

### Question 1: Do renovated properties have a higher selling price than unrenovated properties?

To answer this question, we conduct a Welch's T-test which does not assume equal population variance to compare the means of two independent sample populations, which in this case is the mean selling price for renovated vs. unrenovated properties.

**Difference of Two Means**

$$H_o: \mu_1 = \mu_2$$  

The null hypothesis is that there is no statistically significant difference between the housing price means of the two groups, renovated and not renovated. 

$$H_a: \mu_1 \neq \mu_2$$  

The alternate hypothesis is that there is statistically significant difference between the housing price means of the two groups.



In [44]:
renovated = df[df.renovated==1]
not_renovated = df[df.renovated==0]
p_value = stats.ttest_ind(renovated.price, not_renovated.price, equal_var=False)[1]
print("P-value for T-Test: ", p_value)
if p_value < 0.05:
    print('We reject the null hypothesis, and the sample populations are statistical different. Price is correlated with whether the property is renovated or not.')
else:
    print('We do not reject the null hypothesis')

P-value for T-Test:  6.478917377975333e-20
We reject the null hypothesis, and the sample populations are statistical different. Price is correlated with whether the property is renovated or not.


### Question 2: Does whether or not a property has been viewed have any effect on selling price?

To answer this question, we also conduct a Welch's T-test to compare the means of the sample populations of viewed and not viewed.

**Difference of Two Means**

$$H_o: \mu_1 = \mu_2$$  

The null hypothesis is that there is no statistically significant difference between the housing price means of the two groups, viewed and not viewed. 

$$H_a: \mu_1 \neq \mu_2$$  

The alternate hypothesis is that there is statistically significant difference between the housing price means of the two groups.

In [45]:
viewed = df[df.viewed==1]
not_viewed = df[df.viewed==0]
p_value = stats.ttest_ind(viewed.price, not_viewed.price, equal_var=False)[1]
print("P-value for T-Test: ", p_value)
if p_value < 0.05:
    print('We reject the null hypothesis, and the sample populations are statistical different. Price is correlated with whether the property has been viewed or not.')
else:
    print('Do Not Reject Null Hypothesis')

P-value for T-Test:  2.784996317762731e-131
We reject the null hypothesis, and the sample populations are statistical different. Price is correlated with whether the property has been viewed or not.


### Question 3: Does the grade given to the housing unit have an overall effect on the selling price?

**One-Way ANOVA for Variance Between Multiple Means**

$$H_o : \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5$$

Null Hypothesis is that there is no statistically significant difference between the housing price means between the different grades.

$$H_a : \mu_1 \neq \mu_2 \text{ or } \mu_2 \neq \mu_3 \text{ or } \mu_1 \neq \mu_3...$$

Alternative Hypothesis is that there is statistically significant difference of the housing price means of at least one of the grades.

In [46]:
formula = 'price~grade'
lm_condition = ols(formula, df).fit()
anova_condition = sm.stats.anova_lm(lm_condition, type=2)
print('F-stat Probability: ', anova_condition["PR(>F)"][0])
if anova_condition['PR(>F)'][0] < 0.05:
    print("We reject the null hypothesis, and at least one of the sample populations is statistically different. Price is correlated with at least one of the grade categories.")
else:
    print("Do Not Reject Null Hypothesis")

F-stat Probability:  0.0
We reject the null hypothesis, and at least one of the sample populations is statistically different. Price is correlated with at least one of the grade categories.


# Feature Engineering Continued

## One-Hot Encoding/Dummy Variables

Creating dummy variables allow us to input categorical variables into the Machine Learning models, which require that all input data be numerical. Here, they are numerical, but take on discrete numerical values, so we consider them as categorical.  Dummy variables only take on the value of 0 or 1 for the absence or presence of some aspect of the category that is expected to effect the outcome. We did not create polynomial and interaction features for dummy variables since the values are only 0 and 1.


In [47]:
# Get index of the columns
df.columns

Index(['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'condition', 'grade', 'sqft_above', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15', 'sale_age', 'renovated',
       'basement', 'viewed'],
      dtype='object')

In [48]:
# Grab indices of columns for creating dummy variables and create dataframe with dummy variables
dum_feat = df[['bedrooms', 'bathrooms', 'floors', 'condition', 'grade', 'zipcode']]
dum_index = dum_feat.columns
# To prevent what they call the dummy variable trap (related to multicollinearity), drop one of the dummy variable, as well as  the original categorical variable used in creating the dummy variables
df_dum = pd.get_dummies(data=dum_feat, columns=dum_index, drop_first=True, prefix=['bdr', 'bth', 'flr', 'cnd', 'grd', 'zip'])
df_dum.head()

Unnamed: 0,bdr_1,bdr_2,bdr_3,bdr_4,bdr_5,bdr_6,bdr_7,bdr_8,bdr_9,bdr_10,bdr_11,bdr_33,bth_0.5,bth_0.75,bth_1.0,bth_1.25,bth_1.5,bth_1.75,bth_2.0,bth_2.25,bth_2.5,bth_2.75,bth_3.0,bth_3.25,bth_3.5,bth_3.75,bth_4.0,bth_4.25,bth_4.5,bth_4.75,bth_5.0,bth_5.25,bth_5.5,bth_5.75,bth_6.0,bth_6.25,bth_6.5,bth_6.75,bth_7.5,bth_7.75,bth_8.0,flr_1.5,flr_2.0,flr_2.5,flr_3.0,flr_3.5,cnd_2,cnd_3,cnd_4,cnd_5,grd_3,grd_4,grd_5,grd_6,grd_7,grd_8,grd_9,grd_10,grd_11,grd_12,grd_13,zip_98002,zip_98003,zip_98004,zip_98005,zip_98006,zip_98007,zip_98008,zip_98010,zip_98011,zip_98014,zip_98019,zip_98022,zip_98023,zip_98024,zip_98027,zip_98028,zip_98029,zip_98030,zip_98031,zip_98032,zip_98033,zip_98034,zip_98038,zip_98039,zip_98040,zip_98042,zip_98045,zip_98052,zip_98053,zip_98055,zip_98056,zip_98058,zip_98059,zip_98065,zip_98070,zip_98072,zip_98074,zip_98075,zip_98077,zip_98092,zip_98102,zip_98103,zip_98105,zip_98106,zip_98107,zip_98108,zip_98109,zip_98112,zip_98115,zip_98116,zip_98117,zip_98118,zip_98119,zip_98122,zip_98125,zip_98126,zip_98133,zip_98136,zip_98144,zip_98146,zip_98148,zip_98155,zip_98166,zip_98168,zip_98177,zip_98178,zip_98188,zip_98198,zip_98199
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Polynomial and Interaction Features

Polynomial features are created by raising our exisitng features by some exponent, generally not greater than 3 or 4.  Adding polynomial features helps the regression models to recognize nonlinear patterns. For instance, age is related to price in more of a parabolic function due to the higher premium placed on brand new constructions vs. vintage or historic homes, which are on opposite ends of the age spectrum.

Interaction features, however, are represented by one variable or feature multipled by another feature. The idea here is that feature A's effect on C depend on the differing values of feature B.  Let's say C is plant growth, feature A is the amount of bacteria and feature B is the amount of sunlight.  In low amounts of sunlight, a high amount of bacteria in the soil creates tall plants, let's say, but in high amounts of sunlight, that same amount of bacteria creates short plants.  Only an interaction feature would be able to express that relationship.


In [49]:
# Grab columns for polynominal and interaction features from the original dataframe without dummy variables
poly_feat = df.drop('price', axis=1)
target = df['price']
# Use PolynomialFeatures to create binomial and interaction features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_data = poly.fit_transform(poly_feat)
poly_columns = poly.get_feature_names(poly_feat.columns)
df_poly = pd.DataFrame(poly_data, columns=poly_columns)
df_poly.head()


Unnamed: 0,id,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,grade,sqft_above,zipcode,lat,long,sqft_living15,sqft_lot15,sale_age,renovated,basement,viewed,id^2,id bedrooms,id bathrooms,id sqft_living,id sqft_lot,id floors,id waterfront,id condition,id grade,id sqft_above,id zipcode,id lat,id long,id sqft_living15,id sqft_lot15,id sale_age,id renovated,id basement,id viewed,bedrooms^2,bedrooms bathrooms,bedrooms sqft_living,bedrooms sqft_lot,bedrooms floors,bedrooms waterfront,bedrooms condition,bedrooms grade,bedrooms sqft_above,bedrooms zipcode,bedrooms lat,bedrooms long,bedrooms sqft_living15,bedrooms sqft_lot15,bedrooms sale_age,bedrooms renovated,bedrooms basement,bedrooms viewed,bathrooms^2,bathrooms sqft_living,bathrooms sqft_lot,bathrooms floors,bathrooms waterfront,bathrooms condition,bathrooms grade,bathrooms sqft_above,bathrooms zipcode,bathrooms lat,bathrooms long,bathrooms sqft_living15,bathrooms sqft_lot15,bathrooms sale_age,bathrooms renovated,bathrooms basement,bathrooms viewed,sqft_living^2,sqft_living sqft_lot,sqft_living floors,sqft_living waterfront,sqft_living condition,sqft_living grade,sqft_living sqft_above,sqft_living zipcode,sqft_living lat,sqft_living long,sqft_living sqft_living15,sqft_living sqft_lot15,sqft_living sale_age,sqft_living renovated,sqft_living basement,sqft_living viewed,sqft_lot^2,sqft_lot floors,sqft_lot waterfront,sqft_lot condition,sqft_lot grade,sqft_lot sqft_above,sqft_lot zipcode,sqft_lot lat,sqft_lot long,sqft_lot sqft_living15,sqft_lot sqft_lot15,...,floors zipcode,floors lat,floors long,floors sqft_living15,floors sqft_lot15,floors sale_age,floors renovated,floors basement,floors viewed,waterfront^2,waterfront condition,waterfront grade,waterfront sqft_above,waterfront zipcode,waterfront lat,waterfront long,waterfront sqft_living15,waterfront sqft_lot15,waterfront sale_age,waterfront renovated,waterfront basement,waterfront viewed,condition^2,condition grade,condition sqft_above,condition zipcode,condition lat,condition long,condition sqft_living15,condition sqft_lot15,condition sale_age,condition renovated,condition basement,condition viewed,grade^2,grade sqft_above,grade zipcode,grade lat,grade long,grade sqft_living15,grade sqft_lot15,grade sale_age,grade renovated,grade basement,grade viewed,sqft_above^2,sqft_above zipcode,sqft_above lat,sqft_above long,sqft_above sqft_living15,sqft_above sqft_lot15,sqft_above sale_age,sqft_above renovated,sqft_above basement,sqft_above viewed,zipcode^2,zipcode lat,zipcode long,zipcode sqft_living15,zipcode sqft_lot15,zipcode sale_age,zipcode renovated,zipcode basement,zipcode viewed,lat^2,lat long,lat sqft_living15,lat sqft_lot15,lat sale_age,lat renovated,lat basement,lat viewed,long^2,long sqft_living15,long sqft_lot15,long sale_age,long renovated,long basement,long viewed,sqft_living15^2,sqft_living15 sqft_lot15,sqft_living15 sale_age,sqft_living15 renovated,sqft_living15 basement,sqft_living15 viewed,sqft_lot15^2,sqft_lot15 sale_age,sqft_lot15 renovated,sqft_lot15 basement,sqft_lot15 viewed,sale_age^2,sale_age renovated,sale_age basement,sale_age viewed,renovated^2,renovated basement,renovated viewed,basement^2,basement viewed,viewed^2
0,2591820000.0,4.0,2.25,2070.0,8893.0,2.0,0.0,4.0,8.0,2070.0,98058.0,47.4388,-122.162,2390.0,7700.0,28.0,0.0,0.0,0.0,6.717533e+18,10367280000.0,5831596000.0,5365068000000.0,23049060000000.0,5183641000.0,0.0,10367280000.0,20734560000.0,5365068000000.0,254148700000000.0,122952800000.0,-316622000000.0,6194451000000.0,19957020000000.0,72570970000.0,0.0,0.0,0.0,16.0,9.0,8280.0,35572.0,8.0,0.0,16.0,32.0,8280.0,392232.0,189.7552,-488.648,9560.0,30800.0,112.0,0.0,0.0,0.0,5.0625,4657.5,20009.25,4.5,0.0,9.0,18.0,4657.5,220630.5,106.7373,-274.8645,5377.5,17325.0,63.0,0.0,0.0,0.0,4284900.0,18408510.0,4140.0,0.0,8280.0,16560.0,4284900.0,202980060.0,98198.316,-252875.34,4947300.0,15939000.0,57960.0,0.0,0.0,0.0,79085449.0,17786.0,0.0,35572.0,71144.0,18408510.0,872029800.0,421873.2484,-1086386.666,21254270.0,68476100.0,...,196116.0,94.8776,-244.324,4780.0,15400.0,56.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,32.0,8280.0,392232.0,189.7552,-488.648,9560.0,30800.0,112.0,0.0,0.0,0.0,64.0,16560.0,784464.0,379.5104,-977.296,19120.0,61600.0,224.0,0.0,0.0,0.0,4284900.0,202980060.0,98198.316,-252875.34,4947300.0,15939000.0,57960.0,0.0,0.0,0.0,9615371000.0,4651754.0,-11978960.0,234358620.0,755046600.0,2745624.0,0.0,0.0,0.0,2250.439745,-5795.218686,113378.732,365278.76,1328.2864,0.0,0.0,0.0,14923.554244,-291967.18,-940647.4,-3420.536,-0.0,-0.0,-0.0,5712100.0,18403000.0,66920.0,0.0,0.0,0.0,59290000.0,215600.0,0.0,0.0,0.0,784.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7974201000.0,5.0,3.0,2900.0,6730.0,1.0,0.0,5.0,8.0,1830.0,98115.0,47.6784,-122.285,2370.0,6283.0,37.0,0.0,1.0,0.0,6.358788e+19,39871000000.0,23922600000.0,23125180000000.0,53666370000000.0,7974201000.0,0.0,39871000000.0,63793610000.0,14592790000000.0,782388700000000.0,380197100000.0,-975125100000.0,18898860000000.0,50101900000000.0,295045400000.0,0.0,7974201000.0,0.0,25.0,15.0,14500.0,33650.0,5.0,0.0,25.0,40.0,9150.0,490575.0,238.392,-611.425,11850.0,31415.0,185.0,0.0,5.0,0.0,9.0,8700.0,20190.0,3.0,0.0,15.0,24.0,5490.0,294345.0,143.0352,-366.855,7110.0,18849.0,111.0,0.0,3.0,0.0,8410000.0,19517000.0,2900.0,0.0,14500.0,23200.0,5307000.0,284533500.0,138267.36,-354626.5,6873000.0,18220700.0,107300.0,0.0,2900.0,0.0,45292900.0,6730.0,0.0,33650.0,53840.0,12315900.0,660314000.0,320875.632,-822978.05,15950100.0,42284590.0,...,98115.0,47.6784,-122.285,2370.0,6283.0,37.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,40.0,9150.0,490575.0,238.392,-611.425,11850.0,31415.0,185.0,0.0,5.0,0.0,64.0,14640.0,784920.0,381.4272,-978.28,18960.0,50264.0,296.0,0.0,8.0,0.0,3348900.0,179550450.0,87251.472,-223781.55,4337100.0,11497890.0,67710.0,0.0,1830.0,0.0,9626553000.0,4677966.0,-11997990.0,232532550.0,616456500.0,3630255.0,0.0,98115.0,0.0,2273.229827,-5830.353144,112997.808,299563.3872,1764.1008,0.0,47.6784,0.0,14953.621225,-289815.45,-768316.655,-4524.545,-0.0,-122.285,-0.0,5616900.0,14890710.0,87690.0,0.0,2370.0,0.0,39476089.0,232471.0,0.0,6283.0,0.0,1369.0,0.0,37.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,7701450000.0,4.0,2.5,3770.0,10893.0,2.0,0.0,3.0,11.0,3770.0,98006.0,47.5646,-122.129,3710.0,9685.0,17.0,0.0,0.0,1.0,5.931233e+19,30805800000.0,19253630000.0,29034470000000.0,83891900000000.0,15402900000.0,0.0,23104350000.0,84715950000.0,29034470000000.0,754788300000000.0,366316400000.0,-940570400000.0,28572380000000.0,74588540000000.0,130924700000.0,0.0,0.0,7701450000.0,16.0,10.0,15080.0,43572.0,8.0,0.0,12.0,44.0,15080.0,392024.0,190.2584,-488.516,14840.0,38740.0,68.0,0.0,0.0,4.0,6.25,9425.0,27232.5,5.0,0.0,7.5,27.5,9425.0,245015.0,118.9115,-305.3225,9275.0,24212.5,42.5,0.0,0.0,2.5,14212900.0,41066610.0,7540.0,0.0,11310.0,41470.0,14212900.0,369482620.0,179318.542,-460426.33,13986700.0,36512450.0,64090.0,0.0,0.0,3770.0,118657449.0,21786.0,0.0,32679.0,119823.0,41066610.0,1067579000.0,518121.1878,-1330351.197,40413030.0,105498705.0,...,196012.0,95.1292,-244.258,7420.0,19370.0,34.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,33.0,11310.0,294018.0,142.6938,-366.387,11130.0,29055.0,51.0,0.0,0.0,3.0,121.0,41470.0,1078066.0,523.2106,-1343.419,40810.0,106535.0,187.0,0.0,0.0,11.0,14212900.0,369482620.0,179318.542,-460426.33,13986700.0,36512450.0,64090.0,0.0,0.0,3770.0,9605176000.0,4661616.0,-11969370.0,363602260.0,949188100.0,1666102.0,0.0,0.0,98006.0,2262.391173,-5809.017033,176464.666,460663.151,808.5982,0.0,0.0,47.5646,14915.492641,-453098.59,-1182819.365,-2076.193,-0.0,-0.0,-122.129,13764100.0,35931350.0,63070.0,0.0,0.0,3710.0,93799225.0,164645.0,0.0,0.0,9685.0,289.0,0.0,0.0,17.0,0.0,0.0,0.0,0.0,0.0,1.0
3,9522300000.0,3.0,3.5,4560.0,14608.0,2.0,0.0,3.0,12.0,4560.0,98034.0,47.6995,-122.228,4050.0,14226.0,25.0,0.0,0.0,1.0,9.06742e+19,28566900000.0,33328050000.0,43421690000000.0,139101800000000.0,19044600000.0,0.0,28566900000.0,114267600000.0,43421690000000.0,933509200000000.0,454208900000.0,-1163892000000.0,38565320000000.0,135464200000000.0,238057500000.0,0.0,0.0,9522300000.0,9.0,10.5,13680.0,43824.0,6.0,0.0,9.0,36.0,13680.0,294102.0,143.0985,-366.684,12150.0,42678.0,75.0,0.0,0.0,3.0,12.25,15960.0,51128.0,7.0,0.0,10.5,42.0,15960.0,343119.0,166.94825,-427.798,14175.0,49791.0,87.5,0.0,0.0,3.5,20793600.0,66612480.0,9120.0,0.0,13680.0,54720.0,20793600.0,447035040.0,217509.72,-557359.68,18468000.0,64870560.0,114000.0,0.0,0.0,4560.0,213393664.0,29216.0,0.0,43824.0,175296.0,66612480.0,1432081000.0,696794.296,-1785506.624,59162400.0,207813408.0,...,196068.0,95.399,-244.456,8100.0,28452.0,50.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,36.0,13680.0,294102.0,143.0985,-366.684,12150.0,42678.0,75.0,0.0,0.0,3.0,144.0,54720.0,1176408.0,572.394,-1466.736,48600.0,170712.0,300.0,0.0,0.0,12.0,20793600.0,447035040.0,217509.72,-557359.68,18468000.0,64870560.0,114000.0,0.0,0.0,4560.0,9610665000.0,4676173.0,-11982500.0,397037700.0,1394632000.0,2450850.0,0.0,0.0,98034.0,2275.2423,-5830.214486,193182.975,678573.087,1192.4875,0.0,0.0,47.6995,14939.683984,-495023.4,-1738815.528,-3055.7,-0.0,-0.0,-122.228,16402500.0,57615300.0,101250.0,0.0,0.0,4050.0,202379076.0,355650.0,0.0,0.0,14226.0,625.0,0.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,1.0
4,9510861000.0,3.0,2.5,2550.0,5376.0,2.0,0.0,3.0,9.0,2550.0,98052.0,47.6647,-122.083,2250.0,4050.0,10.0,0.0,0.0,0.0,9.045648e+19,28532580000.0,23777150000.0,24252700000000.0,51130390000000.0,19021720000.0,0.0,28532580000.0,85597750000.0,24252700000000.0,932559000000000.0,453332300000.0,-1161114000000.0,21399440000000.0,38518990000000.0,95108610000.0,0.0,0.0,0.0,9.0,7.5,7650.0,16128.0,6.0,0.0,9.0,27.0,7650.0,294156.0,142.9941,-366.249,6750.0,12150.0,30.0,0.0,0.0,0.0,6.25,6375.0,13440.0,5.0,0.0,7.5,22.5,6375.0,245130.0,119.16175,-305.2075,5625.0,10125.0,25.0,0.0,0.0,0.0,6502500.0,13708800.0,5100.0,0.0,7650.0,22950.0,6502500.0,250032600.0,121544.985,-311311.65,5737500.0,10327500.0,25500.0,0.0,0.0,0.0,28901376.0,10752.0,0.0,16128.0,48384.0,13708800.0,527127600.0,256245.4272,-656318.208,12096000.0,21772800.0,...,196104.0,95.3294,-244.166,4500.0,8100.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,27.0,7650.0,294156.0,142.9941,-366.249,6750.0,12150.0,30.0,0.0,0.0,0.0,81.0,22950.0,882468.0,428.9823,-1098.747,20250.0,36450.0,90.0,0.0,0.0,0.0,6502500.0,250032600.0,121544.985,-311311.65,5737500.0,10327500.0,25500.0,0.0,0.0,0.0,9614195000.0,4673619.0,-11970480.0,220617000.0,397110600.0,980520.0,0.0,0.0,0.0,2271.923626,-5819.04957,107245.575,193042.035,476.647,0.0,0.0,0.0,14904.258889,-274686.75,-494436.15,-1220.83,-0.0,-0.0,-0.0,5062500.0,9112500.0,22500.0,0.0,0.0,0.0,16402500.0,40500.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [50]:
# Concatenating two dataframes together for input into linear regression model
df_model = pd.concat([df_poly, df_dum], axis=1)
df_model

Unnamed: 0,id,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,grade,sqft_above,zipcode,lat,long,sqft_living15,sqft_lot15,sale_age,renovated,basement,viewed,id^2,id bedrooms,id bathrooms,id sqft_living,id sqft_lot,id floors,id waterfront,id condition,id grade,id sqft_above,id zipcode,id lat,id long,id sqft_living15,id sqft_lot15,id sale_age,id renovated,id basement,id viewed,bedrooms^2,bedrooms bathrooms,bedrooms sqft_living,bedrooms sqft_lot,bedrooms floors,bedrooms waterfront,bedrooms condition,bedrooms grade,bedrooms sqft_above,bedrooms zipcode,bedrooms lat,bedrooms long,bedrooms sqft_living15,bedrooms sqft_lot15,bedrooms sale_age,bedrooms renovated,bedrooms basement,bedrooms viewed,bathrooms^2,bathrooms sqft_living,bathrooms sqft_lot,bathrooms floors,bathrooms waterfront,bathrooms condition,bathrooms grade,bathrooms sqft_above,bathrooms zipcode,bathrooms lat,bathrooms long,bathrooms sqft_living15,bathrooms sqft_lot15,bathrooms sale_age,bathrooms renovated,bathrooms basement,bathrooms viewed,sqft_living^2,sqft_living sqft_lot,sqft_living floors,sqft_living waterfront,sqft_living condition,sqft_living grade,sqft_living sqft_above,sqft_living zipcode,sqft_living lat,sqft_living long,sqft_living sqft_living15,sqft_living sqft_lot15,sqft_living sale_age,sqft_living renovated,sqft_living basement,sqft_living viewed,sqft_lot^2,sqft_lot floors,sqft_lot waterfront,sqft_lot condition,sqft_lot grade,sqft_lot sqft_above,sqft_lot zipcode,sqft_lot lat,sqft_lot long,sqft_lot sqft_living15,sqft_lot sqft_lot15,...,bth_5.0,bth_5.25,bth_5.5,bth_5.75,bth_6.0,bth_6.25,bth_6.5,bth_6.75,bth_7.5,bth_7.75,bth_8.0,flr_1.5,flr_2.0,flr_2.5,flr_3.0,flr_3.5,cnd_2,cnd_3,cnd_4,cnd_5,grd_3,grd_4,grd_5,grd_6,grd_7,grd_8,grd_9,grd_10,grd_11,grd_12,grd_13,zip_98002,zip_98003,zip_98004,zip_98005,zip_98006,zip_98007,zip_98008,zip_98010,zip_98011,zip_98014,zip_98019,zip_98022,zip_98023,zip_98024,zip_98027,zip_98028,zip_98029,zip_98030,zip_98031,zip_98032,zip_98033,zip_98034,zip_98038,zip_98039,zip_98040,zip_98042,zip_98045,zip_98052,zip_98053,zip_98055,zip_98056,zip_98058,zip_98059,zip_98065,zip_98070,zip_98072,zip_98074,zip_98075,zip_98077,zip_98092,zip_98102,zip_98103,zip_98105,zip_98106,zip_98107,zip_98108,zip_98109,zip_98112,zip_98115,zip_98116,zip_98117,zip_98118,zip_98119,zip_98122,zip_98125,zip_98126,zip_98133,zip_98136,zip_98144,zip_98146,zip_98148,zip_98155,zip_98166,zip_98168,zip_98177,zip_98178,zip_98188,zip_98198,zip_98199
0,2.591820e+09,4.0,2.25,2070.0,8893.0,2.0,0.0,4.0,8.0,2070.0,98058.0,47.4388,-122.162,2390.0,7700.0,28.0,0.0,0.0,0.0,6.717533e+18,1.036728e+10,5.831596e+09,5.365068e+12,2.304906e+13,5.183641e+09,0.0,1.036728e+10,2.073456e+10,5.365068e+12,2.541487e+14,1.229528e+11,-3.166220e+11,6.194451e+12,1.995702e+13,7.257097e+10,0.0,0.000000e+00,0.000000e+00,16.0,9.0,8280.0,35572.0,8.0,0.0,16.0,32.0,8280.0,392232.0,189.7552,-488.648,9560.0,30800.0,112.0,0.0,0.0,0.0,5.0625,4657.5,20009.25,4.50,0.0,9.00,18.00,4657.5,220630.50,106.737300,-274.8645,5377.5,17325.00,63.0,0.0,0.0,0.0,4284900.0,18408510.0,4140.0,0.0,8280.0,16560.0,4284900.0,202980060.0,98198.316,-252875.34,4947300.0,15939000.0,57960.0,0.0,0.0,0.0,7.908545e+07,17786.0,0.0,35572.0,71144.0,18408510.0,8.720298e+08,4.218732e+05,-1086386.666,21254270.0,6.847610e+07,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7.974201e+09,5.0,3.00,2900.0,6730.0,1.0,0.0,5.0,8.0,1830.0,98115.0,47.6784,-122.285,2370.0,6283.0,37.0,0.0,1.0,0.0,6.358788e+19,3.987100e+10,2.392260e+10,2.312518e+13,5.366637e+13,7.974201e+09,0.0,3.987100e+10,6.379361e+10,1.459279e+13,7.823887e+14,3.801971e+11,-9.751251e+11,1.889886e+13,5.010190e+13,2.950454e+11,0.0,7.974201e+09,0.000000e+00,25.0,15.0,14500.0,33650.0,5.0,0.0,25.0,40.0,9150.0,490575.0,238.3920,-611.425,11850.0,31415.0,185.0,0.0,5.0,0.0,9.0000,8700.0,20190.00,3.00,0.0,15.00,24.00,5490.0,294345.00,143.035200,-366.8550,7110.0,18849.00,111.0,0.0,3.0,0.0,8410000.0,19517000.0,2900.0,0.0,14500.0,23200.0,5307000.0,284533500.0,138267.360,-354626.50,6873000.0,18220700.0,107300.0,0.0,2900.0,0.0,4.529290e+07,6730.0,0.0,33650.0,53840.0,12315900.0,6.603140e+08,3.208756e+05,-822978.050,15950100.0,4.228459e+07,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,7.701450e+09,4.0,2.50,3770.0,10893.0,2.0,0.0,3.0,11.0,3770.0,98006.0,47.5646,-122.129,3710.0,9685.0,17.0,0.0,0.0,1.0,5.931233e+19,3.080580e+10,1.925363e+10,2.903447e+13,8.389190e+13,1.540290e+10,0.0,2.310435e+10,8.471595e+10,2.903447e+13,7.547883e+14,3.663164e+11,-9.405704e+11,2.857238e+13,7.458854e+13,1.309247e+11,0.0,0.000000e+00,7.701450e+09,16.0,10.0,15080.0,43572.0,8.0,0.0,12.0,44.0,15080.0,392024.0,190.2584,-488.516,14840.0,38740.0,68.0,0.0,0.0,4.0,6.2500,9425.0,27232.50,5.00,0.0,7.50,27.50,9425.0,245015.00,118.911500,-305.3225,9275.0,24212.50,42.5,0.0,0.0,2.5,14212900.0,41066610.0,7540.0,0.0,11310.0,41470.0,14212900.0,369482620.0,179318.542,-460426.33,13986700.0,36512450.0,64090.0,0.0,0.0,3770.0,1.186574e+08,21786.0,0.0,32679.0,119823.0,41066610.0,1.067579e+09,5.181212e+05,-1330351.197,40413030.0,1.054987e+08,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9.522300e+09,3.0,3.50,4560.0,14608.0,2.0,0.0,3.0,12.0,4560.0,98034.0,47.6995,-122.228,4050.0,14226.0,25.0,0.0,0.0,1.0,9.067420e+19,2.856690e+10,3.332805e+10,4.342169e+13,1.391018e+14,1.904460e+10,0.0,2.856690e+10,1.142676e+11,4.342169e+13,9.335092e+14,4.542089e+11,-1.163892e+12,3.856532e+13,1.354642e+14,2.380575e+11,0.0,0.000000e+00,9.522300e+09,9.0,10.5,13680.0,43824.0,6.0,0.0,9.0,36.0,13680.0,294102.0,143.0985,-366.684,12150.0,42678.0,75.0,0.0,0.0,3.0,12.2500,15960.0,51128.00,7.00,0.0,10.50,42.00,15960.0,343119.00,166.948250,-427.7980,14175.0,49791.00,87.5,0.0,0.0,3.5,20793600.0,66612480.0,9120.0,0.0,13680.0,54720.0,20793600.0,447035040.0,217509.720,-557359.68,18468000.0,64870560.0,114000.0,0.0,0.0,4560.0,2.133937e+08,29216.0,0.0,43824.0,175296.0,66612480.0,1.432081e+09,6.967943e+05,-1785506.624,59162400.0,2.078134e+08,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,9.510861e+09,3.0,2.50,2550.0,5376.0,2.0,0.0,3.0,9.0,2550.0,98052.0,47.6647,-122.083,2250.0,4050.0,10.0,0.0,0.0,0.0,9.045648e+19,2.853258e+10,2.377715e+10,2.425270e+13,5.113039e+13,1.902172e+10,0.0,2.853258e+10,8.559775e+10,2.425270e+13,9.325590e+14,4.533323e+11,-1.161114e+12,2.139944e+13,3.851899e+13,9.510861e+10,0.0,0.000000e+00,0.000000e+00,9.0,7.5,7650.0,16128.0,6.0,0.0,9.0,27.0,7650.0,294156.0,142.9941,-366.249,6750.0,12150.0,30.0,0.0,0.0,0.0,6.2500,6375.0,13440.00,5.00,0.0,7.50,22.50,6375.0,245130.00,119.161750,-305.2075,5625.0,10125.00,25.0,0.0,0.0,0.0,6502500.0,13708800.0,5100.0,0.0,7650.0,22950.0,6502500.0,250032600.0,121544.985,-311311.65,5737500.0,10327500.0,25500.0,0.0,0.0,0.0,2.890138e+07,10752.0,0.0,16128.0,48384.0,13708800.0,5.271276e+08,2.562454e+05,-656318.208,12096000.0,2.177280e+07,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17285,6.273002e+08,5.0,2.50,3240.0,9960.0,1.0,0.0,3.0,8.0,2020.0,98008.0,47.5858,-122.112,2730.0,10400.0,57.0,0.0,1.0,1.0,3.935055e+17,3.136501e+09,1.568250e+09,2.032453e+12,6.247910e+12,6.273002e+08,0.0,1.881901e+09,5.018402e+09,1.267146e+12,6.148044e+13,2.985058e+10,-7.660088e+10,1.712530e+12,6.523922e+12,3.575611e+10,0.0,6.273002e+08,6.273002e+08,25.0,12.5,16200.0,49800.0,5.0,0.0,15.0,40.0,10100.0,490040.0,237.9290,-610.560,13650.0,52000.0,285.0,0.0,5.0,5.0,6.2500,8100.0,24900.00,2.50,0.0,7.50,20.00,5050.0,245020.00,118.964500,-305.2800,6825.0,26000.00,142.5,0.0,2.5,2.5,10497600.0,32270400.0,3240.0,0.0,9720.0,25920.0,6544800.0,317545920.0,154177.992,-395642.88,8845200.0,33696000.0,184680.0,0.0,3240.0,3240.0,9.920160e+07,9960.0,0.0,29880.0,79680.0,20119200.0,9.761597e+08,4.739546e+05,-1216235.520,27190800.0,1.035840e+08,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
17286,8.819900e+09,2.0,1.75,1300.0,4000.0,2.0,0.0,3.0,7.0,1300.0,98105.0,47.6687,-122.288,1350.0,4013.0,66.0,0.0,0.0,0.0,7.779064e+19,1.763980e+10,1.543483e+10,1.146587e+13,3.527960e+13,1.763980e+10,0.0,2.645970e+10,6.173930e+10,1.146587e+13,8.652763e+14,4.204332e+11,-1.078568e+12,1.190687e+13,3.539426e+13,5.821134e+11,0.0,0.000000e+00,0.000000e+00,4.0,3.5,2600.0,8000.0,4.0,0.0,6.0,14.0,2600.0,196210.0,95.3374,-244.576,2700.0,8026.0,132.0,0.0,0.0,0.0,3.0625,2275.0,7000.00,3.50,0.0,5.25,12.25,2275.0,171683.75,83.420225,-214.0040,2362.5,7022.75,115.5,0.0,0.0,0.0,1690000.0,5200000.0,2600.0,0.0,3900.0,9100.0,1690000.0,127536500.0,61969.310,-158974.40,1755000.0,5216900.0,85800.0,0.0,0.0,0.0,1.600000e+07,8000.0,0.0,12000.0,28000.0,5200000.0,3.924200e+08,1.906748e+05,-489152.000,5400000.0,1.605200e+07,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
17287,3.816300e+09,3.0,1.00,1050.0,9876.0,1.0,0.0,3.0,7.0,1050.0,98028.0,47.7635,-122.262,1760.0,9403.0,61.0,0.0,0.0,0.0,1.456415e+19,1.144890e+10,3.816300e+09,4.007115e+12,3.768978e+13,3.816300e+09,0.0,1.144890e+10,2.671410e+10,4.007115e+12,3.741043e+14,1.822798e+11,-4.665885e+11,6.716688e+12,3.588467e+13,2.327943e+11,0.0,0.000000e+00,0.000000e+00,9.0,3.0,3150.0,29628.0,3.0,0.0,9.0,21.0,3150.0,294084.0,143.2905,-366.786,5280.0,28209.0,183.0,0.0,0.0,0.0,1.0000,1050.0,9876.00,1.00,0.0,3.00,7.00,1050.0,98028.00,47.763500,-122.2620,1760.0,9403.00,61.0,0.0,0.0,0.0,1102500.0,10369800.0,1050.0,0.0,3150.0,7350.0,1102500.0,102929400.0,50151.675,-128375.10,1848000.0,9873150.0,64050.0,0.0,0.0,0.0,9.753538e+07,9876.0,0.0,29628.0,69132.0,10369800.0,9.681245e+08,4.717123e+05,-1207459.512,17381760.0,9.286403e+07,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
17288,1.220691e+08,3.0,1.50,1900.0,43186.0,1.5,0.0,4.0,7.0,1300.0,98038.0,47.4199,-121.990,2080.0,108028.0,43.0,0.0,1.0,0.0,1.490087e+16,3.662073e+08,1.831037e+08,2.319313e+11,5.271676e+12,1.831037e+08,0.0,4.882764e+08,8.544837e+08,1.586898e+11,1.196741e+13,5.788505e+09,-1.489121e+10,2.539037e+11,1.318688e+13,5.248972e+09,0.0,1.220691e+08,0.000000e+00,9.0,4.5,5700.0,129558.0,4.5,0.0,12.0,21.0,3900.0,294114.0,142.2597,-365.970,6240.0,324084.0,129.0,0.0,3.0,0.0,2.2500,2850.0,64779.00,2.25,0.0,6.00,10.50,1950.0,147057.00,71.129850,-182.9850,3120.0,162042.00,64.5,0.0,1.5,0.0,3610000.0,82053400.0,2850.0,0.0,7600.0,13300.0,2470000.0,186272200.0,90097.810,-231781.00,3952000.0,205253200.0,81700.0,0.0,1900.0,0.0,1.865031e+09,64779.0,0.0,172744.0,302302.0,56141800.0,4.233869e+09,2.047876e+06,-5268260.140,89826880.0,4.665297e+09,...,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Instantiating and Fitting a Supervised Learning Model 

There are several techniques we could use to prepare a linear regression model, including taking it to pen and paper to calculate means, standard deviations, correlations, and covariance.  Here we employ OLS method or ordinary least squares method.

R^2 or the coefficient of determination is a measure to assess the goodness of fit of a regression model:

$$ R^2 = 1 - \frac{SS_{RES}}{SS_{TOT}} = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \hat{y}_i)^2} $$

- The R-squared value is more specifically the amount of variance in the dependent variable that can be explained by the independent variables, also the covariance of X and Y (SSxy) divided by variance in X (SSxx).
- The intercept is the mean of the dependent variable when all the indepedent variables are zero. 
- The coefficients are estimates of the actual population parameters, where the increase

- R-values range from 0 to 1, and higher values of R^2 would be indicative of a good fit, but if the R^2 value is too high, that often can be indicative of over-fitting. 
- The model starts becoming attuned to fit the noise in the sample rather than reflecting the entire population, which decreases its capability to make precise predictions.
- 



In [59]:
# Use scikit-learn to instantiate a linear regression object and fit the model to the data
lm = LinearRegression()
lm = lm.fit(df_model, target)
# We use the value of R_squared as an indication of the fit 
print('Intercept :', lm.intercept_)
print('Coefficients :', lm.coef_)
print('R^2 Score : ', lm.score(df_model, target))

Intercept : -44330279382.61959

Coefficients :
 [ 8.70892026e+06 -5.40238724e+06 -7.94856211e+04  6.38387889e+01
 -2.66344839e+06  1.52600990e+05 -4.04268365e+06 -2.84377321e+07
  8.66564199e+04  1.28933113e+05  5.30304654e+08 -4.16702472e+08
 -1.16236385e+04 -3.36995575e+02 -8.79145597e+05  3.49955186e+06
 -6.40768866e+06  3.12461103e+05 -1.47291064e+05  3.05286677e+03
 -3.45944291e+01  6.16842871e-02  2.08653595e+04  2.62878401e+00
 -2.12985419e+01  3.77639401e+03  8.05707023e+00  1.20433761e+01
 -4.45932391e+03  6.83359715e+04 -1.69280802e+00  3.83724513e-02
  1.65765843e+02 -2.65424066e+03  2.16920814e+04  8.21058909e+03
 -1.11161557e+04  2.39195362e+01 -3.47086294e-01 -1.53536080e+04
  9.64899935e+04 -7.03952222e+03  1.23701635e+04 -1.74338815e+01
 -5.16069630e+01  1.51646633e+04 -7.98098986e+04  1.87180484e+01
  3.57242242e-01  2.16972893e+01 -2.48528015e+04 -3.44765114e+04
 -2.42231136e+03 -3.15419436e-02 -4.49508429e-04 -4.84426499e-02
  1.25072021e+02  3.10254626e+01  3.426703

## **Train-Test Split**

The train-test split is a technique for evaluating the performance of a machine learning algorithm, which can be used for classification or regression problems or any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model, the training dataset, and for the second subset, the test dataset, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values.  The objective is to estimate the performance of the machine learning model on new data: data not used to train the model.

This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or target values.

The train-test procedure is appropriate when there is a sufficiently large dataset available. When the dataset available is small, we can consider using a k-fold cross-validation procedure to evaluate the model performance.

### Identify features and target variable

In [51]:
features = df_model
target = df['price']
df_model.columns

Index(['id', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'condition', 'grade', 'sqft_above',
       ...
       'zip_98146', 'zip_98148', 'zip_98155', 'zip_98166', 'zip_98168',
       'zip_98177', 'zip_98178', 'zip_98188', 'zip_98198', 'zip_98199'],
      dtype='object', length=339)

### Create Train and Test Split

In [72]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=34, test_size=0.2)
lm = LinearRegression().fit(X_train, y_train)
print("R^2 Score: ", lm.score(X_train, y_train))
print('Intercept: ', lm.intercept_)

R^2 Score:  0.6482472145485566
Intercept:  -3228665.1816111896


### Assessing Training Model Performance, Predicting on Testing Set, and Comparing Model Performance

As for what we use for regression metrics, 

1. MAE describes the typical magnitude of the residuals, where small MAE suggests that the model is good for prediction.
2. MSE is the square of the difference between actual and predicted values, and will always be larger than MAE.  The presence of outliers will contribute quadratically to the error such that large differences between actual and predicted values are punished to a greater degree.
3. RMSE is the square root of the variance of the residuals, which indicates the best absolute fit of the model to the data, having the same units as the target variable, where lower values indicates a better fit.  

RMSE is the most important criterion for fit when we are working within prediction models and is the metric we most often use to compare between the training and testing model performance.

In [73]:
def bias(y, y_hat):
	return np.mean(y_hat - y)
def variance(y_hat):
	return np.mean([yi**2 for yi in y_hat]) - np.mean(y_hat)**2

In [74]:
y_tr_pred = lm.predict(X_train)
y_tt_pred = lm.predict(X_test)

print('Training R^2 Score: ', lm.score(X_train, y_train))
print('Training MAE: ', metrics.mean_absolute_error(y_train, y_tr_pred))
print('Training MSE: ', metrics.mean_squared_error(y_train, y_tr_pred))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_tr_pred)))
print('Training Bias: ', bias(y_train, y_tr_pred))
print('Training Variance: ', variance(y_tr_pred))
print("")
print('Testing R^2 Score: ', lm.score(X_test, y_test))
print('Testing MAE: ', metrics.mean_absolute_error(y_test, y_tt_pred))
print('Testing MSE: ', metrics.mean_squared_error(y_test, y_tt_pred))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_tt_pred)))
print('Testing Bias: ', bias(y_test, y_tt_pred))
print('Testing Variance: ', variance(y_tt_pred))

Training R^2 Score:  0.6482472145485566
Training MAE:  138535.621281305
Training MSE:  48706536503.8643
Training RMSE:  220695.57427339657
Training Bias:  1.7581474167408456e-09
Training Variance:  110344836922.5603

Testing R^2 Score:  0.6499042995934861
Testing MAE:  140227.98842986746
Testing MSE:  50035082235.40746
Testing RMSE:  223685.23025762668
Testing Bias:  -3389.862806085368
Testing Variance:  96872319905.39313


**Initial thoughts:**

**Comparing the training and testing RMSE, there is expected increase in RMSE from the training to the testing set.  The model still includes all the polynomial and interaction features, so it is definitely still overfit.**



### Check for the Linear Regression assumptions:  Normal distribution of residuals and homoscedasticity


In [75]:
# this defines residuals as the sample estimate of the error for each observation
residuals = (y_test - y_tt_pred)

In [78]:
# This checks for the normal distribution of the residuals or error term.  By satisfying this assumption, you are able to generate more reliable confidence and prediction intervals.
# plt.hist(residuals)
# plt.savefig('images/residuals.png')

<img src='images/residuals.png'>

In [79]:
# We use residplot to check for heteroscedasticity, which is the case where the residuals have a non-constant variance
# sns.residplot(y_tt_pred, y_test, lowess=True, color='g')
# plt.savefig('images/residplot.png')

<img src='images/residplot.png'>

## Feature Selection

There are three types of feature selection methods:  filter, wrapper, and embedded.  Filtering methods, like K-Best, approach the problem by estimating the validities of features through statistical tests (i.e. correlation coefficient, information gain, chi-squared test, f-test) to assign scoring to each feature, which are subsequently ranked and selected.  Here K-Best uses the f-test (`f_regression`) to compare the least square errors between the two models and checks if the difference is significant and returns the top 20 features.

Filtering methods are computationally less expensive than wrapper methods since we are not training an actual model.

In [81]:
# Instantiate SelectKBest object and fit training data where k is the number of features you want to select
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=9, test_size=0.2)
sel = SelectKBest(f_regression, k=20)
sel.fit(X_train, y_train)
sel_columns = X_train.columns[sel.get_support()]
rem_columns = X_train.columns[~sel.get_support()]

In [82]:
list(sel_columns)

['sqft_living',
 'grade',
 'bathrooms sqft_living',
 'bathrooms grade',
 'bathrooms sqft_above',
 'bathrooms sqft_living15',
 'sqft_living^2',
 'sqft_living condition',
 'sqft_living grade',
 'sqft_living sqft_above',
 'sqft_living zipcode',
 'sqft_living lat',
 'sqft_living long',
 'sqft_living sqft_living15',
 'grade^2',
 'grade sqft_above',
 'grade zipcode',
 'grade lat',
 'grade long',
 'grade sqft_living15']

In [83]:
# Instantiate linear regression object and fit the linear regression to the data
kbest = LinearRegression().fit(X_train[sel_columns], y_train)
y_tr_pred = kbest.predict(X_train[sel_columns])
kb_tr_rmse = np.sqrt(metrics.mean_squared_error(y_train, y_tr_pred))
y_tt_pred = kbest.predict(X_test[sel_columns])
kb_tt_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_tt_pred))


print('Training R^2 Score: ', kbest.score(X_train, y_train))
print('Training MAE: ', metrics.mean_absolute_error(y_train, y_tr_pred))
print('Training MSE: ', metrics.mean_squared_error(y_train, y_tr_pred))
print('Training RMSE: ', kb_tr_rmse)
print('Training Bias: ', bias(y_train, y_tr_pred))
print('Training Variance: ', variance(y_tr_pred))
print("")
print('Testing RMSE: ', kb_tt_rmse)
print('Testing Bias: ', bias(y_test, y_tt_pred))
print('Testing Variance: ', variance(y_tt_pred))

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 20 is different from 339)

**Initial thoughts:**

This would indicate that `sqft_living`, `grade`, `bathrooms` are strongly correlated with the `price`.  It would make sense that `grade` and `sqft_living` would also be strongly correlated, but `bathrooms` seem correlated through an interaction feature, but not `bathrooms` alone.  `sqft_above`, `sqft_living15`, `zipcode`, `lat`, `long`, and `condition` only appear through interaction features, but could be worth exploring.

### RFECV

Wrapper algorithms like RFECV returns a best set of features with an extensive greedy search, where different combinations are prepared, evaluated, and compared to other combinations.  Recursive Feature Elimination and Cross-Validation Selection begins with a model with the complete set of predictors and a score is assigned to each predictor, and the least important are removed.  The model is then rebuilt, and importance scores are computed again.  It is usually best practice to identify multicollinearity first, as it will select relevant and redundant features alike. 

In [51]:
ols = LinearRegression()
rfecv = RFECV(estimator=ols, step=1, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
rfecv.fit(X_train, y_train)
selected = X_train.columns[selector.support_]
removed = X_train.columns[~selector.support_]

In [52]:
print(list(removed))
len(list(selected))

['bedrooms sqft_lot', 'bedrooms zipcode', 'bedrooms sqft_lot15', 'bathrooms sqft_lot', 'bathrooms sqft_lot15', 'sqft_living^2', 'sqft_living sqft_lot', 'sqft_living sqft_above', 'sqft_living sqft_living15', 'sqft_living sqft_lot15', 'sqft_lot^2', 'sqft_lot floors', 'sqft_lot condition', 'sqft_lot grade', 'sqft_lot sqft_above', 'sqft_lot zipcode', 'sqft_lot lat', 'sqft_lot sqft_living15', 'sqft_lot sqft_lot15', 'sqft_lot sale_age', 'sqft_lot basement', 'sqft_lot viewed', 'floors sqft_lot15', 'waterfront sqft_lot15', 'condition sqft_lot15', 'grade sqft_lot15', 'sqft_above^2', 'sqft_above zipcode', 'sqft_above sqft_living15', 'sqft_above sqft_lot15', 'sqft_above sale_age', 'zipcode sqft_living15', 'zipcode sqft_lot15', 'long sqft_lot15', 'sqft_living15^2', 'sqft_living15 sqft_lot15', 'sqft_lot15^2', 'sqft_lot15 sale_age', 'sqft_lot15 renovated', 'sqft_lot15 basement', 'sqft_lot15 viewed', 'bdr_11', 'bth_7.75']


206

In [53]:
lm = LinearRegression()
lm = lm.fit(X_train[selected], y_train)
y_pred = lm.predict(X_train[selected])
train = np.sqrt(metrics.mean_squared_error(y_train, y_pred))
print('Training RMSE: ', train)
y_pred = lm.predict(X_test[selected])
test = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('Testing RMSE: ', test)

Training RMSE:  152424.20363793467
Testing RMSE:  72940789.52495904


With such a high testing RMSE, it is indicative of overfitting

### Ridge and Lasso Regression

Embedded methods learn which features best contribute to the accuracy of the model while the model is being created, the most common being the regularization methods.  They are also called penalization methods that introduce additional constraints into the optimization of a predictive algorithm that bias the model toward lower complexity, i.e. fewer coefficients.

Ridge regression optimizes the RSS by adding a penalty equivalent to the square of the magnitude of the coefficients, while Lasso adds a penalty equivalent to the absolute value of the magnitude of the coefficients.

In [172]:
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,grade,sqft_above,zipcode,lat,long,sqft_living15,sqft_lot15,sale_age,renovated,basement,viewed
0,365000.0,4,2.25,2070,8893,2.0,0,4,8,2070,98058,47.4388,-122.162,2390,7700,28,0,0,0
1,865000.0,5,3.0,2900,6730,1.0,0,5,8,1830,98115,47.6784,-122.285,2370,6283,37,0,1,0
2,1038000.0,4,2.5,3770,10893,2.0,0,3,11,3770,98006,47.5646,-122.129,3710,9685,17,0,0,1
3,1490000.0,3,3.5,4560,14608,2.0,0,3,12,4560,98034,47.6995,-122.228,4050,14226,25,0,0,1
4,711000.0,3,2.5,2550,5376,2.0,0,3,9,2550,98052,47.6647,-122.083,2250,4050,10,0,0,0


In [173]:
y = df['price']
X = df.drop(columns=['price'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, test_size=0.2)
feat_cat = df[['condition', 'grade', 'waterfront', 'floors', 'bedrooms', 'bathrooms', 'zipcode', 'renovated', 'basement', 'viewed']]
col_cat = feat_cat.columns
feat_cont = df[['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_living15', 'sqft_lot15', 'sale_age', 'lat', 'long']]
col_cont = feat_cont.columns
X_train_cont = X_train.loc[:, col_cont]
X_test_cont = X_test.loc[:, col_cont]
lm = LinearRegression()
lm.fit(X_train_cont, y_train)
y_pred = lm.predict(X_train_cont)
print('Training R^2: ', lm.score(X_train_cont, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_pred)))
y_pred = lm.predict(X_test_cont)
print('Testing R^2: ', lm.score(X_test_cont, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Training R^2:  0.6014378607289312
Training RMSE:  236030.5476209643
Testing R^2:  0.5799486700394496
Testing RMSE:  240484.36704981446


In [174]:
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train_cont)
X_test_scaled = ss.fit_transform(X_test_cont)
lm = LinearRegression()
lm.fit(X_train_scaled, y_train)
y_pred = lm.predict(X_train_scaled)
print('Training R^2: ', lm.score(X_train_scaled, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_pred)))
y_pred = lm.predict(X_test_scaled)
print('Testing R^2: ', lm.score(X_test_scaled, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Training R^2:  0.6014378607289309
Training RMSE:  236030.5476209644
Testing R^2:  0.5802911593094102
Testing RMSE:  240386.30747348102


In [175]:
from sklearn.preprocessing import OneHotEncoder
X_train_cat = X_train.loc[:, col_cat]
X_test_cat = X_test.loc[:, col_cat]
ohe = OneHotEncoder(handle_unknown='ignore')
X_train_ohe = ohe.fit_transform(X_train_cat)
X_test_ohe = ohe.transform(X_test_cat)
columns = ohe.get_feature_names(input_features=X_train_cat.columns)
cat_train_df = pd.DataFrame(X_train_ohe.todense(), columns=columns)
cat_test_df = pd.DataFrame(X_test_ohe.todense(), columns=columns)

In [176]:
X_train_all = pd.concat([pd.DataFrame(X_train_scaled), cat_train_df], axis=1)
X_test_all = pd.concat([pd.DataFrame(X_test_scaled), cat_test_df], axis=1)
lm = LinearRegression()
lm.fit(X_train_all, y_train)
y_pred = lm.predict(X_train_all)
print('Training R^2: ', lm.score(X_train_all, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_pred)))
y_pred = lm.predict(X_test_all)
print('Testing R^2: ', lm.score(X_test_all, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Training R^2:  0.84131960907618
Training RMSE:  148929.8747415769
Testing R^2:  -2.7380967217723377e+18
Testing RMSE:  613988250532612.6


In [177]:
lasso = Lasso()
lasso.fit(X_train_all, y_train)
y_train_pred = lasso.predict(X_train_all)
print('Training R^2: ', lasso.score(X_train_all, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
y_test_pred = lasso.predict(X_test_all)
print('Testing R^2: ', lasso.score(X_test_all, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

Training R^2:  0.841318007412382
Training RMSE:  148930.62636244667
Testing R^2:  0.8138574472175385
Testing RMSE:  160087.89738523302


In [222]:
lasso_coef01 = pd.DataFrame(data=lasso.coef_).T
lasso_coef01.columns = X_train_all.columns
lasso_coef01 = lasso_coef01.T.sort_values(by=0).T 
# lasso_coef01.plot(kind='bar', title='Modal Coefficients', legend=False, figsize=(16,8))
# plt.savefig('lasso_coef_1.png')

<img src='images/lasso_coeff_1.png'>

In [179]:
coeff_df = lasso_coef01.T
coeff_df[coeff_df[0]==0].count()

0    5
dtype: int64

In [180]:
lasso_1 = coeff_df[coeff_df[0]==0]
lasso_1

Unnamed: 0,0
renovated_1,0.0
basement_1,-0.0
viewed_1,0.0
waterfront_1,0.0
grade_9,0.0


In [181]:
lasso = Lasso(alpha=10)
lasso.fit(X_train_all, y_train)
y_train_pred = lasso.predict(X_train_all)
print('Training R^2: ', lasso.score(X_train_all, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
y_test_pred = lasso.predict(X_test_all)
print('Testing R^2: ', lasso.score(X_test_all, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))


Training R^2:  0.841194377236266
Training RMSE:  148988.63147812762
Testing R^2:  0.8152591755201664
Testing RMSE:  159483.99519699477


In [214]:
lasso_coef01 = pd.DataFrame(data=lasso.coef_).T
lasso_coef01.columns = X_train_all.columns
lasso_coef01 = lasso_coef01.T.sort_values(by=0).T 
# lasso_coef01.plot(kind='bar', title='Modal Coefficients', legend=False, figsize=(16,8))
# plt.savefig('lasso_coeff_2.png')

<img src='images/lasso_coeff_2.png'>

In [183]:
coeff_df = lasso_coef01.T
coeff_df[abs(coeff_df[0]==0)].count()

0    13
dtype: int64

In [184]:
lasso_2 = coeff_df[coeff_df[0]==0]
lasso_2

Unnamed: 0,0
zipcode_98034,0.0
zipcode_98022,-0.0
renovated_1,0.0
basement_1,-0.0
viewed_1,0.0
waterfront_1,0.0
bathrooms_3.5,0.0
grade_9,0.0
bedrooms_11,-0.0
condition_3,-0.0


In [215]:
train_rmse = []
test_rmse = []
alphas = []

for alpha in np.linspace(0, 200, num=50):
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train_all, y_train)
    train_preds = lasso.predict(X_train_all)
    train_rmse.append(np.sqrt(metrics.mean_squared_error(y_train, train_preds)))
    test_preds = lasso.predict(X_test_all)
    test_rmse.append(np.sqrt(metrics.mean_squared_error(y_test, test_preds)))
    alphas.append(alpha)

# fig,ax = plt.subplots()
# ax.plot(alphas, train_rmse, label='Train')
# ax.plot(alphas, test_rmse, label='Test')
# ax.set_xlabel('Alpha')
# ax.set_ylabel('RMSE')
optimal_alpha = alphas[np.argmin(test_rmse)]
# ax.axvline(optimal_alpha, color='black', linestyle='--')
print(f'Optimal Alpha Value: {int(optimal_alpha)}')
# plt.savefig('optimal_lasso.png')

Optimal Alpha Value: 200


<img src='images/optimal_lasso.png'>

In [186]:
lasso = Lasso(alpha=163)
lasso.fit(X_train_all, y_train)
y_train_pred = lasso.predict(X_train_all)
print('Training R^2: ', lasso.score(X_train_all, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
y_test_pred = lasso.predict(X_test_all)
print('Testing R^2: ', lasso.score(X_test_all, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

Training R^2:  0.8344565483829849
Training RMSE:  152116.4552052251
Testing R^2:  0.822469127430404
Testing RMSE:  156340.9014041161


In [216]:
lasso_coef01 = pd.DataFrame(data=lasso.coef_).T
lasso_coef01.columns = X_train_all.columns
lasso_coef01 = lasso_coef01.T.sort_values(by=0).T 
# lasso_coef01.plot(kind='bar', title='Modal Coefficients', legend=False, figsize=(16,8))
# plt.savefig('lasso_coeff_3.png')

<img src='images/lasso_coeff_3.png'>

In [188]:
coeff_df = lasso_coef01.T
coeff_df[abs(coeff_df[0]==0)].count()

0    41
dtype: int64

In [189]:
lasso_3 = coeff_df[coeff_df[0]==0]
lasso_3

Unnamed: 0,0
zipcode_98003,-0.0
zipcode_98001,0.0
bathrooms_8.0,0.0
bathrooms_7.5,-0.0
bathrooms_6.75,-0.0
...,...
bathrooms_1.0,-0.0
bathrooms_5.5,0.0
bathrooms_5.25,0.0
bathrooms_1.25,-0.0


In [190]:
ridge = Ridge()
ridge.fit(X_train_all, y_train)
y_train_pred = ridge.predict(X_train_all)
print('Training R^2: ', ridge.score(X_train_all, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
y_test_pred = ridge.predict(X_test_all)
print('Testing R^2: ', ridge.score(X_test_all, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

Training R^2:  0.8407886210026972
Training RMSE:  149178.84672054244
Testing R^2:  0.8182525899976507
Testing RMSE:  158186.633331917


In [217]:
ridge_coef01 = pd.DataFrame(data=ridge.coef_).T
ridge_coef01.columns = X_train_all.columns
ridge_coef01 = ridge_coef01.T.sort_values(by=0).T 
# ridge_coef01.plot(kind='bar', title='Modal Coefficients', legend=False, figsize=(16,8))
# plt.savefig('ridge_coeff_1.png')

<img src='images/ridge_coeff_1.png'>

In [192]:
coeff_df = ridge_coef01.T
coeff_df[coeff_df[0]==0].count()

0    0
dtype: int64

In [193]:
ridge = Ridge(alpha=10)
ridge.fit(X_train_all, y_train)
y_train_pred = ridge.predict(X_train_all)
print('Training R^2: ', ridge.score(X_train_all, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
y_test_pred = ridge.predict(X_test_all)
print('Testing R^2: ', ridge.score(X_test_all, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

Training R^2:  0.8336522149592601
Training RMSE:  152485.55494390102
Testing R^2:  0.8140886012655171
Testing RMSE:  159988.46697450607


In [218]:
ridge_coef01 = pd.DataFrame(data=ridge.coef_).T
ridge_coef01.columns = X_train_all.columns
ridge_coef01 = ridge_coef01.T.sort_values(by=0).T 
# ridge_coef01.plot(kind='bar', title='Modal Coefficients', legend=False, figsize=(16,8))
# plt.savefig('ridge_coeff_2.png')

<img src='images/ridge_coeff_2.png'>

In [195]:
coeff_df = ridge_coef01.T
coeff_df[coeff_df[0]==0].count()

0    0
dtype: int64

In [221]:
train_rmse = []
test_rmse = []
alphas = []

for alpha in np.linspace(0, 200, num=50):
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_all, y_train)
    train_preds = ridge.predict(X_train_all)
    train_rmse.append(np.sqrt(metrics.mean_squared_error(y_train, train_preds)))
    test_preds = ridge.predict(X_test_all)
    test_rmse.append(np.sqrt(metrics.mean_squared_error(y_test, test_preds)))
    alphas.append(alpha)

# fig, ax = plt.subplots()
# ax.plot(alphas, train_rmse, label='Train')
# ax.plot(alphas, test_rmse, label='Test')
# ax.set_xlabel('Alpha')
# ax.set_ylabel('RMSE')
optimal_alpha = alphas[np.argmin(test_rmse)]
# ax.axvline(optimal_alpha, color='black', linestyle='--')

print(f'Optimal Alpha Value: {int(optimal_alpha)}')
# plt.savefig('optimal_ridge.png')

Optimal Alpha Value: 200


<img src='images/optimal_ridge.png'>

# Second Run

In [197]:
df2 = df.copy()

In [198]:
df2.drop('sqft_above', axis=1, inplace=True)
df2.drop(['floors', 'condition'], inplace=True, axis=1)

In [199]:
index_dum = df2[['bedrooms', 'bathrooms', 'grade']].columns
df2_dum = pd.get_dummies(data=df2, columns=index_dum, drop_first=True, prefix=['bdr', 'bth', 'grd'])
df2_dum.head()

Unnamed: 0,price,sqft_living,sqft_lot,waterfront,zipcode,lat,long,sqft_living15,sqft_lot15,sale_age,renovated,basement,viewed,bdr_1,bdr_2,bdr_3,bdr_4,bdr_5,bdr_6,bdr_7,bdr_8,bdr_9,bdr_10,bdr_11,bth_0.5,bth_0.75,bth_1.0,bth_1.25,bth_1.5,bth_1.75,bth_2.0,bth_2.25,bth_2.5,bth_2.75,bth_3.0,bth_3.25,bth_3.5,bth_3.75,bth_4.0,bth_4.25,bth_4.5,bth_4.75,bth_5.0,bth_5.25,bth_5.5,bth_5.75,bth_6.0,bth_6.25,bth_6.5,bth_6.75,bth_7.5,bth_7.75,bth_8.0,grd_3,grd_4,grd_5,grd_6,grd_7,grd_8,grd_9,grd_10,grd_11,grd_12,grd_13
0,365000.0,2070,8893,0,98058,47.4388,-122.162,2390,7700,28,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,865000.0,2900,6730,0,98115,47.6784,-122.285,2370,6283,37,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
2,1038000.0,3770,10893,0,98006,47.5646,-122.129,3710,9685,17,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,1490000.0,4560,14608,0,98034,47.6995,-122.228,4050,14226,25,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,711000.0,2550,5376,0,98052,47.6647,-122.083,2250,4050,10,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [200]:
# Grab columns for polynominal and interaction features from the original dataframe without dummy variables
poly_feat = df2[['sale_age', 'sqft_living', 'sqft_living15', 'grade']]
y = df2['price']
# Use SKlearn to create binomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_data = poly.fit_transform(poly_feat)
poly_columns = poly.get_feature_names(poly_feat.columns)
df2_poly = pd.DataFrame(poly_data, columns=poly_columns)
df2_poly.head()

Unnamed: 0,sale_age,sqft_living,sqft_living15,grade,sale_age^2,sale_age sqft_living,sale_age sqft_living15,sale_age grade,sqft_living^2,sqft_living sqft_living15,sqft_living grade,sqft_living15^2,sqft_living15 grade,grade^2
0,28.0,2070.0,2390.0,8.0,784.0,57960.0,66920.0,224.0,4284900.0,4947300.0,16560.0,5712100.0,19120.0,64.0
1,37.0,2900.0,2370.0,8.0,1369.0,107300.0,87690.0,296.0,8410000.0,6873000.0,23200.0,5616900.0,18960.0,64.0
2,17.0,3770.0,3710.0,11.0,289.0,64090.0,63070.0,187.0,14212900.0,13986700.0,41470.0,13764100.0,40810.0,121.0
3,25.0,4560.0,4050.0,12.0,625.0,114000.0,101250.0,300.0,20793600.0,18468000.0,54720.0,16402500.0,48600.0,144.0
4,10.0,2550.0,2250.0,9.0,100.0,25500.0,22500.0,90.0,6502500.0,5737500.0,22950.0,5062500.0,20250.0,81.0


In [201]:
# Concatenating two dataframes together
X = pd.concat([df2_poly, df2_dum], axis=1)
X = X.drop(columns=['price'], axis=1)
X

Unnamed: 0,sale_age,sqft_living,sqft_living15,grade,sale_age^2,sale_age sqft_living,sale_age sqft_living15,sale_age grade,sqft_living^2,sqft_living sqft_living15,sqft_living grade,sqft_living15^2,sqft_living15 grade,grade^2,sqft_living.1,sqft_lot,waterfront,zipcode,lat,long,sqft_living15.1,sqft_lot15,sale_age.1,renovated,basement,viewed,bdr_1,bdr_2,bdr_3,bdr_4,bdr_5,bdr_6,bdr_7,bdr_8,bdr_9,bdr_10,bdr_11,bth_0.5,bth_0.75,bth_1.0,bth_1.25,bth_1.5,bth_1.75,bth_2.0,bth_2.25,bth_2.5,bth_2.75,bth_3.0,bth_3.25,bth_3.5,bth_3.75,bth_4.0,bth_4.25,bth_4.5,bth_4.75,bth_5.0,bth_5.25,bth_5.5,bth_5.75,bth_6.0,bth_6.25,bth_6.5,bth_6.75,bth_7.5,bth_7.75,bth_8.0,grd_3,grd_4,grd_5,grd_6,grd_7,grd_8,grd_9,grd_10,grd_11,grd_12,grd_13
0,28.0,2070.0,2390.0,8.0,784.0,57960.0,66920.0,224.0,4284900.0,4947300.0,16560.0,5712100.0,19120.0,64.0,2070,8893,0,98058,47.4388,-122.162,2390,7700,28,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,37.0,2900.0,2370.0,8.0,1369.0,107300.0,87690.0,296.0,8410000.0,6873000.0,23200.0,5616900.0,18960.0,64.0,2900,6730,0,98115,47.6784,-122.285,2370,6283,37,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
2,17.0,3770.0,3710.0,11.0,289.0,64090.0,63070.0,187.0,14212900.0,13986700.0,41470.0,13764100.0,40810.0,121.0,3770,10893,0,98006,47.5646,-122.129,3710,9685,17,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,25.0,4560.0,4050.0,12.0,625.0,114000.0,101250.0,300.0,20793600.0,18468000.0,54720.0,16402500.0,48600.0,144.0,4560,14608,0,98034,47.6995,-122.228,4050,14226,25,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,10.0,2550.0,2250.0,9.0,100.0,25500.0,22500.0,90.0,6502500.0,5737500.0,22950.0,5062500.0,20250.0,81.0,2550,5376,0,98052,47.6647,-122.083,2250,4050,10,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17285,57.0,3240.0,2730.0,8.0,3249.0,184680.0,155610.0,456.0,10497600.0,8845200.0,25920.0,7452900.0,21840.0,64.0,3240,9960,0,98008,47.5858,-122.112,2730,10400,57,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
17286,66.0,1300.0,1350.0,7.0,4356.0,85800.0,89100.0,462.0,1690000.0,1755000.0,9100.0,1822500.0,9450.0,49.0,1300,4000,0,98105,47.6687,-122.288,1350,4013,66,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
17287,61.0,1050.0,1760.0,7.0,3721.0,64050.0,107360.0,427.0,1102500.0,1848000.0,7350.0,3097600.0,12320.0,49.0,1050,9876,0,98028,47.7635,-122.262,1760,9403,61,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
17288,43.0,1900.0,2080.0,7.0,1849.0,81700.0,89440.0,301.0,3610000.0,3952000.0,13300.0,4326400.0,14560.0,49.0,1900,43186,0,98038,47.4199,-121.990,2080,108028,43,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


In [202]:
# Use scikit-learn to fit and assess new model
lm_2 = LinearRegression()
lm_2 = lm_2.fit(X, y)
print('Intercept: ', lm_2.intercept_)
print('\nCoefficients:\n', lm_2.coef_)
print("\nR^2 Score: ", lm_2.score(X, y))

Intercept:  12327068.201419381

Coefficients:
 [-4.83378870e+03 -1.39362579e+02  2.84916968e+01 -3.92223807e+04
  3.65568474e+01  1.84937738e-03  1.91340031e+00  6.04602613e+02
  7.64164553e-03 -1.54109884e-02  4.85073072e+01  2.31485782e-02
 -1.83255651e+01  4.63637029e+03 -1.39362579e+02  7.73326942e-03
  6.43195014e+05 -6.25610645e+02  6.02930569e+05 -1.69672844e+05
  2.84916963e+01 -2.46962598e-01 -4.83378870e+03  1.53364386e+05
  7.67914442e+03  9.26702017e+04  1.09425638e+04  5.59841976e+04
  6.06808991e+04  5.35891542e+04  4.48873806e+04 -1.26735897e+04
 -1.06577845e+05  7.81945452e+04 -1.81182179e+05 -7.53670347e+04
 -9.10399986e+04 -9.78188286e+04 -9.46191541e+04 -2.08867203e+04
 -1.76235372e+04 -1.12213666e+04 -2.85294635e+03  1.42746910e+03
  8.18248194e+03  2.47168390e+03  1.37628598e+04  3.93076802e+04
  6.82629299e+04  4.75071520e+04  1.36224795e+05  1.05901902e+05
  1.48396251e+05  1.20831362e+05  2.17390838e+05  3.26321809e+05
  3.96771812e+05  2.73578535e+05 -1.2490916

In [203]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=34, test_size=0.2)
lm = linear_model.LinearRegression()
lm = lm.fit(X_train, y_train)
print('Intercept: ', lm.intercept_)
print('\nCoefficients:\n', lm.coef_)
print('\nR^2 Score: ', lm.score(X_train, y_train))

Intercept:  13991081.149470026

Coefficients:
 [-4.89448469e+03 -9.19913525e+01  2.11567071e+01 -4.81428782e+04
  3.69147586e+01 -1.63638377e-01  1.97853144e+00  6.38777437e+02
  2.50284414e-02 -2.91717863e-02  3.00715477e+01  2.92786123e-02
 -1.63574862e+01  6.74522818e+03 -9.19913530e+01  6.46007914e-02
  6.18912599e+05 -6.42387835e+02  6.08655091e+05 -1.67223946e+05
  2.11567079e+01 -2.23952869e-01 -4.89448469e+03  1.35708933e+05
  8.79199332e+03  9.14287934e+04  6.27745094e+03  6.28726639e+04
  6.85465103e+04  6.35618354e+04  5.89091208e+04 -9.03947792e+03
 -2.37894042e+04  4.13851803e+04 -1.72419891e+05 -5.04399966e+04
 -1.02661174e+05 -1.09187843e+05 -9.51685745e+04 -3.84224257e+04
 -5.75036444e+04 -2.81077711e+04 -1.82612887e+04 -1.25601732e+04
 -9.86637621e+03 -1.08628480e+04 -1.60238578e+03  3.07908622e+04
  3.48814170e+04  2.89463452e+04  1.23610451e+05  4.80553567e+04
  1.57974939e+05  1.05664259e+05  2.08045696e+05  3.30259646e+05
  2.57386317e+05  4.70774752e+05 -1.0278811

In [204]:
y_pred = lm.predict(X_train)
tr_rmse = np.sqrt(metrics.mean_squared_error(y_train, y_pred))
print('Training RMSE: ', tr_rmse)
y_pred = lm.predict(X_test)
tt_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('Testing RMSE: ', tt_rmse)


Training RMSE:  179049.1131564559
Testing RMSE:  215427.14956097506


In [227]:
residuals = (y_test - y_pred)
# sns.residplot(y_pred, y_test, lowess=True, color='g')
# plt.savefig('residplot2.png')

<img src='images/residplot2.png'>

In [226]:
# plt.hist(residuals)
# plt.savefig('residuals2.png')

<img src='images/residuals2.png'>

In [207]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=9, test_size=0.2)
selector = SelectKBest(f_regression, k=20)
selector.fit(X_train, y_train)
selected_columns = X_train.columns[selector.get_support()]
removed_columns = X_train.columns[~selector.get_support()]

In [208]:
list(selected_columns)

['sqft_living',
 'sqft_living15',
 'grade',
 'sale_age sqft_living',
 'sqft_living^2',
 'sqft_living sqft_living15',
 'sqft_living grade',
 'sqft_living15^2',
 'sqft_living15 grade',
 'grade^2',
 'sqft_living',
 'waterfront',
 'lat',
 'sqft_living15',
 'viewed',
 'bth_1.0',
 'grd_7',
 'grd_10',
 'grd_11',
 'grd_12']

In [209]:
#instantiate a linear regression object, #fit the linear regression to the data
kbest = LinearRegression()
kbest.fit(X_train[selected_columns], y_train)
y_pred = kbest.predict(X_train[selected_columns])
tr_rmse = np.sqrt(metrics.mean_squared_error(y_train, y_pred))
print('Training RMSE: ', tr_rmse)
y_pred = kbest.predict(X_test[selected_columns])
tt_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('Testing RMSE: ', tt_rmse)

Training RMSE:  195043.444857314
Testing RMSE:  191651.24161643485


In [210]:
ols = linear_model.LinearRegression()
selector = RFECV(estimator=ols, step=1, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
selector.fit(X_train, y_train)
selected = X_train.columns[selector.support_]
removed = X_train.columns[~selector.support_]
lm = LinearRegression()
lm = lm.fit(X_train[selected], y_train)
y_pred = lm.predict(X_train[selected])
tr_rmse = np.sqrt(metrics.mean_squared_error(y_train, y_pred))
print('Training RMSE: ', tr_rmse)
y_pred = lm.predict(x_test[selected])
tt_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('Testing RMSE: ', tt_rmse)


Training RMSE:  183374.4866926505
Testing RMSE:  182879.86788913174


In [212]:
print(selected)
len(selected)

Index(['sale_age', 'sqft_living', 'sqft_living15', 'grade', 'sale_age^2',
       'sale_age sqft_living15', 'sale_age grade', 'sqft_living grade',
       'sqft_living15 grade', 'grade^2', 'sqft_living', 'waterfront',
       'zipcode', 'lat', 'long', 'sqft_living15', 'sale_age', 'renovated',
       'basement', 'viewed', 'bdr_1', 'bdr_2', 'bdr_3', 'bdr_4', 'bdr_5',
       'bdr_6', 'bdr_7', 'bdr_8', 'bdr_9', 'bdr_10', 'bth_0.5', 'bth_0.75',
       'bth_1.0', 'bth_1.25', 'bth_1.5', 'bth_1.75', 'bth_2.0', 'bth_2.25',
       'bth_2.5', 'bth_2.75', 'bth_3.0', 'bth_3.25', 'bth_3.5', 'bth_3.75',
       'bth_4.0', 'bth_4.25', 'bth_4.5', 'bth_4.75', 'bth_5.0', 'bth_5.25',
       'bth_5.5', 'bth_5.75', 'bth_6.0', 'bth_6.25', 'bth_6.5', 'bth_6.75',
       'bth_7.5', 'bth_8.0', 'grd_3', 'grd_4', 'grd_5', 'grd_6', 'grd_7',
       'grd_8', 'grd_9', 'grd_10', 'grd_11', 'grd_12', 'grd_13'],
      dtype='object')


69

In [213]:
X.head()

Unnamed: 0,sale_age,sqft_living,sqft_living15,grade,sale_age^2,sale_age sqft_living,sale_age sqft_living15,sale_age grade,sqft_living^2,sqft_living sqft_living15,sqft_living grade,sqft_living15^2,sqft_living15 grade,grade^2,sqft_living.1,sqft_lot,waterfront,zipcode,lat,long,sqft_living15.1,sqft_lot15,sale_age.1,renovated,basement,viewed,bdr_1,bdr_2,bdr_3,bdr_4,bdr_5,bdr_6,bdr_7,bdr_8,bdr_9,bdr_10,bdr_11,bth_0.5,bth_0.75,bth_1.0,bth_1.25,bth_1.5,bth_1.75,bth_2.0,bth_2.25,bth_2.5,bth_2.75,bth_3.0,bth_3.25,bth_3.5,bth_3.75,bth_4.0,bth_4.25,bth_4.5,bth_4.75,bth_5.0,bth_5.25,bth_5.5,bth_5.75,bth_6.0,bth_6.25,bth_6.5,bth_6.75,bth_7.5,bth_7.75,bth_8.0,grd_3,grd_4,grd_5,grd_6,grd_7,grd_8,grd_9,grd_10,grd_11,grd_12,grd_13
0,28.0,2070.0,2390.0,8.0,784.0,57960.0,66920.0,224.0,4284900.0,4947300.0,16560.0,5712100.0,19120.0,64.0,2070,8893,0,98058,47.4388,-122.162,2390,7700,28,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,37.0,2900.0,2370.0,8.0,1369.0,107300.0,87690.0,296.0,8410000.0,6873000.0,23200.0,5616900.0,18960.0,64.0,2900,6730,0,98115,47.6784,-122.285,2370,6283,37,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
2,17.0,3770.0,3710.0,11.0,289.0,64090.0,63070.0,187.0,14212900.0,13986700.0,41470.0,13764100.0,40810.0,121.0,3770,10893,0,98006,47.5646,-122.129,3710,9685,17,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,25.0,4560.0,4050.0,12.0,625.0,114000.0,101250.0,300.0,20793600.0,18468000.0,54720.0,16402500.0,48600.0,144.0,4560,14608,0,98034,47.6995,-122.228,4050,14226,25,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,10.0,2550.0,2250.0,9.0,100.0,25500.0,22500.0,90.0,6502500.0,5737500.0,22950.0,5062500.0,20250.0,81.0,2550,5376,0,98052,47.6647,-122.083,2250,4050,10,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [None]:
y = df['price']
X = df.drop(columns=['price'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, test_size=0.2)
feat_cat = df[['condition', 'grade', 'waterfront', 'floors', 'bedrooms', 'bathrooms', 'zipcode', 'renovated', 'basement', 'viewed']]
col_cat = feat_cat.columns
feat_cont = df[['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_living15', 'sqft_lot15', 'sale_age', 'lat', 'long']]
col_cont = feat_cont.columns
X_train_cont = X_train.loc[:, col_cont]
X_test_cont = X_test.loc[:, col_cont]
lm = LinearRegression()
lm.fit(X_train_cont, y_train)
y_pred = lm.predict(X_train_cont)
print('Training R^2: ', lm.score(X_train_cont, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_pred)))
y_pred = lm.predict(X_test_cont)
print('Testing R^2: ', lm.score(X_test_cont, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
lasso = Lasso()
lasso.fit(X_train_all, y_train)
y_train_pred = lasso.predict(X_train_all)
print('Training R^2: ', lasso.score(X_train_all, y_train))
print('Training RMSE: ', np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))
y_test_pred = lasso.predict(X_test_all)
print('Testing R^2: ', lasso.score(X_test_all, y_test))
print('Testing RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

# Third Model

In [None]:
ss = StandardScaler()
X_train

In [167]:
df3 = df.copy()
df3

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,condition,grade,sqft_above,zipcode,lat,long,sqft_living15,sqft_lot15,sale_age,renovated,basement,viewed
0,365000.0,4,2.25,2070,8893,2.0,0,4,8,2070,98058,47.4388,-122.162,2390,7700,28,0,0,0
1,865000.0,5,3.00,2900,6730,1.0,0,5,8,1830,98115,47.6784,-122.285,2370,6283,37,0,1,0
2,1038000.0,4,2.50,3770,10893,2.0,0,3,11,3770,98006,47.5646,-122.129,3710,9685,17,0,0,1
3,1490000.0,3,3.50,4560,14608,2.0,0,3,12,4560,98034,47.6995,-122.228,4050,14226,25,0,0,1
4,711000.0,3,2.50,2550,5376,2.0,0,3,9,2550,98052,47.6647,-122.083,2250,4050,10,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17285,750000.0,5,2.50,3240,9960,1.0,0,3,8,2020,98008,47.5858,-122.112,2730,10400,57,0,1,1
17286,440000.0,2,1.75,1300,4000,2.0,0,3,7,1300,98105,47.6687,-122.288,1350,4013,66,0,0,0
17287,310000.0,3,1.00,1050,9876,1.0,0,3,7,1050,98028,47.7635,-122.262,1760,9403,61,0,0,0
17288,427500.0,3,1.50,1900,43186,1.5,0,4,7,1300,98038,47.4199,-121.990,2080,108028,43,0,1,0


In [168]:
df3['log_sqft_liv'] = np.log(df3.sqft_living)
df3['log_sqft_liv15'] = np.log(df3.sqft_living15)
df3['log_sqft_lot'] = np.log(df3.sqft_lot)
df3['log_sqft_lot15'] = np.log(df3.sqft_lot15)

In [169]:
df3.drop('sqft_above', axis=1, inplace=True)
df3.drop(['floors', 'condition'], inplace=True, axis=1)

In [None]:
features = df3[['sqft_above', 'sale_year', ]]
features

In [None]:
target = df['price']
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_data = poly.fit_transform(features)
poly_columns = poly.get_feature_names(features.columns)
df_poly = pd.DataFrame(poly_data, columns=poly_columns)
df_poly.head()

# Final Model

In [107]:
#instantiate a linear regression object
lm_final = LinearRegression()
#fit the linear regression to the data
lm_final = lm_final.fit(features[selected_columns], target)

KeyError: "['lat price', 'log_sqft_liv^2', 'price bathrooms', 'log_sqft_liv15 price', 'lat grade', 'log_sqft_liv', 'lat log_sqft_liv15', 'lat log_sqft_liv', 'price grade', 'log_sqft_liv bathrooms', 'log_sqft_liv15 grade', 'log_sqft_liv log_sqft_liv15', 'grade bathrooms', 'log_sqft_liv grade', 'price^2', 'log_sqft_liv price'] not in index"

In [79]:
lm_final.coef_

array([-2.36956282e+04,  1.73511005e+04, -1.12617809e+04, -3.78350078e-02,
        1.18160091e-01, -7.48147471e-01,  3.04086466e+02, -6.75290751e+02,
        8.37253482e-02, -9.55613607e-01, -3.29756252e-02,  7.60779907e-02,
        4.51008813e+01,  2.19510615e+02, -1.16575356e-01,  8.71207944e+01,
        4.44540510e-01,  8.44635978e+00,  2.67506561e+02,  2.09351280e-02])

## **Pickle**

In [80]:
pickle_out = open("model.pickle","wb")
pickle.dump(lm_final, pickle_out)
pickle_out.close()

In [83]:
pickle_out = open("scaler.pickle", "wb")
pickle.dump(scaler, pickle_out)
pickle_out.close

<function BufferedWriter.close>

# Prediction with Holdout Set

In [None]:
# read csv file
df = pd.read_csv('data/kc_house_data_test_features.csv', index_col=0)


In [None]:
# data preprocessing
df['sale_date'] = [x[:8] for x in df.date]
df.sale_date = df.sale_date.apply(lambda x: datetime.strptime(x, '%Y%m%d'))
df.drop(columns='date', inplace=True)
df.drop(['id'], inplace=True, axis=1)
df.replace({'bedrooms': {33: 3}}, inplace=True)
df.replace({'bedrooms': {11: 1}}, inplace=True)
df['sale_age'] = df.sale_date.dt.year - df[['yr_built', 'yr_renovated']].max(axis=1)
df.replace({'sale_age': {-1: 0}}, inplace=True)
df['renovated'] = df.yr_renovated.apply(lambda x: x if x==0 else 1)
df['basement'] = df.sqft_basement.apply(lambda x: x if x==0 else 1)
df['viewed'] = df.view.apply(lambda x: x if x==0 else 1)
df.drop(['yr_built', 'yr_renovated', 'sale_date', 'sqft_basement', 'view'], inplace=True, axis=1)

In [None]:
# dummy variables
index_dum = df[['bedrooms', 'bathrooms', 'floors', 'condition', 'grade']].columns
df_dum = pd.get_dummies(data=df, columns=index_dum, drop_first=True, prefix=['bdr', 'bth', 'flr', 'cnd', 'grd'])
# polynomial and interaction features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_data = poly.fit_transform(df_dum)
poly_columns = poly.get_feature_names(df_dum.columns)
df_poly = pd.DataFrame(poly_data, columns=poly_columns)

In [None]:
# subset identified by K-Best
features = df_poly[['sqft_living', 'sqft_above', 'sqft_living15', 'sqft_living^2',
       'sqft_living sqft_above', 'sqft_living zipcode', 'sqft_living lat',
       'sqft_living long', 'sqft_living sqft_living15', 'sqft_living viewed',
       'sqft_above^2', 'sqft_above zipcode', 'sqft_above lat',
       'sqft_above long', 'sqft_above sqft_living15', 'sqft_above viewed',
       'zipcode sqft_living15', 'lat sqft_living15', 'long sqft_living15',
       'sqft_living15^2']]

In [None]:
# Scaling
scaler = StandardScaler()
features = pd.DataFrame(data=scaler.fit_transform(features), columns=features.columns)

In [None]:
# Load pickle
with open('data/model.pickle', 'rb') as file:
    final_answer = pickle.load(file)
final_answers = final_answer.predict(features)

In [None]:
# Write prediction to CSV file
pd.DataFrame(final_answers, columns=['predictions']).to_csv('housing_preds_Steven_Yan.csv')