1. Business Understanding
2. Data Mining
3. Data Cleaning
4. Data Exploration
5. Feature Enineering
6. Predictive Modelling
7. Data Visualisation

# Kings County House Prices
An analysis by Vivika Wilde (wilde.vivika@gmail.com).

## Objective

## Set up

In [1]:
%reset -fs
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import datetime
from scipy import stats
import collections


%matplotlib inline
#%matplotlib notebook

kc = pd.read_csv('/Users/vivika/nf-may-20/hh-2020-ds1-Project-EDA/King_County_House_prices_dataset.csv')

## Variable Names and Descriptions
from project description

* **id** - unique identified for a house
* **dateDate** - house was sold
* **pricePrice** -  is prediction target
* **bedroomsNumber** -  of Bedrooms/House
* **bathroomsNumber** -  of bathrooms/bedrooms
* **sqft_livingsquare** -  footage of the home
* **sqft_lotsquare** -  footage of the lot
* **floorsTotal** -  floors (levels) in house
* **waterfront** - House which has a view to a waterfront
* **view** - Has been viewed
* **condition** - How good the condition is ( Overall )
* **grade** - overall grade given to the housing unit, based on King County grading system
* **sqft_above** - square footage of house apart from basement
* **sqft_basement** - square footage of the basement
* **yr_built** - Built Year
* **yr_renovated** - Year when house was renovated
* **zipcode** - zip
* **lat** - Latitude coordinate
* **long** - Longitude coordinate
* **sqft_living15** - The square footage of interior housing living space for the nearest 15 neighbors
* **sqft_lot15** - The square footage of the land lots of the nearest 15 neighbors

## Data Types & Missings

In [2]:
kc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
id               21597 non-null int64
date             21597 non-null object
price            21597 non-null float64
bedrooms         21597 non-null int64
bathrooms        21597 non-null float64
sqft_living      21597 non-null int64
sqft_lot         21597 non-null int64
floors           21597 non-null float64
waterfront       19221 non-null float64
view             21534 non-null float64
condition        21597 non-null int64
grade            21597 non-null int64
sqft_above       21597 non-null int64
sqft_basement    21597 non-null object
yr_built         21597 non-null int64
yr_renovated     17755 non-null float64
zipcode          21597 non-null int64
lat              21597 non-null float64
long             21597 non-null float64
sqft_living15    21597 non-null int64
sqft_lot15       21597 non-null int64
dtypes: float64(8), int64(11), object(2)
memory usage: 3.5+ MB


In [3]:
kc.head(10)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,0.0,...,7,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,0.0,0.0,...,7,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,0.0,0.0,...,6,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,0.0,0.0,...,7,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,0.0,0.0,...,8,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503
5,7237550310,5/12/2014,1230000.0,4,4.5,5420,101930,1.0,0.0,0.0,...,11,3890,1530.0,2001,0.0,98053,47.6561,-122.005,4760,101930
6,1321400060,6/27/2014,257500.0,3,2.25,1715,6819,2.0,0.0,0.0,...,7,1715,?,1995,0.0,98003,47.3097,-122.327,2238,6819
7,2008000270,1/15/2015,291850.0,3,1.5,1060,9711,1.0,0.0,,...,7,1060,0.0,1963,0.0,98198,47.4095,-122.315,1650,9711
8,2414600126,4/15/2015,229500.0,3,1.0,1780,7470,1.0,0.0,0.0,...,7,1050,730.0,1960,0.0,98146,47.5123,-122.337,1780,8113
9,3793500160,3/12/2015,323000.0,3,2.5,1890,6560,2.0,0.0,0.0,...,7,1890,0.0,2003,0.0,98038,47.3684,-122.031,2390,7570


In [4]:
kc.tail()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
21592,263000018,5/21/2014,360000.0,3,2.5,1530,1131,3.0,0.0,0.0,...,8,1530,0.0,2009,0.0,98103,47.6993,-122.346,1530,1509
21593,6600060120,2/23/2015,400000.0,4,2.5,2310,5813,2.0,0.0,0.0,...,8,2310,0.0,2014,0.0,98146,47.5107,-122.362,1830,7200
21594,1523300141,6/23/2014,402101.0,2,0.75,1020,1350,2.0,0.0,0.0,...,7,1020,0.0,2009,0.0,98144,47.5944,-122.299,1020,2007
21595,291310100,1/16/2015,400000.0,3,2.5,1600,2388,2.0,,0.0,...,8,1600,0.0,2004,0.0,98027,47.5345,-122.069,1410,1287
21596,1523300157,10/15/2014,325000.0,2,0.75,1020,1076,2.0,0.0,0.0,...,7,1020,0.0,2008,0.0,98144,47.5941,-122.299,1020,1357


In [5]:
missing = pd.DataFrame(kc.isnull().sum(),columns=['Number'])
missing['Percentage'] = round(missing.Number/kc.shape[0]*100,1)
missing[missing.Number!=0]

Unnamed: 0,Number,Percentage
waterfront,2376,11.0
view,63,0.3
yr_renovated,3842,17.8


Of the 21 variables two show incomplete data:
* waterfront is missng 11% of the values
* view is missing 0.3% of the values
* yr_renovated is missing 17.8% of the values

In [6]:
kc.waterfront.fillna(0.0, inplace=True)
kc.view.fillna(0.0, inplace=True)
kc.yr_renovated.fillna(0.0, inplace=True)

In [7]:
# years since last modernisation
kc['modernised'] = datetime.datetime.now().year - kc[['yr_built','yr_renovated']].max(axis=1)
kc = kc.drop(['yr_built','yr_renovated'], axis=1)
kc.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,zipcode,lat,long,sqft_living15,sqft_lot15,modernised
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,0.0,0.0,3,7,1180,0.0,98178,47.5112,-122.257,1340,5650,65.0
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,0.0,0.0,3,7,2170,400.0,98125,47.721,-122.319,1690,7639,29.0
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,0.0,0.0,3,6,770,0.0,98028,47.7379,-122.233,2720,8062,87.0
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,0.0,0.0,5,7,1050,910.0,98136,47.5208,-122.393,1360,5000,55.0
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,0.0,0.0,3,8,1680,0.0,98074,47.6168,-122.045,1800,7503,33.0


In [8]:
# completing basement data
kc['sqft_basement'] = kc['sqft_living'] - kc['sqft_above']

In [9]:
#datetime.strptime(date, '%mm-%dd-%YYYY')
#datetime.fromisoformat('10/13/2014').timestamp()

kc['view'].unique()

array([0., 3., 4., 2., 1.])

In [10]:
kc = kc.drop(['id', 'date','sqft_living'], axis=1) # removing unusable data 

cat = kc.filter(['waterfront',
         'view', 
         'condition', 
         'grade', 
         'yr_built', 
         'yr_renovated',
         'zipcode'], axis=1).astype("category")

kc = kc.drop(cat, axis=1) # replacing categorical data with dummies
kc = kc.apply(pd.to_numeric)

In [11]:
# removing outliers 
outlier_indicies = list(kc[stats.zscore(kc) > 10].index)
kc.drop(outlier_indicies, inplace=True)
cat.drop(outlier_indicies, inplace=True)


In [12]:
#sns.pairplot(kc)

In [13]:
kc.describe().round()

Unnamed: 0,price,bedrooms,bathrooms,sqft_lot,floors,sqft_above,sqft_basement,lat,long,sqft_living15,sqft_lot15,modernised
count,21531.0,21531.0,21531.0,21531.0,21531.0,21531.0,21531.0,21531.0,21531.0,21531.0,21531.0,21531.0
mean,537512.0,3.0,2.0,13865.0,1.0,1784.0,291.0,48.0,-122.0,1985.0,12102.0,47.0
std,349499.0,1.0,1.0,29337.0,1.0,819.0,440.0,0.0,0.0,684.0,22386.0,29.0
min,78000.0,1.0,0.0,520.0,1.0,370.0,0.0,47.0,-123.0,399.0,651.0,5.0
25%,321000.0,3.0,2.0,5040.0,1.0,1190.0,0.0,47.0,-122.0,1490.0,5100.0,21.0
50%,450000.0,3.0,2.0,7600.0,2.0,1560.0,0.0,48.0,-122.0,1840.0,7619.0,43.0
75%,643000.0,4.0,2.0,10614.0,2.0,2210.0,560.0,48.0,-122.0,2360.0,10050.0,66.0
max,4210000.0,11.0,8.0,426450.0,4.0,9410.0,4130.0,48.0,-121.0,6210.0,275299.0,120.0


In [14]:
#creating dummies for categorical data
dummies = pd.DataFrame()
i_dummies = pd.DataFrame()

for i in cat:
    i_dummies = pd.get_dummies(cat[i], prefix=i, drop_first=True)
    dummies = pd.concat([dummies, i_dummies], axis=1)

kc_dum = pd.concat([kc, dummies], axis=1)


In [15]:
exp = list(kc_dum)
exp.remove('price')

In [16]:
kc_dum

Unnamed: 0,price,bedrooms,bathrooms,sqft_lot,floors,sqft_above,sqft_basement,lat,long,sqft_living15,...,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
0,221900.0,3,1.00,5650,1.0,1180,0,47.5112,-122.257,1340,...,0,0,0,0,0,0,1,0,0,0
1,538000.0,3,2.25,7242,2.0,2170,400,47.7210,-122.319,1690,...,0,0,0,0,0,0,0,0,0,0
2,180000.0,2,1.00,10000,1.0,770,0,47.7379,-122.233,2720,...,0,0,0,0,0,0,0,0,0,0
3,604000.0,4,3.00,5000,1.0,1050,910,47.5208,-122.393,1360,...,0,0,0,0,0,0,0,0,0,0
4,510000.0,3,2.00,8080,1.0,1680,0,47.6168,-122.045,1800,...,0,0,0,0,0,0,0,0,0,0
5,1230000.0,4,4.50,101930,1.0,3890,1530,47.6561,-122.005,4760,...,0,0,0,0,0,0,0,0,0,0
6,257500.0,3,2.25,6819,2.0,1715,0,47.3097,-122.327,2238,...,0,0,0,0,0,0,0,0,0,0
7,291850.0,3,1.50,9711,1.0,1060,0,47.4095,-122.315,1650,...,0,0,0,0,0,0,0,0,1,0
8,229500.0,3,1.00,7470,1.0,1050,730,47.5123,-122.337,1780,...,1,0,0,0,0,0,0,0,0,0
9,323000.0,3,2.50,6560,2.0,1890,0,47.3684,-122.031,2390,...,0,0,0,0,0,0,0,0,0,0


In [17]:
rs = []
for i in exp:
    X = kc_dum[i]
    X = sm.add_constant(X)
    y = kc_dum.price
    rs.append(sm.OLS(y,X).fit().rsquared)

rs_df = pd.DataFrame()
rs_df['explanatory_variable'] = exp
rs_df['r_squared'] = rs
rs_df.sort_values('r_squared', ascending=False).head(10)

     

  return ptp(axis=axis, out=out, **kwargs)


Unnamed: 0,explanatory_variable,r_squared
4,sqft_above,0.358666
8,sqft_living15,0.357537
1,bathrooms,0.272148
27,grade_11,0.136389
26,grade_10,0.129875
23,grade_7,0.106324
6,lat,0.102579
0,bedrooms,0.101442
5,sqft_basement,0.098993
15,view_4.0,0.087638


In [18]:
kc_dum.columns

Index(['price', 'bedrooms', 'bathrooms', 'sqft_lot', 'floors', 'sqft_above',
       'sqft_basement', 'lat', 'long', 'sqft_living15', 'sqft_lot15',
       'modernised', 'waterfront_1.0', 'view_1.0', 'view_2.0', 'view_3.0',
       'view_4.0', 'condition_2', 'condition_3', 'condition_4', 'condition_5',
       'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9',
       'grade_10', 'grade_11', 'grade_12', 'grade_13', 'zipcode_98002',
       'zipcode_98003', 'zipcode_98004', 'zipcode_98005', 'zipcode_98006',
       'zipcode_98007', 'zipcode_98008', 'zipcode_98010', 'zipcode_98011',
       'zipcode_98014', 'zipcode_98019', 'zipcode_98022', 'zipcode_98023',
       'zipcode_98024', 'zipcode_98027', 'zipcode_98028', 'zipcode_98029',
       'zipcode_98030', 'zipcode_98031', 'zipcode_98032', 'zipcode_98033',
       'zipcode_98034', 'zipcode_98038', 'zipcode_98039', 'zipcode_98040',
       'zipcode_98042', 'zipcode_98045', 'zipcode_98052', 'zipcode_98053',
       'zipcode_98055', 'zipc

In [19]:
X = kc_dum[['sqft_above',
           'sqft_living15',
           'bathrooms',
           'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9', 'grade_10', 'grade_11', 'grade_12', 'grade_13',
           'bedrooms',
           'sqft_basement']]
X = sm.add_constant(X)
y = kc_dum.price
sm.OLS(y,X).fit().summary()


0,1,2,3
Dep. Variable:,price,R-squared:,0.585
Model:,OLS,Adj. R-squared:,0.584
Method:,Least Squares,F-statistic:,2019.0
Date:,"Mon, 08 Jun 2020",Prob (F-statistic):,0.0
Time:,23:25:30,Log-Likelihood:,-295920.0
No. Observations:,21531,AIC:,591900.0
Df Residuals:,21515,BIC:,592000.0
Df Model:,15,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.801e+05,2.25e+05,0.799,0.424,-2.62e+05,6.22e+05
sqft_above,113.0901,4.217,26.820,0.000,104.825,121.355
sqft_living15,33.7286,3.726,9.051,0.000,26.424,41.033
bathrooms,-1871.2129,3256.564,-0.575,0.566,-8254.321,4511.895
grade_4,-6.132e+04,2.29e+05,-0.267,0.789,-5.11e+05,3.88e+05
grade_5,-4.166e+04,2.26e+05,-0.185,0.854,-4.84e+05,4.01e+05
grade_6,-1.433e+04,2.25e+05,-0.064,0.949,-4.56e+05,4.27e+05
grade_7,1.719e+04,2.25e+05,0.076,0.939,-4.25e+05,4.59e+05
grade_8,9.048e+04,2.25e+05,0.401,0.688,-3.51e+05,5.32e+05

0,1,2,3
Omnibus:,10873.677,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151094.491
Skew:,2.094,Prob(JB):,0.0
Kurtosis:,15.283,Cond. No.,1390000.0


In [20]:
X3 = kc_dum
X3.drop('price', axis=1, inplace=True)
X3 = sm.add_constant(X3)
y3 = kc.price
sm.OLS(y3,X3).fit().summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.837
Model:,OLS,Adj. R-squared:,0.837
Method:,Least Squares,F-statistic:,1116.0
Date:,"Mon, 08 Jun 2020",Prob (F-statistic):,0.0
Time:,23:25:30,Log-Likelihood:,-285820.0
No. Observations:,21531,AIC:,571800.0
Df Residuals:,21431,BIC:,572600.0
Df Model:,99,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3.324e+07,5.43e+06,-6.117,0.000,-4.39e+07,-2.26e+07
bedrooms,-9625.7049,1448.579,-6.645,0.000,-1.25e+04,-6786.382
bathrooms,2.322e+04,2334.788,9.944,0.000,1.86e+04,2.78e+04
sqft_lot,0.3880,0.055,7.058,0.000,0.280,0.496
floors,-3.1e+04,2804.914,-11.052,0.000,-3.65e+04,-2.55e+04
sqft_above,153.3779,2.803,54.723,0.000,147.884,158.872
sqft_basement,102.9439,3.171,32.468,0.000,96.729,109.159
lat,1.699e+05,5.6e+04,3.031,0.002,6e+04,2.8e+05
long,-2.061e+05,4.04e+04,-5.105,0.000,-2.85e+05,-1.27e+05

0,1,2,3
Omnibus:,11857.072,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,407972.097
Skew:,2.047,Prob(JB):,0.0
Kurtosis:,23.928,Cond. No.,224000000.0
