# Machine Learning Example Case: 
House Sale Price Prediction (like Zillow's "zestimate") 

When you see a line starting with "TASK", do that task!

### TASK: Click on the next cell and press shift-enter
You will get the code in it get executed.   
The result of last command or representation of last varible in that cell will be displayed 

In [154]:
import pandas as pd
housing = pd.read_csv('data/housing_processed.csv')
housing.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/housing_processed.csv'

### Filtering Columns
Some columns were not removed when equivalent coded ones were created

In [None]:
housing[["ExterQual","ExterQual_Coded"]].head()

### Filtering in a series
dtypes returns a series   
filtering series and dataframes are similar

In [None]:
type(housing.dtypes==object)

In [None]:
housing.dtypes[housing.dtypes==object]

In [None]:
housing.dtypes[housing.dtypes==object].shape

In [None]:
"SalePrice" in housing.columns 

### Removing Undesired Columns
In my case, my colleague had left above non-numeric columns in preprocessing, after creating corresponding coded versions

In [None]:
len(housing.columns)

In [None]:
# We could drop columns by name:
housing_ml = housing.drop(columns=["ExterQual"])

In [None]:
# or wholesale, keeping only numeric:
housing_ml = housing.loc[:,housing.dtypes != object]

In [None]:
len(housing_ml.columns)

# Separate Target into new Variable
- "SalePrice" is the target.    
 - The value we want to predict from other values (features) for a house.  
- Currently it is a column like the other features.   
- Scikit-learn needs 2 variables: features (X) and target (y) to be Predicted into its own 1-D array 

# NumPy
- Both Pandas and scikit-learn are build on top of NumPy
- scikit-learn can not directly work on dataframes
- X and y data type needs to be NumPy "ndarrays"

In [None]:
housing_ml.shape

In [None]:
# Split data as features and target
# take "SalePrice" values into its own 1-D array 
sale_price = housing_ml.pop('SalePrice')
type(sale_price)

In [None]:
# pop removes the column
# "in place" operation
# now housing_ml has one less column
housing_ml.shape

In [None]:
y = sale_price.values
type(y)

# See what other methods are available for ndarray

In [None]:
# press tab after putting cursor after dot "."
#y. #uncomment, press tab after . 

In [None]:
y.shape
# (1460,)
# it is equivalent to (1460)
# means it is a 1-d array

### TASK: get ndarray version of feature dataframe put it onto variable X

In [None]:
X = housing_ml.values

### TASK: check the shape of X

In [None]:
X.shape

### TASK: programmatically check if X and y has matching number of rows
You

In [None]:
X.shape[0] == y.shape[0]

# First Model
Q: What would you do if you had no features?

A: You would always estimatate the average house price.

We will have to do much better than that.  
We have so much data to base our decision on.   
It can still serve us as a baseline to compare.   
An inferior baseline could be: random in the range or max and min in training data. 

In [None]:
# Import estimator
import sklearn
from sklearn.dummy import DummyRegressor
# Instantiate estimator
# guess the mean every single time
mean_reg = DummyRegressor(strategy='mean')
# fit estimator
mean_reg.fit(X, y)

In [None]:
# predict
mean_reg.predict(X)

## Evaluating The Model
scikit-learn regressors have a score function.   
It gives you how much better your model does compared to worst model
Technically: what percentage of the variance has decreased over the worst model

"Mean" *is* the worst model, so its score will be 0.

In [None]:
mean_reg.score(X, y)

## Fitting a linear model 
First, let's use only one feature 

In [None]:
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()

In [None]:
X_lf = housing_ml[['LotFrontage']]

In [None]:
linear_model.fit(X_lf, y)

Above, you see that it used defaults to create the estimator.   
You could google "LinearRegression sklearn" and find the documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
to see the options for the other parameters.

In [None]:
y_pred = linear_model.predict(X_lf)

In [None]:
linear_model.score(X_lf, y)

### Chart Showing the Linear Fit
matplotlib is the most common visualization library

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.figure(figsize=(12, 5))
plt.scatter(y, y_pred);

In [None]:
plt.scatter(X_lf,y)
plt.plot(X_lf,y_pred,'r--')

### TASK: add labels to these charts
search label:
https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py


### Task: try replacing scatter with plot
Do you see why scatter is needed for data rows.
Try also replacing plot with scatter. 

# Effect of using a Better predictor 
Ground Living Area should be better than Lot Frontage!

In [None]:
X_area = housing_ml[['GrLivArea']]

In [None]:
linear_model.fit(X_area, y)

Now the linear_model has another model in it

In [None]:
y_pred2 = linear_model.predict(X_area)
linear_model.score(X_area, y)

In [None]:
plt.figure(figsize=(12, 5))
plt.scatter(y, y_pred2); # blue obviously better
plt.scatter(y, y_pred); # orange

### TASK: add legend
which color is the prediction based on which feature

# Using all predictors!

In [None]:
# We had 81 columns (80 features) in original dataset,
# coded as 221 features!
X.shape

In [None]:
linear_model.fit(X, y)

In [None]:
y_pred3 = linear_model.predict(X)

In [None]:
linear_model.score(X, y)

In [None]:
plt.figure(figsize=(12, 5))
plt.scatter(y_pred3, y);

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
random_state = 21
train_size = .8


In [155]:
linear_model = LinearRegression()
X_lf = housing_ml[['YearBuilt']]
linear_model.fit(X_lf, y)
y_pred = linear_model.predict(X_lf)
linear_model.score(X_lf, y)

0.27342162073249154

In [156]:
for feature in housing_ml.columns:
    linear_model = LinearRegression()
    X_lf = housing_ml[[feature]]
    linear_model.fit(X_lf, y)
    y_pred = linear_model.predict(X_lf)
    linear_model.score(X_lf, y)
    

In [157]:
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor

mseDict = {}
scoreDict = {}
for feature in housing_ml.columns:
    linear_model = LinearRegression()
    X_lf = housing_ml[[feature]]
    X_train, X_test, y_train, y_test = train_test_split(X_lf, y, train_size = .8, random_state = 21)
    linear_model.fit(X_train, y_train)
    y_pred = linear_model.predict(X_test)
    mseDict[feature] = mean_squared_error(y_test, y_pred, squared = False)
    scoreDict[feature] = linear_model.score(X_test, y_test)

mseDict = sorted(mseDict.items(), key=lambda x: x[1])
scoreDict = sorted(scoreDict.items(), key=lambda x: x[1], reverse = True)

top10 = mseDict[:10]
top10score = scoreDict[:10]
#print(top10score)
print('The best prediction is from', top10score[0][0], 'giving an error score of', top10score[0][1])
print('Top 10 error:')

i = 1
for p in top10:
    print(str(i), ' ', top10[i-1][0], ' ', top10[i-1][1], ', score: ', top10score[i-1][1], sep = '')
    i += 1

The best prediction is from OverallQual giving an error score of 0.6454631197278842
Top 10 error:
1 OverallQual 49018.435788124996, score: 0.6454631197278842
2 ExterQual_Coded 58260.14865535066, score: 0.4991753471267154
3 GrLivArea 61369.55515927838, score: 0.444289666781406
4 KitchenQual_Coded 61622.30113827634, score: 0.43970293723342646
5 TotalBsmtSF 62430.331675814006, score: 0.42491266111423975
6 1stFlrSF 62876.2562386846, score: 0.4166679048480356
7 GarageCars 63131.022100901646, score: 0.41193116659665807
8 GarageArea 63377.71018105261, score: 0.4073263621599067
9 BsmtQual_Coded 66831.5181312807, score: 0.3409700148262619
10 GarageFinish_Coded 68071.65468394081, score: 0.3162849544224505


In [158]:
top10head = []
for predictor in top10:
    top10head.append(predictor[0])

group45 = []
c = 1

for predictor in top10head:
    for index in range(c,10):
        group45.append([predictor, top10head[index]])
    c+=1
    
mseDict2 = {}
scoreDict2 = {}
n = 1

for pair in group45:
    linear_model = LinearRegression()
    X_lf = housing_ml[[pair[0], pair[1]]]
    X_train, X_test, y_train, y_test = train_test_split(X_lf, y, train_size = .8, random_state = 21)
    linear_model.fit(X_train, y_train)
    y_pred = linear_model.predict(X_test)
    mseDict2['Model: ' + str(n)] = mean_squared_error(y_test, y_pred, squared = False)
    scoreDict2['Model: ' + str(n)] = linear_model.score(X_test, y_test)
    n += 1

mseDict2 = sorted(mseDict2.items(), key=lambda x: x[1])
scoreDict2 = sorted(scoreDict2.items(), key=lambda x: x[1], reverse = True)

top10_2 = mseDict2[:10]
top10score2 = scoreDict2[:10]
#print(top10score2)
print()
print('The best prediction is from', top10score2[0][0], 'giving an error score of', top10score2[0][1])
print('Top 10 error:')

i = 1
for p in top10_2:
    print(str(i), ' ', top10_2[i-1][0], ' ', top10_2[i-1][1], ', score: ', top10score2[i-1][1], sep = '')
    i += 1


The best prediction is from Model: 5 giving an error score of 0.7173882583039418
Top 10 error:
1 Model: 5 43764.68831784155, score: 0.7173882583039418
2 Model: 4 44577.66409611639, score: 0.706791108152437
3 Model: 2 45673.46700931651, score: 0.6921986779040229
4 Model: 7 46468.13490082075, score: 0.6813946935673894
5 Model: 3 47070.07230442364, score: 0.6730869536074611
6 Model: 6 47124.910051666164, score: 0.6723247870365775
7 Model: 1 47433.82463056837, score: 0.6680147343521529
8 Model: 8 48237.754631121876, score: 0.6566660974647345
9 Model: 9 48571.27166300009, score: 0.6519020465859777
10 Model: 13 49640.643170742085, score: 0.6364054863411375


In [159]:
linear_model = LinearRegression()
X_lf = housing_ml
X_train, X_test, y_train, y_test = train_test_split(X_lf, y, train_size = .8, random_state = 21)
linear_model.fit(X_train, y_train)
y_pred = linear_model.predict(X_test)
allMSE = mean_squared_error(y_test, y_pred, squared = False)
allScore = linear_model.score(X_test, y_test)
print('Linear model all features:')
print('Mean Squared Error:', allMSE)
print('Score:', allScore)

Linear model all features:
Mean Squared Error: 35623.303450008585
Score: 0.8127547098527779


In [160]:
KNR_model = KNeighborsRegressor(n_neighbors=5)
X_lf = housing_ml
X_train, X_test, y_train, y_test = train_test_split(X_lf, y, train_size = .8, random_state = 21)
KNR_model.fit(X_train, y_train)
y_pred = KNR_model.predict(X_test)
knr5MSE = mean_squared_error(y_test, y_pred, squared = False)
knr5Score = KNR_model.score(X_test, y_test)
print('KNR5:')
print('Mean Squared Error:', knr5MSE)
print(knr5Score)

KNR5:
Mean Squared Error: 51966.07983368139
0.6015421069563245


In [161]:
KNR_model = KNeighborsRegressor(n_neighbors=10)
X_lf = housing_ml
X_train, X_test, y_train, y_test = train_test_split(X_lf, y, train_size = .8, random_state = 21)
KNR_model.fit(X_train, y_train)
y_pred = KNR_model.predict(X_test)
knr10MSE = mean_squared_error(y_test, y_pred, squared = False)
knr10Score = KNR_model.score(X_test, y_test)
print('KNR10:')
print('Mean Squared Error:', knr10MSE)
print('Score:', knr10Score)

KNR10:
Mean Squared Error: 52392.16317256475
Score: 0.5949812006956909


In [162]:
print('Linear regression, when using all features seems to perform the best')

Linear regression, when using all features seems to perform the best
