30 Days of Kaggle

This is [Day 9: Selecting Data For Modeling](https://www.kaggle.com/dansbecker/your-first-machine-learning-model).

In [1]:
import pandas as pd
import datetime as dt

melbourne_file_path = '../datasets/kaggle/melbourne-house-prices/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


In [2]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [3]:
# dropna drops missing values
melbourne_data = melbourne_data.dropna(axis = 0)
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
x = melbourne_data[melbourne_features]
x.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [4]:
x.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


In [5]:
from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor(random_state=1)
# Fit the model
melbourne_model.fit(x, y)

DecisionTreeRegressor(random_state=1)

Now we have a decision tree model with five variables.  Let's make some predictions:

In [6]:
print('Making predictions for the following 5 houses')
print(x.head())
print('Predictions')
print(melbourne_model.predict(x.head()))

Making predictions for the following 5 houses
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
Predictions
[1035000. 1465000. 1600000. 1876000. 1636000.]


That's enough lecture.  Time for exercises.

How to validate a model?  Part 2 of Day 9.

Mean Absolute Error

In [7]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(x)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

My value isn't the same as the 434.71594577146544 value reported in the lesson.  Why not?

Test-train split is next.


In [8]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(x, y, random_state=0)
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)

val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

270926.3249408221


My value is different from the 261425.9928986443 value in the lesson.  Are there values in the data set that needed to be removed?

The lesson points out that the training error was ~$500, but the test error is ~$250K.  Since the average home value is $1.1M, this means that the error is on the order of 25%.  We'll need a better model.

More exercises: Iowa home prices data set from Day 8.

In [9]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

iowa_file_path = '../datasets/kaggle/iowa-house-prices/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
print(home_data.columns)


Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
      

In [12]:
feature_columns = ['Lot Area', 'Year Built', '1st Flr SF', '2nd Flr SF', 'Bedroom AbvGr', 'TotRms AbvGrd']
X = home_data[feature_columns]
iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(X, y)
print('First in-sample predicxtions: ', iowa_model.predict(X.head()))
print('Actual target values        : ', y.head().tolist())


First in-sample predicxtions:  [159000. 271900. 137500. 248500. 167000.]
Actual target values        :  [159000, 271900, 137500, 248500, 167000]


In [17]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
iowa_model.fit(train_X, train_y)
iowa_prediction = iowa_model.predict(val_X)
print(val_X.head())
print('predicted: ', iowa_model.predict(val_X.head()))
print('actual   : ', val_y.head().tolist())

      Lot Area  Year Built  1st Flr SF  2nd Flr SF  Bedroom AbvGr  \
2145     10266        1952         768         768              4   
306       5150        1910         671         378              2   
2167      9060        1957         967         671              4   
854      17671        1882         916         826              4   
439      12929        1960        1081           0              3   

      TotRms AbvGrd  
2145              7  
306               5  
2167              6  
854               8  
439               5  
predicted:  [165150.  37900. 140200. 152000. 155000.]
actual   :  [136000, 80900, 139000, 168000, 148000]


In [18]:
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, iowa_prediction)
print(val_mae)

29366.936363636363


Not a bad value.  Finished with Day 9 - onto Day 10: over- and under-fitting.
