# DNN Regressor

California Housing Data

This data set contains information about all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area. 

The task is to aproximate the median house value of each block from the values of the rest of the variables. 

 It has been obtained from the LIACC repository. The original page where the data set can be found is: http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html.
 

The Features:
 
* housingMedianAge: continuous. 
* totalRooms: continuous. 
* totalBedrooms: continuous. 
* population: continuous. 
* households: continuous. 
* medianIncome: continuous. 
* medianHouseValue: continuous. 

## The Data

** Import the cal_housing_clean.csv file with pandas. Separate it into a training (70%) and testing set(30%).**

In [1]:
# Import Dependencies
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Import Data
data = pd.read_csv('cal_housing_clean.csv')

In [3]:
data.head()

Unnamed: 0,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
0,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


In [4]:
data.describe()

Unnamed: 0,housingMedianAge,totalRooms,totalBedrooms,population,households,medianIncome,medianHouseValue
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,28.639486,2635.763081,537.898014,1425.476744,499.53968,3.870671,206855.816909
std,12.585558,2181.615252,421.247906,1132.462122,382.329753,1.899822,115395.615874
min,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,18.0,1447.75,295.0,787.0,280.0,2.5634,119600.0
50%,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [5]:
# Features
X = data.drop('medianHouseValue', axis=1)

# Labels 
y = data['medianHouseValue']

In [6]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)

In [7]:
print(X_train)

       housingMedianAge  totalRooms  totalBedrooms  population  households  \
4926               29.0      1419.0          363.0      1696.0       317.0   
19178              14.0      5315.0         1037.0      2228.0       950.0   
20451               5.0     25187.0         3521.0     11956.0      3478.0   
5182               37.0      1002.0          270.0      1092.0       273.0   
15879              52.0      2164.0          606.0      2034.0       513.0   
10720              17.0      4477.0          610.0      1798.0       612.0   
4214               47.0      1375.0          359.0      1512.0       418.0   
6154               21.0      2215.0          484.0      1792.0       419.0   
12291              29.0      2793.0          722.0      1583.0       626.0   
16821              25.0      7778.0         1493.0      4674.0      1451.0   
18453              18.0      2127.0          387.0      1547.0       402.0   
20426              11.0      1177.0          138.0       415.0  

In [8]:
print(y_train)

4926     101300.0
19178    208400.0
20451    321300.0
5182      94500.0
15879    178100.0
10720    410400.0
4214     208900.0
6154     166500.0
12291     73200.0
16821    272400.0
18453    217100.0
20426    500001.0
11549    264900.0
3713     226400.0
14561    237300.0
18446    291500.0
15771    500001.0
19900     64600.0
12383    121300.0
9961     339200.0
1857     105900.0
19741     55600.0
6131     139400.0
1667     198500.0
1821     288200.0
11817    201900.0
11541    214800.0
17278    340400.0
3799     322900.0
17695    123500.0
           ...   
14048    154400.0
11119    254100.0
12451     73400.0
19860     64900.0
14647    143800.0
9491     118800.0
13138    103600.0
16480    244200.0
14713    243400.0
11318    160900.0
16662    247000.0
15449    182600.0
4081     376100.0
11641    386700.0
19316    296800.0
11366    225200.0
2902      63800.0
5586     268300.0
16161    500001.0
2072      72000.0
8807     500001.0
4504     117200.0
14078    190300.0
1242      67500.0
2711      

### Scale the Feature Data

** Use sklearn preprocessing to create a MinMaxScaler for the feature data. Fit this scaler only to the training data. Then use it to transform X_test and X_train. Then use the scaled X_test and X_train along with pd.Dataframe to re-create two dataframes of scaled data.**

In [9]:
# Scale the Feature Data
# Do scaling only on Training Data

# MinMaxScaler: Transforms features by scaling each feature to a given range.

# Transformation is Given by:
# X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
# X_scaled = X_std * (max - min) + min

from sklearn.preprocessing import MinMaxScaler

In [10]:
scaler = MinMaxScaler()

scaler.fit(X_train)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [11]:
# Make X_train to be the Scaled Version of Data
# X_train has 6 columns
# This process scales all the values in all 6 columns and replaces them with the new values
X_train = pd.DataFrame(data=scaler.transform(X_train), columns=X_train.columns, index=X_train.index)

In [12]:
print(X_train)

       housingMedianAge  totalRooms  totalBedrooms  population  households  \
4926           0.549020    0.035941       0.056176    0.047451    0.051965   
19178          0.254902    0.135041       0.160770    0.062362    0.156060   
20451          0.078431    0.640510       0.546245    0.335015    0.571781   
5182           0.705882    0.025334       0.041744    0.030522    0.044729   
15879          1.000000    0.054891       0.093886    0.056924    0.084197   
10720          0.313725    0.113725       0.094507    0.050310    0.100477   
4214           0.901961    0.034822       0.055556    0.042294    0.068574   
6154           0.392157    0.056189       0.074953    0.050142    0.068739   
12291          0.549020    0.070891       0.111887    0.044284    0.102779   
16821          0.470588    0.197690       0.231533    0.130917    0.238448   
18453          0.333333    0.053950       0.059901    0.043275    0.065943   
20426          0.196078    0.029786       0.021260    0.011547  

In [13]:
# Do same Scaling for Test Features
scal = MinMaxScaler()
scal.fit(X_test)
X_test = pd.DataFrame(data=scal.transform(X_test), columns=X_test.columns, index=X_test.index)

In [14]:
print(X_test)

       housingMedianAge  totalRooms  totalBedrooms  population  households  \
19191          0.901961    0.028142       0.028744    0.019612    0.028130   
1687           0.294118    0.153469       0.138994    0.148765    0.143225   
13806          0.686275    0.067983       0.104576    0.090188    0.097068   
9555           0.470588    0.053944       0.058434    0.051868    0.052298   
3732           0.568627    0.062586       0.096634    0.067480    0.097861   
19511          0.568627    0.053632       0.075076    0.120766    0.075277   
19708          0.411765    0.070573       0.073374    0.067738    0.079437   
15452          0.058824    0.182329       0.168684    0.146829    0.158281   
6168           0.666667    0.058935       0.066377    0.070125    0.075277   
15056          0.431373    0.179614       0.237519    0.200116    0.237718   
17284          0.568627    0.073724       0.078480    0.067544    0.080626   
11019          0.588235    0.144203       0.117625    0.119089  

### Create Feature Columns

** Create the necessary tf.feature_column objects for the estimator. They should all be trated as continuous numeric_columns. **

In [15]:
data.columns

Index(['housingMedianAge', 'totalRooms', 'totalBedrooms', 'population',
       'households', 'medianIncome', 'medianHouseValue'],
      dtype='object')

In [16]:
# Make Feature Columns
age = tf.feature_column.numeric_column('housingMedianAge')
rooms = tf.feature_column.numeric_column('totalRooms')
bedrooms = tf.feature_column.numeric_column('totalBedrooms')
population = tf.feature_column.numeric_column('population')
households = tf.feature_column.numeric_column('households')
med_income = tf.feature_column.numeric_column('medianIncome')

In [17]:
feat_cols = [age, rooms, bedrooms, population, households, med_income]

** Create the input function for the estimator object. (play around with batch_size and num_epochs)**

In [18]:
input_func = tf.estimator.inputs.pandas_input_fn(X_train, y_train, batch_size=10, num_epochs=1000, shuffle=True)

** Create the estimator model. Use a DNNRegressor. Play around with the hidden units! **

In [19]:
model = tf.estimator.DNNRegressor(hidden_units=[10,20,20,10], feature_columns=feat_cols)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_keep_checkpoint_max': 5, '_save_summary_steps': 100, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_model_dir': '/tmp/tmpc4sfhpmk', '_session_config': None, '_save_checkpoints_steps': None, '_tf_random_seed': 1}


##### ** Train the model for ~1,000 steps. (Later come back to this and train it for more and check for improvement) **

In [20]:
model.train(input_fn=input_func, steps=2000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpc4sfhpmk/model.ckpt.
INFO:tensorflow:loss = 672655740000.0, step = 1
INFO:tensorflow:global_step/sec: 584.677
INFO:tensorflow:loss = 61623013000.0, step = 101 (0.175 sec)
INFO:tensorflow:global_step/sec: 558.522
INFO:tensorflow:loss = 220911830000.0, step = 201 (0.176 sec)
INFO:tensorflow:global_step/sec: 516.058
INFO:tensorflow:loss = 52306625000.0, step = 301 (0.194 sec)
INFO:tensorflow:global_step/sec: 519.836
INFO:tensorflow:loss = 214431450000.0, step = 401 (0.192 sec)
INFO:tensorflow:global_step/sec: 504.118
INFO:tensorflow:loss = 36806762000.0, step = 501 (0.198 sec)
INFO:tensorflow:global_step/sec: 580.941
INFO:tensorflow:loss = 107792350000.0, step = 601 (0.174 sec)
INFO:tensorflow:global_step/sec: 606.291
INFO:tensorflow:loss = 39627125000.0, step = 701 (0.163 sec)
INFO:tensorflow:global_step/sec: 506.632
INFO:tensorflow:loss = 87271100000.0, step = 801 (0.197 sec)
INFO:tensorflo

<tensorflow.python.estimator.canned.dnn.DNNRegressor at 0x7f8d70b92ef0>

** Create a prediction input function and then use the .predict method off your estimator model to create a list or predictions on your test data. **

In [21]:
predict_input_func = tf.estimator.inputs.pandas_input_fn(X_test, batch_size=10, num_epochs=1, shuffle=False)

In [22]:
pred = model.predict(input_fn=predict_input_func)

In [23]:
predictions = list(pred)

INFO:tensorflow:Restoring parameters from /tmp/tmpc4sfhpmk/model.ckpt-2000


In [24]:
predictions

[{'predictions': array([232485.97], dtype=float32)},
 {'predictions': array([260271.47], dtype=float32)},
 {'predictions': array([183691.69], dtype=float32)},
 {'predictions': array([209671.27], dtype=float32)},
 {'predictions': array([193558.5], dtype=float32)},
 {'predictions': array([176303.56], dtype=float32)},
 {'predictions': array([196009.77], dtype=float32)},
 {'predictions': array([240633.64], dtype=float32)},
 {'predictions': array([223350.11], dtype=float32)},
 {'predictions': array([209539.38], dtype=float32)},
 {'predictions': array([260261.78], dtype=float32)},
 {'predictions': array([314375.6], dtype=float32)},
 {'predictions': array([191280.19], dtype=float32)},
 {'predictions': array([239329.94], dtype=float32)},
 {'predictions': array([284828.75], dtype=float32)},
 {'predictions': array([179006.53], dtype=float32)},
 {'predictions': array([189953.33], dtype=float32)},
 {'predictions': array([253768.8], dtype=float32)},
 {'predictions': array([233490.03], dtype=float32

** Calculate the RMSE. You should be able to get around 100,000 RMSE (remember that this is in the same units as the label.) Do this manually or use [sklearn.metrics](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) **

In [25]:
final_preds = []

for pred in predictions:
    final_preds.append(pred['predictions'])

print(len(final_preds))

6192


In [26]:
from sklearn.metrics import mean_squared_error, classification_report

In [27]:
# Root Mean Squared Error
mean_squared_error(y_test,final_preds)**0.5

86314.39290117052