<a href="https://colab.research.google.com/github/cezzzanne/HousingPredictionsKeras/blob/master/HousingPricePrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# These are the tools whcih will help us build our algorithms much more efficiently
import pandas as pd
import numpy as np
import tensorflow as tf

# Curate the Data

We use read_csv() to read the data we have loaded into our colab space. This will create a dataframe (matrix or table)

We Then use the .head() function on our dataframe to visualize the data

In [3]:
train_df = pd.read_csv('housing.csv')
train_df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Postcode,Regionname,Propertycount,Distance,CouncilArea
0,Abbotsford,49 Lithgow St,3,h,1490000.0,S,Jellis,1/04/2017,3067,Northern Metropolitan,4019,3.0,Yarra City Council
1,Abbotsford,59A Turner St,3,h,1220000.0,S,Marshall,1/04/2017,3067,Northern Metropolitan,4019,3.0,Yarra City Council
2,Abbotsford,119B Yarra St,3,h,1420000.0,S,Nelson,1/04/2017,3067,Northern Metropolitan,4019,3.0,Yarra City Council
3,Aberfeldie,68 Vida St,3,h,1515000.0,S,Barry,1/04/2017,3040,Western Metropolitan,1543,7.5,Moonee Valley City Council
4,Airport West,92 Clydesdale Rd,2,h,670000.0,S,Nelson,1/04/2017,3042,Western Metropolitan,3464,10.4,Moonee Valley City Council


Now, we don't want all these variables because we don't think they all influence the price that much. For example, sellerName is of no interest to us.


In [0]:
new_train = train_df.drop(["Address", "SellerG", "Date", "CouncilArea", "Regionname", "Suburb", "Method"], axis=1)

We can now see that we have dropped these columns from our dataframe


In [5]:
new_train.head()

Unnamed: 0,Rooms,Type,Price,Postcode,Propertycount,Distance
0,3,h,1490000.0,3067,4019,3.0
1,3,h,1220000.0,3067,4019,3.0
2,3,h,1420000.0,3067,4019,3.0
3,3,h,1515000.0,3040,1543,7.5
4,2,h,670000.0,3042,3464,10.4


But what are we going to do to "Type" ? Our algorithm cannot learn on letters, it can only work on numbers


We want to then encode it so it becomes a number. Get all the values in our "Type" column so we can encode them


In [0]:
features_to_encode = new_train['Type'].values

Then delete them from our dataframe as they are of no use as letters

In [0]:
new_train = new_train.drop('Type', axis=1)

In [0]:
from sklearn.preprocessing import OneHotEncoder

In [0]:
# sklearns OneHotEncoder can do this for us. It is a way of making categorical variables into binary arrays
oh = OneHotEncoder()

In [0]:
encoded_type_features = oh.fit_transform(features_to_encode.reshape(-1, 1))

In [11]:
# Check out the array and how it looks
encoded_type_features.toarray()

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       ...,
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [12]:
# We can also see what categories were present in our "Type" column
oh.categories_

[array(['h', 't', 'u'], dtype=object)]

In [0]:
# We now need to create a new type of dataframe which will support this new datatype
columns = ["Rooms", "Price", "Postcode", "Propertycount", "Distance", "Type1", "Type2", "Type3"]

In [0]:
# Transform into a matrix so we can add the columns
new_matrix = new_train.as_matrix()

In [15]:
# Notice we do not have the "Type" column. So we want to add it
new_matrix[0]

array([3.000e+00, 1.490e+06, 3.067e+03, 4.019e+03, 3.000e+00])

In [17]:
new_matrix.shape

(63023, 5)

In [0]:
# np.c_ is a function which adds columns to a matrix. We use it to add columns to our new_matrix
matrix_with_columns = np.c_[new_matrix, encoded_type_features.toarray()]

In [18]:
# We can see we have 3 more columns now! Meaning we were able to add our "Type" variables
matrix_with_columns.shape

(63023, 8)

In [19]:
matrix_with_columns[0]

array([3.000e+00, 1.490e+06, 3.067e+03, 4.019e+03, 3.000e+00, 1.000e+00,
       0.000e+00, 0.000e+00])

Now, we don't want our alogrithm to think certain things are more important because they have higher values. I.e we don't want our algorithm to think Postal Code is more important than rooms just because 34132 > 3. So we want to scale the accoridngly and have them all on values [0, 1]

In [0]:
# We once again use sklearn
from sklearn.preprocessing import MinMaxScaler

In [0]:
scaler = MinMaxScaler(feature_range=(0, 1))


In [0]:
# Turn our matrix into a dataframe again so we can use the scaler
df = pd.DataFrame(matrix_with_columns)

In [24]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,3.0,1490000.0,3067.0,4019.0,3.0,1.0,0.0,0.0
1,3.0,1220000.0,3067.0,4019.0,3.0,1.0,0.0,0.0
2,3.0,1420000.0,3067.0,4019.0,3.0,1.0,0.0,0.0
3,3.0,1515000.0,3040.0,1543.0,7.5,1.0,0.0,0.0
4,2.0,670000.0,3042.0,3464.0,10.4,1.0,0.0,0.0


In [0]:
scaled_train = scaler.fit_transform(df)

In [26]:
# See our first column
scaled_train[0]

array([0.06666667, 0.12640576, 0.06836735, 0.18416547, 0.04680187,
       1.        , 0.        , 0.        ])

In [0]:
# We have our values in a dataframe but we want to add names to the columns to see what we are doing
scaled_train_df = pd.DataFrame(scaled_train, columns=columns)

In [28]:
scaled_train_df.head()

Unnamed: 0,Rooms,Price,Postcode,Propertycount,Distance,Type1,Type2,Type3
0,0.066667,0.126406,0.068367,0.184165,0.046802,1.0,0.0,0.0
1,0.066667,0.102114,0.068367,0.184165,0.046802,1.0,0.0,0.0
2,0.066667,0.120108,0.068367,0.184165,0.046802,1.0,0.0,0.0
3,0.066667,0.128655,0.040816,0.069594,0.117005,1.0,0.0,0.0
4,0.033333,0.052632,0.042857,0.158484,0.162246,1.0,0.0,0.0


In [0]:
# You might not see it above, but we have NaN values in our data. This is critical for the algorithm as it heavily affects the learning 
# So we use a function that fills all the NaN with the mean value of the column. This is better, but not optimal
scaled_train_df = scaled_train_df.fillna(scaled_train_df.mean())

In [0]:
# We want our training data to be everything except the prices (which we will try to predict)
X = scaled_train_df.drop("Price", axis=1).values
Y = scaled_train_df[["Price"]].values

# Build the model

In [0]:
# Decalre the model
model = tf.keras.Sequential()

In [0]:
model.add(tf.keras.layers.Dense(50, activation='relu'))
model.add(tf.keras.layers.Dense(100, activation='relu'))
model.add(tf.keras.layers.Dense(50, activation='relu'))
model.add(tf.keras.layers.Dense(1))

In [34]:
model.compile(loss='mean_squared_error', optimizer='adam')

Instructions for updating:
Colocations handled automatically by placer.


In [0]:
model.fit(
    X[10:],
    Y[10:],
    epochs=50,
    shuffle=True,
    verbose=2
)

# Make Predictions

In [0]:
prediction = model.predict(X[:1])

In [166]:
prediction

array([[0.10254882]], dtype=float32)

In [0]:
multiplier = scaler.scale_[1]
adder = scaler.min_[1]

In [0]:
# Scale back
pred = prediction[0][0]
print('Prediction with scaling  = ',format(pred))
pred -= adder
pred /= multiplier
print("Housing Price Prediction  - ${}".format(pred))