**CALIFORNIA HOUSING DATASET**

In [1]:
# importing basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Reading the data
df=pd.read_csv('housing.csv')

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
df.shape

(20640, 10)

OBSERVATIONS: there are 10 features and 20,640 records

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


**Checking Null or Missing Values**

In [6]:
df.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

There are only 207 missing values in the total_bedrooms column, so we can drop them

In [7]:
df.dropna(inplace=True)

In [8]:
df.shape

(20433, 10)

Now there are 20,433 records and 10 features

In [9]:
# Checking the number of different/unique values in Ocean Proximity column 
df['ocean_proximity'].value_counts()

<1H OCEAN     9034
INLAND        6496
NEAR OCEAN    2628
NEAR BAY      2270
ISLAND           5
Name: ocean_proximity, dtype: int64

There are 5 categories in the Ocean proximity namely <1h Ocean, Inland, Near Ocean,Near Bay and Island

In [10]:
# OneHotEncoding in Categorical Column
Ocean_Proximity=pd.get_dummies(df['ocean_proximity'],drop_first=True)

In [11]:
Ocean_Proximity

Unnamed: 0,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,0,0,1,0
1,0,0,1,0
2,0,0,1,0
3,0,0,1,0
4,0,0,1,0
...,...,...,...,...
20635,1,0,0,0
20636,1,0,0,0
20637,1,0,0,0
20638,1,0,0,0


In [12]:
# Dropping the ocean_proximity from the dataset
df.drop(columns=['ocean_proximity'],axis=1,inplace=True)

In [13]:
# Concatenate the Ocean Proximity column in the dataset
df=pd.concat([df,Ocean_Proximity],axis=1)

In [14]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0,0,1,0


In [15]:
df.shape

(20433, 13)

Now there are 13 columns and 20,433 records

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20433 entries, 0 to 20639
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20433 non-null  float64
 1   latitude            20433 non-null  float64
 2   housing_median_age  20433 non-null  float64
 3   total_rooms         20433 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20433 non-null  float64
 6   households          20433 non-null  float64
 7   median_income       20433 non-null  float64
 8   median_house_value  20433 non-null  float64
 9   INLAND              20433 non-null  uint8  
 10  ISLAND              20433 non-null  uint8  
 11  NEAR BAY            20433 non-null  uint8  
 12  NEAR OCEAN          20433 non-null  uint8  
dtypes: float64(9), uint8(4)
memory usage: 1.6 MB


Now all the columns are numerical and we can use this dataset for Model Training

**Separating dependent and Independent columns**

In [17]:
X=df.drop('median_house_value',axis=True)
y=df['median_house_value']

In [18]:
X.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,0,0,1,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,0,0,1,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,0,0,1,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,0,0,1,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,0,0,1,0


In [19]:
X.shape

(20433, 12)

In [20]:
y.head()

0    452600.0
1    358500.0
2    352100.0
3    341300.0
4    342200.0
Name: median_house_value, dtype: float64

In [21]:
y.shape

(20433,)

**Splitting the dataset into train and test dataset**

In [22]:
, train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)

In [23]:
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

(15324, 12) (5109, 12) (15324,) (5109,)


**Feature Scaling**

In [24]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

**Now let's create the ANN**

- We will build the ANN with 4 layers
1. Input layer            
2. Hidden layer 1                     
3. Hidden layer 2                 
4. Output layer

- Input layer, Hidden Layer 1 and 2 will have the range between 32 to 512 with a step size of 32
- The learning rate will be a hyper parameter and will be picked by the model
- Since this is a regression problem statement, so we'll use 'linear' activation function in the output layer
- For weight updation we'll use Adam optimizer and the loss function will be Mean Squared Log error





In [25]:
import tensorflow as tf 
import keras_tuner as kt
from tensorflow.keras import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LeakyReLU,PReLU,ELU,ReLU
from tensorflow.keras.layers import Dropout
from tensorflow.keras.losses import MeanSquaredLogarithmicError

In [26]:
msle = MeanSquaredLogarithmicError()


def build_annmodel(hp):
  model = tf.keras.Sequential()
  
  units1 = hp.Int('units1', min_value=32, max_value=512, step=32)
  units2 = hp.Int('units2', min_value=32, max_value=512, step=32)
  units3 = hp.Int('units3', min_value=32, max_value=512, step=32)
  model.add(Dense(units=units1, activation='relu'))
  model.add(tf.keras.layers.Dense(units=units2, activation='relu'))
  model.add(tf.keras.layers.Dense(units=units3, activation='relu'))
  model.add(Dense(1, kernel_initializer='normal', activation='linear'))

  hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

  model.compile(
      optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),
      loss=msle,
      metrics=[msle]
  )

  return model

In [27]:
# HyperBand algorithm from keras tuner
tuner = kt.Hyperband(
    build_annmodel,
    objective='val_mean_squared_logarithmic_error',
    max_epochs=10,
    directory='keras_tuner_dir',
    project_name='house price with keras'
)

In [28]:
tuner.search(X_train, y_train, epochs=10, validation_split=0.2)

**Best Parameters**

In [29]:
for p in ['units1','units2','units3','learning_rate']:
  print(p, tuner.get_best_hyperparameters()[0].get(p))

units1 384
units2 384
units3 512
learning_rate 0.01


**Model Evaluation**

In [30]:
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]
best_hps

<keras_tuner.engine.hyperparameters.HyperParameters at 0x7f964f8d2a90>

In [31]:
model = tuner.hypermodel.build(best_hps)

In [32]:
history = model.fit(X_train, y_train, epochs=50, validation_split=0.2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [33]:
eval_result = model.evaluate(X_test, y_test)
print(eval_result)

[0.06754999607801437, 0.06758880615234375]


In [34]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_4 (Dense)             (None, 384)               4992      
                                                                 
 dense_5 (Dense)             (None, 384)               147840    
                                                                 
 dense_6 (Dense)             (None, 512)               197120    
                                                                 
 dense_7 (Dense)             (None, 1)                 513       
                                                                 
Total params: 350,465
Trainable params: 350,465
Non-trainable params: 0
_________________________________________________________________
