<h1>Machine Learning - Project</h1>
<h3>California Housing Price Predication</h3>
<h5>By - Anishka Patel email: anishka.vpatel@gmail.com</h5>

<p>DESCRIPTION <br>
<br>
Background of Problem Statement :<br>
<br>
The US Census Bureau has published California Census Data which has 10 types of metrics <br>
such as the population, median income, median housing price, and so on for each block <br>
group in California. The dataset also serves as an input for project scoping and tries to <br>
specify the functional and nonfunctional requirements for it.<br>
<br>
Problem Objective :<br>
<br>
The project aims at building a model of housing prices to predict median house values <br>
in California using the provided dataset. This model should learn from the data and be <br>
able to predict the median housing price in any district, given all the other metrics.<br>
<br>
Districts or block groups are the smallest geographical units for which the US Census <br>
Bureau publishes sample data (a block group typically has a population of 600 to 3,000 <br>
people). There are 20,640 districts in the project dataset.<br></p>
<p>
Analysis Tasks to be performed:
<ol>
<li>Build a model of housing prices to predict median house values in California using <br>the provided dataset.

<li>Train the model to learn from the data to predict the median housing price in any <br>district, given all the other metrics.

<li>Predict housing prices based on median_income and plot the regression chart for it.</ol></p>

In [1]:
# IMPORTS

import pandas as pd

<h4>1. Load the data :</h4>
<li>Read the “housing.csv” file from the folder into the program.

In [2]:
df = pd.read_excel('1553768847_housing.xlsx')

<li>Print first few rows of this data.

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200


<li>Extract input (X) and output (Y) data from the dataset.

In [4]:
X = df.iloc[:, 2:-1]
Y = df.iloc[:,-1]
print(X.head())
print(Y.head())

   housing_median_age  total_rooms  total_bedrooms  population  households  \
0                  41          880           129.0         322         126   
1                  21         7099          1106.0        2401        1138   
2                  52         1467           190.0         496         177   
3                  52         1274           235.0         558         219   
4                  52         1627           280.0         565         259   

   median_income ocean_proximity  
0         8.3252        NEAR BAY  
1         8.3014        NEAR BAY  
2         7.2574        NEAR BAY  
3         5.6431        NEAR BAY  
4         3.8462        NEAR BAY  
0    452600
1    358500
2    352100
3    341300
4    342200
Name: median_house_value, dtype: int64


<h4>2. Handle missing values :</h4>
<li>Fill the missing values with the mean of the respective column.

In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 7 columns):
housing_median_age    20640 non-null int64
total_rooms           20640 non-null int64
total_bedrooms        20433 non-null float64
population            20640 non-null int64
households            20640 non-null int64
median_income         20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(2), int64(4), object(1)
memory usage: 1.1+ MB


<p> Column: total_bedrooms has some null values </p>

In [6]:
# Handling missing data
mean_total_bedrooms = X['total_bedrooms'].mean()
mean_total_bedrooms

537.8705525375618

In [7]:
X['total_bedrooms'].fillna(value=mean_total_bedrooms, axis=0, inplace=True)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 7 columns):
housing_median_age    20640 non-null int64
total_rooms           20640 non-null int64
total_bedrooms        20640 non-null float64
population            20640 non-null int64
households            20640 non-null int64
median_income         20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(2), int64(4), object(1)
memory usage: 1.1+ MB


<p> No null values in X </p>

<h4>3. Encode categorical data :</h4>
<li>Convert categorical column in the dataset to numerical data.

In [8]:
X = pd.get_dummies(X, columns=['ocean_proximity'])
X.head()

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,41,880,129.0,322,126,8.3252,0,0,0,1,0
1,21,7099,1106.0,2401,1138,8.3014,0,0,0,1,0
2,52,1467,190.0,496,177,7.2574,0,0,0,1,0
3,52,1274,235.0,558,219,5.6431,0,0,0,1,0
4,52,1627,280.0,565,259,3.8462,0,0,0,1,0


<h4>4. Split the dataset : </h4>
<li>Split the data into 80% training dataset and 20% test dataset.

In [9]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=.2, random_state=15)
print(Xtrain.shape)
print(Xtest.shape)
print(Ytrain.shape)
print(Ytest.shape)

(16512, 11)
(4128, 11)
(16512,)
(4128,)


<h4>5. Standardize data :</h4>
<li>Standardize training and test datasets.

In [10]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Xtrain = pd.DataFrame(data=scaler.fit_transform(Xtrain))
Xtest = pd.DataFrame(data=scaler.fit_transform(Xtest))
print(Xtrain.head())
print(Xtest.head())

         0         1         2         3         4         5         6   \
0 -1.479975 -0.989472 -1.012499 -1.060311 -1.058113 -0.892478 -0.891506   
1  0.261625  0.185907  0.674387  0.250759  0.585480 -0.350094 -0.891506   
2 -0.213356 -0.685503 -0.346942  0.407092 -0.260252 -0.839358  1.121698   
3 -0.767502  1.920812  2.194278  2.472293  2.385986  0.046856  1.121698   
4  1.053262 -0.682710 -0.659149 -0.335492 -0.683118 -0.168301  1.121698   

         7         8         9         10  
0 -0.682289 -0.017404  2.828138 -0.383283  
1 -0.682289 -0.017404  2.828138 -0.383283  
2 -0.682289 -0.017404 -0.353590 -0.383283  
3 -0.682289 -0.017404 -0.353590 -0.383283  
4 -0.682289 -0.017404 -0.353590 -0.383283  
         0         1         2         3         4         5         6   \
0  1.902821 -0.517923 -0.408778 -0.492543 -0.391786 -0.481546 -0.889757   
1  0.611848 -0.665987 -0.634690 -0.827466 -0.876374 -1.039469 -0.889757   
2 -1.485984  0.105071  0.034011  0.207515  0.060823 -0.00926

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


<h4>6. Perform Linear Regression : </h4>
<li>Perform Linear Regression on training data.

In [11]:
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
model_lr.fit(Xtrain, Ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

<li>Predict output for test dataset using the fitted model.

In [12]:
Ypred_lr = model_lr.predict(Xtest)

In [13]:
model_lr.score(Xtest, Ytest)

0.6279244012499515

<li>Print root mean squared error (RMSE) from Linear Regression.

In [14]:
from sklearn.metrics import mean_squared_error
from math import sqrt
print(sqrt(mean_squared_error(Ypred_lr, Ytest)))

70338.55535453843


In [15]:
df_out_lr = pd.DataFrame(data=Ytest.values, columns=['Actual'])
df_out_lr['Prediction'] = Ypred_lr
df_out_lr['Error'] = df_out_lr['Prediction'] - df_out_lr['Actual']
df_out_lr

Unnamed: 0,Actual,Prediction,Error
0,220800,225001.787303,4201.787303
1,82800,155615.110498,72815.110498
2,141000,127842.841652,-13157.158348
3,340900,281204.634820,-59695.365180
4,56100,59454.426655,3354.426655
5,120700,127797.063211,7097.063211
6,125000,150067.308400,25067.308400
7,67500,181072.264384,113572.264384
8,236800,183188.388279,-53611.611721
9,178300,209663.209705,31363.209705


<h4>7. Perform Decision Tree Regression :</h4>
<li>Perform Decision Tree Regression on training data.

In [16]:
from sklearn.tree import DecisionTreeRegressor
model_dtr = DecisionTreeRegressor()
model_dtr.fit(Xtrain, Ytrain)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

<li>Predict output for test dataset using the fitted model.

In [17]:
Ypred_dtr = model_dtr.predict(Xtest)
model_dtr.score(Xtest, Ytest)

0.4294857310962491

<li>Print root mean squared error from Decision Tree Regression.

In [18]:
print(sqrt(mean_squared_error(Ypred_dtr, Ytest)))

87098.58611814289


In [19]:
df_out_dtr = pd.DataFrame(data=Ytest.values, columns=['Actual'])
df_out_dtr['Prediction'] = Ypred_dtr
df_out_dtr['Error'] = df_out_dtr['Prediction'] - df_out_dtr['Actual']
df_out_dtr

Unnamed: 0,Actual,Prediction,Error
0,220800,122800.0,-98000.0
1,82800,76100.0,-6700.0
2,141000,193000.0,52000.0
3,340900,353100.0,12200.0
4,56100,57700.0,1600.0
5,120700,91700.0,-29000.0
6,125000,151700.0,26700.0
7,67500,260100.0,192600.0
8,236800,150000.0,-86800.0
9,178300,198000.0,19700.0


<h4>8. Perform Random Forest Regression :</h4>
<li>Perform Random Forest Regression on training data.

In [20]:
from sklearn.ensemble import RandomForestRegressor
model_rfr = RandomForestRegressor()
model_rfr.fit(Xtrain, Ytrain)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

<li>Predict output for test dataset using the fitted model.

In [21]:
Ypred_rfr = model_rfr.predict(Xtest)
model_rfr.score(Xtest, Ytest)

0.6718491022217677

<li>Print RMSE (root mean squared error) from Random Forest Regression.

In [22]:
print(sqrt(mean_squared_error(Ypred_rfr, Ytest)))

66056.36146233698


In [23]:
df_out_rfr = pd.DataFrame(data=Ytest.values, columns=['Actual'])
df_out_rfr['Prediction'] = Ypred_rfr
df_out_rfr['Error'] = df_out_rfr['Prediction'] - df_out_rfr['Actual']
df_out_rfr

Unnamed: 0,Actual,Prediction,Error
0,220800,204300.0,-16500.0
1,82800,119750.0,36950.0
2,141000,130750.0,-10250.0
3,340900,311110.0,-29790.0
4,56100,63160.0,7060.0
5,120700,144150.0,23450.0
6,125000,188880.0,63880.0
7,67500,172990.0,105490.0
8,236800,172150.0,-64650.0
9,178300,164020.0,-14280.0


<h4>9. Bonus exercise: Perform Linear Regression with one independent variable :</h4>
<li>Extract just the median_income column from the independent variables (from X_train and X_test).

In [24]:
Xtrainb = Xtrain[[5]]
Xtestb = Xtest[[5]]
print(Xtrainb.head())
print(Xtestb.head())

          5
0 -0.892478
1 -0.350094
2 -0.839358
3  0.046856
4 -0.168301
          5
0 -0.481546
1 -1.039469
2 -0.009265
3  1.605853
4 -1.175225


<li>Perform Linear Regression to predict housing values based on median_income.

In [25]:
model_lrb = LinearRegression()
model_lrb.fit(Xtrainb, Ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

<li>Predict output for test dataset using the fitted model.

In [26]:
Ypred_lrb = model_lrb.predict(Xtestb)
model_lrb.score(Xtestb, Ytest)

0.4593220819728651

In [27]:
print(sqrt(mean_squared_error(Ypred_lrb, Ytest)))

84790.49410859433


In [28]:
df_out_lrb = pd.DataFrame(data=Ytest.values, columns=['Actual'])
df_out_lrb['Prediction'] = Ypred_lrb
df_out_lrb['Error'] = df_out_lrb['Prediction'] - df_out_lrb['Actual']
df_out_lrb

Unnamed: 0,Actual,Prediction,Error
0,220800,168337.761939,-52462.238061
1,82800,123869.203503,41069.203503
2,141000,205980.284649,64980.284649
3,340900,334711.287812,-6188.712188
4,56100,113048.985882,56948.985882
5,120700,171418.987973,50718.987973
6,125000,186064.321589,61064.321589
7,67500,158075.461789,90575.461789
8,236800,157272.398762,-79527.601238
9,178300,175818.928029,-2481.071971


<li>Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.