<h1>Machine Learning - Project</h1>
<h3>California Housing Price Predication</h3>
<h5>By - Anishka Patel email: anishka.vpatel@gmail.com</h5>

<p>DESCRIPTION <br>
<br>
Background of Problem Statement :<br>
<br>
The US Census Bureau has published California Census Data which has 10 types of metrics <br>
such as the population, median income, median housing price, and so on for each block <br>
group in California. The dataset also serves as an input for project scoping and tries to <br>
specify the functional and nonfunctional requirements for it.<br>
<br>
Problem Objective :<br>
<br>
The project aims at building a model of housing prices to predict median house values <br>
in California using the provided dataset. This model should learn from the data and be <br>
able to predict the median housing price in any district, given all the other metrics.<br>
<br>
Districts or block groups are the smallest geographical units for which the US Census <br>
Bureau publishes sample data (a block group typically has a population of 600 to 3,000 <br>
people). There are 20,640 districts in the project dataset.<br></p>
<p>
Analysis Tasks to be performed:
<ol>
<li>Build a model of housing prices to predict median house values in California using <br>the provided dataset.

<li>Train the model to learn from the data to predict the median housing price in any <br>district, given all the other metrics.

<li>Predict housing prices based on median_income and plot the regression chart for it.</ol></p>

In [1]:
# IMPORTS

import pandas as pd

<h4>1. Load the data :</h4>
<li>Read the “housing.csv” file from the folder into the program.

In [2]:
df = pd.read_excel('1553768847_housing.xlsx')

<li>Print first few rows of this data.

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200


<li>Extract input (X) and output (Y) data from the dataset.

In [4]:
X = df.iloc[:, 2:-1]
Y = df.iloc[:,-1]
print(X.head())
print(Y.head())

   housing_median_age  total_rooms  total_bedrooms  population  households  \
0                  41          880           129.0         322         126   
1                  21         7099          1106.0        2401        1138   
2                  52         1467           190.0         496         177   
3                  52         1274           235.0         558         219   
4                  52         1627           280.0         565         259   

   median_income ocean_proximity  
0         8.3252        NEAR BAY  
1         8.3014        NEAR BAY  
2         7.2574        NEAR BAY  
3         5.6431        NEAR BAY  
4         3.8462        NEAR BAY  
0    452600
1    358500
2    352100
3    341300
4    342200
Name: median_house_value, dtype: int64


<h4>2. Handle missing values :</h4>
<li>Fill the missing values with the mean of the respective column.

In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 7 columns):
housing_median_age    20640 non-null int64
total_rooms           20640 non-null int64
total_bedrooms        20433 non-null float64
population            20640 non-null int64
households            20640 non-null int64
median_income         20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(2), int64(4), object(1)
memory usage: 1.1+ MB


<p> Column: total_bedrooms has some null values </p>

In [6]:
# Handling missing data
mean_total_bedrooms = X['total_bedrooms'].mean()
mean_total_bedrooms

537.8705525375618

In [7]:
X['total_bedrooms'].fillna(value=mean_total_bedrooms, axis=0, inplace=True)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 7 columns):
housing_median_age    20640 non-null int64
total_rooms           20640 non-null int64
total_bedrooms        20640 non-null float64
population            20640 non-null int64
households            20640 non-null int64
median_income         20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(2), int64(4), object(1)
memory usage: 1.1+ MB


<p> No null values in X </p>

<h4>3. Encode categorical data :</h4>
<li>Convert categorical column in the dataset to numerical data.

In [8]:
X = pd.get_dummies(X, columns=['ocean_proximity'])
X.head()

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,41,880,129.0,322,126,8.3252,0,0,0,1,0
1,21,7099,1106.0,2401,1138,8.3014,0,0,0,1,0
2,52,1467,190.0,496,177,7.2574,0,0,0,1,0
3,52,1274,235.0,558,219,5.6431,0,0,0,1,0
4,52,1627,280.0,565,259,3.8462,0,0,0,1,0


<h4>4. Split the dataset : </h4>
<li>Split the data into 80% training dataset and 20% test dataset.

In [10]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=.2, random_state=15)
print(Xtrain.shape)
print(Ytrain.shape)
print(Xtest.shape)
print(Ytest.shape)

(16512, 11)
(16512,)
(4128, 11)
(4128,)


<h4>5. Standardize data :</h4>
<li>Standardize training and test datasets.

In [11]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
Xtrain = pd.DataFrame(data=scaler.fit_transform(Xtrain))
Xtest = pd.DataFrame(data=scaler.fit_transform(Xtest))
print(Xtrain.head())
print(Xtest.head())

         0         1         2         3         4         5    6    7    8   \
0  0.176471  0.015236  0.018156  0.006306  0.016280  0.115212  0.0  0.0  0.0   
1  0.607843  0.092640  0.126319  0.047675  0.117908  0.186404  0.0  0.0  0.0   
2  0.490196  0.035253  0.060832  0.052608  0.065614  0.122185  1.0  0.0  0.0   
3  0.352941  0.206891  0.223774  0.117772  0.229239  0.238507  1.0  0.0  0.0   
4  0.803922  0.035437  0.040813  0.029177  0.039467  0.210266  1.0  0.0  0.0   

    9    10  
0  1.0  0.0  
1  1.0  0.0  
2  0.0  0.0  
3  0.0  0.0  
4  0.0  0.0  
         0         1         2         3         4         5    6    7    8   \
0  1.000000  0.037286  0.058151  0.053067  0.064414  0.170267  0.0  0.0  0.0   
1  0.686275  0.028587  0.042043  0.029264  0.027633  0.097709  0.0  0.0  0.0   
2  0.176471  0.073885  0.089723  0.102822  0.098768  0.231686  0.0  1.0  0.0   
3  0.078431  0.373417  0.362436  0.262025  0.321135  0.441732  0.0  1.0  0.0   
4  0.568627  0.019075  0.030928  0.

  return self.partial_fit(X, y)
  return self.partial_fit(X, y)


<h4>6. Perform Linear Regression : </h4>
<li>Perform Linear Regression on training data.

In [12]:
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
model_lr.fit(Xtrain, Ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [14]:
print(model_lr.coef_)
print(model_lr.intercept_)

[   62250.00092443  -223380.87409685   418972.39600965 -1299214.62112495
   550168.54698153   584937.81422458   -25087.23071944   -92440.51180502
   150925.12217593   -21973.71446979   -11423.66518167]
71192.55328171773


<li>Predict output for test dataset using the fitted model.

In [13]:
model_lr.score(Xtest, Ytest)

0.30077082963536794

<li>Print root mean squared error (RMSE) from Linear Regression.

In [30]:
from sklearn.metrics import mean_squared_error
from math import sqrt

print(sqrt(mean_squared_error(Ytrain, model_lr.predict(Xtrain))))
print(sqrt(mean_squared_error(Ytest, model_lr.predict(Xtest))))

69738.99783909012
96424.54952449775


<h4>7. Perform Decision Tree Regression :</h4>
<li>Perform Decision Tree Regression on training data.

In [16]:
from sklearn.tree import DecisionTreeRegressor
model_dtr = DecisionTreeRegressor()
model_dtr.fit(Xtrain, Ytrain)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

<li>Predict output for test dataset using the fitted model.

In [17]:
model_dtr.score(Xtest, Ytest)

0.3094755846074423

<li>Print root mean squared error from Decision Tree Regression.

In [28]:
print(sqrt(mean_squared_error(Ytrain, model_dtr.predict(Xtrain))))
print(sqrt(mean_squared_error(Ytest, model_dtr.predict(Xtest))))

0.0
95822.47170128874


<h4>8. Perform Random Forest Regression :</h4>
<li>Perform Random Forest Regression on training data.

In [19]:
from sklearn.ensemble import RandomForestRegressor
model_rfr = RandomForestRegressor()
model_rfr.fit(Xtrain, Ytrain)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

<li>Predict output for test dataset using the fitted model.

In [20]:
model_rfr.score(Xtest, Ytest)

0.5464456250926315

<li>Print RMSE (root mean squared error) from Random Forest Regression.

In [27]:
print(sqrt(mean_squared_error(Ytrain, model_rfr.predict(Xtrain))))
print(sqrt(mean_squared_error(Ytest, model_rfr.predict(Xtest))))

26893.866596069893
77659.13160503219


<h4>9. Bonus exercise: Perform Linear Regression with one independent variable :</h4>
<li>Extract just the median_income column from the independent variables (from X_train and X_test).

In [22]:
Xtrainb = Xtrain[[5]]
Xtestb = Xtest[[5]]
print(Xtrainb.head())
print(Xtestb.head())

          5
0  0.115212
1  0.186404
2  0.122185
3  0.238507
4  0.210266
          5
0  0.170267
1  0.097709
2  0.231686
3  0.441732
4  0.080054


<li>Perform Linear Regression to predict housing values based on median_income.

In [23]:
model_lrb = LinearRegression()
model_lrb.fit(Xtrainb, Ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

<li>Predict output for test dataset using the fitted model.

In [24]:
model_lrb.score(Xtestb, Ytest)

0.4594764520934151

In [26]:
print(sqrt(mean_squared_error(Ytrain, model_lrb.predict(Xtrainb))))
print(sqrt(mean_squared_error(Ytest, model_lrb.predict(Xtestb))))

83470.55503439721
84778.38888590109


<li>Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.