<h1>Machine Learning - Project</h1>
<h3>California Housing Price Predication</h3>
<h5>By - Anishka Patel email: anishka.vpatel@gmail.com</h5>

<p>DESCRIPTION <br>
<br>
Background of Problem Statement :<br>
<br>
The US Census Bureau has published California Census Data which has 10 types of metrics <br>
such as the population, median income, median housing price, and so on for each block <br>
group in California. The dataset also serves as an input for project scoping and tries to <br>
specify the functional and nonfunctional requirements for it.<br>
<br>
Problem Objective :<br>
<br>
The project aims at building a model of housing prices to predict median house values <br>
in California using the provided dataset. This model should learn from the data and be <br>
able to predict the median housing price in any district, given all the other metrics.<br>
<br>
Districts or block groups are the smallest geographical units for which the US Census <br>
Bureau publishes sample data (a block group typically has a population of 600 to 3,000 <br>
people). There are 20,640 districts in the project dataset.<br></p>
<p>
Analysis Tasks to be performed:
<ol>
<li>Build a model of housing prices to predict median house values in California using <br>the provided dataset.

<li>Train the model to learn from the data to predict the median housing price in any <br>district, given all the other metrics.

<li>Predict housing prices based on median_income and plot the regression chart for it.</ol></p>

In [1]:
# IMPORTS

import pandas as pd

<h4>1. Load the data :</h4>
<li>Read the “housing.csv” file from the folder into the program.

In [2]:
df = pd.read_excel('1553768847_housing.xlsx')

<li>Print first few rows of this data.

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
0,-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200


<li>Extract input (X) and output (Y) data from the dataset.

In [4]:
X = df.iloc[:, 2:-1]
Y = df.iloc[:,-1]
print(X.head())
print(Y.head())

   housing_median_age  total_rooms  total_bedrooms  population  households  \
0                  41          880           129.0         322         126   
1                  21         7099          1106.0        2401        1138   
2                  52         1467           190.0         496         177   
3                  52         1274           235.0         558         219   
4                  52         1627           280.0         565         259   

   median_income ocean_proximity  
0         8.3252        NEAR BAY  
1         8.3014        NEAR BAY  
2         7.2574        NEAR BAY  
3         5.6431        NEAR BAY  
4         3.8462        NEAR BAY  
0    452600
1    358500
2    352100
3    341300
4    342200
Name: median_house_value, dtype: int64


<h4>2. Handle missing values :</h4>
<li>Fill the missing values with the mean of the respective column.

In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 7 columns):
housing_median_age    20640 non-null int64
total_rooms           20640 non-null int64
total_bedrooms        20433 non-null float64
population            20640 non-null int64
households            20640 non-null int64
median_income         20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(2), int64(4), object(1)
memory usage: 1.1+ MB


<p> Column: total_bedrooms has some null values </p>

In [6]:
# Handling missing data
mean_total_bedrooms = X['total_bedrooms'].mean()
mean_total_bedrooms

537.8705525375618

In [7]:
X['total_bedrooms'].fillna(value=mean_total_bedrooms, axis=0, inplace=True)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 7 columns):
housing_median_age    20640 non-null int64
total_rooms           20640 non-null int64
total_bedrooms        20640 non-null float64
population            20640 non-null int64
households            20640 non-null int64
median_income         20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(2), int64(4), object(1)
memory usage: 1.1+ MB


<p> No null values in X </p>

<h4>3. Encode categorical data :</h4>
<li>Convert categorical column in the dataset to numerical data.

In [8]:
X = pd.get_dummies(X, columns=['ocean_proximity'])
X.head()

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,41,880,129.0,322,126,8.3252,0,0,0,1,0
1,21,7099,1106.0,2401,1138,8.3014,0,0,0,1,0
2,52,1467,190.0,496,177,7.2574,0,0,0,1,0
3,52,1274,235.0,558,219,5.6431,0,0,0,1,0
4,52,1627,280.0,565,259,3.8462,0,0,0,1,0


<h4>4. Split the dataset : </h4>
<li>Split the data into 80% training dataset and 20% test dataset.

In [9]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=.2, random_state=15)
print(Xtrain.shape)
print(Xtest.shape)
print(Ytrain.shape)
print(Ytest.shape)

(16512, 11)
(4128, 11)
(16512,)
(4128,)


<h4>5. Standardize data :</h4>
<li>Standardize training and test datasets.

In [10]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
Xtrain = pd.DataFrame(data=scaler.transform(Xtrain))
Xtest = pd.DataFrame(data=scaler.transform(Xtest))
print(Xtrain.head())
print(Xtest.head())

         0         1         2         3         4         5         6   \
0 -1.481058 -0.977631 -1.001465 -1.057436 -1.045039 -0.894932 -0.891156   
1  0.267020  0.179796  0.661002  0.245951  0.571406 -0.351553 -0.891156   
2 -0.209729 -0.678304 -0.345542  0.401368 -0.260357 -0.841716  1.122138   
3 -0.765935  1.888205  2.158892  2.454467  2.342172  0.046126  1.122138   
4  1.061601 -0.675554 -0.653229 -0.336864 -0.676238 -0.169426  1.122138   

         7         8         9         10  
0 -0.681889 -0.015566  2.830742 -0.384466  
1 -0.681889 -0.015566  2.830742 -0.384466  
2 -0.681889 -0.015566 -0.353264 -0.384466  
3 -0.681889 -0.015566 -0.353264 -0.384466  
4 -0.681889 -0.015566 -0.353264 -0.384466  
         0         1         2         3         4         5         6   \
0  1.856182 -0.535287 -0.417097 -0.490515 -0.398984 -0.474725 -0.891156   
1  0.584852 -0.692056 -0.655614 -0.833140 -0.914258 -1.028527 -0.891156   
2 -1.481058  0.124331  0.050397  0.225641  0.082288 -0.00593

  return self.partial_fit(X, y)
  after removing the cwd from sys.path.
  """


<h4> Applying PCA with 2 major components </h4>

In [11]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)

In [12]:
pca.fit(Xtrain)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [13]:
x_pca = pca.transform(Xtrain)

In [14]:
pca.components_

array([[-0.23057552,  0.48485286,  0.48986607,  0.47510984,  0.49097558,
         0.04632552,  0.02707199,  0.00721981, -0.00963923, -0.0435184 ,
        -0.00892459],
       [ 0.19553646,  0.00644481, -0.00893356,  0.02330384,  0.0196778 ,
         0.33198537,  0.62648489, -0.67335457, -0.00430953,  0.05273933,
        -0.04254266]])

In [15]:
pca.explained_variance_ratio_

array([0.33985358, 0.1576118 ])

In [16]:
Xtrain = pca.transform(Xtrain)
Xtest = pca.transform(Xtest)

<h4>6. Perform Linear Regression : </h4>
<li>Perform Linear Regression on training data.

In [17]:
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
model_lr.fit(Xtrain, Ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

<li>Predict output for test dataset using the fitted model.

In [18]:
model_lr.score(Xtest, Ytest)

0.30793100109309013

<li>Print root mean squared error (RMSE) from Linear Regression.

In [19]:
from sklearn.metrics import mean_squared_error
from math import sqrt
print(sqrt(mean_squared_error(Ytrain, model_lr.predict(Xtrain))))
print(sqrt(mean_squared_error(Ytest, model_lr.predict(Xtest))))

95384.45403525533
95929.58097461327


<h4>7. Perform Decision Tree Regression :</h4>
<li>Perform Decision Tree Regression on training data.

In [20]:
from sklearn.tree import DecisionTreeRegressor
model_dtr = DecisionTreeRegressor()
model_dtr.fit(Xtrain, Ytrain)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

<li>Predict output for test dataset using the fitted model.

In [21]:
model_dtr.score(Xtest, Ytest)

-0.007953400314982817

<li>Print root mean squared error from Decision Tree Regression.

In [22]:
print(sqrt(mean_squared_error(Ytrain, model_dtr.predict(Xtrain))))
print(sqrt(mean_squared_error(Ytest, model_dtr.predict(Xtest))))

0.0
115770.54474579399


<h4>8. Perform Random Forest Regression :</h4>
<li>Perform Random Forest Regression on training data.

In [23]:
from sklearn.ensemble import RandomForestRegressor
model_rfr = RandomForestRegressor()
model_rfr.fit(Xtrain, Ytrain)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

<li>Predict output for test dataset using the fitted model.

In [24]:
model_rfr.score(Xtest, Ytest)

0.3491747117962258

<li>Print RMSE (root mean squared error) from Random Forest Regression.

In [25]:
print(sqrt(mean_squared_error(Ytrain, model_rfr.predict(Xtrain))))
print(sqrt(mean_squared_error(Ytest, model_rfr.predict(Xtest))))

38448.07618658941
93027.22354520868


<h4>9. Bonus exercise: Perform Linear Regression with one independent variable :</h4>
<li>Extract just the median_income column from the independent variables (from X_train and X_test).

<li>Perform Linear Regression to predict housing values based on median_income.

<li>Predict output for test dataset using the fitted model.

<li>Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.

<h3>Conclusion: PCA with 2 components dosent provide a better result</h3>