# California Housing Price Prediction

### Description

**Background of Problem Statement:**
The US Census Bureau has published California Census Data which has 10 types of metrics such as the population, median income, median housing price, and so on for each block group in California. The dataset also serves as an input for project scoping and tries to specify the functional and nonfunctional requirements for it.


**Problem Objective :**
The project aims at building a model of housing prices to predict median house values in California using the provided dataset. This model should learn from the data and be able to predict the median housing price in any district, given all the other metrics.
Districts or block groups are the smallest geographical units for which the US Census Bureau
publishes sample data (a block group typically has a population of 600 to 3,000 people). There are 20,640 districts in the project dataset.


In [1]:
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from math import sqrt

%matplotlib inline

**1. Load the data :** 
Read the “housing.csv” file from the folder into the program. Print first few rows of this data.
Extract input (X) and output (Y) data from the dataset.

In [2]:
#Task 1 - read in the data and print first 5 rows
data= pd.read_csv("housing.csv", index_col=0)
data.head()


Unnamed: 0_level_0,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,median_house_value
longitude,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
-122.23,37.88,41,880,129.0,322,126,8.3252,NEAR BAY,452600
-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,NEAR BAY,358500
-122.24,37.85,52,1467,190.0,496,177,7.2574,NEAR BAY,352100
-122.25,37.85,52,1274,235.0,558,219,5.6431,NEAR BAY,341300
-122.25,37.85,52,1627,280.0,565,259,3.8462,NEAR BAY,342200


In [3]:
#Task 1 extract x & y
feature_cols=['latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms',
       'population', 'households', 'median_income', 'ocean_proximity']
x=data[feature_cols]
y=data.median_house_value #possible error here

**2. Handle missing values :**
Fill the missing values with the mean of the respective column.

In [4]:
data.isnull().sum()

latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64

We can see from the sum of all null values in our csv that the "total_bedrooms" column has 207 null values. As instructed by the project we will fill these values with the mean of the column.

In [5]:
data["total_bedrooms"] = data["total_bedrooms"].fillna(data["total_bedrooms"].mean())

In [6]:
data.isnull().sum()

latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0
median_house_value    0
dtype: int64

We have now confirmed that there are no missing values in our dataset and we can continue with our analysis.

**3. Encode categorical data :**
Convert categorical column in the dataset to numerical data.

In [7]:
# 3. Encode categorical data
le = LabelEncoder()
data["ocean_proximity"] = le.fit_transform(data["ocean_proximity"])
data.head()
x=data[feature_cols]
y=data.median_house_value

**4. Split the dataset :**
Split the data into 80% training dataset and 20% test dataset.

In [8]:
# 4. Split the dataset :
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.2,random_state=1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
x_train.head()

(16512, 8)
(4128, 8)
(16512,)
(4128,)


Unnamed: 0_level_0,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
longitude,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
-122.43,37.71,52,1410,286.0,879,282,3.1908,3
-122.35,37.95,42,1485,290.0,971,303,3.6094,3
-121.24,37.9,16,50,10.0,20,6,2.625,1
-118.35,34.02,34,5218,1576.0,3538,1371,1.5143,0
-118.39,33.89,38,1851,332.0,750,314,7.3356,0


**5. Standardize data :**
Standardize training and test datasets.

In [9]:
# 5. Standardize data :
from sklearn import preprocessing
standard_x_train = preprocessing.scale(x_train)
standard_x_test = preprocessing.scale(x_test)
standard_y_train = preprocessing.scale(y_train)
standard_y_test = preprocessing.scale(y_test)


**6. Perform Linear Regression :**
Perform Linear Regression on training data.
Predict output for test dataset using the fitted model.
Print root mean squared error (RMSE) from Linear Regression.
[ HINT: Import mean_squared_error from sklearn.metrics ]

In [10]:
#6. Perform Linear Regression :
lm = LinearRegression()
lm.fit(standard_x_train,standard_y_train)
#print(lm.intercept_)
#print(lm.coef_)
predictions = lm.predict(standard_x_test)
print(sqrt(mean_squared_error(standard_y_test,predictions)))

0.6555816683062126


In [27]:
print(r2_score(standard_y_test,predictions))

0.5702126761808431


The linear regression model isn't a perfect model but does show acceptable values according to our RMSE and the R^2 values.

**7. Perform Decision Tree Regression :**
Perform Decision Tree Regression on training data.
Predict output for test dataset using the fitted model.
Print root mean squared error from Decision Tree Regression.

In [15]:
# 7. Perform Decision Tree Regression :
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()

In [18]:
classifier.fit(x_train,y_train)

DecisionTreeClassifier()

In [19]:
y_predict = classifier.predict(x_test)

In [20]:
print(sqrt(mean_squared_error(y_test,y_predict)))

89123.65358174406


In [21]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_predict,y_test)
print(acc)

0.023498062015503876


The Decision Tree model shows extremely poor accuracy.

In [28]:
#8. Random Forest Regression
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(x_train,y_train)
y_predict = classifier.predict(x_test)


In [29]:
print(sqrt(mean_squared_error(y_test,y_predict)))
acc = accuracy_score(y_predict,y_test)
print(acc)

79420.23076744254
0.03972868217054264


The Random Forest model also shows extremely poor accuracy but performs slightly better than the Decision Tree.