## ***Housing Price Prediction***

### **Overview**

This Project Notebook covers all the necessary steps to complete the Machine Learning Task of Predicting the Housing Prices on California Housing Dataset available on scikit-learn.

---

We will perform the following steps for successfully creating a model for house price prediction:


---

[**1. Data Extraction**](#scrollTo=nlL0k3pO4XM9&line=1&uniqifier=1)


* [Import libraries](#scrollTo=hJb_1cMg5eCo&line=1&uniqifier=1ps://)
* [Import Dataset from scikit-learn](#scrollTo=kzexflx_6QUw&line=2&uniqifier=1)
* [Understanding the given Description of Data and the problem Statement](#scrollTo=7cnoPXyS61jG&line=2&uniqifier=1)
* [Take a look at different Inputs and details available with dataset](#scrollTo=nNt1_Vt1_RnY&line=2&uniqifier=1)
* [Storing the obtained dataset into a Pandas Dataframe](#scrollTo=bGtCEhK3FIZ1&line=3&uniqifier=1)


---


**2. Preprocessing, EDA (Exploratory Data Analysis) and Visualization**

* Getting a closer Look at obtained Data
* Exploring different Statistics of the Data (Summary and Distributions)
* Dealing with Duplicate and Null (NaN) values
* Looking at Correlations (between indiviual features and between Input features and Target)
* Dealing with Outlier values
* Data Normalization (Plots and Tests)
* Feature Scaling (Feature Transformation)
* Feature Engineering (Feature Design)


---


**3. Modeling**

* Specifing Evaluation Metrics such as MAE, MSE, RMSE, R squared and adjusted R square (using Cross-Validation and train test split)
* Base Line Models - trying multiple hyperparameters and models such as:
    * Linear Regression
    * Ridge Regression
    * Descision Trees Regressor
    * Random Forests Regressor
    * Gradient Boosted Regressor
    * XGBoost Regressor
    * Support Vector Regressor
* Model Selection (by comparing evaluation metrics)
* Prediction

---

**4. Gaining Insights about the Model**

* Learn Feature Importance and Relations with Prediction

---

**5. Deployment**

* Exporting the trained model to be used for later predictions. (by storing model object as byte file)

### **1. Data Extraction**

##### **Importing all Libraries needed for extracting and representing (visualizing) data** 

In [2]:
#IMPORTING LIBRARIES

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

---

##### **Importing / Loading the California Housing Price Data from scikit-learn (sklearn)**

using the parameter as_frame = True, returns the data in the form of Pandas Dataframe

In [9]:
#IMPORTING DATA

from sklearn.datasets import fetch_california_housing  
cal_housing_dataset = fetch_california_housing(as_frame = True)

---

##### **Understanding the given Description of Data and the problem Statement.**


Using sklearn to import a dataset we obtain a Bunch object which is similar to a dictonary which contains information about the dataset and the actual data that we can use

> we can access the available keys in the Bunch Object using keys() function


In [10]:
#LIST OF KEYS AVAILABLE WITH DATASET BUNCH OBJECT

cal_housing_dataset.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

We have the following keys available in the Bunch (data) obtained from sklearn 

*   data - It contains data rows, each row corresponding to the 8 input feature values.
*   target - It contains target data rows, each value corresponds to the average house value in units of 100,000 US Dollars.
*   frame - Only present when as_frame = True. Pandas DataFrame with data and target.
*   target_names - Name of the target feature.
*   feature_names - Array of ordered feature names used in the dataset.
*   DESCR - Description of the California housing dataset. This is important to understand the meaning of features that will be used to predict the housing Prices.



---

##### **Take a look at different Inputs and details available with dataset**

We can take a look at the information avaliable in the DESCR key to get a understanding of the data such as what is shape of our dataset and learn what are different features available that we can use for predicting house prices.

In [14]:
#DESCRIPTION OF DATASET

print(cal_housing_dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Using the dataset description above we can see that we have 20640 housing data points (records) and each of the housing record contains information about the houses of the block in the form of 8 input features: 

* MedInc,  HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude and Longitude

This information about the block in which house is located can be used to create a model that can predict what should be the price of a new house with different set of characteristics. 

We can also seperatly get a list of all the input features and the target feature available.

In [15]:
#PREDICTIVE (INPUT) FEATURES AVAILABLE 

print(cal_housing_dataset.feature_names)

['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


In [16]:
#TARGET (OUTPUT) FEATURE

print(cal_housing_dataset.target_names)

['MedHouseVal']


the data key of the 'Bunch' object contains the input feature data values for the housing records.

In [17]:
#DATA AVAILABLE CORRESPONDING TO INPUT FEATURES FOR EACH HOUSING RECORD

print(cal_housing_dataset.data)

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
...       ...       ...       ...        ...         ...       ...       ...   
20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48   
20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49   
20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43   
20638  1.8672      18.0  5.329513   1.171920       741.0  2.123209     39.43   
20639  2.3886      16.0  5.254717   1.162264      1387.0  2.616981     39.37   

       Longitude  
0        -122.23  
1

We also have the coorsponding housing price for each of the records.

In [18]:
#RESULTING TARGET / OUTPUT FEATURE VALUES AVAILABLE

print(cal_housing_dataset.target)

0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Name: MedHouseVal, Length: 20640, dtype: float64


We can see the complete dataset (input features and targets) using the frame key of our dataset object which stores the complete housing dataset (all records and features) as a Pandas DataFrame because we imported our data using the parameter as_frame = True.

In [19]:
#THE COMPLETE DATASET AVAILABLE FOR USE 

cal_housing_dataset.frame

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


---

##### **We can store our cal_housing_dataset.frame in a seperate new pandas variable (as a DataFrame) for easy reference later on.**

In [20]:
#STORING THE HOUSING DATA IN A SEPERATE PANDAS DATAFRAME VARIABLE

dataset = cal_housing_dataset.frame

We can view the top 5 rows of the dataset using .head() method of dataframe.

In [21]:
#VIEW TOP 5 ROWS OF DATASET

dataset.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
