# California Housing Price Prediction

> ##  Business Problem: 

To predict the prices of houses in Californa based on their different specifications and locations

> ##   Description : 

The Dataset is built using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

The information was collected on the variables using all the block groups in California from the 1990 Census. In this sample a block group on average includes 1425.5 individuals living in a geographically compact area. Naturally, the geographical area included varies inversely with the population density. Distances were computed among the centroids of each block group as measured in latitude and longitude and all the districts reporting zero entries for the independent and dependent variables were excluded. The final data contained 20,640 observations on 9 variables. The dependent variable is ln(median house value). The other variables are as follows: 

    1. longitude: A measure of how far west a house is; a higher value is farther west
    2. latitude: A measure of how far north a house is; a higher value is farther north
    3. housingMedianAge: Median age of a house within a block; a lower number is a newer building
    4. totalRooms: Total number of rooms within a block
    5. totalBedrooms: Total number of bedrooms within a block
    6. population: Total number of people residing within a block
    7. households: Total number of households, a group of people residing within a home unit, for a block
    8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
    9. medianHouseValue: Median house value for households within a block (measured in US Dollars)
    10. oceanProximity: Location of the house w.r.t ocean/sea

> ## Approach : 

You will understand the data and make the best model in the following stages : 

- **Data Description**

- **Exploratory Data Analysis**
         
      - Uni-Variate Analysis : Boxplot , Histogram , Barplot
      - Correlation Analysis : Correlation Matrix
      - Bi-Variate Analysis : Scatter Matrix and Plot
      - Multi-Variate Analysis : Scatter Plot
      
- **Data Cleaning and Manipulation**
    
      - Attribute Combination
      - OneHotEncoding Categorical Attributes
      - Missing Values Handling

- **Data Sampling and Splitting**
     
      - Stratified Sampling : Implementation and Comparison
      - Train-Validation-Test Splitting

- **Modelling and CrossValidating**

      - Linear Regression 
      - Decision Tree 
      - Random Forest 

- **HyperTuning : GridSearch** 
- **Test Set Evaluation**
- **Conclusion** 

## `1` Data Description

In [1]:
#importing Libraries
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# loading data
data_path = 'C:/Users/kusht/OneDrive/Desktop/Excel-csv/housing.csv'
housing = pd.read_csv(data_path)


**TASK : Get info about the dataset using `info()`**

In [3]:
### START CODE HERE (~ 1 Line of code)

### END CODE

**TASK : Fill the information we get from `info()`**
- total observations: `_`  (Each observation is the data about a block group)
- total columns (features): `_`
- data type of each feature: `_` numberical and `_` object 
- features with null values : `____` 

**TASK : Output the first five instances of the dataset and analyse**

In [5]:
### START CODE HERE (~1 Line of code)

### END CODE

`describe()` shows a summary of **Numerical Features** , which can be visualized using boxplots and histograms. `value_counts()` can be used to generate a summary of **Categorical Attributes**

**TASK : Describe the dataset**

In [6]:
### START CODE HERE (~ 1 Line of code)

### END CODE

## `2`  Exploratory Data Analysis

### `2.1` Uni-variate Analysis

**TASK : Make a `Boxplot` of `median_house_value`**

In [7]:
### Create BOXPLOT using boxplot() and keep figsize=(6,6)

### START CODE HERE : (~ 1 Line of code)

### END CODE HERE

**TASK : Now make `histograms` of all of the Numerical Attributes**

In [8]:
# Create histograms of all attributes in one line using hist() function and keep figsize=(15,15)

### START CODE HERE : (~1 Line of code)

### END CODE


Given that `.boxplot()` and `.hist()` only handle numerical features. You have to use other technique to visualise categorical attributes like `ocean_proximity`, which is object type 

One idea is to plot a bar graph between elements names/labels and their respective counts/frequencies . The frequencies can be found out using `value_counts()` method and a barplot can be plotted between indexes and values of the categorical attributes value_counts

**Task : Find value counts of the categorical attribute and store it in a variable `op_count`**

In [9]:
### START CODE HERE (~ 1 Line of code)

### END CODE

**TASK : Plot a `bar graph` between op_count indexes and op_count values**

In [10]:
### Parameters: figsize=(10,5) , alpha= 0.7 , fontsize=12 for x and y labels

### START CODE HERE (FULL CODE)

### END CODE

#### Understand and Analyse The Data
1. Make sense of the data

      - **Why are total rooms and bedrooms in hunderds or thousands?** 
      - **Is population in thousands or millions or just the number of people living in that block?** 
      - **Why does median income value is so low? is it already scaled?** 
 

2. **Feature Scaling** : Is it required? 
 
3. **Distribution** : from the histograms what can you infer , is the data skewed or normal? 

### `2.2`  Correlation Analysis
Further
explore the data to look for correlations between different attributes. correlation coefficient is between -1 and 1, representing negative and positive correlations. 0 means there is no liner correlation. Correlation is said to be linear if the **ratio of change** is constant, otherwise is non-linear. 

**TASK : Print the `Correlation Matrix` and make it more visually appealing using `sns.heatmap`**

In [11]:
### START CODE HERE (FULL CODE)

### END CODE

### `2.3` Bi-Variate Analysis

**Scatter matrix gives a good idea about histograms of each attribute and their dependencies on each other through scatter plots between each pair**

**TASK : Plot a `scatter matrix`**

In [12]:
from pandas.plotting import scatter_matrix
### Plot scatter matrix only of attrbutes : 'median_house_value', 'median_income', 'total_rooms', 'housing_median_age'
### Keep figsize = (12,12)

### START CODE HERE 

### END CODE

Analyse the promising attributes by seeing which forms the closest linear plot and judging values of correlation matrix with respect to `median_house_value`

 **Note** that in scatter plots , like between `median_income` and `median_house_value` , there are many horizontal lines which are abrupting the linearity between the two attributes . 
**Think of a reason for this and you can remove them which might help in better model**

**TASK : Plot an individual scatter plot between `median_income` and `median_house_value` to understand the problem better**

In [13]:
### Keep figsize=(10,10) and alpha=0.2

### START CODE HERE (FULL CODE)

### END CODE

**TASK (Optional but effective) : Remove the duplicate values of `median_house_value`** 

> **HINT 1** : Try to set a criterion for removing , like removing instances if their number of duplicates are greater than 20 or any number of your choice. you can use value_counts to see which threshold number would be better. This way information of attributes will be retained as there are still 20 duplicates and yet plot would be more linear as again , there's only 20 duplicates. 

> **HINT 2 (Basic Approach)** : You can find all the instances in `median_house_value` having duplicates >20 using `duplicated()` and `value_counts()` and then remove those instances

In [14]:
### You can use as many functions , variables and for loops you want but try to find the most optimum code

### START CODE HERE ( FULL CODE )




### END CODE

> **TASK (Optional) : Plot the scatter plot and compare the results with previous plot** 

In [21]:
### START CODE HERE (~2 Lines of code)

### END CODE HERE

**Analyse the Scatter plots and decide whether it has been a useful exercise or not**

### `2.4` Multi-Variate Analysis

To get to know the shape and see your dataset covers mostly all parts of california , a scatter plot of latitude and longitude can be especially important to know the densities at each point in california

**TASK : Plot a scatter plot between `Latitude` and `Longitude` and analyse**

In [36]:
### Use alpha = 0.1 

### START CODE HERE : (~ 1 Line of code)

### END CODE HERE

**Does it look like the map of California?**

<img src="https://ca-at.org/wp-content/uploads/2016/04/County-Map-Colored.jpg">

To extract more information from it you can also plot a `Multi-Variate Scatter plot` with size as the `housing population` and color as `meidan_house_Value`

**TASK : Extract more information from scatter plot by plotting `latitude` , `longitude` , `median_house_value` and `population` in the same scatter plot**

In [None]:
#  s: radius of each circle represent the housing.population/100
#  c: Put meidan_house_value as the color scheme

### START CODE HERE : (WRITE CODE WHERE '#' IS GIVEN)

housing.plot(kind='#' , x='#', y='#', alpha=0.4, 
    s='#', label='population', figsize=(10,7), 
    c='#' , cmap=plt.get_cmap('jet'), colorbar=True)

### END CODE HERE

You can also extract more information in the same way using `s` and `c` parameters as different attributes like ocean_proximity etc to get more insight about the data . 

**This completes your visualisation and one should always analyse the graphs to get a better understanding of how and what type of model to build or data to manipulate to get best results**

## `3` Data Cleaning and Manipulation 

In this section you have to clean the data and make new chnages/manipulations before models are build on it 

### `3.1` Attribute Combinations
While total rooms and bedrooms might not be that useful however a combination of attributes can help to create a much more meaningful attribute for eg : 

- rooms per household
- bedroom/total room ratio
- population per household

These are just a few , you can think of more such attributes

**TASK : Create new attributes by combining previous ones and check out correlation again**

In [None]:
### You can create any choice of new attrbute but it should make sense
### For now lets stick to the new attribute cominations that are mentioned above

### START CODE HERE : (WRITE CODE WHERE '#' IS GIVEN)

# Calculated attributes : 

housing['rooms_per_household'] = housing['#']/housing['#']
    # Write attribute name which forms the combination

# Write code for bedroom/total room and population per household

# Checkout the correlations again

### END CODE

**TASK : Analyse the new attributes correlations with others and judge if it was a good decision to form these attributes**

In [124]:
### START CODE HERE (FULL CODE)

### END CODE

### `3.2` Text and Categorial Attributes
Most ML algorithms and visualisations work with numbers better. Therefore, you need to convert text attributes into numerical attributes. 


**TASK : Convert categorical attribute to numerical attribute**

You can use LabelEncoder() , Binarizer() , OneHotEncode() or get_dummies whichever seems the mose convenient to you. Decide and analyse whether its safe to just label encode it or does it require one hot encoding then encode it via any of the functions method above

In [27]:
### START CODE HERE : (FULL CODE)

### END CODE HERE

### `3.3` Missing Values 

Now missing values in some attributes need to be handled

**TASK : Use `info()` method to know which attributes has missing values**

In [29]:
### START CODE HERE (~1 Line of code)

### END CODE

**TASK : Use `fillna()` method to fill the missing values with `mean` and check `info()` again for confirmation**

In [126]:
### START CODE HERE (~ 2 Lines of code)


# END CODE HERE 

## `4` Data Sampling and Splitting

This section would guide you to choose the best option of sampling seeing the dataset and then implementing it by creating training , validation and test sets

### `4.1` Stratified Sampling

In statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations.
An exmaple to explain better would be to assume that we need to estimate the average number of votes for each candidate in an election. Assume that a country has 3 towns: Town A has 1 million factory workers, Town B has 2 million office workers and Town C has 3 million retirees. We can choose to get a random sample of size 60 over the entire population but there is some chance that the resulting random sample is poorly balanced across these towns and hence is biased, causing a significant error in estimation. Instead if we choose to take a random sample of 10, 20 and 30 from Town A, B and C respectively, then we can produce a smaller error in estimation for the same total sample size. This method is generally used when a population is not a homogeneous group.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Stratified_sampling.PNG/220px-Stratified_sampling.PNG">

<img src="https://www.researchgate.net/profile/Stefano_Ferilli/publication/216799623/figure/fig3/AS:650409278468108@1532081064594/Stratified-sampling-methodology.png">

First understand why straitified sampling is crucial here and then analyse which attribute you can base your sampling on.

> A good option would be to stratify sample it on basis of `income` . Lets try it out 

**TASK : Stratify sample on basis of `median_income`** 

You can see the documentation and examples of stratified sampling [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)

limit the number of categories by dividing the median income by 1.5 and merge all the income greater than 5 into 5. Then, you can use stratified sampling. This is done to create bins and small range for sampling

> **Task : Use `np.ceil` to create bins and divide `median_income` by 1.5 to create small convenient range**

In [None]:
### START CODE HERE : (WRITE CODE WHERE '#' IS GIVEN)

housing['income_cat'] = np.ceil(housing['#']/1.5) 
# Write the attributes name

# Merge value greater than 5 to 5 in the same variable 'income_cat'
# Print a histogram of 'income_cat' to see the distribution 

### END CODE

> **Task : Create an object `split` of `StratifiedShuffleSplit()` and use it to create Stratified train and test sets**

In [None]:
# stratified sampling based on income categories
from sklearn.model_selection import StratifiedShuffleSplit

### START CODE HERE : (WRITE CODE IN PLACE OF '#')

split = StratifiedShuffleSplit('#')
# Use n_splits as 1 , test size as 0.2 and random state = 42

## Create Stratified Train and Test Sets

## The following function will split into stratified samples and return the train_index and test_index of samples
for train_index, test_index in split.split(housing, housing['#']):
    # Write the attribute name on basis of which sampling is done
    
    strat_train_set = # Fill this using loc() and train_index given dataset housing
    strat_test_set = #  Fill this using loc() and test_index given dataset housing

# Print the first five elements of strat_train_set

### END CODE 


Now you can check that stratified sampling distribution is much more similar to the original distribution than random sampling

> **TASK : Check distributions of stratified sampling and random sampling and compare it with original distribution**

In [44]:
### Straitified sampling

### START CODE HERE (WRITE CODE WHERE '#' IS GIVEN)

#['income_cat'].value_counts() / len(#)

## Replace # with strat_train_set to know the distribution of stratified sampling

### END CODE

In [43]:
### Original sampling

### START CODE HERE (WRITE CODE WHERE '#' IS GIVEN)

#['income_cat'].value_counts() / len(#)

## Replace # with original datafram to know the original distribution 

### END CODE

In [42]:
### Random sampling 

### START CODE HERE (FULL CODE)
from sklearn.model_selection import train_test_split

## Write code to get random split samples from train_test_split() and store it in train_set and test_set

#['income_cat'].value_counts() / len(#)

## Replace # with train_set to know the distribution of random sampling

### END CODE

**TASK : Remove the income_cat variable**

In [63]:
### START CODE HERE (FULL CODE)

### END CODE

**Analyse and decide which sampling to take and thus that sampling's training set would be your `current training set` henceforth. Keep the chosen `sampling's test set aside` as no changes or modelling should be done on that , that test set would only be used in the last evaluation section when the best model has already been built** 

### `4.2`  Create training and validation sets

**TASK : Split your `current training set` into dependent and independent variables and name them y and X respectively**

In [127]:
### START CODE HERE (~ 2 Lines of code)

### END CODE

**TASK : Use `train_test_split` to split the X and y into X_train , X_val , y_train , y_val**

In [78]:
### Keep the test_size=0.2 and random_state=42

### START CODE HERE (FULL CODE)


### END CODE

**NOTE** : You're creating a train set and test set from the sampling you choose. Then you keep the test set aside and split the train set into actual training set and validation set. All models would be trained on this training set and evaluated on the validation set. Once youre convinced that you have the best model , then you finally test it on the initial test set that was kept aside. 

## `5` Modelling and Cross-Validating
You are going to try Linear Regression, Decision Tree, Random Forest models and cross validate them to analyse the best model

### `5.1` Linear Regression : 

Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship. Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other. For example, using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical relationship is not accurate in determining relationship between two variables. For example, relationship between height and weight.

<img src="https://miro.medium.com/max/4328/1*KwdVLH5e_P9h8hEzeIPnTg.png">

The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line. 
Any line can be represented as `y = (theta)X + b` , where theta is the slope or weight and b is a constant whereas y is our predicted value of the target variable and x is our training set . The goal is to minimize the difference between `y_pred` and `y` (origial values of the target variable) by finding the best `weights` . Its determined by the formula : 

<img src="https://patientopinioncorpus.files.wordpress.com/2013/12/normal-equation.jpg">

where X and y are represented as vectors. The can be implemented through manual coding but you can use built in libraries and functions

**TASK : Build a linear regression model,fit on the training set and print the rmse (`Root Mean Squared Error`)**

In [129]:
# Linear Regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### START CODE (FULL CODE)

### END CODE

**TASK : Describe the training set and get its `25%` and `75%`**

Analyse its `25%` and `75%` quantile and judge whether the RMSE is good or not

### `5.2` Decision Tree :

A decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. It chooses a path based on the decisions and ultimately ends in a classification or regression figure. It can be understood better with a diagram : 

<img src="https://dimensionless.in/wp-content/uploads/2018/11/Picture1-1.png">

Regression trees are represented in the same manner, just they predict continuous values like price of a house.

**TASK : Use built in libaries to implement `Decision Trees` , fit it on the training set and print the `RMSE` values on the training set predictions**

In [83]:
# Try Decision Tree
from sklearn.tree import DecisionTreeRegressor

### START CODE HERE (FULL CODE)

### END CODE

**Does this mean `Decision Trees` gives the best model?**

Analyse this and convince yourself about the reason for such a value. Check whether its really a great model using `cross validation`

### Cross Validation : 

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

The general procedure is as follows:

    1. Shuffle the dataset randomly.
    2. Split the dataset into k groups
    3. For each unique group:
        
        - Take the group as a hold out or test data set
        - Take the remaining groups as a training data set
        - Fit a model on the training set and evaluate it on the test set
        - Retain the evaluation score and discard the model

    4. Summarize the skill of the model using the sample of model evaluation scores

Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times.

**TASK : Get `Mean RMSE` cross validation scores for `Decision Tree` and `Linear Regression Model`**

In [91]:
# Do a 5-fold cross validation
from sklearn.model_selection import cross_val_score

### START CODE HERE : (FULL CODE)

# for decision tree


# for linear regression


### END CODE

**Is the Decision Tree model still better?** 
Analyse whether `RMSE` values are even good for both Decision Tree or Linear Regression

**To further improve the `RMSE` , you should now try different algorithms like `Random Forest`** 

### `5.3` Random Forest

Since you already have an idea baout Decision Trees , It's easier to understand `Random Forest` . Random forest, like its name implies, consists of a large number of individual decision trees that operate as an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning) . Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. 

<img src="https://miro.medium.com/max/1000/1*VHDtVaDPNepRglIAv72BFg.jpeg"> 

The fundamental concept is  large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models. Since this dataset has very low correlation between attributes , random forest can be a good option. 

**TASK : Implement `Random Forest` from built in Libraries , fit on training set and print `Mean RMSE Cross Validation` score**

In [93]:
# Try Random Forest, which is an Ensemble Learning model
# Use CrossValidation K-fold = 5
from sklearn.ensemble import RandomForestRegressor

### START CODE HERE (FULL CODE)

### END CODE

Analyse and compare this model with the previous two models

Now your modelling is done and its time to **hypertune** the best model's parameters to further decrease the `RMSE` 

## `6` HyperTuning

Once you've chosen the best model , look for its documentation , see its parameters and make an appropriate `param_grid` and then do a GridSearch to find the best combinations of parameters.

In this section you'll learn how to implement hypertuning of random forest model's parameters and you can hypertune other models based on that  

#### Random Forest Hypertuning : GridSearch

Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions. It builds a model for every combination of hyperparameters specified and evaluates each model. A more efficient technique for hyperparameter tuning is the Randomized search — where random combinations of the hyperparameters are used to find the best solution. However , if its a small sample like the current dataset then gridsearch is also fine. 

**TASK : Form param_grid and do a `GridSearch` to print the best parameters for `Random Forest` model** 

In [96]:
# use GridSearch to find best hyperparameter combinations
from sklearn.model_selection import GridSearchCV

## You can use a standard  param_grid as : 
## {n_estimators : [3,10,30] , max_features : [2,4,6,8]},{bootstrap : [False] , n_estimators:[3,10]} , max_features : [2,3,4]}
## you can also use your own param_grid 
## cross validation k-fold =5 

###  START CODE HERE : (FULL CODE)



### END CODE

**TASK : Print the best estimator** 

In [98]:
### START CODE HERE (~ 1 Line of code)

### END CODE

**TASK : Print the feature importance** 

In [100]:
### START CODE HERE (~ 1 Line of code)

### END CODE

**TASK : Print the attributes along with their feature importance**

**HINT** : Try using zip() method

Based on the feature importance, you can choose to drop some features such as the last four/least four important fetaures to simplify the model

**TASK : Predict the values of the validation set which we saved as X_val , y_val for all the three models and analyse which is the best** 

In [110]:
RF_HyperTuned_model = grid_search.best_estimator_

### START CODE HERE (FULL CODE)

### END CODE

**Analyse the scores and pick out the best one**

Modelling and Hypertuning is done and the only thing left to do is to try it on the **Test Set**

## `7` Evaluation via the Test Set
This step is to see how the model performs on unknow data. As long as the result is not way off from the validation result, you should go ahead lauch the model.

**TASK : Split the earlier calculated `strat_test_set` into dependent and independent variables and name them y_test and X_test respectively**

In [112]:
### START CODE HERE (FULL CODE)

### END CODE

**NOTE**  : Check for missing values and remove them if any

**TASK: Calculate the `RMSE` Values for the best model on the test set** 

In [128]:
### START CODE HERE (FULL CODE)

### END CODE

### Discussion and Conclusion:  

Find the `25 %` and `75 %` Quantile and see if the RMSE obtained from the best model is adequate or not. Also try out other model and come to a conclusive best model and write down its characterstics :

- **Model Algorithm** : `____`
- **Model Parameters** : `____`
- **Root Mean Squared Error** : `___`