# **Predict Median House Values in Californian Districts**

*This project uses California census data to build a model of housing prices in the state. This data includes metrics such as the population, median income, and median housing price for each block group in California. The model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.*

*This is the project in chapter 2 in the book "Hands-on  Machine Learning   with Scikit-Learn,  Keras & TensorFlow ".*

## **Import Libraries** 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV, cross_val_score 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor 
from scipy import stats 
import joblib

## **Get The Data**

In [2]:
df = pd.read_csv("../input/handson-ml-book-housing/housing.csv")

## **Data Exploration**

In [3]:
df.head()

In [4]:
df.info()

In [5]:
# Explore the numerical variables
df.describe()

In [6]:
# Explore the categorical variable
df["ocean_proximity"].value_counts()

In [7]:
# Numerical values distributions
df.hist(bins=50,figsize=(20,15));

## **Train Test Split**

The purpose of this step is to split the test set and keep it aside.

In this dataset, the "median income" variable is a very important attribute to predict median housing prices. So, it is important for the test set to be representative of the various categories of incomes in the whole dataset. So, stratified sampling is used to split the test data.

In [8]:
df["income_categ"] = pd.cut(df["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])
df["income_categ"].hist();

In [9]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) 
for train_index, test_index in split.split(df, df["income_categ"]):
    train_set = df.loc[train_index]
    test_set = df.loc[test_index] 

In [10]:
# difference between income category proportions in overall dataset and in test set with stratified sampling
(test_set["income_categ"].value_counts() / len(test_set)) - (df["income_categ"].value_counts() / len(df))

In [11]:
train_set.drop("income_categ", axis=1, inplace=True)
test_set.drop("income_categ", axis=1, inplace=True)

# **Discover and Visualize the Data**

In [12]:
temp = train_set.copy()

In [13]:
# The radius of each circle represents the population, and the color represents the price
temp.plot(kind="scatter", x="longitude", y="latitude", alpha=0.5, s=temp["population"]/100,
         label="Population", figsize=(12,7), c="median_house_value",  cmap=plt.get_cmap("jet"),
         colorbar=True)
plt.legend();

## **Looking For Correlations**

In [14]:
corr_matrix = temp.corr()
corr_matrix["median_house_value"].sort_values(ascending=False) 

In [15]:
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
pd.plotting.scatter_matrix(temp[attributes], figsize=(12, 8));

In [16]:
temp.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.3);

## **Data Preparation**

In [17]:
housing = train_set.drop("median_house_value", axis=1) 
housing_labels = train_set["median_house_value"].copy()
housing_num = housing.drop("ocean_proximity", axis=1)
housing_cat = housing[["ocean_proximity"]] 

In [18]:
num_pipeline = Pipeline([        
    ('imputer', SimpleImputer(strategy="median")),              
    ('std_scaler', StandardScaler())
])

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([        
    ("num", num_pipeline, num_attribs),        
    ("cat", OneHotEncoder(), cat_attribs)  
])

housing_prepared = full_pipeline.fit_transform(housing) 

## **Select and Train a Model**

In [19]:
linReg = LinearRegression()
linReg.fit(housing_prepared, housing_labels)
housing_predictions = linReg.predict(housing_prepared) 
linMSE = mean_squared_error(housing_labels, housing_predictions) 
linRMSE = np.sqrt(linMSE) 
print(f"Linear Regression Root Mean Square Error: {linRMSE}")

**The Linear Regression model is underfitting the data. So, a more powerful model should be selected**

In [20]:
treeReg = DecisionTreeRegressor() 
treeReg.fit(housing_prepared, housing_labels) 
housing_predictions = treeReg.predict(housing_prepared)
treeRMSE = mean_squared_error(housing_labels, housing_predictions) 
treeRMSE = np.sqrt(treeRMSE)  
print(f"Decision Tree Root Mean Square Error: {treeRMSE}")

**The Decision Tree model is probably overfitting the data. So, we use validation set.**

In [21]:
scores = cross_val_score(treeReg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
treeRMSE_scores = np.sqrt(-scores)
print("Scores:", treeRMSE_scores) 
print("Mean:", treeRMSE_scores.mean())
print("Standard deviation:", treeRMSE_scores.std())

**We are sure now that decision tree model was overfitting the data and its performance is worse than Linear regression model.**

In [22]:
forestReg = RandomForestRegressor() 
forestReg.fit(housing_prepared, housing_labels) 
housing_predictions = forestReg.predict(housing_prepared)
forestRMSE = mean_squared_error(housing_labels, housing_predictions) 
forestRMSE = np.sqrt(forestRMSE)
print(f"Random Forest Root Mean Square Error on training set: {forestRMSE}")
scores = cross_val_score(forestReg, housing_prepared, housing_labels, 
                         scoring="neg_mean_squared_error", cv=10)
forestRMSE_scores = np.sqrt(-scores)
print("Scores:", forestRMSE_scores) 
print("Mean:", forestRMSE_scores.mean())
print("Standard deviation:", forestRMSE_scores.std())

**Random Forest is better than the previous models, but, there is some degree of overfitting.**

## **Grid Search**

In [23]:
param_grid = [    
    {'n_estimators': [30, 40], 'max_features': [8, 12]},    
    {'bootstrap': [False], 'n_estimators': [25, 35], 'max_features': [6, 10]}
]
forestRegGrid = RandomForestRegressor() 
grid_search = GridSearchCV(forestReg, param_grid, cv=4,                           
                           scoring='neg_mean_squared_error', 
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

In [24]:
GSres = grid_search.cv_results_ 
for mean_score, params in zip(GSres["mean_test_score"], GSres["params"]): 
    print(np.sqrt(-mean_score), params) 
print("\n")
print(f"Best Parameters are: {grid_search.best_params_}")

**The default parameters values for the random forest are the best values for the previous trials.**

## **Evaluate System on Test Set**

In [25]:
final_model = forestReg
X_test = test_set.drop("median_house_value", axis=1) 
y_test = test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print(f"Final Root Mean Square Error on Test set: {final_rmse}")

**Next, compute 95% confidence interval of RMSE**

In [26]:
confidence = 0.95 
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1, 
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors))) 

## **A full pipeline with both preparation and prediction**

In [27]:
full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("randomForest", RandomForestRegressor())
    ])

full_pipeline_with_predictor.fit(housing, housing_labels)

## **Save the Model**

In [28]:
my_model = full_pipeline_with_predictor
joblib.dump(my_model, "my_model.pkl")