# 15.5 Case Study: Multiple Linear Regression with the California Housing Dataset
## 15.5.1 Loading the Dataset
### Loading the Data
**We added `%matplotlib inline` to enable Matplotlib in this notebook.**

In [None]:
%matplotlib inline
from sklearn.datasets import fetch_california_housing

In [None]:
california = fetch_california_housing()

### Displaying the Dataset’s Description

In [None]:
print(california.DESCR)

In [None]:
california.data.shape

In [None]:
california.target.shape

In [None]:
california.feature_names

## 15.5.2 Exploring the Data with Pandas

In [None]:
import pandas as pd

In [None]:
pd.set_option('precision', 4)

In [None]:
pd.set_option('max_columns', 9)

In [None]:
pd.set_option('display.width', None)

In [None]:
california_df = pd.DataFrame(california.data, 
                              columns=california.feature_names)
 

In [None]:
california_df['MedHouseValue'] = pd.Series(california.target)

In [None]:
california_df.head()

In [None]:
california_df.describe()

![Self Check Exercises check mark image](files/art/check.png)
## 15.5.2 Self Check
**1. _(Discussion)_** Based on the `DataFrame`’s summary statistics, what was the average median household income across all block groups for California in 1990?

**Answer:** $38,707 (`3.8707 * 10000`—recall that the datasets median income is expressed in tens of thousands).

## 15.5.3 Visualizing the Features 

In [None]:
sample_df = california_df.sample(frac=0.1, random_state=17)

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns

In [None]:
sns.set(font_scale=2)

In [None]:
sns.set_style('whitegrid')                                    

In [None]:
for feature in california.feature_names:
     plt.figure(figsize=(16, 9))
     sns.scatterplot(data=sample_df, x=feature, 
                     y='MedHouseValue', hue='MedHouseValue', 
                     palette='cool', legend=False)
         

![Self Check Exercises check mark image](files/art/check.png)
## 15.5.3 Self Check
**1. _(Fill-In)_** `DataFrame` method `________` returns a randomly selected subset of the `DataFrame`’s rows.

**Answer:** `sample`.

**2. _(Discussion)_** Why would it be useful in a scatter plot to plot a randomly selected subset of a dataset’s samples?

**Answer:** When you are getting to know your data for a large dataset, there could be too many samples to get a sense of how they are truly distributed

## 15.5.4 Splitting the Data for Training and Testing 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
     california.data, california.target, random_state=11)

In [None]:
X_train.shape

In [None]:
X_test.shape

## 15.5.5 Training the Model 

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linear_regression = LinearRegression()

In [None]:
linear_regression.fit(X=X_train, y=y_train)

In [None]:
for i, name in enumerate(california.feature_names):
     print(f'{name:>10}: {linear_regression.coef_[i]}')

In [None]:
linear_regression.intercept_

![Self Check Exercises check mark image](files/art/check.png)
## 15.5.5 Self Check
**1. _(True/False)_** By default, a `LinearRegression` estimator uses all the features in the dataset to perform a multiple linear regression. 

**Answer:** False. By default, a `LinearRegression` estimator uses all the numeric features in the dataset to perform a multiple linear regression. An error occurs if any of the features are categorical rather than numeric. Categorical features must be preprocessed into numerical ones or must be excluded from the training process.

## 15.5.6 Testing the Model 

In [None]:
predicted = linear_regression.predict(X_test)

In [None]:
expected = y_test

In [None]:
predicted[:5]

In [None]:
expected[:5]

## 15.5.7 Visualizing the Expected vs. Predicted Prices 

In [None]:
df = pd.DataFrame()

In [None]:
df['Expected'] = pd.Series(expected)

In [None]:
df['Predicted'] = pd.Series(predicted)

In [None]:
figure = plt.figure(figsize=(9, 9))

axes = sns.scatterplot(data=df, x='Expected', y='Predicted', 
     hue='Predicted', palette='cool', legend=False)

start = min(expected.min(), predicted.min())

end = max(expected.max(), predicted.max())

axes.set_xlim(start, end)

axes.set_ylim(start, end)

line = plt.plot([start, end], [start, end], 'k--')

In [None]:
# This placeholder cell was added because we had to combine 
# the sections snippets 37-43 for the visualization to work in Jupyter
# and want the subsequent snippet numbers to match the book

In [None]:
# Placeholder cell 

In [None]:
# Placeholder cell 

In [None]:
# Placeholder cell 

In [None]:
# Placeholder cell 

In [None]:
# Placeholder cell 

## 15.5.8 Regression Model Metrics 
 

In [None]:
from sklearn import metrics

In [None]:
metrics.r2_score(expected, predicted)

In [None]:
metrics.mean_squared_error(expected, predicted)

![Self Check Exercises check mark image](files/art/check.png)
## 15.5.8 Self Check
**1. _(Fill-In)_** An R2 score of `________` indicates that an estimator perfectly predicts the dependent variable’s value, given the independent variable(s) value(s). 

**Answer:** 1.0.

**2. _(True/False)_** When comparing estimators, the one with the mean squared error value closest to 0 is the estimator that best fits your data. 

**Answer:** True. 

## 15.5.9 Choosing the Best Model

In [None]:
from sklearn.linear_model import ElasticNet, Lasso, Ridge

In [None]:
estimators = {
    'LinearRegression': linear_regression,
    'ElasticNet': ElasticNet(),
    'Lasso': Lasso(),
    'Ridge': Ridge()
}

In [None]:
from sklearn.model_selection import KFold, cross_val_score

In [None]:
for estimator_name, estimator_object in estimators.items():
     kfold = KFold(n_splits=10, random_state=11, shuffle=True)
     scores = cross_val_score(estimator=estimator_object, 
         X=california.data, y=california.target, cv=kfold,
         scoring='r2')
     print(f'{estimator_name:>16}: ' + 
           f'mean of r2 scores={scores.mean():.3f}')

In [None]:
##########################################################################
# (C) Copyright 2019 by Deitel & Associates, Inc. and                    #
# Pearson Education, Inc. All Rights Reserved.                           #
#                                                                        #
# DISCLAIMER: The authors and publisher of this book have used their     #
# best efforts in preparing the book. These efforts include the          #
# development, research, and testing of the theories and programs        #
# to determine their effectiveness. The authors and publisher make       #
# no warranty of any kind, expressed or implied, with regard to these    #
# programs or to the documentation contained in these books. The authors #
# and publisher shall not be liable in any event for incidental or       #
# consequential damages in connection with, or arising out of, the       #
# furnishing, performance, or use of these programs.                     #
##########################################################################
