### Self-Study Colab Activity 8.4: The “Best” Model.

This module was all about regression and using Python's scikitlearn library to build regression models.  Below, a dataset related to real estate prices in California is given. During many of the assignments, you have built and evaluated different models, it is important to spend some time interpreting the resulting "best" model.  


Your goal is to build a regression model to predict the price of a house in California.  After doing so, you are to *interpret* the model.  There are many strategies for doing so, including some built-in methods from scikitlearn.  One example is `permutation_importance`.  Permutation feature importance is a strategy for inspecting a model and its features' importance.  

Take a look at the user guide for `permutation_importance` [here](https://scikit-learn.org/stable/modules/permutation_importance.html).  Use  the `sklearn.inspection` module implementation of `permutation_importance` to investigate the importance of different features to your regression models.  Share these results on the discussion board.

In [40]:
import pandas as pd
from sklearn.inspection import permutation_importance
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [41]:
import numpy as np
import matplotlib.pyplot as plt

In [42]:
cali = pd.read_csv('data/housing.csv')

In [43]:
cali.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [44]:
cali.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [45]:
cali.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [None]:
#Cleaning and preprossesing the data
categorical_features = ['ocean_proximity']
numerical_features = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']

#Filling in the missing values and One-hot coding the Ocean_proximity
prepros = ColumnTransformer(transformers=[
    ('NumFeat', make_pipeline(SimpleImputer(), PolynomialFeatures(degree = 3, include_bias=False)), numerical_features), 
    ('Catfeat', make_pipeline(OneHotEncoder(), categorical_features))
])


In [48]:
pipe = Pipeline([
    ('Preprocessor', prepros),
    ('Linreg', LinearRegression())
])

In [50]:
#Getting the dummies for Ocean_proximity
dummies = pd.get_dummies(cali, columns= ['ocean_proximity'], dtype= int).fillna(0)
dummies.sample(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
7312,-118.2,33.98,43.0,1091.0,320.0,1418.0,316.0,2.1522,159400.0,1,0,0,0,0
8377,-118.35,33.95,30.0,2661.0,765.0,2324.0,724.0,3.0519,137500.0,1,0,0,0,0
4380,-118.27,34.09,52.0,2327.0,555.0,1048.0,491.0,3.7847,252300.0,1,0,0,0,0
15891,-122.38,37.73,38.0,1388.0,276.0,871.0,265.0,2.1667,193800.0,0,0,0,1,0
8175,-118.11,33.79,36.0,2223.0,370.0,1039.0,370.0,5.7942,257000.0,1,0,0,0,0


In [67]:
#Creating train and test var
X = dummies.drop(['median_house_value', 'longitude', 'latitude'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, cali['median_house_value'], test_size = 0.4, random_state= 0)


In [68]:
#Train and predict
pipe = Pipeline([
    ('pfeat', PolynomialFeatures(degree = 3, include_bias=False)), 
    ('linreg', LinearRegression())
])

train_preds = pipe.fit(X_train, y_train)
train_preds = pipe.predict(X_train)
test_preds = pipe.predict(X_test)



In [69]:
#Get the mean squared error for both
train_mses = mean_squared_error(y_train, train_preds)
test_mses = mean_squared_error(y_test, test_preds)


In [70]:
#Getting the permutation importance
per_impor = permutation_importance(pipe, X_test, y_test)


In [71]:
#Add the per_importance to the cali data frame
pd.DataFrame(data= per_impor.importances_mean, index = X.columns, columns=['Permutation Importance']).sort_values(by = 'Permutation Importance', ascending = False)

Unnamed: 0,Permutation Importance
households,792.089207
ocean_proximity_<1H OCEAN,28.007861
ocean_proximity_INLAND,27.593176
total_rooms,19.019041
total_bedrooms,15.729693
ocean_proximity_NEAR OCEAN,13.27413
population,13.041862
ocean_proximity_NEAR BAY,7.83659
median_income,1.425237
ocean_proximity_ISLAND,0.252041
