# Regression Coefficients - Revisited

## Lesson Objectives

By the end of this lesson, students will be able to:
- Use scikit-learn v1.1's simplified toolkit.
- Extract and visualize coefficients from sklearn regression model. 
- Control panda's display options to facilitate interpretation.


## Introduction

- At the end of last stack, we dove deep into linear regression models and their assumptions. We introduced a new package called statsmodels, which produced a Linear Regression model using "Ordinary-Least-Squared (OLS)". 
- The model included a robust statistical summary that was incredibly informative as we critically diagnosed our regression model and if we met the assumptions of linear regression.
- This stack, we will be focusing on extracting insights from our models: both by examining parameters/aspects of the model itself, like the coefficients it calculated, but also by applying some additional tools and packages specifically designed to explain models. 

- Most of these tools are compatible with the scikit-learn ecosystem but are not yet available for statsmodels.

Since we are not focusing on regression diagnostics this week, we will shift back to using scikit-learn models. Scikit-learn recently released version 1.1.1, which added several helpful tools that will simplify our workflow. 

Let's review some of these key updates as we rebuild our housing regression model from week 16.


# Confirming Package Versions

- All packages have a version number that indicates which iteration of the package is currently being used.
    - If you import an entire package, you can use the special method `package.__version__` (replace package with the name of the package you want to check).
- The reason this is important is that as of the writing of this stack, Google Colab is still using a version of python that is too old to support the newest scikit-learn.
    - You can check which version of python you are using by running the following command in a jupyter notebook:
        - `!python --version`
        - Note: if you remove the `!`, you can run this command in your terminal.

- If you run the following code on Google Colab and on your local computer, you can compare the version numbers. 
        
<img src="colab_versions.png" width=400px>

- Now, run the following block of code in a jupyter notebook on your local machine to confirm that you have Python 3.8.13 and sklearn v1.1.1.


In [1]:
# Run the following command on your local computer to 
import sklearn
print(f"sklearn version: {sklearn.__version__}")
!python --version

sklearn version: 1.1.1
Python 3.8.13



>- If you have a Python 3.7 or an earlier version of scikit-learn, please revisit the "`<Insert the name of the "week" of content on the LP for installation>`". 
    - See the "`Updating Your Dojo-Env Lesson` [Note to Brenda: does not exist yet - see 1:1 doc for question on handling multiple envs] for how to remove your current dojo-env and replace it with the new one.

# Extracting Coefficients from LinearRegression in scikit-learn

## Highlighted Changes  - scikit-learn v1.1

- The single biggest change in the updated sklearn is a fully-functional `.get_feature_names_out()` method in the `ColumnTransformer`.
    - This will make it MUCH easier for us to extract our transformed data as dataframes and to match up the feature names to our models' coefficients.
- There are some additional updates that are not pertinent to this stack, but if you are curious, you can find the [details on the new release here](https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_1_0.html).

## New and Improved `ColumnTransformer` 

In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

## Customization Options
plt.style.use(['fivethirtyeight','seaborn-talk'])
mpl.rcParams['figure.facecolor']='white'

## additional required imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import metrics

SEED = 321
np.random.seed(SEED)

In [3]:
## Load in the King's County housing dataset and display the head and info
df = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSEZQEzxja7Hmj5tr5nc52QqBvFQdCAGb52e1FRK1PDT2_TQrS6rY_TR9tjZjKaMbCy1m5217sVmI5q/pub?output=csv")

## Dropping some features for time
df = df.drop(columns=['date'])
display(df.head(),df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   price          21613 non-null  float64
 2   bedrooms       21613 non-null  int64  
 3   bathrooms      21613 non-null  float64
 4   sqft_living    21613 non-null  int64  
 5   sqft_lot       21613 non-null  int64  
 6   floors         21613 non-null  float64
 7   waterfront     21613 non-null  int64  
 8   view           21613 non-null  int64  
 9   condition      21613 non-null  int64  
 10  grade          21613 non-null  int64  
 11  sqft_above     21613 non-null  int64  
 12  sqft_basement  21613 non-null  int64  
 13  yr_built       21613 non-null  int64  
 14  yr_renovated   21613 non-null  int64  
 15  zipcode        21613 non-null  int64  
 16  lat            21613 non-null  float64
 17  long           21613 non-null  float64
 18  sqft_l

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


None

### Dropping Irrelevant Features 

- If we wanted to make recommendations to homeowners about changes they can make to their home to increase its sale price, we would want to think about what features make the most sense to include.

- The `id` column is a unique identifier and therefore we cannot include it in the model. We could drop the feature, but better yet, we could make the id column the index for our dataframe. This would allow us to keep track of homes across our df, X_train/X_test data.
- We want to include some representation of location, as well. We all know that when it comes to real estate its "Location. Location. Location".
    - Latitude and longitude would be too simplified a representation of location. As they would miss out on the nuance of some neighborhoods being more expensive than others (as opposed ton East/West North/South).
    - Zipcode may be best, but we need to treat it as a categorical variable, so we will convert it to a string.
    

In [4]:
## Make the house ids the index
df = df.set_index('id')

In [5]:
## drop lat/long
df = df.drop(columns=['lat','long'])
## Treating zipcode as a category
df['zipcode'] = df['zipcode'].astype(str)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21613 entries, 7129300520 to 1523300157
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   price          21613 non-null  float64
 1   bedrooms       21613 non-null  int64  
 2   bathrooms      21613 non-null  float64
 3   sqft_living    21613 non-null  int64  
 4   sqft_lot       21613 non-null  int64  
 5   floors         21613 non-null  float64
 6   waterfront     21613 non-null  int64  
 7   view           21613 non-null  int64  
 8   condition      21613 non-null  int64  
 9   grade          21613 non-null  int64  
 10  sqft_above     21613 non-null  int64  
 11  sqft_basement  21613 non-null  int64  
 12  yr_built       21613 non-null  int64  
 13  yr_renovated   21613 non-null  int64  
 14  zipcode        21613 non-null  object 
 15  sqft_living15  21613 non-null  int64  
 16  sqft_lot15     21613 non-null  int64  
dtypes: float64(3), int64(13), object(1)


### Train Test Split

In [7]:
## Make x and y variables
y = df['price'].copy()
X = df.drop(columns=['price']).copy()

## train-test-split with random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=SEED)
X_train.head()

Unnamed: 0_level_0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,sqft_living15,sqft_lot15
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1795900120,3,2.5,2250,9235,2.0,0,0,3,8,2250,0,1985,0,98052,2290,8187
6788201240,4,2.75,1590,6000,1.5,0,0,4,8,1590,0,1925,0,98112,1590,4000
2461900550,4,1.75,2040,6000,1.0,0,0,5,7,1020,1020,1943,0,98136,1440,6000
1775920210,3,1.0,1200,9800,1.0,0,0,4,7,1200,0,1971,0,98072,1220,10220
2310010050,3,2.25,1570,8767,1.0,0,0,3,7,1180,390,1990,0,98038,1570,7434


### Preprocessing + ColumnTransformer

In [8]:
## make categorical selector and verifying it works 
cat_sel = make_column_selector(dtype_include='object')
cat_sel(X_train)

['zipcode']

In [9]:
## make numeric selector and verifying it works 
num_sel = make_column_selector(dtype_include='number')
num_sel(X_train)

['bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'waterfront',
 'view',
 'condition',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'sqft_living15',
 'sqft_lot15']

In [10]:
## make pipelines for categorical vs numeric data
cat_pipe = make_pipeline(SimpleImputer(strategy='constant',
                                       fill_value='MISSING'),
                         OneHotEncoder(handle_unknown='ignore', sparse=False))

num_pipe = make_pipeline(SimpleImputer(strategy='mean'))

> Nothing we have done yet should be new code. The changes we will make will be when we create our ColumnTransformer with `make_column_transformer`.
- From now on, you should add `verbose_feature_names_out=False` to `make_column_transformer`

In [11]:
## make the preprocessing column transformer
preprocessor = make_column_transformer((num_pipe, num_sel),
                                       (cat_pipe,cat_sel),
                                      verbose_feature_names_out=False)
preprocessor

>- In order to extract the feature names from the preprocessor, we first have to fit it on the data.
- Next, we can use the `preprocessor.get_feature_names_out()` method and save the output as something like "feature_names" or "final_features".

In [12]:
## fit column transformer and run get_feature_names_out
preprocessor.fit(X_train)
feature_names = preprocessor.get_feature_names_out()
feature_names

array(['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'sqft_living15',
       'sqft_lot15', 'zipcode_98001', 'zipcode_98002', 'zipcode_98003',
       'zipcode_98004', 'zipcode_98005', 'zipcode_98006', 'zipcode_98007',
       'zipcode_98008', 'zipcode_98010', 'zipcode_98011', 'zipcode_98014',
       'zipcode_98019', 'zipcode_98022', 'zipcode_98023', 'zipcode_98024',
       'zipcode_98027', 'zipcode_98028', 'zipcode_98029', 'zipcode_98030',
       'zipcode_98031', 'zipcode_98032', 'zipcode_98033', 'zipcode_98034',
       'zipcode_98038', 'zipcode_98039', 'zipcode_98040', 'zipcode_98042',
       'zipcode_98045', 'zipcode_98052', 'zipcode_98053', 'zipcode_98055',
       'zipcode_98056', 'zipcode_98058', 'zipcode_98059', 'zipcode_98065',
       'zipcode_98070', 'zipcode_98072', 'zipcode_98074', 'zipcode_98075',
       'zipcode_98077', 'zipcode_98092', 'zipcode_

- Notice how we were able to get the complete list of feature names, including the One Hot Encoded features with their proper "zipcode" prefix. 
- Quick note: if you forgot to add `verbose_feature_names_out` when you made your preprocessor, you would get something like this:


In [13]:
## make the preprocessing column transformer
preprocessor_oops = make_column_transformer((num_pipe, num_sel),
                                       (cat_pipe,cat_sel)
                                           ) # forgot verbose_feature_names_out=False
## fit column transformer and run get_feature_names_out
preprocessor_oops.fit(X_train)
feature_names_oops = preprocessor_oops.get_feature_names_out()
feature_names_oops

array(['pipeline-1__bedrooms', 'pipeline-1__bathrooms',
       'pipeline-1__sqft_living', 'pipeline-1__sqft_lot',
       'pipeline-1__floors', 'pipeline-1__waterfront', 'pipeline-1__view',
       'pipeline-1__condition', 'pipeline-1__grade',
       'pipeline-1__sqft_above', 'pipeline-1__sqft_basement',
       'pipeline-1__yr_built', 'pipeline-1__yr_renovated',
       'pipeline-1__sqft_living15', 'pipeline-1__sqft_lot15',
       'pipeline-2__zipcode_98001', 'pipeline-2__zipcode_98002',
       'pipeline-2__zipcode_98003', 'pipeline-2__zipcode_98004',
       'pipeline-2__zipcode_98005', 'pipeline-2__zipcode_98006',
       'pipeline-2__zipcode_98007', 'pipeline-2__zipcode_98008',
       'pipeline-2__zipcode_98010', 'pipeline-2__zipcode_98011',
       'pipeline-2__zipcode_98014', 'pipeline-2__zipcode_98019',
       'pipeline-2__zipcode_98022', 'pipeline-2__zipcode_98023',
       'pipeline-2__zipcode_98024', 'pipeline-2__zipcode_98027',
       'pipeline-2__zipcode_98028', 'pipeline-2__zipcod

### Remaking Our X_train and X_test as DataFrames

- Now that we have our list of feature names, we can very easily transform out X_train and X_test into preprocessed dataframes. 
- We can immediately turn the output of our preprocessor into a dataframe and do not need to save it as a separate variable first.
    - Therefore, in our pd.DataFrame, we will provide the `preprocessor.transform(X_train)` as the first argument, followed by `columns=feature_names` (the list we extracted from our precprocessor)
    - Pro Tip: you can also use the same index as your X_train or X_test variable, if you want to match up one of the transformed rows with the original dataframe.

In [14]:
X_train_df = pd.DataFrame(preprocessor.transform(X_train), 
                          columns = feature_names, index = X_train.index)
X_train_df.head(3)

Unnamed: 0_level_0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,...,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1795900120,3.0,2.5,2250.0,9235.0,2.0,0.0,0.0,3.0,8.0,2250.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6788201240,4.0,2.75,1590.0,6000.0,1.5,0.0,0.0,4.0,8.0,1590.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2461900550,4.0,1.75,2040.0,6000.0,1.0,0.0,0.0,5.0,7.0,1020.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
X_test_df = pd.DataFrame(preprocessor.transform(X_test), 
                          columns = feature_names, index = X_test.index)
X_test_df.head(3)

Unnamed: 0_level_0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,...,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3835500005,2.0,1.75,2050.0,11900.0,1.0,0.0,0.0,4.0,8.0,2050.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2202500110,3.0,1.5,1690.0,9708.0,1.5,0.0,0.0,5.0,7.0,1690.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3761700053,3.0,2.75,3470.0,9610.0,3.0,1.0,4.0,3.0,11.0,3470.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
## confirm the first 3 rows index in y_test matches X_test_df
y_test.head(3)

id
3835500005    1100000.0
2202500110     430000.0
3761700053    2150000.0
Name: price, dtype: float64

- Notice that we cannot see all of our features after OneHotEncoding. Pandas truncates the display in the middle and displays `...` instead. 
- We can get around this by changing the settings in Pandas using `pd.set_option`
    - In this case, we want to change the `max_columns` to be a number larger than our number of final features. Since we have 87 features, setting the `max_columns` to 100 would be sufficient.
- For more information on pandas options, see their [documentation on Options and Settings](https://pandas.pydata.org/docs/user_guide/options.html)
- Final note: in your project notebooks, you should add this function to the top of your notebook right after your imports.

In [17]:
## Using pd.set_option to display more columns
pd.set_option('display.max_columns',100)
X_train_df.head(3)

Unnamed: 0_level_0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,sqft_living15,sqft_lot15,zipcode_98001,zipcode_98002,zipcode_98003,zipcode_98004,zipcode_98005,zipcode_98006,zipcode_98007,zipcode_98008,zipcode_98010,zipcode_98011,zipcode_98014,zipcode_98019,zipcode_98022,zipcode_98023,zipcode_98024,zipcode_98027,zipcode_98028,zipcode_98029,zipcode_98030,zipcode_98031,zipcode_98032,zipcode_98033,zipcode_98034,zipcode_98038,zipcode_98039,zipcode_98040,zipcode_98042,zipcode_98045,zipcode_98052,zipcode_98053,zipcode_98055,zipcode_98056,zipcode_98058,zipcode_98059,zipcode_98065,zipcode_98070,zipcode_98072,zipcode_98074,zipcode_98075,zipcode_98077,zipcode_98092,zipcode_98102,zipcode_98103,zipcode_98105,zipcode_98106,zipcode_98107,zipcode_98108,zipcode_98109,zipcode_98112,zipcode_98115,zipcode_98116,zipcode_98117,zipcode_98118,zipcode_98119,zipcode_98122,zipcode_98125,zipcode_98126,zipcode_98133,zipcode_98136,zipcode_98144,zipcode_98146,zipcode_98148,zipcode_98155,zipcode_98166,zipcode_98168,zipcode_98177,zipcode_98178,zipcode_98188,zipcode_98198,zipcode_98199
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1
1795900120,3.0,2.5,2250.0,9235.0,2.0,0.0,0.0,3.0,8.0,2250.0,0.0,1985.0,0.0,2290.0,8187.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6788201240,4.0,2.75,1590.0,6000.0,1.5,0.0,0.0,4.0,8.0,1590.0,0.0,1925.0,0.0,1590.0,4000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2461900550,4.0,1.75,2040.0,6000.0,1.0,0.0,0.0,5.0,7.0,1020.0,1020.0,1943.0,0.0,1440.0,6000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Extracting Coefficients and Intercept from Scikit-Learn Linear Regression

In [18]:
from sklearn.linear_model import LinearRegression

## fitting a linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_df, y_train)
print(f'Training R^2: {lin_reg.score(X_train_df, y_train):.3f}')
print(f'Test R^2: {lin_reg.score(X_test_df, y_test):.3f}')

Training R^2: 0.811
Test R^2: 0.798


In [19]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
def evaluate_linreg(model, X_train,y_train, X_test,y_test):
    results = []
    y_hat_train = model.predict(X_train)
    r2_train = r2_score(y_train,y_hat_train)
    rmse_train = mean_squared_error(y_train,y_hat_train, squared=False)
    results.append({'Data':'Train', 'R^2':r2_train, "RMSE": rmse_train})
    
    y_hat_test = model.predict(X_test)
    r2_test = r2_score(y_test,y_hat_test)
    rmse_test = mean_squared_error(y_test,y_hat_test, squared=False)
    results.append({'Data':'Test', 'R^2':r2_test, "RMSE": rmse_test})
    
    results_df = pd.DataFrame(results).round(3).set_index('Data')
    results_df.loc['Delta'] = results_df.loc['Test'] - results_df.loc['Train']
    return results_df

In [20]:
## fitting a linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_df, y_train)
evaluate_linreg(lin_reg, X_train_df, y_train, X_test_df,y_test)

Unnamed: 0_level_0,R^2,RMSE
Data,Unnamed: 1_level_1,Unnamed: 2_level_1
Train,0.811,160231.707
Test,0.798,163653.598
Delta,-0.013,3421.891


- For scikit-learn Linear Regressions, we can find the coefficients for the features that were included in our X-data under the `.coef_` attribute. 
-  the `.coef_` is a numpy matrix that should have the same number of values as the # of columns in X_train_df

In [21]:
lin_reg.coef_

array([-2.39605871e+04,  2.38720736e+04,  1.40195287e+12,  2.10604488e-01,
       -4.68383897e+04,  6.63684178e+05,  5.88332016e+04,  2.40529455e+04,
        5.94223860e+04, -1.40195287e+12, -1.40195287e+12, -6.33718178e+02,
        1.53258262e+01,  1.17719643e+01, -8.69358917e-02, -1.23604903e+07,
       -1.23364921e+07, -1.23809445e+07, -1.15719993e+07, -1.20521747e+07,
       -1.21044175e+07, -1.21172134e+07, -1.21091811e+07, -1.23006235e+07,
       -1.22433725e+07, -1.22668322e+07, -1.22745999e+07, -1.23777679e+07,
       -1.24000930e+07, -1.21991019e+07, -1.21835166e+07, -1.22451491e+07,
       -1.21571737e+07, -1.23574769e+07, -1.23503261e+07, -1.23644310e+07,
       -1.19932171e+07, -1.21634026e+07, -1.23311415e+07, -1.10029369e+07,
       -1.18572731e+07, -1.23601593e+07, -1.22754067e+07, -1.21370406e+07,
       -1.21762412e+07, -1.23162739e+07, -1.22661179e+07, -1.23385786e+07,
       -1.22781834e+07, -1.22835303e+07, -1.23676740e+07, -1.22052280e+07,
       -1.21894810e+07, -

In [22]:
## Checking the number of coeffs matches the # of feature names
print(len(lin_reg.coef_))
len(feature_names)

85


85

> Note: if for some reason the length of your coef_ is 1, you should add the `.flatten()` method to convert the  coef_ into a simple 1-D array.

### Saving the coefficients as a pandas Series

- We can immediately turn the the models' .coef_ into a pd.Series, as well.
    - Therefore, in our pd.Series, we will provide the `lin_reg.coef_` as the first argument, followed by `index=feature_names` (pandas Series are 1D and do not have columns)

In [23]:
# feature_names = [f for f in feature_names if f not in zip_cols]

In [24]:
## Saving the coefficients
coeffs = pd.Series(lin_reg.coef_, index= feature_names)
coeffs

bedrooms        -2.396059e+04
bathrooms        2.387207e+04
sqft_living      1.401953e+12
sqft_lot         2.106045e-01
floors          -4.683839e+04
                     ...     
zipcode_98177   -1.214372e+07
zipcode_98178   -1.232492e+07
zipcode_98188   -1.232647e+07
zipcode_98198   -1.236204e+07
zipcode_98199   -1.197201e+07
Length: 85, dtype: float64

- The constant/intercept is not included in the .ceof_ attribute (if we used the default settings for LinearRegression which sets fit_intercept = True)
- The intercept is stored in the `.intercept_` attribute 
- We can add this as a new value to our coeffs series.
- Note: it is up to you what you name your intercept/constant. If you wanted to keep the naming convention of statsmodels, you could use "const" or just "intercept" for simplicity.

In [25]:
# use .loc to add the intercept to the series
coeffs.loc['intercept'] = lin_reg.intercept_
coeffs

bedrooms        -2.396059e+04
bathrooms        2.387207e+04
sqft_living      1.401953e+12
sqft_lot         2.106045e-01
floors          -4.683839e+04
                     ...     
zipcode_98178   -1.232492e+07
zipcode_98188   -1.232647e+07
zipcode_98198   -1.236204e+07
zipcode_98199   -1.197201e+07
intercept        1.308675e+07
Length: 86, dtype: float64

### Displaying the Coefficients

- Just like we increased the number of columns displayed by pandas, we can also increase the number of rows displayed by pandas.
- CAUTION: DO NOT SET THE MAX ROWS TO 0!! If you try to display a dataframe that has 1,000,000 it will try to display ALL 1,000,000 rows and will crash your kernel.

In [26]:
pd.set_option('display.max_rows',100)
coeffs

bedrooms        -2.396059e+04
bathrooms        2.387207e+04
sqft_living      1.401953e+12
sqft_lot         2.106045e-01
floors          -4.683839e+04
waterfront       6.636842e+05
view             5.883320e+04
condition        2.405295e+04
grade            5.942239e+04
sqft_above      -1.401953e+12
sqft_basement   -1.401953e+12
yr_built        -6.337182e+02
yr_renovated     1.532583e+01
sqft_living15    1.177196e+01
sqft_lot15      -8.693589e-02
zipcode_98001   -1.236049e+07
zipcode_98002   -1.233649e+07
zipcode_98003   -1.238094e+07
zipcode_98004   -1.157200e+07
zipcode_98005   -1.205217e+07
zipcode_98006   -1.210442e+07
zipcode_98007   -1.211721e+07
zipcode_98008   -1.210918e+07
zipcode_98010   -1.230062e+07
zipcode_98011   -1.224337e+07
zipcode_98014   -1.226683e+07
zipcode_98019   -1.227460e+07
zipcode_98022   -1.237777e+07
zipcode_98023   -1.240009e+07
zipcode_98024   -1.219910e+07
zipcode_98027   -1.218352e+07
zipcode_98028   -1.224515e+07
zipcode_98029   -1.215717e+07
zipcode_98

### Suppressing Scientific Notation in Pandas

> We can ALSO use panda's options to change how it display numeric values.
- if we want to add a `,` separator for thousands and round to 2 decimal places, we would use the format code ",.2f". 
- In order for Pandas to use this, we will have to use an f-string with a lambda x. (X represent any numeric value being displayed by pandas).

In [27]:
pd.set_option('display.float_format', lambda x: f"{x:,.2f}")
coeffs

bedrooms                   -23,960.59
bathrooms                   23,872.07
sqft_living      1,401,952,867,768.46
sqft_lot                         0.21
floors                     -46,838.39
waterfront                 663,684.18
view                        58,833.20
condition                   24,052.95
grade                       59,422.39
sqft_above      -1,401,952,867,567.36
sqft_basement   -1,401,952,867,643.39
yr_built                      -633.72
yr_renovated                    15.33
sqft_living15                   11.77
sqft_lot15                      -0.09
zipcode_98001          -12,360,490.30
zipcode_98002          -12,336,492.11
zipcode_98003          -12,380,944.51
zipcode_98004          -11,571,999.26
zipcode_98005          -12,052,174.67
zipcode_98006          -12,104,417.47
zipcode_98007          -12,117,213.41
zipcode_98008          -12,109,181.11
zipcode_98010          -12,300,623.48
zipcode_98011          -12,243,372.54
zipcode_98014          -12,266,832.25
zipcode_9801

## Inspecting Our Coefficients - Sanity Check

- Remember that we are currently using the raw numeric data, we have not applied any scaling. Therefore, our coefficients represent the actual cost in USD (\$) that is added to/subtracted from the home for each additional 1 unit of that feature.

- We have a large number of coefficients, which makes it trickier to visualize on one graph. 

- If we save the list of feature names that are the One Hot Encoded zipcodes, we can easily slice them out into a separate graph



In [28]:
## Saving list of zipcode features and other feautures
# Method A) For Loop to save zipcode column names
zip_cols = []
nonzip_cols = []
for col in X_train_df.columns:
    if col.startswith('zipcode'):
        zip_cols.append(col)
    else:
        nonzip_cols.append(col)

## Preview first 5 zipcols and all nonzip cols
print(zip_cols[:5])        
nonzip_cols

['zipcode_98001', 'zipcode_98002', 'zipcode_98003', 'zipcode_98004', 'zipcode_98005']


['bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'waterfront',
 'view',
 'condition',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'sqft_living15',
 'sqft_lot15']

In [29]:
## Saving list of zipcode features and other feautures
# Method B) List Comprehension Way
zip_cols = [c for c in X_train_df.columns if c.startswith('zipcode')]
nonzip_cols = [c for c in X_train_df.columns if not c.startswith('zipcode')]


## Preview first 5 zipcols and all nonzip cols
print(zip_cols[:5])        
nonzip_cols

['zipcode_98001', 'zipcode_98002', 'zipcode_98003', 'zipcode_98004', 'zipcode_98005']


['bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'waterfront',
 'view',
 'condition',
 'grade',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'sqft_living15',
 'sqft_lot15']

### VIsualizing Coefficients

- Now, let's examine the coefficients below and see if they make sense, based on our knowledge about houses. 

In [30]:
## Temporar
nonzip_coeffs = coeffs.drop(zip_cols)
nonzip_coeffs

bedrooms                   -23,960.59
bathrooms                   23,872.07
sqft_living      1,401,952,867,768.46
sqft_lot                         0.21
floors                     -46,838.39
waterfront                 663,684.18
view                        58,833.20
condition                   24,052.95
grade                       59,422.39
sqft_above      -1,401,952,867,567.36
sqft_basement   -1,401,952,867,643.39
yr_built                      -633.72
yr_renovated                    15.33
sqft_living15                   11.77
sqft_lot15                      -0.09
intercept               13,086,752.23
dtype: float64

<p style="color:red; fontweight:bold">NOTE: UPDATE THE VALUES BELOW ONCE SETTLE ON FINAL MODEL</p>

- Hmmm... for for each additional:
    - 1 bedroom subtract \\$23,962.29 from the price.
        - Hmm, seems a little odd, but we can investigate bedrooms further with EDA. 
    - 1 bathroom add \\$23,824.63
        - Ok, that sounds reasonable. 
    - 1 sqft of living space subtract ... \\$1,553,596,045,401.33 ?!?!?!?
        - Hmm... $1.6 trillion dollars?!?!?! for 1 sqft of space? Something seems wrong here... 

Indeed, if we examine our other coefficients, we have several that seem like impractical/unrealistic values (i.e. sqft_above and sqft_basement)

If we inspect the coefficients for zipcodes, we will find some additional unrealistic values. (Why would a zipcode subtract \\$8 million from a home's value??

In [31]:
zip_coeffs = coeffs.loc[zip_cols]
zip_coeffs

zipcode_98001   -12,360,490.30
zipcode_98002   -12,336,492.11
zipcode_98003   -12,380,944.51
zipcode_98004   -11,571,999.26
zipcode_98005   -12,052,174.67
zipcode_98006   -12,104,417.47
zipcode_98007   -12,117,213.41
zipcode_98008   -12,109,181.11
zipcode_98010   -12,300,623.48
zipcode_98011   -12,243,372.54
zipcode_98014   -12,266,832.25
zipcode_98019   -12,274,599.90
zipcode_98022   -12,377,767.91
zipcode_98023   -12,400,093.05
zipcode_98024   -12,199,101.92
zipcode_98027   -12,183,516.57
zipcode_98028   -12,245,149.10
zipcode_98029   -12,157,173.67
zipcode_98030   -12,357,476.90
zipcode_98031   -12,350,326.13
zipcode_98032   -12,364,431.03
zipcode_98033   -11,993,217.14
zipcode_98034   -12,163,402.56
zipcode_98038   -12,331,141.49
zipcode_98039   -11,002,936.86
zipcode_98040   -11,857,273.08
zipcode_98042   -12,360,159.31
zipcode_98045   -12,275,406.71
zipcode_98052   -12,137,040.61
zipcode_98053   -12,176,241.16
zipcode_98055   -12,316,273.93
zipcode_98056   -12,266,117.86
zipcode_

- We will iterate upon this model and discuss alternative choices we can make to optimize the model for providing insights for our stakeholder.

## Summary

- In this lesson, we revisited linear regression with scikit-learn. We introduced some simplifications to our workflow and discussed extracting coefficients from our LinearRegression model. 

- Next lesson we will iterate on our current model to find more intuitive coefficients that we can use to extract insight for our stakeholders.

### Recap - Sklearn v1.1

- We added the argument `verbose_feature_names_out=False` to `make_column_transformer`, which let us extract our feature names (after fitting the preprocessor) using `.get_feature_names_out()`

- We then used this list of features when reconstruction our transformed X_train and X_test as dataframes and when extracting coefficients from our model.

### Recap - Pandas Options

- We used the following options in our notebook. Ideally, we should group these together and move them to the top of our notebook, immediately after our imports.

In [32]:
## Reviewing the options used
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)
pd.set_option('display.float_format', lambda x: f"{x:,.2f}")