# Extracting Insights from a Modeling Pipeline

In stack 2, we always combined our ColumnTransformer and model together in a final modeling pipeline. Let's create and fit a Linear Regression pipeline and a DecisionTreeRegressor pipeline using this approach.

### Preprocessing

In [1]:
## Typical Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Modeling & preprocessing import
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.compose import ColumnTransformer,make_column_transformer,make_column_selector
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer

In [2]:
## Load dataset from published web view link
import pandas as pd

fpath ="https://docs.google.com/spreadsheets/d/e/2PACX-1vS6Sn9LaMSc_E1EHQpuRK6BTpKp6h27obTP_dTpAVu_xtoqsge30jBGh9vYlO4DYe-utRKMgMqYChU_/pub?output=csv"
df = pd.read_csv(fpath)
df.head(3)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27


In [3]:
## replace inconsistent categories
fat_content_map = {'LF':'Low Fat',
                   'reg':'Regular',
                   'low fat':'Low Fat'}

df['Item_Fat_Content'] = df['Item_Fat_Content'].replace(fat_content_map)

## Verify 
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

In [4]:
## Define X and y
target = 'Item_Outlet_Sales'

X = df.drop(columns=target).copy()
y = df[target].copy()
X

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,9.300,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,DRC01,5.920,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2
2,FDN15,17.500,Low Fat,0.016760,Meat,141.6180,OUT049,1999,Medium,Tier 1,Supermarket Type1
3,FDX07,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,OUT010,1998,,Tier 3,Grocery Store
4,NCD19,8.930,Low Fat,0.000000,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1
...,...,...,...,...,...,...,...,...,...,...,...
8518,FDF22,6.865,Low Fat,0.056783,Snack Foods,214.5218,OUT013,1987,High,Tier 3,Supermarket Type1
8519,FDS36,8.380,Regular,0.046982,Baking Goods,108.1570,OUT045,2002,,Tier 2,Supermarket Type1
8520,NCJ29,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,OUT035,2004,Small,Tier 2,Supermarket Type1
8521,FDN46,7.210,Regular,0.145221,Snack Foods,103.1332,OUT018,2009,Medium,Tier 3,Supermarket Type2


In [5]:
## Drop unwanted/inappropriate columns 
bad_cols = ['Item_Identifier','Outlet_Identifier','Outlet_Establishment_Year']
X = X.drop(columns=bad_cols)
X

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,9.300,Low Fat,0.016047,Dairy,249.8092,Medium,Tier 1,Supermarket Type1
1,5.920,Regular,0.019278,Soft Drinks,48.2692,Medium,Tier 3,Supermarket Type2
2,17.500,Low Fat,0.016760,Meat,141.6180,Medium,Tier 1,Supermarket Type1
3,19.200,Regular,0.000000,Fruits and Vegetables,182.0950,,Tier 3,Grocery Store
4,8.930,Low Fat,0.000000,Household,53.8614,High,Tier 3,Supermarket Type1
...,...,...,...,...,...,...,...,...
8518,6.865,Low Fat,0.056783,Snack Foods,214.5218,High,Tier 3,Supermarket Type1
8519,8.380,Regular,0.046982,Baking Goods,108.1570,,Tier 2,Supermarket Type1
8520,10.600,Low Fat,0.035186,Health and Hygiene,85.1224,Small,Tier 2,Supermarket Type1
8521,7.210,Regular,0.145221,Snack Foods,103.1332,Medium,Tier 3,Supermarket Type2


In [6]:
## Perform a train-test-split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

### Making a Preprocessing Pipeline

In [7]:
## Create categorical pipeline
cat_selector = make_column_selector(dtype_include='object')

# create pipeline for handling categorical data
impute_most_freq = SimpleImputer(strategy='most_frequent')
encoder = OneHotEncoder(handle_unknown='ignore',sparse=False)

cat_pipe = make_pipeline(impute_most_freq,encoder)


## Create numeric pipelien
num_selector = make_column_selector(dtype_include='number')
num_selector(X_train)

# create pipeline for handling categorical data
impute_mean = SimpleImputer(strategy='mean')
scaler = StandardScaler()

num_pipe = make_pipeline(impute_mean, scaler)


## Combine into 1 column transformer
preprocessor = make_column_transformer( (cat_pipe,cat_selector),
                                       (num_pipe,num_selector),
                                      verbose_feature_names_out=False)

```python
preprocessor
```

<img src="preprocessor.jpg" width=300px>

## Modeling

### Model 1 -  LinearRegression

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error

In [9]:
## Make & Fit the modeling pipeline
pipe = make_pipeline(preprocessor, LinearRegression())
pipe.fit(X_train, y_train)

## Quick peek at the R^2 value for test data
print(f"Training R2: {pipe.score(X_train,y_train) :.3f}")
print(f"Test R2: {pipe.score(X_test,y_test): .3f}")

Training R2: 0.558
Test R2:  0.563


### Model 2 - Decision Tree Regressor

In [10]:
from sklearn.tree import DecisionTreeRegressor

In [11]:
## Make and fit model
tree_pipe = make_pipeline(preprocessor,DecisionTreeRegressor())
tree_pipe.fit(X_train, y_train)

## Quick peek at the R^2 value for test data
print(f"Training R2: {tree_pipe.score(X_train,y_train) :.3f}")
print(f"Test R2: {tree_pipe.score(X_test,y_test): .3f}")

Training R2: 1.000
Test R2:  0.133


## Extracting Values from a Modeling Pipeline

If you wanted to extract the feature importances or coefficients from your model, you can do so without creating your X_train and X_test data as DataFrames. 

Instead, we will need to slice out the correct object in our modeling pipeline for each piece of information we need (feature names vs coefficients/importances). 

### Extracting Feature Names and Model Parameters from a Pipeline

- First, make sure you have already fit your modeling pipeline and are working below that point in your notebook. 

- Pipelines can be sliced as if they were a list - with square brackets and a numeric index.

- Right now, we have a fit modeling pipeline with 2 components:
    - the preprocessing ColumnTransformer (with the method to get our feature names: `.get_feature_names_out()`).
    - your model (with your `.feature_importances` or `.coef_`)
    

```python 
pipe
```


<img src="modeling_pipeline.png" width=350px>

#### Extracting the feature names
- The ColumnTransformer is the first item in the pipeline, so it is index 0. 
- Therefore, to slice out the pipeline and run the `get_feature_names_out` method we would use:

In [12]:
# Extracting the feature names from the pipeline
feature_names = pipe[0].get_feature_names_out()
feature_names

array(['Item_Fat_Content_Low Fat', 'Item_Fat_Content_Regular',
       'Item_Type_Baking Goods', 'Item_Type_Breads',
       'Item_Type_Breakfast', 'Item_Type_Canned', 'Item_Type_Dairy',
       'Item_Type_Frozen Foods', 'Item_Type_Fruits and Vegetables',
       'Item_Type_Hard Drinks', 'Item_Type_Health and Hygiene',
       'Item_Type_Household', 'Item_Type_Meat', 'Item_Type_Others',
       'Item_Type_Seafood', 'Item_Type_Snack Foods',
       'Item_Type_Soft Drinks', 'Item_Type_Starchy Foods',
       'Outlet_Size_High', 'Outlet_Size_Medium', 'Outlet_Size_Small',
       'Outlet_Location_Type_Tier 1', 'Outlet_Location_Type_Tier 2',
       'Outlet_Location_Type_Tier 3', 'Outlet_Type_Grocery Store',
       'Outlet_Type_Supermarket Type1', 'Outlet_Type_Supermarket Type2',
       'Outlet_Type_Supermarket Type3', 'Item_Weight', 'Item_Visibility',
       'Item_MRP'], dtype=object)

- Note: if you are seeing `pipeline-1__` and `pipeline-2__` in your feature names, you must scroll back up to where you ran the `make_column_transformer` function and add `verbose_feature_names_out=False`.
    - Make sure to Restart and Run All once you've added this argument.

#### Extracting the coefficients
- The model is the last item in the pipeline, so it is index -1. Once we've sliced out the model, we can use the access `.coef_` or `.feature_importances_` attributes.

- Therefore, in order to slice out the  coefficients from a Linear Regression model from the pipeline in our pipeline, we would use:

In [13]:
# Extracting the ceofficients from the pipeline
pipe[-1].coef_

array([ 1.49883856e+15,  1.49883856e+15,  1.30987094e+15,  1.30987094e+15,
        1.30987094e+15,  1.30987094e+15,  1.30987094e+15,  1.30987094e+15,
        1.30987094e+15,  1.30987094e+15,  1.30987094e+15,  1.30987094e+15,
        1.30987094e+15,  1.30987094e+15,  1.30987094e+15,  1.30987094e+15,
        1.30987094e+15,  1.30987094e+15,  1.87183719e+15,  1.87183719e+15,
        1.87183719e+15, -1.92663604e+16, -1.92663604e+16, -1.92663604e+16,
        6.04396096e+15,  6.04396096e+15,  6.04396096e+15,  6.04396096e+15,
       -5.17095682e+00, -2.17398823e+01,  9.85913680e+02])

### Putting it all together

- Now, we just need to create our panda's Series but using the values extracted from our pipeline.

#### Extracting Regression Coefficients

In [14]:
feature_names = pipe[0].get_feature_names_out()
coeffs = pd.Series(pipe[-1].coef_, index=feature_names)
coeffs

Item_Fat_Content_Low Fat           1.498839e+15
Item_Fat_Content_Regular           1.498839e+15
Item_Type_Baking Goods             1.309871e+15
Item_Type_Breads                   1.309871e+15
Item_Type_Breakfast                1.309871e+15
Item_Type_Canned                   1.309871e+15
Item_Type_Dairy                    1.309871e+15
Item_Type_Frozen Foods             1.309871e+15
Item_Type_Fruits and Vegetables    1.309871e+15
Item_Type_Hard Drinks              1.309871e+15
Item_Type_Health and Hygiene       1.309871e+15
Item_Type_Household                1.309871e+15
Item_Type_Meat                     1.309871e+15
Item_Type_Others                   1.309871e+15
Item_Type_Seafood                  1.309871e+15
Item_Type_Snack Foods              1.309871e+15
Item_Type_Soft Drinks              1.309871e+15
Item_Type_Starchy Foods            1.309871e+15
Outlet_Size_High                   1.871837e+15
Outlet_Size_Medium                 1.871837e+15
Outlet_Size_Small                  1.871

#### Extracting Feature Importances

In [15]:
feature_names = tree_pipe[0].get_feature_names_out()
importances = pd.Series(tree_pipe[-1].feature_importances_, index=feature_names)
importances

Item_Fat_Content_Low Fat           0.002982
Item_Fat_Content_Regular           0.004952
Item_Type_Baking Goods             0.003353
Item_Type_Breads                   0.002925
Item_Type_Breakfast                0.003106
Item_Type_Canned                   0.004082
Item_Type_Dairy                    0.006096
Item_Type_Frozen Foods             0.005868
Item_Type_Fruits and Vegetables    0.007729
Item_Type_Hard Drinks              0.002428
Item_Type_Health and Hygiene       0.004270
Item_Type_Household                0.004999
Item_Type_Meat                     0.002254
Item_Type_Others                   0.002065
Item_Type_Seafood                  0.002022
Item_Type_Snack Foods              0.007568
Item_Type_Soft Drinks              0.005715
Item_Type_Starchy Foods            0.002190
Outlet_Size_High                   0.005090
Outlet_Size_Medium                 0.008672
Outlet_Size_Small                  0.008639
Outlet_Location_Type_Tier 1        0.004943
Outlet_Location_Type_Tier 2     

### Summary

In this short optional lesson, we reviewed how to extract our feature names, coefficients, and feature importances from a modeling pipeline.