In [20]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd

import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.feature_selection import RFECV, SelectKBest

# Feature Selection with Column Transformer

Categorical and numerical variables need to be treated differently some times. We can loop back to the pipeline and column transformer stuff to incorporate the newer feature selection. 

In [21]:
df = pd.read_csv("data/mtcars.csv")
df.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


### Feature Types

For this, I am going to treat # of cylinders, vs, am, # of gears, and carb as categorical. This is a reasonable, but not necissarily correct, interpretation of the scenario - a domain knowledge decision.

The data doesn't matter much here, and the example is tiny, but the feature selection stuff transfers as is to other datasets. 

In [22]:
# Manually set categories as categories
df["cyl"] = df["cyl"].astype("category")
df["vs"] = df["vs"].astype("category")
df["am"] = df["am"].astype("category")
df["gear"] = df["gear"].astype("category")
df["carb"] = df["carb"].astype("category")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   model   32 non-null     object  
 1   mpg     32 non-null     float64 
 2   cyl     32 non-null     category
 3   disp    32 non-null     float64 
 4   hp      32 non-null     int64   
 5   drat    32 non-null     float64 
 6   wt      32 non-null     float64 
 7   qsec    32 non-null     float64 
 8   vs      32 non-null     category
 9   am      32 non-null     category
 10  gear    32 non-null     category
 11  carb    32 non-null     category
dtypes: category(5), float64(5), int64(1), object(1)
memory usage: 2.7+ KB


### Setup Pipeline

We have two pipelines that are then combined in our column transformer. The numerical one includes some rfecv to do feature selection on those variables. The categorical one includes k-best to do feature selection there. Each subset is feature selected, then the two subsets are combined. 

In [23]:
#Data Split
cat_feat = ["cyl", "vs", "am", "gear", "carb"]
num_feat = ["disp", "hp", "drat", "wt", "qsec"]
y = df["mpg"]
X = df.drop(columns={"mpg", "model"})
X_train, X_test, y_train, y_test = train_test_split(X, y)

#estimators
model = LinearRegression()
selector = Lasso()

# RFECV on Numerical Data
min_features_to_select = 1  # Minimum number of features to consider
rfecv = RFECV(
    estimator=selector,
    step=1,
    cv=3,
    min_features_to_select=min_features_to_select,
)
num_pipe = Pipeline([ 
                ("rfecc", rfecv)                          
])

# KBest on Categorical Data
kbest = SelectKBest(k=2)

cat_pipe = Pipeline([
                ("kbest",kbest)
])

#Pre-processing and Column Transformer
prepro = ColumnTransformer([
                ("cat", cat_pipe, cat_feat),
                ("num", num_pipe, num_feat)
])

pipe = Pipeline([("prepro", prepro), 
                ("model", model)    
])

pipe.fit(X_train, y_train)
print("Score:", pipe.score(X_test, y_test))
#print(pipe.get_params("steps"))

Score: -0.16868065403596688


  f = msb / msw


### Conclusion and Caveats

This evaluates the features in two separate groups, and selects the strongest in each group. There is a possibility that there's some categorical-numerical combo of variables that is a great predictor filtered out, but that's pretty unlikely. 

You can also put in any transformations to the categorical and numerical pipes, like we did previously. I left them with only one step for clarity, but if you need to encode, scale, etc... all that stuff can just be layered in to those pipes. 