Feature-engine is a Python library designed for feature engineering and selection in machine learning pipelines. It is built to work seamlessly with scikit-learn, so you can use it in pipelines with transformers and estimators. **Feature-engine can handle both categorical and numerical features, though different transformers are used for each type**


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures
from sklearn.pipeline import Pipeline

In [2]:
from sklearn.datasets import fetch_kddcup99

df = fetch_kddcup99(as_frame=True)
df = df.frame
df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,labels
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,b'normal.'
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,b'normal.'
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,b'normal.'


In [3]:
X = df.drop('labels', axis=1)
y = df['labels']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [None]:
# variables=None → check all features in X_train for duplication.
# missing_values='raise' → raise an error if missing values are present,
#                           ensuring duplicate detection is not affected by NaNs.
drop_constant = DropConstantFeatures(
    tol=1,
    variables=None,
    missing_values='raise'
    )

drop_constant.fit(X_train)

In [5]:
X_train_clean = drop_constant.transform(X_train)
X_test_clean = drop_constant.transform(X_test)

Using `tol` in Feature-engine for constant and quasi-constant features

`tol=1`: Drops constant features only (all values are the same).
`tol < 1`: Drops quasi-constant features, where one value dominates the column.

For example, `tol=0.9` means 90% or more values are the same are removed.


In [6]:
drop_constant = DropConstantFeatures(
    tol=0.95,
    variables=None,
    missing_values='raise'
    )

drop_constant.fit(X_train)

X_train_clean = drop_constant.transform(X_train)
X_test_clean = drop_constant.transform(X_test)

In [7]:
drop_constant.features_to_drop_

['duration',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login']

In [8]:
drop_duplicates = DropDuplicateFeatures(
    variables=None,
    missing_values='raise'
    )

drop_duplicates.fit(X_train)

In [9]:
X_train_clean = drop_duplicates.transform(X_train)
X_test_clean = drop_duplicates.transform(X_test)

In [10]:
drop_duplicates.features_to_drop_

{'is_host_login'}

In [11]:
drop_duplicates.duplicated_feature_sets_

[{'is_host_login', 'num_outbound_cmds'}]

### Pipeline


In [12]:
pipeline = Pipeline([
    ('constant', DropConstantFeatures(tol=0.95)),
    ('duplicated', DropDuplicateFeatures())
]
)

pipeline.fit(X_train)

In [13]:
X_train_clean = pipeline.transform(X_train)
X_test_clean = pipeline.transform(X_test)

In [14]:
# Navigate to the pipline transformers

pipeline.named_steps['constant'].features_to_drop_

['duration',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login']