Feature-engine is a Python library designed for feature engineering and selection in machine learning pipelines. It is built to work seamlessly with scikit-learn, so you can use it in pipelines with transformers and estimators. **Feature-engine can handle both categorical and numerical features, though different transformers are used for each type**


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.selection import DropConstantFeatures, DropDuplicateFeatures
from sklearn.pipeline import Pipeline

In [None]:
from sklearn.datasets import fetch_kddcup99

df = fetch_kddcup99(as_frame=True)
df = df.frame
df.head()

In [None]:
X = df.drop('labels', axis=1)
y = df['labels']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [None]:
drop_constant = DropConstantFeatures(
    tol=1,
    variables=None,
    missing_values='raise'
    )

drop_constant.fit(X_train)

In [None]:
X_train_clean = drop_constant.transform(X_train)
X_test_clean = drop_constant.transform(X_test)

Using `tol` in Feature-engine for constant and quasi-constant features

`tol=1`: Drops constant features only (all values are the same).
`tol < 1`: Drops quasi-constant features, where one value dominates the column.

For example, `tol=0.9` means 90% or more values are the same are removed.


In [None]:
drop_constant = DropConstantFeatures(
    tol=0.95,
    variables=None,
    missing_values='raise'
    )

drop_constant.fit(X_train)

X_train_clean = drop_constant.transform(X_train)
X_test_clean = drop_constant.transform(X_test)

In [None]:
drop_constant.features_to_drop_

In [None]:
drop_duplicates = DropDuplicateFeatures(
    variables=None,
    missing_values='raise'
    )

drop_duplicates.fit(X_train)

In [None]:
X_train_clean = drop_duplicates.transform(X_train)
X_test_clean = drop_duplicates.transform(X_test)

In [None]:
drop_duplicates.features_to_drop_

### Pipeline


In [None]:
pipeline = Pipeline([
    ('constant', DropConstantFeatures(tol=0.95)),
    ('duplicated', DropDuplicateFeatures())
]
)

pipeline.fit(X_train)

In [None]:
X_train_clean = pipeline.transform(X_train)
X_test_clean = pipeline.transform(X_test)