Basic Methods:

- **Constact feature**: A constant feature has exactly the same value for all samples.

  - Action: Always remove — it carries no information.

- **Quasi-constant feature**: A quasi-constant feature has very little variability: one value dominates almost all observations, and the remaining values occur very rarely.

  - Action: If 95% or more of the values in a column are identical, the feature is considered quasi-constant and is removed.

- Duplicatate: A duplicate feature is a column that is identical (or near-identical) to another column.

  - Action: Drop one of them — they provide redundant information.

  **Note**: For feature selection, you should remove constant (or quasi-constant) features based on the training set only, not the whole dataset.

---


### How to remove constant features

| Method                              | Variable type           | Criterion                                           | Drops automatically? | Notes                                                                                                    |
| ----------------------------------- | ----------------------- | --------------------------------------------------- | -------------------- | -------------------------------------------------------------------------------------------------------- |
| `VarianceThreshold` (scikit-learn)  | Numerical               | Variance ≤ threshold (default 0 = constant)         | ✅ Yes               | Only numerical; can set threshold >0 for quasi-constant; no info about dropped columns stored internally |
| `df.std()` (pandas)                 | Numerical               | std = 0 or < threshold                              | ❌ No                | Manual filtering needed; only numerical                                                                  |
| `df.nunique()` (pandas)             | Numerical / Categorical | Constant: nunique = 1; Quasi-constant: max_freq ≥ τ | ❌ No                | Works for categorical; manual drop; threshold τ can be 0.90–0.95 for quasi-constant                      |
| `df.nunique(dropna=False)` (pandas) | Numerical / Categorical | Same as above but includes NaNs                     | ❌ No                | Counts NaNs as unique if needed                                                                          |


### Handling Constant and Quasi-Constant Features in Pipelines

When preprocessing data, it is important to handle **constant and quasi-constant features** to improve model performance and reduce noise. Numerical and categorical features should be processed **separately**.

#### Why not use ordinal encoding for categorical features?

One might think to convert categorical features into numbers using an **ordinal encoder** and then apply numerical methods like `VarianceThreshold`. However:

- The numeric values assigned to categories are **arbitrary** and do **not represent true quantities**.
- Using variance on ordinal-encoded categories can be **misleading**, introduce errors, or cause misunderstanding in feature selection.

To avoid this, it is better to **handle categorical features directly** using **frequency-based checks**, such as `df.nunique()`.

---

#### Pipeline Approach

- **Categorical variables:**  
  Remove features with **nunique = 1** (constant) or a **dominant category ≥ threshold** (quasi-constant, e.g., 0.95).

- **Numerical variables:**  
  Use **`VarianceThreshold`** to remove features with zero or very low variance.

- **Pipeline benefits:**
  - Automates preprocessing
  - Handles categorical and numerical features separately
  - Ensures **reproducibility** for training and future datasets

---

#### Python Pipeline Example

```python
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# Function to drop constant/quasi-constant categorical features
def drop_quasi_constant_categorical(df, threshold=0.95):
    to_drop = [col for col in df.columns if df[col].nunique() == 1 or
               (df[col].value_counts(normalize=True).iloc[0] >= threshold)]
    return df.drop(columns=to_drop)

# Transformers
categorical_transformer = FunctionTransformer(drop_quasi_constant_categorical)
numerical_transformer = VarianceThreshold(threshold=0.0)  # threshold>0 for quasi-constant

# Columns (replace with your dataset's columns)
categorical_cols = ['cat_col1', 'cat_col2']
numerical_cols = ['num_col1', 'num_col2']

# Column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_cols),
        ('num', numerical_transformer, numerical_cols)
    ]
)

# Full pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])
```


---

### Handling NaN Values in Feature Selection

When performing feature selection, handling **NaN values** is important because most algorithms behave differently with missing data.

#### Behavior of common methods with NaNs

- **VarianceThreshold (scikit-learn)**

  - Does **not handle NaNs**.
  - If a column contains NaNs, it **raises an error** instead of ignoring or treating them as zero.
  - **Implication:** NaNs must be handled before applying VarianceThreshold.

- **Pandas methods (`df.nunique()`, `value_counts()`)**
  - `nunique(dropna=True)` **ignores NaNs**.
  - `nunique(dropna=False)` **counts NaN as a unique value**, useful if missing values are meaningful.

#### Best practices for handling NaNs

1. **Imputation before feature selection**

   - Replace NaNs with a value appropriate for the feature type:
     - **Numerical:** mean, median, or a constant within the range of the data.
     - **Categorical:** mode or a special category (e.g., `"Missing"`).
   - Using zero for NaNs is **not always recommended**, especially if it is **outside the natural range** of the feature, because it can **artificially inflate variance or mislead selection**.
   - The choice of imputation should **preserve the scale and distribution** of the original feature.

2. **Dropping features with excessive NaNs**

   - If a feature has too many missing values, consider dropping it **before variance or frequency-based selection**.

3. **Pipeline integration**
   - Include imputation as a preprocessing step **before feature selection** in pipelines to ensure **reproducibility and safety**.
   ***


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

In [2]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [3]:
# For feature selection, you should remove constant (or quasi-constant) features
# based on the training set only, not the whole dataset.

X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

#### Sklearn


In [4]:
selector = VarianceThreshold(threshold=1e-6) # it computes population variance not sample variance
selector.fit(X_train)

In [5]:
selector.get_support() # True for features that are kept

array([ True,  True,  True,  True])

In [6]:
constant_features = X_train.columns[~selector.get_support()]
constant_features

Index([], dtype='object')

In [7]:
# Drop columns from training and test sets
X_train = selector.transform(X_train)
X_test = selector.transform(X_test)

#### Manual code


In [8]:
X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [9]:
constant_features = [
    feature for feature in X_train.columns if X_train[feature].var(ddof=0) <= 1e-6
    ] # it by default computes sample variance

constant_features

[]

In [10]:
X_train = X_train.drop(columns=constant_features)
X_test = X_test.drop(columns=constant_features)

#### Population variance or sample variance?

Either population or sample variance works for feature selection, because you are comparing **relative variances across features**, not estimating parameters for inference.

- `VarianceThreshold`: computes **population variance** (`ddof=0`)
- `df.var()`: computes **sample variance** by default (`ddof=1`)

For consistency, you can explicitly set the degree of freedom in pandas:

```python
X_train[feature].var(ddof=0)  # population variance
# or
X_train[feature].var(ddof=1)  # sample variance
```


In [11]:
X_train.var()

sepal_length    0.694396
sepal_width     0.172189
petal_length    2.958407
petal_width     0.553727
dtype: float64

### Handling Categorical Features

Categorical features require **special handling** because most feature selection methods (like `VarianceThreshold`) work on numerical data.

#### Strategy

**Identify categorical features**

- They may be strings or numbers representing categories.
- If numerical features represent categories, convert them to **object type** to treat them as categorical:

```python
# Example: convert numerical columns to categorical
categorical_cols = ['num_as_cat1', 'num_as_cat2']
X_train[categorical_cols] = X_train[categorical_cols].astype('object')
```


In [12]:
from sklearn.datasets import fetch_kddcup99

df = fetch_kddcup99(as_frame=True)
df = df.frame
df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,labels
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,b'normal.'
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,b'normal.'
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,b'normal.'


In [13]:
X = df.drop('labels', axis=1)
y = df['labels']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [15]:
cat_columns = X_train.select_dtypes(include=['object']).columns

In [16]:
constant_features = [
    feature for feature in cat_columns if X_train[feature].nunique() == 1
]

constant_features

['num_outbound_cmds', 'is_host_login']

### Duplicate columns


In [17]:
duplicate_featurs = X_train.columns[X_train.T.duplicated()].to_list()
duplicate_featurs

['is_host_login']

### Pipeline


In [31]:
# df = df.loc[:, ~df.T.duplicated()] # is computationally expensive

def drop_global_duplicates_fast(df: pd.DataFrame) -> pd.DataFrame:
    """
    Drop duplicate columns fast and efficiently by hashing column values.
    Works for both numeric and categorical columns.
    """
    hashes = df.apply(
        lambda col: pd.util.hash_pandas_object(col, index=False).sum(), axis=0
        )
    return df.loc[:, ~hashes.duplicated()]


df = drop_global_duplicates_fast(df)

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [33]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer


def drop_quasi_constant_categorical(
df: pd.DataFrame,
    threshold: float = 0.95
) -> pd.DataFrame:
    
    to_drop = []
    for col in df.columns:
        value_ratio = df[col].value_counts(
            normalize=True,
            dropna=False
        )
        if len(value_ratio) == 1 or value_ratio.iloc[0] >= threshold:
            to_drop.append(col)

    return df.drop(columns=to_drop)


def build_feature_selection_pipeline(
    cat_threshold: float = 0.95,
    num_variance: float = 0.05
) -> Pipeline:
    """
    Feature-selection pipeline:
    - Global duplicate removal
    - Categorical: constant & quasi-constant
    - Numerical: low-variance filtering
    """

    categorical_transformer = FunctionTransformer(
        drop_quasi_constant_categorical,
        kw_args={'threshold': cat_threshold},
        validate=False
    )

    numerical_transformer = VarianceThreshold(
        threshold=num_variance
    )

    preprocessor = ColumnTransformer(
        transformers=[
            ('cat', categorical_transformer,
            make_column_selector(dtype_include=['object', 'category'])),

            ('num', numerical_transformer,
            make_column_selector(dtype_include=['number']))
        ],
        remainder='drop'
    )

    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor)
    ])

    return pipeline


In [34]:
PIPELINE = build_feature_selection_pipeline(
    cat_threshold=0.95,
    num_variance=0.05
)

PIPELINE.fit(X_train)

X_train_clean = PIPELINE.transform(X_train)
X_test_clean = PIPELINE.transform(X_test)

`make_column_selector` returns a list of columns.
