# 04 - High cardinality management: Target encoding

To handle high cardinality categorical features in a dataset, one effective technique is target encoding, which replaces each category with a statistical metric (e.g., the mean of the target variable for each category). 

*High-cardinality categorical features are categorical variables that contain a large number of unique categories or levels.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from category_encoders import TargetEncoder

# pip install category_encoders


In [5]:
# Load dataset
df = pd.read_csv("bank_numeric.csv")

# Define features and target
target_column = "deposit"
X = df.drop(columns=[target_column])
y = df[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [6]:
# Identify categorical columns (high-cardinality columns, if any)
categorical_columns = X.select_dtypes(include=['object']).columns
print("\nCategorical Columns for Target Encoding:", categorical_columns.tolist())


Categorical Columns for Target Encoding: []


In [7]:
# Apply Target Encoding to Categorical Columns
encoder = TargetEncoder(cols=categorical_columns)
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)

print("\nDataset Shape After Target Encoding:", X_train_encoded.shape)



Dataset Shape After Target Encoding: (3679, 16)


In [8]:
# Train Logistic Regression Model
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train_encoded, y_train)

# Predict and evaluate metrics
y_pred = log_reg.predict(X_test_encoded)
print("\nMetrics After Target Encoding:")
print(classification_report(y_test, y_pred))
acc = accuracy_score(y_test, y_pred)
print("\nModel overall accuracy: {:.2f}%".format(acc * 100))


Metrics After Target Encoding:
              precision    recall  f1-score   support

           0       0.80      0.89      0.84       915
           1       0.82      0.68      0.75       663

    accuracy                           0.81      1578
   macro avg       0.81      0.79      0.79      1578
weighted avg       0.81      0.81      0.80      1578


Model overall accuracy: 80.54%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# insights:

# no changes in metrics
# probably, because data was cleaned well