# **Dataset Selection & Preparation**

# **Dataset Description**

I've selected the "**Credit Card Customers**" dataset from Kaggle, which contains information about 10,000 credit card customers and whether they churned (closed their account). This is a classification problem where we predict customer churn (binary target variable: Attrition_Flag).

Dataset Source: Kaggle Credit Card Customers Dataset

**Prediction Task:**Binary Classification (predicting one of two possible outcomes).

**Features**:

**Numerical**: Customer_Age, Dependent_count, Months_on_book, Total_Relationship_Count, etc.

**Categorical:** Gender, Education_Level, Marital_Status, Income_Category, Card_Category

**Target Variable**: Attrition_Flag (1 if customer attrited, 0 if existing customer)

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report
from scipy.stats import randint, uniform
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('BankChurners.csv')

# Clean the dataset - remove columns we won't use and clean target variable
df = df.drop(['CLIENTNUM', 'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
             'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'], axis=1)

# Convert target to binary
df['Attrition_Flag'] = df['Attrition_Flag'].apply(lambda x: 1 if x == 'Attrited Customer' else 0)

# Display dataset info
print(f"Dataset shape: {df.shape}")
print("\nFirst 5 rows:")
display(df.head())
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isna().sum())

Dataset shape: (10127, 20)

First 5 rows:


Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,0,45,M,3,High School,Married,$60K - $80K,Blue,39,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,0,49,F,5,Graduate,Single,Less than $40K,Blue,44,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,0,51,M,3,Graduate,Married,$80K - $120K,Blue,36,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,0,40,F,4,High School,Unknown,Less than $40K,Blue,34,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,0,40,M,3,Uneducated,Married,$60K - $80K,Blue,21,5,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0



Data types:
Attrition_Flag                int64
Customer_Age                  int64
Gender                       object
Dependent_count               int64
Education_Level              object
Marital_Status               object
Income_Category              object
Card_Category                object
Months_on_book                int64
Total_Relationship_Count      int64
Months_Inactive_12_mon        int64
Contacts_Count_12_mon         int64
Credit_Limit                float64
Total_Revolving_Bal           int64
Avg_Open_To_Buy             float64
Total_Amt_Chng_Q4_Q1        float64
Total_Trans_Amt               int64
Total_Trans_Ct                int64
Total_Ct_Chng_Q4_Q1         float64
Avg_Utilization_Ratio       float64
dtype: object

Missing values:
Attrition_Flag              0
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category             

# **Train-Test Split**
We'll split the data into 75% training and 25% testing sets before any preprocessing or model training.

In [3]:
# Split into features and target
X = df.drop('Attrition_Flag', axis=1)
y = df['Attrition_Flag']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print("\nClass distribution in training set:")
print(y_train.value_counts(normalize=True))
print("\nClass distribution in test set:")
print(y_test.value_counts(normalize=True))

Training set size: 7595
Test set size: 2532

Class distribution in training set:
Attrition_Flag
0    0.839368
1    0.160632
Name: proportion, dtype: float64

Class distribution in test set:
Attrition_Flag
0    0.839258
1    0.160742
Name: proportion, dtype: float64


# **2.Pipeline Construction**

Our pipeline will include:

**Preprocessing:**

Numerical features: Standard scaling after mean imputation

Categorical features: One-hot encoding after constant imputation

**RandomForestRegressor:** Our main model for prediction

In [4]:
# Identify numerical and categorical columns
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

print("Numerical columns:", numerical_cols.tolist())
print("Categorical columns:", categorical_cols.tolist())

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])

# Create the full pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', RandomForestClassifier(random_state=42))])



Numerical columns: ['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
Categorical columns: ['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']


The dataset contains both numerical and categorical features, so different preprocessing strategies are required for each type:

**Numerical Features:**

**SimpleImputer(strategy='mean')**: Missing numerical values are imputed using the mean of each column. This is a simple and effective method when data is missing at random and doesn't introduce extreme bias.

**StandardScaler():**  
 Although Random Forests do not require feature scaling, it's still beneficial when the pipeline might be reused for other models that are sensitive to feature scales (e.g., Logistic Regression, SVMs). It also ensures that all features contribute equally when calculating feature importances or distances.

 **Categorical Features**

**SimpleImputer(strategy='constant', fill_value='missing'):**

Missing categorical values are filled with a placeholder string ('missing'), which allows the model to treat missing values as a separate category.

**OneHotEncoder(handle_unknown='ignore'):**
One-hot encoding is used to convert categorical variables into numerical format. The handle_unknown='ignore' setting ensures the pipeline doesn't break when it encounters new categories in the test set

**ColumnTransformer:**

The use of ColumnTransformer allows us to apply the appropriate preprocessing to each column type in a clean and modular way.

üå≤ **RandomForestClassifier:**

The final estimator is a RandomForestClassifier, which is robust to outliers, handles both numerical and categorical data (once encoded), and performs well even with minimal hyperparameter tuning.

# **3.Hyperparameter Tuning Setup**
We'll use RandomizedSearchCV to tune the RandomForest hyperparameters.

In [5]:
# Parameter distributions for RandomizedSearchCV
param_dist = {
    'classifier__n_estimators': randint(50, 500),
    'classifier__max_depth': [None] + list(np.arange(5, 50, 5)),
    'classifier__min_samples_split': randint(2, 20),
    'classifier__min_samples_leaf': randint(1, 20),
    'classifier__max_features': ['sqrt', 'log2', None],
    'classifier__bootstrap': [True, False],
    'classifier__class_weight': [None, 'balanced']
}

# Create RandomizedSearchCV object
random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_dist,
    n_iter=50,
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1,
    verbose=1
)


# **Hyperparameter Tuning :**
To optimize the performance of the **RandomForestClassifier** within the pipeline, **RandomizedSearchCV** was used. This method samples random combinations of hyperparameters from defined distributions, which is much more computationally efficient than an exhaustive grid search ‚Äî especially useful when tuning complex models like Random Forests.


**Parameter Distributions**

Key hyperparameters of the Random Forest model were chosen for tuning:

**n_estimators** controls the number of trees in the forest. A wide range (between 50 and 500) was selected to balance model accuracy and training time.

**max_depth** limits how deep each tree can grow. Shallower trees help reduce overfitting, while deeper ones capture more complexity. Including None allows unlimited depth when needed.

**min_samples_split** and **min_samples_leaf** determine how the trees grow. These help control overfitting by forcing splits and leaf nodes to have a minimum number of samples.

**max_features** determines how many features to consider at each split, adding randomness and diversity to the forest.

**bootstrap** controls whether the model uses bootstrap samples when building trees, which affects the variance and bias trade-off.

**class_weight** is tuned to handle class imbalance. Setting it to 'balanced' helps improve the model‚Äôs sensitivity to underrepresented classes.

‚öôÔ∏è **RandomizedSearchCV Settings**
**n_iter=50** was chosen to strike a balance between exploration of the parameter space and computational efficiency.

**cv=5 (5-fold cross-validation)** ensures that the model is validated across multiple subsets of the data, which leads to more reliable performance estimates.

**scoring='roc_auc'** was selected as the evaluation metric. This is ideal for imbalanced classification problems because ROC AUC focuses on how well the model ranks predictions across both classes, without being biased by class frequencies.

**random_state=42** ensures that the search is reproducible.

**n_jobs=-1** allows the search to use all available CPU cores, significantly speeding up the process.

üß† **Summary**
This hyperparameter tuning setup is designed to explore a wide range of meaningful Random Forest configurations while being computationally efficient. The use of ROC AUC as the scoring metric is particularly important given the likely imbalance in the dataset (e.g., more minor than major accidents). Overall, this approach provides a practical and scalable way to improve model performance.





# **4.Execution and Results Reporting**

In [6]:
# Fit the random search to the training data
random_search.fit(X_train, y_train)

# Report best parameters and score
print("Best parameters found:")
print(random_search.best_params_)
print("\nBest cross-validation score (AUC):")
print(random_search.best_score_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters found:
{'classifier__bootstrap': False, 'classifier__class_weight': 'balanced', 'classifier__max_depth': np.int64(30), 'classifier__max_features': 'sqrt', 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 452}

Best cross-validation score (AUC):
0.9889016393442622


# **5.Baseline Comparison & Final Evaluation**

In [7]:
# Baseline model with default parameters
baseline_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('classifier', RandomForestClassifier(random_state=42))])

# Fit baseline model
baseline_pipeline.fit(X_train, y_train)

# Evaluate both models on test set
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("ROC AUC:", roc_auc_score(y_test, y_proba))
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

print("=== Baseline Model Performance ===")
evaluate_model(baseline_pipeline, X_test, y_test)

print("\n=== Tuned Model Performance ===")
evaluate_model(random_search.best_estimator_, X_test, y_test)

=== Baseline Model Performance ===
Accuracy: 0.9541864139020537
F1 Score: 0.8428184281842819
ROC AUC: 0.9850573782338489

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.97      2125
           1       0.94      0.76      0.84       407

    accuracy                           0.95      2532
   macro avg       0.95      0.88      0.91      2532
weighted avg       0.95      0.95      0.95      2532


=== Tuned Model Performance ===
Accuracy: 0.9518167456556083
F1 Score: 0.8351351351351352
ROC AUC: 0.986650672062437

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.97      2125
           1       0.93      0.76      0.84       407

    accuracy                           0.95      2532
   macro avg       0.94      0.87      0.90      2532
weighted avg       0.95      0.95      0.95      2532



# **6.Analysis and Discussion:**
 **Compare Performance: Baseline vs Tuned Pipeline:**

The tuned model showed a slightly improved ROC AUC score (0.9867 vs. 0.9851), indicating better ranking of positive vs. negative predictions, which is especially useful in imbalanced classification tasks. However, the overall accuracy (0.9518 vs. 0.9542) and F1 score (0.8351 vs. 0.8428) were slightly lower after tuning.

This suggests that while tuning helped the model become more confident in its probabilistic predictions, it didn‚Äôt significantly improve its classification performance on the test set ‚Äî and in fact, may have slightly overfitted to the training data or cross-validation folds

**Discussion of Findings:**

The best hyperparameters found during tuning included a high number of estimators (n_estimators=452), a deep tree limit (max_depth=30), and class balancing enabled via class_weight='balanced'. Notably, bootstrap=False was selected, which is less common for Random Forests but may have helped the model focus more precisely on the original training data. These adjustments, especially class weighting and deep trees, likely contributed to the improvement in the ROC AUC score.

Using max_features='sqrt' and min_samples_leaf=1 with min_samples_split=2 allowed the model to explore fine-grained splits, increasing sensitivity to subtle patterns in the data‚Äîespecially important when dealing with imbalanced classes.

The tuned model achieved a best cross-validation ROC AUC of 0.9889, which is slightly better than both the baseline ROC AUC (0.9851) and the tuned test ROC AUC (0.9866). This confirms that the tuning process led to a model that was better at ranking positive and negative instances across the folds

**Surprises:**

One surprising outcome was how well the baseline model performed without tuning. The accuracy and F1 scores were already high, and the improvements from tuning were relatively minor in those metrics. This reinforces the known robustness of Random Forests with default parameters‚Äîespecially on structured tabular data.

Another interesting finding was the choice of bootstrap=False, which goes against typical expectations for ensemble methods but worked well in this context. It suggests that for this dataset, more deterministic sampling may have been beneficial.

‚ö†Ô∏è **Challenges Encountered:**

The main challenge was computational cost, as tuning required 250 fits across 5 folds, which was intensive given the dataset size. Class imbalance also required careful handling via class weights and the use of roc_auc as a more reliable evaluation metric.