<a href="https://colab.research.google.com/github/boiBASH/Elite-Bank-Project/blob/main/Feature_Transformation_and_Class_balancing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#load the csv dataset

file_path=("/content/Bank_Marketing_Dataset.csv")

In [4]:
df = pd.read_csv(file_path)

In [5]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11157,33,blue-collar,single,primary,no,1,yes,no,cellular,20,apr,257,1,-1,0,unknown,no
11158,39,services,married,secondary,no,733,no,no,unknown,16,jun,83,4,-1,0,unknown,no
11159,32,technician,single,secondary,no,29,no,no,cellular,19,aug,156,2,-1,0,unknown,no
11160,43,technician,married,secondary,no,0,no,yes,cellular,8,may,9,2,172,5,failure,no


In [6]:
# Convert numeric columns to appropriate types
numeric_columns = ["age", "balance", "day", "duration", "campaign", "pdays", "previous"]
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors="coerce")

# Handling 'unknown' values in categorical columns
categorical_columns_with_unknown = ["job", "education", "contact", "poutcome"]

for col in categorical_columns_with_unknown:
    # Replace 'unknown' with mode (most frequent value)
    most_frequent = df[col].mode()[0]
    df[col] = df[col].replace("unknown", most_frequent)

## Reasons for Replacing "unknown" with Mode

### 1. Preserves Data Integrity
- Unlike dropping rows, replacing `"unknown"` allows us to **retain all data points**.
- Avoids losing valuable information when the dataset is **small or imbalanced**.

### 2. Ensures Consistency for Machine Learning Models
- Most ML models **cannot handle missing or ambiguous values directly**.
- If `"unknown"` is treated as a separate category, it might introduce **bias** or **mislead the model**.

### 3. Mode is the Most Likely Value
- The **most frequent value (mode)** represents the **most common category** in that feature.
- **Example:** If `"secondary"` is the most common education level, replacing `"unknown"` with `"secondary"` is a reasonable assumption.

### 4. Alternative Approaches (Why Not Use Them?)
- **Remove Rows with "unknown"** → This **reduces the dataset size** and may remove important data.
- **Replace with a New Category ("Other")** → This **creates a new artificial class**, which may not be meaningful.
- **Predict "unknown" using another model** → This requires **additional effort** and may not always be accurate.


In [7]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,cellular,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,cellular,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,cellular,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,cellular,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,cellular,5,may,673,2,-1,0,unknown,yes


## Handling `pdays`

Since `pdays = -1` means the customer was never contacted before, we will:
1. Replace `-1` with `NaN`
2. Apply a log transformation only on positive values

In [10]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Apply log transformation to highly skewed features (adding 1 to avoid log(0))
skewed_features = ["balance", "duration", "campaign", "pdays"]

df["pdays"] = df["pdays"].replace(-1, np.nan)

In [11]:
# Apply log transformation safely (ignoring NaN values)
for col in skewed_features:
    df[col] = np.log1p(df[col])  # log1p(x) = log(1 + x), avoids issues with zero values

# Fill NaN values in pdays with 0 after transformation (since they represent "never contacted before")
df["pdays"] = df["pdays"].fillna(0)

# Normalize numerical features using Min-Max Scaling
scaler = MinMaxScaler()
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

  result = getattr(ufunc, method)(*inputs, **kwargs)


In [12]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,0.532468,admin.,married,secondary,no,0.864608,yes,no,cellular,0.133333,may,0.896965,0.000000,0.000000,0.000000,unknown,yes
1,0.493506,admin.,married,secondary,no,0.627321,no,no,cellular,0.133333,may,0.925315,0.000000,0.000000,0.000000,unknown,yes
2,0.298701,technician,married,secondary,no,0.835750,yes,no,cellular,0.133333,may,0.920866,0.000000,0.000000,0.000000,unknown,yes
3,0.480519,services,married,secondary,no,0.867110,yes,no,cellular,0.133333,may,0.845323,0.000000,0.000000,0.000000,unknown,yes
4,0.467532,admin.,married,tertiary,no,0.728223,no,no,cellular,0.133333,may,0.858923,0.192695,0.000000,0.000000,unknown,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11157,0.194805,blue-collar,single,primary,no,0.209798,yes,no,cellular,0.633333,apr,0.766828,0.000000,0.000000,0.000000,unknown,no
11158,0.272727,services,married,secondary,no,0.807954,no,no,cellular,0.500000,jun,0.640334,0.388236,0.000000,0.000000,unknown,no
11159,0.181818,technician,single,secondary,no,0.590393,no,no,cellular,0.600000,aug,0.713741,0.192695,0.000000,0.000000,unknown,no
11160,0.324675,technician,married,secondary,no,0.000000,no,yes,cellular,0.233333,may,0.305366,0.192695,0.887272,0.086207,failure,no


## Class Balancing

Now, I will:
- **Check class imbalance** in the `deposit` target variable.
- **Apply resampling techniques** if needed:
  - **SMOTE (Synthetic Minority Over-sampling Technique):** Oversample the minority class.
  - **Undersampling:** Further balance the dataset if necessary.

In [13]:
from collections import Counter

class_distribution = Counter(df["deposit"])

class_distribution_df = pd.DataFrame(class_distribution.items(), columns=["Deposit Status", "Count"])


In [14]:
class_distribution_df

Unnamed: 0,Deposit Status,Count
0,yes,5289
1,no,5873


## Class Balancing - Analysis

The target variable `deposit` distribution is:

- **Yes (Deposited):** 5289 instances  
- **No (Did Not Deposit):** 5873 instances

This indicates a slight class imbalance (~10% difference), which suggests:
- **SMOTE (Oversampling):** May not be necessary.
- **Undersampling:** Could lead to potential data loss.

Given the minimal imbalance, I will proceed without resampling.

In [15]:
#save the data as csv to use for model training

# Save DataFrame as CSV
csv_file_path = "/content/Bank_Marketing_Dataset_for_model_trainning.csv"
df.to_csv(csv_file_path, index=False)

# Confirm the file is saved
print(f"Dataset saved as CSV: {csv_file_path}")

Dataset saved as CSV: /content/Bank_Marketing_Dataset_for_model_trainning.csv
