### Step 1: Data Understanding and Preparation
By using *df.shape* we can see the full size of the dataset.

By using *df.dtypes* we can see what data type the variables are.

All variables are either objects or integers, including the days, months, duration, etc.

In [11]:
import numpy as np
import pandas as pd
import matplotlib as plt

df = pd.read_csv("bank-full.csv", sep=";")

print(df.shape)
print(df.dtypes)

(45211, 17)
age          int64
job            str
marital        str
education      str
default        str
balance      int64
housing        str
loan           str
contact        str
day          int64
month          str
duration     int64
campaign     int64
pdays        int64
previous     int64
poutcome       str
y              str
dtype: object


### Step 1.2: Data Cleaning

We are creating new variables just to test out the values inside the dataframe.
By using *df.isnull()* we see that there are no blank values in the dataset.

By using *df=="unknown"* we see that the "education" column has 1857 blank values, along with the "job" column, having 288.
We are choosing to keep these rows, and treat the unknown variable as its own category.

By using *df==0* we see that 36954 cells from the "previous" column are 0. This shows that the customer was not contacted previously.
There are also 3514 cells from the "balance" column that are 0. We will remove these rows.

By using *df==-1* we see that the client was not previously contacted. We will remove this "pdays" column as we do not think this is necessarily a good predictor, in comparison to the "previous" column.



In [12]:
df_null = (df.isnull()).sum()
print("This is how many values are empty:")
print(df_null)

df_unknown = (df == "unknown").sum()
print("This is how many values are unknown:")
print(df_unknown)

df_0 = (df == 0).sum()
print("This is how many values are 0:")
print(df_0)

df_1 = (df == "-1").sum()
print("This is how many values are -1:")
print(df_1)

This is how many values are empty:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64
This is how many values are unknown:
age              0
job            288
marital          0
education     1857
default          0
balance          0
housing          0
loan             0
contact      13020
day              0
month            0
duration         0
campaign         0
pdays            0
previous         0
poutcome     36959
y                0
dtype: int64
This is how many values are 0:
age              0
job              0
marital          0
education        0
default          0
balance       3514
housing          0
loan             0
contact          0
day              0
month            0
duration         3
campaign         0
pdays            0
previous     36954
poutcome

Step 1.3: We are removing the "poutcome", "pdays", "contact", and the "duration" column.
We are not removing any rows for now, since there are no blank values.


In [13]:
df.drop(columns=['poutcome', 'pdays', 'contact', 'duration'], inplace=True)
print(df.shape)

(45211, 13)


The categorical columns are: job, marital and education. So we are going to convert these columns to binary so it can be easier to calculate.job', 'marital', 'education', 'default', 'housing', 'loan', 'month',
       'y

In [14]:

print("Numerical columns:" , df.select_dtypes(include=np.number).columns)
print(" ")
print("Categorical columns:" , df.select_dtypes(include='object').columns)


Numerical columns: Index(['age', 'balance', 'day', 'campaign', 'previous'], dtype='str')
 
Categorical columns: Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'month',
       'y'],
      dtype='str')


See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  print("Categorical columns:" , df.select_dtypes(include='object').columns)


--One-Hot Encoding - Categorical variables were converted to numeric using one-hot encoding, and the target variable was mapped to 0/1. This makes the dataset fully machine-readable for modeling.

In [15]:
import pandas as pd

df = pd.read_csv("bank-full.csv", sep=';')
df['y'] = df['y'].map({'no': 0, 'yes': 1})
df_encoded = pd.get_dummies(df.drop('y', axis=1), drop_first=True)
df_encoded['y'] = df['y']
print(df_encoded.head())


   age  balance  day  duration  campaign  pdays  previous  job_blue-collar  \
0   58     2143    5       261         1     -1         0            False   
1   44       29    5       151         1     -1         0            False   
2   33        2    5        76         1     -1         0            False   
3   47     1506    5        92         1     -1         0             True   
4   33        1    5       198         1     -1         0            False   

   job_entrepreneur  job_housemaid  job_management  job_retired  \
0             False          False            True        False   
1             False          False           False        False   
2              True          False           False        False   
3             False          False           False        False   
4             False          False           False        False   

   job_self-employed  job_services  job_student  job_technician  \
0              False         False        False           Fal

After encoding, all features are numeric, there are no missing values, and the dataset is fully machine readable for machine learning models. So now we are verifing encoded dataset is machine readable

In [None]:
df_encoded.dtypes.unique()
df_encoded.isnull().sum().sum()
df_encoded.shape
pd.set_option('display.max_columns', None) 
df_encoded.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_married,marital_single,education_secondary,education_tertiary,education_unknown,default_yes,housing_yes,loan_yes,contact_telephone,contact_unknown,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown,y
0,58,2143,5,261,1,-1,0,False,False,False,True,False,False,False,False,False,False,False,True,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,0
1,44,29,5,151,1,-1,0,False,False,False,False,False,False,False,False,True,False,False,False,True,True,False,False,False,True,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,0
2,33,2,5,76,1,-1,0,False,True,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,True,True,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,0
3,47,1506,5,92,1,-1,0,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,True,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,0
4,33,1,5,198,1,-1,0,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,False,True,0


Feature Scaling: Identify numerical variables (age, balance, etj.)
In our dataset numeric columns are the ones that hold meaningful numbers for the model
the function select_dtypes(inlcue= ['int64'], ['float64']) is used to select all the numeric columns
And we used the function remove to remove the target variable for now

In [17]:

numeric_cols = df_encoded.select_dtypes(include=['int64', 'float64']).columns.tolist()
numeric_cols.remove('y')

print("Numeric features for scaling:", numeric_cols)


Numeric features for scaling: ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']


Standardize numeric features using: StandardScaler
We split the data into train and test sets, then scaled the numeric columns using StandardScaler. This puts all numbers on the same scale so the model doesn’t favor features just because they have bigger values.

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df_encoded.drop('y', axis=1)
y = df_encoded['y']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()

X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

pd.set_option('display.max_columns', None)
print(X_train.head())


            age   balance       day  duration  campaign     pdays  previous  \
13382 -0.930461 -0.443813 -0.817039 -0.680152 -0.566850 -0.410038 -0.234774   
32641 -0.553871 -0.380625  0.143236 -0.214545 -0.566850 -0.410038 -0.234774   
3991  -1.589493 -0.419059  0.023201 -0.087562 -0.244783 -0.410038 -0.234774   
8068  -0.553871 -0.418082 -1.657280  1.232298 -0.244783 -0.410038 -0.234774   
27484 -0.365576 -0.409939  0.623373 -0.237633 -0.244783  1.175153  1.371081   

       job_blue-collar  job_entrepreneur  job_housemaid  job_management  \
13382            False             False          False           False   
32641            False             False          False           False   
3991              True             False          False           False   
8068              True             False          False           False   
27484            False             False          False           False   

       job_retired  job_self-employed  job_services  job_student  \
13382 

Outlier Handling. Detect outliers using: IQR method
We calculate the IQR (Interquartile Range) and define lower/upper bounds. Values outside these bounds are considered outliers.

In [22]:
Q1 = X_train[numeric_cols].quantile(0.25)
Q3 = X_train[numeric_cols].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = ((X_train[numeric_cols] < lower_bound) | (X_train[numeric_cols] > upper_bound)).sum()
print("Number of outliers per column:\n", outliers)


Number of outliers per column:
 age          344
balance     3318
day            0
duration    2323
campaign    3031
pdays       5735
previous    5735
dtype: int64


--Decide strategy (Keep, Cap, Remove) We can either keep, cap, or remove outliers depending on their importance and effect on the model. 
We decided to: keep 'age', 'day' cap extremes for 'balance', 'duration', and 'campaign' and keep 'previous' mostly as-is because zeros are meaningful. The 'pdays' column was already removed during cleaning.
***For column AGE there are 344 outliers. We keep them because they represent very young or very old clients, which are real and meaningful data. No need to remove or cap.

***For column BALANCE there are 3318 outliers. We cap these extreme values using Winsorization. Some clients have extremely high or very low balances, which could skew the model.

***For DAY column there are 0 outliers. We keep all values since all days are valid and there’s nothing extreme.

***For CAMPAIGN column there are 3031 outliers. We cap the extreme values because some clients were contacted many times during the campaign, and these extremes could skew the model.

Applying outlier treatment consistently

In [25]:
from scipy.stats.mstats import winsorize
X_train['balance'] = winsorize(X_train['balance'], limits=[0.05, 0.05])
X_test['balance'] = winsorize(X_test['balance'], limits=[0.05, 0.05])

X_train['duration'] = winsorize(X_train['duration'], limits=[0.05, 0.05])
X_test['duration'] = winsorize(X_test['duration'], limits=[0.05, 0.05])

X_train['campaign'] = winsorize(X_train['campaign'], limits=[0.05, 0.05])
X_test['campaign'] = winsorize(X_test['campaign'], limits=[0.05, 0.05])

pd.set_option('display.max_columns', None)
print(X_train.head())

            age   balance       day  duration  campaign     pdays  previous  \
13382 -0.930461 -0.443813 -0.817039 -0.680152 -0.566850 -0.410038 -0.234774   
32641 -0.553871 -0.380625  0.143236 -0.214545 -0.566850 -0.410038 -0.234774   
3991  -1.589493 -0.419059  0.023201 -0.087562 -0.244783 -0.410038 -0.234774   
8068  -0.553871 -0.418082 -1.657280  1.232298 -0.244783 -0.410038 -0.234774   
27484 -0.365576 -0.409939  0.623373 -0.237633 -0.244783  1.175153  1.371081   

       job_blue-collar  job_entrepreneur  job_housemaid  job_management  \
13382            False             False          False           False   
32641            False             False          False           False   
3991              True             False          False           False   
8068              True             False          False           False   
27484            False             False          False           False   

       job_retired  job_self-employed  job_services  job_student  \
13382 

---------------------------------- STEP 2 : Exploratory Data Analysis -----------------------------------------