# Peer-Reviewed Midterm Project | ML-ZOOMCAMP | predict-term deposit

List of contents:

1. Problem description
2. Getting the dataset
3. Reading the dataset with pandas

## 1. Problem Description 

## 2. Getting the dataset

- link to dataset: [https://www.kaggle.com/datasets/aslanahmedov/predict-term-deposit](https://www.kaggle.com/datasets/aslanahmedov/predict-term-deposit)

or,

[https://raw.githubusercontent.com/bhasarma/mlcoursezoom-camp/main/WK08-09-midterm-project/predict-term-deposit-data.csv](https://raw.githubusercontent.com/bhasarma/mlcoursezoom-camp/main/WK08-09-midterm-project/predict-term-deposit-data.csv)

You can download the dataset into your local directory with `wget` 

In [1]:
data = 'https://raw.githubusercontent.com/bhasarma/mlcoursezoom-camp/main/WK08-09-midterm-project/predict-term-deposit-data.csv'

In [2]:
#!wget $data  #uncomment it, if you haven't downoaded data already.

In [3]:
ls

notebook.ipynb  predict-term-deposit-data.csv  README.md  report.md


## 3. Reading the dataset with pandas 

In [4]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
df = pd.read_csv('predict-term-deposit-data.csv')

In [6]:
df.head().T

Unnamed: 0,0,1,2,3,4
Id,1001,1002,1003,1004,1005
age,999.0,44.0,33.0,47.0,33.0
job,management,technician,entrepreneur,blue-collar,unknown
marital,married,single,married,married,single
education,tertiary,secondary,secondary,unknown,unknown
default,no,no,no,no,no
balance,2143.0,29.0,2.0,1506.0,1.0
housing,yes,yes,yes,yes,no
loan,no,no,yes,no,no
contact,unknown,unknown,unknown,unknown,unknown


In [7]:
df.shape

(45211, 18)

In [8]:
df.columns

Index(['Id', 'age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign',
       'pdays', 'previous', 'poutcome', 'y'],
      dtype='object')

Checklist:
1. make all features and data look consistent, i.e. all small letters and use underscore if separation between two words are there
2. does data types of the features make sense, e.g. age should be integer and not a string

In [9]:
df.columns = df.columns.str.lower()
df.head().T

Unnamed: 0,0,1,2,3,4
id,1001,1002,1003,1004,1005
age,999.0,44.0,33.0,47.0,33.0
job,management,technician,entrepreneur,blue-collar,unknown
marital,married,single,married,married,single
education,tertiary,secondary,secondary,unknown,unknown
default,no,no,no,no,no
balance,2143.0,29.0,2.0,1506.0,1.0
housing,yes,yes,yes,yes,no
loan,no,no,yes,no,no
contact,unknown,unknown,unknown,unknown,unknown


* name of columns are made consistent

In [10]:
df.dtypes

id             int64
age          float64
job           object
marital       object
education     object
default       object
balance      float64
housing       object
loan          object
contact       object
day            int64
month         object
duration       int64
campaign       int64
pdays          int64
previous       int64
poutcome      object
y             object
dtype: object

**to do:**

- remove `id` feature, since a customer id given by the bank, has *logically* no influence on the outcome
- categorical variables has to be converted with One Hot Encoding (only on train part, after splitting? check notes to be sure)
- `yes` or `no` columns to be converted into 1 and 0s
- find a way to take care of day and month columns in some way.
- what to do with `-1` values in feature `pdays`
- know the difference between `poutcome` and `y` features

In [11]:
df.describe()

Unnamed: 0,id,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45202.0,45208.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,23606.0,40.954714,1362.34662,15.806419,258.16308,2.763841,40.197828,0.580323
std,13051.435847,11.539144,3044.852387,8.322476,257.527812,3.098021,100.128746,2.303441
min,1001.0,-1.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,12303.5,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,23606.0,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,34908.5,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,46211.0,999.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [12]:
df.isnull().sum()

id           0
age          9
job          0
marital      0
education    0
default      0
balance      3
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

In [13]:
df.nunique()

id           45211
age             79
job             12
marital          3
education        4
default          2
balance       7168
housing          2
loan             2
contact          3
day             31
month           12
duration      1573
campaign        48
pdays          559
previous        41
poutcome         4
y                2
dtype: int64

In [14]:
df['poutcome'].value_counts()

unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64

In [15]:
df['y'].value_counts()

no     39922
yes     5289
Name: y, dtype: int64

A summary about the data:

1. We have 18 features (or, variables or, columns) and 45211 columns. 

**converting days and months into day of year**

In [16]:
df['day'] = df['day'].map(str)

type(df['day'][10])

str

In [17]:
df['month'].unique()

array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'jan', 'feb',
       'mar', 'apr', 'sep'], dtype=object)

In [18]:
month_mapping = {
    'jan': '1',
    'feb': '2',
    'mar': '3',
    'apr': '4',
    'may': '5',
    'jun': '6',
    'jul': '7', 
    'aug': '8', 
    'sep': '9',
    'oct': '10', 
    'nov': '11', 
    'dec': '12' 
}
df['month'] = df['month'].map(month_mapping)
df.month.unique()

array(['5', '6', '7', '8', '10', '11', '12', '1', '2', '3', '4', '9'],
      dtype=object)

In [19]:
type(df.month[100])

str

In [20]:
df['date_formatted'] = pd.to_datetime(
    dict(         
        year='2025',
        month=df['month'], 
        day=df['day']
    )
)
df.head()

Unnamed: 0,id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,date_formatted
0,1001,999.0,management,married,tertiary,no,2143.0,yes,no,unknown,5,5,261,1,-1,0,unknown,no,2025-05-05
1,1002,44.0,technician,single,secondary,no,29.0,yes,no,unknown,5,5,151,1,-1,0,unknown,no,2025-05-05
2,1003,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5,5,76,1,-1,0,unknown,no,2025-05-05
3,1004,47.0,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5,5,92,1,-1,0,unknown,no,2025-05-05
4,1005,33.0,unknown,single,unknown,no,1.0,no,no,unknown,5,5,198,1,-1,0,unknown,no,2025-05-05


In [21]:
df['day_of_year']=df['date_formatted'].dt.dayofyear
df.head()

Unnamed: 0,id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y,date_formatted,day_of_year
0,1001,999.0,management,married,tertiary,no,2143.0,yes,no,unknown,5,5,261,1,-1,0,unknown,no,2025-05-05,125
1,1002,44.0,technician,single,secondary,no,29.0,yes,no,unknown,5,5,151,1,-1,0,unknown,no,2025-05-05,125
2,1003,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5,5,76,1,-1,0,unknown,no,2025-05-05,125
3,1004,47.0,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5,5,92,1,-1,0,unknown,no,2025-05-05,125
4,1005,33.0,unknown,single,unknown,no,1.0,no,no,unknown,5,5,198,1,-1,0,unknown,no,2025-05-05,125


In [22]:
df.day_of_year.unique()

array([125, 126, 127, 128, 129, 132, 133, 134, 135, 136, 139, 140, 141,
       143, 146, 147, 148, 149, 150, 153, 154, 155, 156, 157, 160, 162,
       163, 167, 168, 169, 170, 171, 174, 175, 176, 177, 178, 181, 182,
       183, 184, 185, 188, 189, 190, 191, 192, 195, 196, 197, 198, 199,
       202, 203, 204, 205, 206, 209, 210, 211, 212, 216, 217, 218, 219,
       220, 223, 224, 225, 226, 230, 231, 232, 233, 234, 237, 238, 239,
       240, 241, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300,
       301, 302, 303, 304, 308, 309, 310, 312, 313, 314, 315, 316, 317,
       318, 321, 322, 323, 324, 325, 326, 331, 338, 339, 341, 342, 343,
       345, 346, 347, 356, 361,  28,  29,  30,  33,  34,  35,  36,  37,
        40,  41,  42,  43,  44,  47,  48,  49,  50,  57,  58,  61,  62,
        63,  64,  65,  68,  69,  70,  71,  72,  75,  76,  77,  78,  79,
        82,  83,  84,  85,  86,  89,  90,  91,  92,  93,  96,  97,  98,
        99, 103, 104, 105, 106, 107, 110, 111, 112, 113, 114, 11

In [23]:
df.dtypes

id                         int64
age                      float64
job                       object
marital                   object
education                 object
default                   object
balance                  float64
housing                   object
loan                      object
contact                   object
day                       object
month                     object
duration                   int64
campaign                   int64
pdays                      int64
previous                   int64
poutcome                  object
y                         object
date_formatted    datetime64[ns]
day_of_year                int64
dtype: object

In [24]:
df = df.drop(columns = ['id','day','month','date_formatted'])
df.dtypes

age            float64
job             object
marital         object
education       object
default         object
balance        float64
housing         object
loan            object
contact         object
duration         int64
campaign         int64
pdays            int64
previous         int64
poutcome        object
y               object
day_of_year      int64
dtype: object

In [25]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome,y,day_of_year
0,999.0,management,married,tertiary,no,2143.0,yes,no,unknown,261,1,-1,0,unknown,no,125
1,44.0,technician,single,secondary,no,29.0,yes,no,unknown,151,1,-1,0,unknown,no,125
2,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,76,1,-1,0,unknown,no,125
3,47.0,blue-collar,married,unknown,no,1506.0,yes,no,unknown,92,1,-1,0,unknown,no,125
4,33.0,unknown,single,unknown,no,1.0,no,no,unknown,198,1,-1,0,unknown,no,125
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51.0,technician,married,tertiary,no,825.0,no,no,cellular,977,3,-1,0,unknown,yes,321
45207,71.0,retired,divorced,primary,no,1729.0,no,no,cellular,456,2,-1,0,unknown,yes,321
45208,72.0,retired,married,secondary,no,5715.0,no,no,cellular,1127,5,184,3,success,yes,321
45209,57.0,blue-collar,married,secondary,no,668.0,no,no,telephone,508,4,-1,0,unknown,no,321


Before splitting, we still need to convert yes-no columns i.e. (`default`,`housing` and `loan`) into 1s and 0s. 

In [26]:
df.dtypes

age            float64
job             object
marital         object
education       object
default         object
balance        float64
housing         object
loan            object
contact         object
duration         int64
campaign         int64
pdays            int64
previous         int64
poutcome        object
y               object
day_of_year      int64
dtype: object

I think all yes-no cols will be converted into numerical one in OHE. I don't have to do it now. I just have to do it for the target variable. 

In [27]:
(df.y == 'no').head()

0    True
1    True
2    True
3    True
4    True
Name: y, dtype: bool

In [28]:
df.y.unique()

array(['no', 'yes'], dtype=object)

In [29]:
df.y = (df.y == 'yes').astype(int)
df.y

0        0
1        0
2        0
3        0
4        0
        ..
45206    1
45207    1
45208    1
45209    0
45210    0
Name: y, Length: 45211, dtype: int64

In [30]:
df.y.unique()

array([0, 1])

In [31]:
df.dtypes

age            float64
job             object
marital         object
education       object
default         object
balance        float64
housing         object
loan            object
contact         object
duration         int64
campaign         int64
pdays            int64
previous         int64
poutcome        object
y                int64
day_of_year      int64
dtype: object

In [32]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome,y,day_of_year
0,999.0,management,married,tertiary,no,2143.0,yes,no,unknown,261,1,-1,0,unknown,0,125
1,44.0,technician,single,secondary,no,29.0,yes,no,unknown,151,1,-1,0,unknown,0,125
2,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,76,1,-1,0,unknown,0,125
3,47.0,blue-collar,married,unknown,no,1506.0,yes,no,unknown,92,1,-1,0,unknown,0,125
4,33.0,unknown,single,unknown,no,1.0,no,no,unknown,198,1,-1,0,unknown,0,125
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51.0,technician,married,tertiary,no,825.0,no,no,cellular,977,3,-1,0,unknown,1,321
45207,71.0,retired,divorced,primary,no,1729.0,no,no,cellular,456,2,-1,0,unknown,1,321
45208,72.0,retired,married,secondary,no,5715.0,no,no,cellular,1127,5,184,3,success,1,321
45209,57.0,blue-collar,married,secondary,no,668.0,no,no,telephone,508,4,-1,0,unknown,0,321


## splitting the data (creatinng the validation framework)

In [33]:
from sklearn.model_selection import train_test_split

In [34]:
df_full_train, df_test = train_test_split(df, test_size = 0.2, random_state = 11)
df_train, df_val = train_test_split(df_full_train, test_size = 0.25, random_state = 11)

In [35]:
df_train = df_train.reset_index(drop = True)
df_val   = df_val.reset_index(drop = True)
df_test = df_test.reset_index(drop = True)

In [36]:
y_train = df_train.y.values
y_val = df_val.y.values
y_test = df_test.y.values
y_full_train = df_full_train.y.values

In [37]:
del df_train['y']
del df_val['y']
del df_test['y']
del df_full_train['y']

In [38]:
df_train

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome,day_of_year
0,39.0,unemployed,married,primary,no,590.0,yes,no,cellular,190,2,-1,0,unknown,35
1,53.0,management,divorced,secondary,no,1355.0,no,yes,cellular,447,2,196,8,other,35
2,53.0,services,divorced,primary,no,0.0,no,yes,cellular,206,1,-1,0,unknown,189
3,35.0,technician,married,tertiary,no,1473.0,yes,no,unknown,84,3,-1,0,unknown,132
4,53.0,unemployed,divorced,tertiary,no,0.0,yes,yes,cellular,140,2,-1,0,unknown,220
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27121,31.0,student,single,unknown,no,2882.0,yes,no,cellular,158,2,-1,0,unknown,126
27122,39.0,management,married,tertiary,no,5060.0,yes,no,cellular,157,4,-1,0,unknown,232
27123,30.0,blue-collar,married,primary,no,935.0,yes,no,cellular,96,2,-1,0,unknown,36
27124,35.0,management,married,tertiary,no,2123.0,yes,no,cellular,249,3,-1,0,unknown,202


In [39]:
df_train.pdays.value_counts()

-1      22096
 182      113
 92        82
 183       78
 91        74
        ...  
 492        1
 465        1
 541        1
 54         1
 485        1
Name: pdays, Length: 498, dtype: int64

In [40]:
df_train.shape

(27126, 15)

In [41]:
22096/27126

0.8145690481456905

**80% of datas in `pdays` are `-1`. I have to do something about it. I can't remove the rows.**
Options for -1 in pdays:

- Replace `-1` with a very large number e.g. 999999 which will have the same effect as if they are not being contacted for a long time. 
- Remove the column entirely

choice 1: - Replace `-1` in `pdays` with `999999999`

In [42]:
df_train['pdays'] = df_train['pdays'].replace([-1], '999999999')
df_test['pdays'] = df_test['pdays'].replace([-1], '999999999')
df_val['pdays'] = df_val['pdays'].replace([-1], '999999999')
df_full_train['pdays'] = df_full_train['pdays'].replace([-1], '999999999')

## Dealing with missing values (before training)

options:
- fill with zeo, mean or median

In [43]:
df_train.isnull().sum()

age            4
job            0
marital        0
education      0
default        0
balance        1
housing        0
loan           0
contact        0
duration       0
campaign       0
pdays          0
previous       0
poutcome       0
day_of_year    0
dtype: int64

In [44]:
df_train['age'] = df_train['age'].fillna(0)
df_val['age'] = df_val['age'].fillna(0)
df_test['age'] = df_test['age'].fillna(0)
df_full_train['age'] = df_full_train['age'].fillna(0)

df_train['balance'] = df_train['balance'].fillna(0)
df_val['balance'] = df_val['balance'].fillna(0)
df_test['balance'] = df_test['balance'].fillna(0)
df_full_train['balance'] = df_full_train['balance'].fillna(0)

In [45]:
df_train.isnull().sum()

age            0
job            0
marital        0
education      0
default        0
balance        0
housing        0
loan           0
contact        0
duration       0
campaign       0
pdays          0
previous       0
poutcome       0
day_of_year    0
dtype: int64

## Training the first model: regression

In [46]:
y_train

array([0, 0, 0, ..., 0, 0, 0])

In [47]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [48]:
train_dicts = df_train.to_dict(orient = 'records') #let's turn it into a dict
dv = DictVectorizer(sparse = False)
X_train = dv.fit_transform(train_dicts)

In [49]:
X_train.shape

(27126, 40)

In [50]:
df_train.shape

(27126, 15)

In [51]:
type(X_train)

numpy.ndarray

In [52]:
type(y_train)

numpy.ndarray

In [53]:
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [54]:
val_dicts = df_val.to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [57]:
y_pred = model.predict_proba(X_val)[:,1]
y_pred

array([0.06636627, 0.22304209, 0.02359922, ..., 0.11971133, 0.02255976,
       0.01398064])

In [58]:
above_average_predict = (y_pred >= 0.5).astype(int)
above_average_predict

array([0, 0, 0, ..., 0, 0, 0])

In [59]:
y_pred_binary = above_average_predict

In [60]:
y_pred_binary

array([0, 0, 0, ..., 0, 0, 0])

In [61]:
y_val

array([0, 1, 0, ..., 0, 0, 0])

In [64]:
accuracy = (y_pred_binary == y_val).mean()
round(accuracy,3)

0.901