# Mid-term-project | ML-ZOOM-CAMP | predict-term deposit

List of contents:

1. Problem description
2. Getting the dataset
3. Reading the dataset with pandas

## 1. Problem Description 

## 2. Getting the dataset

- link to dataset: [https://www.kaggle.com/datasets/aslanahmedov/predict-term-deposit](https://www.kaggle.com/datasets/aslanahmedov/predict-term-deposit)

or,

[https://raw.githubusercontent.com/bhasarma/mlcoursezoom-camp/main/WK08-09-midterm-project/predict-term-deposit-data.csv](https://raw.githubusercontent.com/bhasarma/mlcoursezoom-camp/main/WK08-09-midterm-project/predict-term-deposit-data.csv)

You can download the dataset into your local directory with `wget` 

In [1]:
data = 'https://raw.githubusercontent.com/bhasarma/mlcoursezoom-camp/main/WK08-09-midterm-project/predict-term-deposit-data.csv'

In [2]:
#!wget $data  #uncomment it, if you haven't downoaded data already.

In [3]:
ls

notebook.ipynb  predict-term-deposit-data.csv  README.md  report.md


## 3. Reading the dataset with pandas 

In [4]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
df = pd.read_csv('predict-term-deposit-data.csv')

In [6]:
df.head().T

Unnamed: 0,0,1,2,3,4
Id,1001,1002,1003,1004,1005
age,999.0,44.0,33.0,47.0,33.0
job,management,technician,entrepreneur,blue-collar,unknown
marital,married,single,married,married,single
education,tertiary,secondary,secondary,unknown,unknown
default,no,no,no,no,no
balance,2143.0,29.0,2.0,1506.0,1.0
housing,yes,yes,yes,yes,no
loan,no,no,yes,no,no
contact,unknown,unknown,unknown,unknown,unknown


In [7]:
df.shape

(45211, 18)

In [8]:
df.columns

Index(['Id', 'age', 'job', 'marital', 'education', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign',
       'pdays', 'previous', 'poutcome', 'y'],
      dtype='object')

Checklist:
1. make all features and data look consistent, i.e. all small letters and use underscore if separation between two words are there
2. does data types of the features make sense, e.g. age should be integer and not a string

In [9]:
df.columns = df.columns.str.lower()
df.head().T

Unnamed: 0,0,1,2,3,4
id,1001,1002,1003,1004,1005
age,999.0,44.0,33.0,47.0,33.0
job,management,technician,entrepreneur,blue-collar,unknown
marital,married,single,married,married,single
education,tertiary,secondary,secondary,unknown,unknown
default,no,no,no,no,no
balance,2143.0,29.0,2.0,1506.0,1.0
housing,yes,yes,yes,yes,no
loan,no,no,yes,no,no
contact,unknown,unknown,unknown,unknown,unknown


* name of columns are made consistent

In [10]:
df.dtypes

id             int64
age          float64
job           object
marital       object
education     object
default       object
balance      float64
housing       object
loan          object
contact       object
day            int64
month         object
duration       int64
campaign       int64
pdays          int64
previous       int64
poutcome      object
y             object
dtype: object

**to do:**

- remove `id` feature, since a customer id given by the bank, has *logically* no influence on the outcome
- categorical variables has to be converted with One Hot Encoding (only on train part, after splitting? check notes to be sure)
- `yes` or `no` columns to be converted into 1 and 0s
- find a way to take care of day and month columns in some way.
- what to do with `-1` values in feature `pdays`
- know the difference between `poutcome` and `y` features

In [11]:
df.describe()

Unnamed: 0,id,age,balance,day,duration,campaign,pdays,previous
count,45211.0,45202.0,45208.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,23606.0,40.954714,1362.34662,15.806419,258.16308,2.763841,40.197828,0.580323
std,13051.435847,11.539144,3044.852387,8.322476,257.527812,3.098021,100.128746,2.303441
min,1001.0,-1.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,12303.5,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,23606.0,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,34908.5,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,46211.0,999.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [12]:
df.isnull().sum()

id           0
age          9
job          0
marital      0
education    0
default      0
balance      3
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

In [13]:
df.nunique()

id           45211
age             79
job             12
marital          3
education        4
default          2
balance       7168
housing          2
loan             2
contact          3
day             31
month           12
duration      1573
campaign        48
pdays          559
previous        41
poutcome         4
y                2
dtype: int64

In [14]:
df['poutcome'].value_counts()

unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64

In [15]:
df['y'].value_counts()

no     39922
yes     5289
Name: y, dtype: int64

A summary about the data:

1. We have 18 features (or, variables or, columns) and 45211 columns. 

**converting days and months into day of year**

In [16]:
df

Unnamed: 0,id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,1001,999.0,management,married,tertiary,no,2143.0,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,1002,44.0,technician,single,secondary,no,29.0,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,1003,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,1004,47.0,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,1005,33.0,unknown,single,unknown,no,1.0,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,46207,51.0,technician,married,tertiary,no,825.0,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,46208,71.0,retired,divorced,primary,no,1729.0,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,46209,72.0,retired,married,secondary,no,5715.0,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,46210,57.0,blue-collar,married,secondary,no,668.0,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [17]:
df['day'] = df['day'].map(str)

type(df['day'][10])

str

In [18]:
df['month'].unique()

array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'jan', 'feb',
       'mar', 'apr', 'sep'], dtype=object)

In [19]:
month_mapping = {
    'jan': '1',
    'feb': '2',
    'mar': '3',
    'apr': '4',
    'may': '5',
    'jun': '6',
    'jul': '7', 
    'aug': '8', 
    'sep': '9',
    'oct': '10', 
    'nov': '11', 
    'dec': '12' 
}
df['month'] = df['month'].map(month_mapping)
df.month.unique()

array(['5', '6', '7', '8', '10', '11', '12', '1', '2', '3', '4', '9'],
      dtype=object)

In [20]:
type(df.month[100])

str

In [None]:
df['date_formatted'] = pd.to_datetime(
    dict(         
        year='2025',
        month=df_new['month'], 
        day=df_new['day']
    )
)

In [21]:
df_new = df[['day','month']].iloc[0:15]

In [22]:
df_new['day']=df_new['day'].map(str)

In [23]:
df_new['month'] = '5'

In [24]:
df_new['date_formatted'] = pd.to_datetime(
    dict(
        #year=df_new['2019'], 
        year='2019',
        month=df_new['month'], 
        day=df_new['day']
    )
)

In [25]:
df_new['day_of_year']=df_new['date_formatted'].dt.dayofyear
df_new

Unnamed: 0,day,month,date_formatted,day_of_year
0,5,5,2019-05-05,125
1,5,5,2019-05-05,125
2,5,5,2019-05-05,125
3,5,5,2019-05-05,125
4,5,5,2019-05-05,125
5,5,5,2019-05-05,125
6,5,5,2019-05-05,125
7,5,5,2019-05-05,125
8,5,5,2019-05-05,125
9,5,5,2019-05-05,125


## splitting the data (creatinng the validation framework)