### Step 1: Data Understanding and Preparation
By using *df.shape* we can see the full size of the dataset.

By using *df.dtypes* we can see what data type the variables are.

All variables are either objects or integers, including the days, months, duration, etc.

In [8]:
import numpy as np
import pandas as pd
import matplotlib as plt

df = pd.read_csv("bank-full.csv", sep=";")

print(df.shape)
print(df.dtypes)

(45211, 17)
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object


### Step 2: Data Cleaning

We are creating new variables just to test out the values inside the dataframe.
By using *df.isnull()* we see that there are no blank values in the dataset.

By using *df=="unknown"* we see that the "education" column has 1857 blank values, along with the "job" column, having 288.
We are choosing to keep these rows, and treat the unknown variable as its own category.

By using *df==0* we see that 36954 cells from the "previous" column are 0. This shows that the customer was not contacted previously.
There are also 3514 cells from the "balance" column that are 0. We will remove these rows.

By using *df==-1* we see that the client was not previously contacted. We will remove this "pdays" column as we do not think this is necessarily a good predictor, in comparison to the "previous" column.



In [9]:
df_null = (df.isnull()).sum()
print("This is how many values are empty:")
print(df_null)

df_unknown = (df == "unknown").sum()
print("This is how many values are unknown:")
print(df_unknown)

df_0 = (df == 0).sum()
print("This is how many values are 0:")
print(df_0)

df_1 = (df == "-1").sum()
print("This is how many values are -1:")
print(df_1)

This is how many values are empty:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64
This is how many values are unknown:
age              0
job            288
marital          0
education     1857
default          0
balance          0
housing          0
loan             0
contact      13020
day              0
month            0
duration         0
campaign         0
pdays            0
previous         0
poutcome     36959
y                0
dtype: int64
This is how many values are 0:
age              0
job              0
marital          0
education        0
default          0
balance       3514
housing          0
loan             0
contact          0
day              0
month            0
duration         3
campaign         0
pdays            0
previous     36954
poutcome

We are removing the "poutcome", "pdays", "contact", and the "duration" column.
We are not removing any rows for now, since there are no blank values.
We will be dealing with the outliers later. ðŸ˜¼

In [7]:
df.drop(columns=['poutcome', 'pdays', 'contact', 'duration'], inplace=True)
print(df.shape)

(45211, 13)
