# Data cleaning: bank.csv
In this exercise, we will clean a dataset. The following cleaning techniques will be used:
- Convert certain columns to appropriate types
- Dealing with outliers (values that deviate from the rest of the data distribution)
- Discarding unnecessary features (columns that are not useful)
- Dealing with missing values in both rows and columns
- Dealing with categories that need to be remapped

In [2]:
# Importing the necessary modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime as dt

In [3]:
# Reading the dataset
df = pd.read_csv("bank-full.csv", sep = ';')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


## Convert data to the right data types
The following columns have to be converted:
- job: from object to category
- marital: from object to category
- education: from object to category
- default: from object to category
- housing: to category
- loan: to category
- contact: category
- poutcome: category
- y: category
- month: category

In [8]:
# Create list of columns to convert to
# Categories
cols = ['job', 'marital', 'education', 'default', 'housing',
       'loan', 'contact', 'poutcome', 'y', 'month', 'default']

# Convert each column to a category
#df[cols] = df[cols].astype('category')
    
df[cols].dtypes

job          category
marital      category
education    category
default        object
housing        object
loan           object
contact        object
poutcome       object
y              object
month          object
default        object
dtype: object

## Dealing with outliers
The following columns can have possible outliers and maybe need to be dealt with.
- age
- balance
- day
- campaign
- pdays
- previous

In [None]:
cols = ['age', 'balance', 'day', 'campaign', 'pdays', 'previous']

for col in cols:
    print(df[col].describe())
# campaign, pdays, previous

Based on the results, the columns campaign, pdays and previous have huge outliers and need to be dealt with.

### Campaign

In [None]:
# Dealing with campaign column
# Plot a histogram
sns.histplot(df['campaign'], bins = 10)

# Most campagnes are between 0 and 10. Some outliers are visible between 10 and 30 but after that, the amount of campaigns are
# too outlied. To fix this, the data will remove with campaigns higher than 10

In [None]:
# Fixing the campaign column
# Replace each value higher than 10 by the value devided by 10
# Round it by zero
campaign = df['campaign']
df2 = df.loc[campaign <= 10]

sns.histplot(df2['campaign'], bins = 10)

### Pdays

In [None]:
# Pdays demonstrates the number of days that passed by after the client was last contacted from a previous campaign
# If the value is -1 then client was not previously contacted
# This means that the value of -1 needs to be replace with NaN

print(df2['pdays'].describe())

# Replacing the values with NaN
df2.loc[df2['pdays'] < 0, 'pdays'] = np.nan

print(df2['pdays'].describe())

## Discarding unnecessary features/columns
In this chapter, we'll check for the columns that have binary outputs if the outputs are both represented or not. If a binary 
column is barely/not represented, it will be removed. Each column will be shown with a histogram to see if both values have a representation

In [None]:
binaries = [x for x in df.columns if len(df[x].unique()) == 2]

print(binaries)

In [None]:
sns.histplot(df[binaries[0]])

In [None]:
sns.histplot(df[binaries[1]])

In [None]:
sns.histplot(df[binaries[2]])

In [None]:
sns.histplot(df[binaries[3]])

Notes: based on the subplots, we can see that the 'default' column with value yes is not good represented. That's why the column will be removed

In [None]:
# Remove default column
df2 = df2.drop(['default'], axis = 1)

## Dealing with missing values in both rows and columns 

In [None]:
df2.info()

# Based on the info function, we can see that the pdays column now only has 8233 non-null values.
# The best option is to remove the pdays column. 

df2 = df2.dropna(axis = 1)
df2.info()

## Remapping categories
The following columns need to be remapped:
- job: remove . from admin and change blue-collar to manual-labor
- remap every binary column to 1 (yes) or 0 (no)
- month: remap from str to num (1, 2, 3 instead of jan, feb, mar etc.)

### Job 

In [None]:
# Get unique jobs
jobs = df2['job'].unique()
    
# change admin. to administrator and services to pink-collar
# Change the datatype to str in order to change the value of job
df2['job'] = df2['job'].astype('str')
df2.loc[df2['job'] == 'admin.', 'job'] = 'administrator'
df2.loc[df2['job'] == 'services', 'job'] = 'manual-labor'

df2['job'] = df2['job'].astype('category')

for job in jobs:
    print(job)

### Binary columns
In the dataset, each binary column has a string value of "yes" and "no". Those values will be replaced
with True or False. Also, the data type will be changed to boolean. 
The binary columns are:
- housing
- loan
- y

In [None]:
# Create mapping columns
columns = ['housing', 'loan', 'y']

# Create mapping list to change the values to
mapping = {"yes": True, "no": False}

# Remap the columns
for col in columns:
    df2[col] = df2[col].replace(mapping)
    df2[col] = df2[col].astype('boolean')

df2.head()

### Month
The month column is created as jan, feb, mar etc. It needs to be remapped to the numbers of the month (1 to 12)

In [None]:
# Create the dict to use for rempaping
mapping = {"jan": 1, "feb": 2, "mar": 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8,
          'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}

# Remap the month column
df2['month'] = df2.month.replace(mapping)

df2.info()

## Exporting the data
The last step is to export the dataframe to a csv file.

In [None]:
# Create the filename based on the current date
date = str(dt.now().strftime('%d-%m-%Y'))
filename = 'bank-data-' + date + '.csv'

# Export the file
df2.to_csv(filename)