# <center> Bank Marketing Data Cleaning </center>

### Overview
This notebook provides the data cleaning process of bank marketing campaign data. The aim of this notebook is to prepare the dataset for further analysis in SQL.

The data contains 45,211 marketing records and 17 columns, including columns such as:
- y: has the client subscribed to a term deposit
- poutcome: number of contacts performed before this campaign and for this client
- pdays: number of days that passed by after the client was last contacted from a previous campaign
- month: last contact month of year

In short, the cleaning process includes:
- formatting the column names and data values
- populating formatted data to new dataframe
- checking for missing data
- checking for duplicate records
---

### What is the source of the data?

#### This data was acquired at: https://archive.ics.uci.edu/dataset/222/bank+marketing 

### Import libraries

In [2]:
import pandas as pd
import os
import numpy as np

### Import bank data 

In [3]:
pwd = os.getcwd()
pwd = pwd + "\\bank-full.csv"

In [4]:
data = pd.read_csv(pwd, header=None)
data.head()

Unnamed: 0,0
0,"age;""job"";""marital"";""education"";""default"";""bal..."
1,"58;""management"";""married"";""tertiary"";""no"";2143..."
2,"44;""technician"";""single"";""secondary"";""no"";29;""..."
3,"33;""entrepreneur"";""married"";""secondary"";""no"";2..."
4,"47;""blue-collar"";""married"";""unknown"";""no"";1506..."


#### It appears that the column names and values are separated by semi-colons and each row of data is listed in the first cell of each row within the 1st column

#### Let's extract the column names by:
- removing unncessary quotation marks
- removing the semi-colons 
- placing each column name in a list as a string
- renaming 'y' as 'subscribed' for clarity purposes

In [5]:
#get column name values
col_names = data.iloc[0:1].values
col_names = col_names[0,0]
#format column names
col_names = col_names.replace('"','')
col_names = col_names.split(';')
#change name of y for clarity
col_names[-1] = 'subscribed'
col_names

['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'day',
 'month',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'subscribed']

#### Let's format and extract the rest of the data values by:
- removing unnecessary quotation marks 
- removing the semi-colons
- placing each value in a list as a string
- converting the values under numerical columns into integers

In [6]:
row_data = data.iloc[1:].values 

for row in range(len(row_data)):
    row_data[row,0] = row_data[row,0].replace('"','')
    row_data[row,0] = row_data[row,0].split(';')
    #Change the data type of each numerical value to integer
    for num in [0,5,9,11,12,13,14]:
        row_data[row,0][num] = int(row_data[row,0][num])

### Let's extract the values from the data we've just cleaned and populate a new dataframe

In [7]:
df = pd.DataFrame(np.zeros((45211, 17)))
for row in range(len(row_data)):
    for value in range(len(col_names)):
        df.iloc[row,value] = row_data[row,0][value]
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,5.0,may,261.0,1.0,-1.0,0.0,unknown,no
1,44.0,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,47.0,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,unknown,5.0,may,198.0,1.0,-1.0,0.0,unknown,no


### Let's populate the column names within this new dataframe

In [8]:
df.columns = col_names
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,subscribed
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,5.0,may,261.0,1.0,-1.0,0.0,unknown,no
1,44.0,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,47.0,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,unknown,5.0,may,198.0,1.0,-1.0,0.0,unknown,no


### Checking for missing values

In [9]:
df.isnull().sum()

age           0
job           0
marital       0
education     0
default       0
balance       0
housing       0
loan          0
contact       0
day           0
month         0
duration      0
campaign      0
pdays         0
previous      0
poutcome      0
subscribed    0
dtype: int64

### Duplicate records check

In [10]:
df[df.duplicated()]

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,subscribed


### The data is now cleaned and ready to be queried with SQL

In [11]:
df.sample(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,subscribed
20849,53.0,services,married,secondary,no,6170.0,no,no,cellular,13.0,aug,838.0,4.0,-1.0,0.0,unknown,yes
26710,49.0,services,single,unknown,no,5095.0,yes,no,cellular,20.0,nov,127.0,2.0,157.0,1.0,failure,no
6612,33.0,admin.,single,secondary,no,598.0,yes,no,unknown,28.0,may,189.0,1.0,-1.0,0.0,unknown,no
6338,31.0,management,single,tertiary,no,-429.0,yes,yes,unknown,27.0,may,207.0,1.0,-1.0,0.0,unknown,no
8938,27.0,blue-collar,single,primary,no,0.0,yes,no,unknown,4.0,jun,299.0,2.0,-1.0,0.0,unknown,no
12602,34.0,technician,married,secondary,no,173.0,no,no,unknown,4.0,jul,31.0,1.0,-1.0,0.0,unknown,no
37074,50.0,management,divorced,tertiary,no,100.0,yes,yes,cellular,13.0,may,157.0,1.0,299.0,3.0,failure,no
33736,41.0,unknown,single,unknown,no,942.0,no,no,cellular,22.0,apr,219.0,1.0,-1.0,0.0,unknown,yes
17191,25.0,management,single,tertiary,no,155.0,yes,no,telephone,28.0,jul,193.0,3.0,-1.0,0.0,unknown,no
11654,41.0,technician,divorced,unknown,no,143.0,no,no,unknown,20.0,jun,141.0,4.0,-1.0,0.0,unknown,no


### Export data

In [12]:
df.to_excel('clean bank data.xlsx', index=False)

### Summary and Next Steps
After cleaning the data, I added a a new column called 'ID' and set it as the primary key for this table in Microsoft Access.
Further analysis was performed using DBeaver, leveraging SQL queries to gain insights and answer important business questions.

Link for SQL data analysis can be found [here](https://github.com/danxchap/Personal-Projects/blob/master/Marketing%20SQL%20Data%20Analysis/bank%20marketing%20queries.sql) 

Thank you for reviewing this notebook! Please feel free to reach out if you have any further inquiries or feedback.