# 1.0 An end-to-end classification problem (ETL)



## 1.1 Dataset description

Bank Marketing Data - A Decision Tree Approach

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Install and load libraries

In [1]:
!pip install wandb

Collecting wandb
  Downloading wandb-0.12.16-py2.py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 5.1 MB/s 
[?25hCollecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.5.12-py2.py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 56.3 MB/s 
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle
  Downloading setproctitle-1.2.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29 kB)
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.27-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 60.5 MB/s 
[?25hCollecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.9-py3-none-any.whl (9.4 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 2.0 MB/s 
Collecti

In [2]:
import wandb
import pandas as pd

import zipfile
from urllib.request import urlopen
import io

## 1.3 Fetch Data

### 1.3.1 Create the raw_data artifact

In [3]:

# importing the dataset
# Load data file
# bank=pd.read_csv('../input/bank.csv')
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip'
zf = zipfile.ZipFile(io.BytesIO(urlopen(url).read()))
zf.filelist

[<ZipInfo filename='bank-full.csv' compress_type=deflate filemode='-rw-r--r--' external_attr=0x4000 file_size=4610348 compress_size=514234>,
 <ZipInfo filename='bank-names.txt' compress_type=deflate filemode='-rw-r--r--' external_attr=0x4000 file_size=3864 compress_size=1665>,
 <ZipInfo filename='bank.csv' compress_type=deflate filemode='-rw-r--r--' external_attr=0x4000 file_size=461474 compress_size=62692>]

In [4]:
bank = pd.read_csv(zf.open('bank-full.csv'), delimiter=";")
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [5]:
bank.to_csv("raw_data.csv",index=False)

In [6]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [7]:
# Send the raw_data.csv to the Wandb storing it as an artifact
!wandb artifact put \
      --name decision_tree_bank/raw_data.csv \
      --type raw_data \
      --description "The raw data bank marketing data" raw_data.csv

[34m[1mwandb[0m: Uploading file raw_data.csv to: "francisvalfgs/decision_tree_bank/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mfrancisvalfgs[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.16
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220523_012020-cx7gzzbb[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mgenerous-plant-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/francisvalfgs/decision_tree_bank[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/francisvalfgs/decision_tree_bank/runs/cx7gzzbb[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("francisvalfgs/decision_tree_bank/raw_data.csv:latest")

[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m:            