# Data Manipulation


.

# Explaination of Features in Data Set


1. age (numeric)


2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')


3. marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)


4. education (categorical: 'primary' ,'secondary','teritary,'unknown')


5. default: has credit in default? (categorical: 'no','yes','unknown')


6. housing: has housing loan? (categorical: 'no','yes','unknown')


7. loan: has personal loan? (categorical: 'no','yes','unknown')



### related with the last contact of the current campaign:

8. contact: contact communication type (categorical: 'cellular','telephone')


9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')


10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')


11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.


### other attributes:

12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)


13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)


14. previous: number of contacts performed before this campaign and for this client (numeric)


15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

In [36]:
import pandas as pd
import numpy as np

In [37]:
bank = pd.read_csv('bank-full.csv', sep=';')
bank1 = bank.copy()

In [38]:
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [39]:
(bank.pdays =='999').sum()

0

In [40]:
# The ratio of customers that participated in the marketing campaign 
# divided by the total number of observations equals 11.7%

# This might be a classification problem for us as there are a disproportionate ratio of observations in each class
# Therefore we are going to have to deal with this problem, adjusting the imbalance

bank.y.value_counts()[1]/bank.shape[0]*100

11.698480458295547

In [41]:
bank.day.value_counts()

20    2752
18    2308
21    2026
17    1939
6     1932
5     1910
14    1848
8     1842
28    1830
7     1817
19    1757
29    1745
15    1703
12    1603
13    1585
30    1566
9     1561
11    1479
4     1445
16    1415
2     1293
27    1121
3     1079
26    1035
23     939
22     905
25     840
31     643
10     524
24     447
1      322
Name: day, dtype: int64

In [42]:
yesno={'yes':1,'no':0}
bank['Target'] = bank.y.map(yesno)
bank['default_n'] = bank.y.map(yesno)

In [43]:
bank = pd.get_dummies(bank, columns=['job','marital','education',
                                'housing','loan','contact','month','poutcome',], drop_first=True)

In [44]:
bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 45 columns):
age                    45211 non-null int64
default                45211 non-null object
balance                45211 non-null int64
day                    45211 non-null int64
duration               45211 non-null int64
campaign               45211 non-null int64
pdays                  45211 non-null int64
previous               45211 non-null int64
y                      45211 non-null object
Target                 45211 non-null int64
default_n              45211 non-null int64
job_blue-collar        45211 non-null uint8
job_entrepreneur       45211 non-null uint8
job_housemaid          45211 non-null uint8
job_management         45211 non-null uint8
job_retired            45211 non-null uint8
job_self-employed      45211 non-null uint8
job_services           45211 non-null uint8
job_student            45211 non-null uint8
job_technician         45211 non-null uint8
job_unemp

In [46]:
bank.drop(columns=['y','default'], axis=1, inplace=True)

In [47]:
bank.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,Target,default_n,job_blue-collar,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,58,2143,5,261,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
1,44,29,5,151,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,33,2,5,76,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,47,1506,5,92,1,-1,0,0,0,1,...,0,0,0,1,0,0,0,0,0,1
4,33,1,5,198,1,-1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [48]:
# All dtypes are now numerical and ready for model building
bank.dtypes

age                    int64
balance                int64
day                    int64
duration               int64
campaign               int64
pdays                  int64
previous               int64
Target                 int64
default_n              int64
job_blue-collar        uint8
job_entrepreneur       uint8
job_housemaid          uint8
job_management         uint8
job_retired            uint8
job_self-employed      uint8
job_services           uint8
job_student            uint8
job_technician         uint8
job_unemployed         uint8
job_unknown            uint8
marital_married        uint8
marital_single         uint8
education_secondary    uint8
education_tertiary     uint8
education_unknown      uint8
housing_yes            uint8
loan_yes               uint8
contact_telephone      uint8
contact_unknown        uint8
month_aug              uint8
month_dec              uint8
month_feb              uint8
month_jan              uint8
month_jul              uint8
month_jun     

In [49]:
bank.to_csv('bank-numerical.csv')