# FCA

Technical Challenge for Data Science Candidates

This workbook loads the data for the Bank example.
Re-code the data and write it out.

The next notebook will do some visualisation.

The models and then developed in the final notebook.

In [1]:
import numpy as np
import pandas as pd
import math
import json

from os import path

import scipy.stats as st
import statsmodels as sm
import statsmodels.api as smi

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler

from pandas.api.types import CategoricalDtype

pd.__version__

'0.24.2'

In [2]:
# If you turn this feature on, you can display each result as it happens.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
# this is the local Utility module I add functions to this and rely upon the reload to make them appear.
from fca import Utility

In [4]:
%reload_ext autoreload
%autoreload 1
%aimport fca

In [5]:
# My utility singleton.
i0 = Utility.instance()
dir(i0);

In [6]:
# Load the data in its oginal form.
df0 = pd.read_csv("in.csv", sep=";")

## Data manipulation

### And the incidence rate

This suggests we should balanced_accuracy

In [7]:
print("{:4.2f}%".format(100 * (df0['y'] == 'yes').values.sum() / df0.shape[0]))

# This suggests we should use balanced_accuracy and not accuracy as the summary metric.

11.27%


### Manipulations

We convert strings to categories. Change most of those to ordered ones that align with other distributions.

We add a boolean field for the NA attributes (unknown and nonexistent) and there is an NaN of 999.

In [8]:
df0.info()
df0.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usa

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [9]:
## Heuristic steps
# Having looked ahead, the correlations and density plots suggest.

## It seems that nr.employed at 5099 is a good split.
# df0 = df0[df0['nr.employed'] >= 5099]

In [10]:
# Convert strings to categories
df1 = i0.str2cat(df0)
df1.info()
df1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null category
marital           41188 non-null category
education         41188 non-null category
default           41188 non-null category
housing           41188 non-null category
loan              41188 non-null category
contact           41188 non-null category
month             41188 non-null category
day_of_week       41188 non-null category
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null category
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null category
dtypes: category(11), float64

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [11]:
# Look at the categories. Because we will be scaling, it's best to order categories and split off the 
# unknowns and non-existents and mark those with a boolean

# This shows the categories and their classes.
cats0 = dict([ (x, tuple(df1[x].cat.categories)) for x in df0.select_dtypes(['category']).columns ])
cats0

{'job': ('admin.',
  'blue-collar',
  'entrepreneur',
  'housemaid',
  'management',
  'retired',
  'self-employed',
  'services',
  'student',
  'technician',
  'unemployed',
  'unknown'),
 'marital': ('divorced', 'married', 'single', 'unknown'),
 'education': ('basic.4y',
  'basic.6y',
  'basic.9y',
  'high.school',
  'illiterate',
  'professional.course',
  'university.degree',
  'unknown'),
 'default': ('no', 'unknown', 'yes'),
 'housing': ('no', 'unknown', 'yes'),
 'loan': ('no', 'unknown', 'yes'),
 'contact': ('cellular', 'telephone'),
 'month': ('apr',
  'aug',
  'dec',
  'jul',
  'jun',
  'mar',
  'may',
  'nov',
  'oct',
  'sep'),
 'day_of_week': ('fri', 'mon', 'thu', 'tue', 'wed'),
 'poutcome': ('failure', 'nonexistent', 'success'),
 'y': ('no', 'yes')}

In [12]:
## Reclassify jobs to be ordered
# Add an extra indicator when unknown.
# The ordering I've chosn is by income/wealth and that should correlat to age as well.

tag='job'
ctype0 = CategoricalDtype(categories=['unknown', 
                                      'unemployed', 'housemaid', 'student', 
                                      'blue-collar', 'services', 'technician', 'admin.', 
                                      'retired', 
                                      'self-employed', 'management', 'entrepreneur'], 
                          ordered=True)
df1 = i0.categorize0(df1, tag=tag, ctype0=ctype0, class0='unknown')
# And a check to see there are no NAs.
any(df1[tag].isna())

False

In [13]:
## Reclassify jobs to be ordered
# Divorced will be similar to single. 
tag='marital'
ctype0 = CategoricalDtype(categories=['unknown', 
                                      'single', 'married', 'divorced' ], 
                          ordered=True)
df1 = i0.categorize0(df1, tag=tag, ctype0=ctype0, class0='unknown')
any(df1['marital'].isna())

False

In [14]:
## Reclassify jobs to be ordered
# Again this should be ordered by age and income
tag='education'
ctype0 = CategoricalDtype(categories=['unknown',
                                      'illiterate',
                                      'basic.4y',
                                      'basic.6y',
                                      'basic.9y',
                                      'high.school',
                                      'university.degree',
                                      'professional.course'], 
                          ordered=True)
df1 = i0.categorize0(df1, tag=tag, ctype0=ctype0, class0='unknown')
any(df1[tag].isna())

False

In [15]:
## Reclassify the yes/no/unknown
# Add the boolean field for unknown
tags=['default', 'housing', 'loan']
ctype0 = CategoricalDtype(categories=['unknown',
                                      'no',
                                      'yes'], 
                          ordered=True)

for tag in tags:
    df1 = i0.categorize0(df1, tag=tag, ctype0=ctype0, class0='unknown')
    any(df1[tag].isna())

False

False

False

In [16]:
## Reclassify this poutcome
# Add the boolean field for nonexistent
tag='poutcome'
ctype0 = CategoricalDtype(categories=['nonexistent',
                                      'failure',
                                      'success'], 
                          ordered=True)
df1 = i0.categorize0(df1, tag=tag, ctype0=ctype0, class0='nonexistent')
any(df1[tag].isna())

False

In [17]:
# There is no "subscribed", we have "y"

## Other minor changes

Only pdays is changed. This has an NaN value of 999. We change it to be not so large and add a boolean.

In [18]:
# campaign is number of contacts in this campaign.

df0[['campaign', 'month']].head()

Unnamed: 0,campaign,month
0,1,may
1,1,may
2,1,may
3,1,may
4,1,may


In [19]:
# campaign is the number of contacts made with the client during this campaign.
set(df0.campaign);
# pdays - might be bettr to put these into classes.
set(df0.pdays);
# marital and education could be an ordered categories
# similarly for job.

In [20]:
## pdays has an NA value of 999. This skews the distribution to much. 
# I reduce it to about 1.5 times the max value and add a boolean.
tag = 'pdays'
tag0 = tag+"0"
df1[tag0] = df1[tag].values == 999

In [21]:
# Make the pdays infinity not quite so large.
x0 = max((df1[tag][df1[tag] != 999]).values)
x0 = math.floor(1.5 * x0)
x0

s0 = df1[tag].copy()
s0[s0.values == 999] = x0
df1[tag] = s0

40

In [22]:
df0.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,nr.employed,y,job0,marital0,education0,default0,housing0,loan0,poutcome0,pdays0
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,5191.0,no,False,False,False,False,False,False,True,True
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,5191.0,no,False,False,False,True,False,False,True,True
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,5191.0,no,False,False,False,False,False,False,True,True
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,5191.0,no,False,False,False,False,False,False,True,True
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,5191.0,no,False,False,False,False,False,False,True,True


# Visualisation output
The following will convert the dataset to fully numeric form. With that, it can be correlated and viewed in another notebook.

In [23]:
# Convert categories to integers and scale
df2 = i0.cat2code(df1)

df2.info()
df2.head()

# Let me look at it with R and do some visualisation
df1.to_csv("catted.csv", index=False)
df1.to_pickle("catted.pickle")
df2.to_csv("coded.csv", index=False)
df2.to_pickle("coded.pickle")

# df3 = i0.code2scale(df2, scaler0=StandardScaler(with_std=False))
df3 = i0.code2scale(df2, scaler0=StandardScaler(with_std=True))
df3.to_pickle("scaled.pickle")

# Visualisation and models are in other notebooks.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 29 columns):
age               41188 non-null int64
job               41188 non-null int8
marital           41188 non-null int8
education         41188 non-null int8
default           41188 non-null int8
housing           41188 non-null int8
loan              41188 non-null int8
contact           41188 non-null int8
month             41188 non-null int8
day_of_week       41188 non-null int8
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null int8
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null int8
job0              41188 non-null bool
marital0          41188 non-null bo

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,nr.employed,y,job0,marital0,education0,default0,housing0,loan0,poutcome0,pdays0
0,56,2,2,2,1,1,1,1,6,1,...,5191.0,0,False,False,False,False,False,False,True,True
1,57,5,2,5,0,1,1,1,6,1,...,5191.0,0,False,False,False,True,False,False,True,True
2,37,5,2,5,1,2,1,1,6,1,...,5191.0,0,False,False,False,False,False,False,True,True
3,40,7,2,3,1,1,1,1,6,1,...,5191.0,0,False,False,False,False,False,False,True,True
4,56,5,2,5,1,1,2,1,6,1,...,5191.0,0,False,False,False,False,False,False,True,True


In [24]:
df2.head();
df3.head();

## Further Checks

I have a global describe method that looks for Near-Zero Variance features.
I also use a R method which is not available in scikit.

In [25]:
# Check the statistics
ds = i0.df2describe(df2)
ds;

In [26]:
list(df2.columns)

['age',
 'job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed',
 'y',
 'job0',
 'marital0',
 'education0',
 'default0',
 'housing0',
 'loan0',
 'poutcome0',
 'pdays0']

In [27]:
## Some investigation of the Near-Zero Variance features
thresh0 = 0.3
thresh1 = thresh0 * (1 - thresh0)
nzv0 = i0.nzv(df3, thresh=thresh1)

In [28]:
ds = i0.df2describe(df3)
ds[ds['name'].isin(nzv0)]

Unnamed: 0,name,q,v
