# Overview 

## Objective

This notebook provides an example of how to train Tensorflow classifiers using the HMEQ dataset

The goal is to predict whether a customer is a BAD (default) borrower, which in this dataset is a binary classification task.

## Assumption

We are working in big data context. 

Then, I'm going to work with HMEQ dataset as it is so large that it would not fit in RAM. 

Then we use the Tensorflow framework to deal with that.

## Imports and setup

In [1]:
#General
import os
import pprint
import tempfile

#Analysis
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers

## Define variables

In [2]:
BASE_DIR_PATH = os.getcwd()
DATA_DIR_PATH = os.path.join(BASE_DIR_PATH, '../data')

# Data directories paths
TRAIN_DIR_PATH = os.path.join(DATA_DIR_PATH, 'train')
TEST_DIR_PATH = os.path.join(DATA_DIR_PATH, 'test')
VAL_DIR_PATH = os.path.join(DATA_DIR_PATH, 'val')

# Data file paths
TRAIN_DATA_PATH = os.path.join(TRAIN_DIR_PATH, 'train.csv')
TEST_DATA_PATH = os.path.join(TEST_DIR_PATH, 'test.csv')
VAL_DATA_PATH = os.path.join(VAL_DIR_PATH, 'val.csv')

## Define Helpers

In [3]:
## Preprocessing data
def _set_categorical_type(dataframe: pd.DataFrame) -> pd.DataFrame:
    '''Set the categorical type as string if needed'''
    for column in CATEGORICAL_VARIABLES:
        if (dataframe[column].dtype == 'O'):
            dataframe[column] = dataframe[column].astype('string')
    return dataframe

def _set_categorical_empty(dataframe: pd.DataFrame) -> pd.DataFrame:
    '''Change object type for categorical variable to avoid TF issue '''
    for column in CATEGORICAL_VARIABLES:
        if any(dataframe[column].isna()):
            dataframe[column] = dataframe[column].fillna('')
    return dataframe

def _set_numerical_type(dataframe: pd.DataFrame) -> pd.DataFrame:
    '''Set the numerical type as float64 if needed'''
    for column in NUMERICAL_VARIABLES:
        if (dataframe[column].dtype == 'int64'):
            dataframe[column] = dataframe[column].astype('float64')
    return dataframe

def _get_impute_parameters_cat(categorical_variables: list) -> dict:
    '''For each column in the numerical features, calculate mean.'''
    impute_parameters = {}
    for column in categorical_variables:
        impute_parameters[column] = 'Missing'
    return impute_parameters
    
def _impute_missing_categorical(inputs: dict, target) -> dict:
    impute_parameters = _get_impute_parameters_cat(CATEGORICAL_VARIABLES)
    # Since we modify just some features, 
    # we need to start by setting `outputs` to a copy of `inputs.
    output = inputs.copy()
    for key, value in impute_parameters.items():
        is_blank = tf.math.equal('', inputs[key])
        tf_other = tf.constant(value, dtype=np.string_)
        output[key] = tf.where(is_blank, tf_other, inputs[key])
    return output, target

def _get_mean_parameter(dataframe: pd.DataFrame, column: str) -> float:
    ''' Given a column, calculate mean'''
    mean = dataframe[column].mean()
    return mean

def _get_impute_parameters_num(dataframe: pd.DataFrame, numerical_variables: list) -> dict:
    '''For each column in the numerical features, calculate mean.'''
    impute_parameters = {}
    for column in numerical_variables:
        impute_parameters[column] = _get_mean_parameter(dataframe, column)
    return impute_parameters

def _impute_missing_numerical(inputs: dict, target) -> dict:
    '''Impute missing based on training mean'''
    impute_parameters = _get_impute_parameters_num(data_train, NUMERICAL_VARIABLES) ## Here we have data_train
    # Since we modify just some features, 
    # we need to start by setting `outputs` to a copy of `inputs.
    output = inputs.copy()
    for key, value in impute_parameters.items():
        is_miss = tf.math.is_nan(inputs[key])
        tf_mean = tf.constant(value, dtype=np.float64)
        output[key] = tf.where(is_miss, tf_mean, inputs[key])
    return output, target
            
# A utility method to create a feature column
# and to transform a batch of data
def check_feature(feature_column):
    feature_layer = layers.DenseFeatures(feature_column)
    print(feature_layer(example_batch).numpy())

# Data

## Preview data

In [4]:
!head -n 5 ../data/train/train.csv

BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,34400,97971.0,145124.0,DebtCon,Other,13.0,0.0,0.0,67.8320416646805,1.0,36.0,40.4027058419691
0,13600,89937.0,110986.0,DebtCon,Sales,14.0,,2.0,146.718742448452,1.0,17.0,33.7471158335903
1,10800,75000.0,87400.0,HomeImp,Other,7.0,1.0,0.0,101.46666666666701,2.0,19.0,
0,14900,87167.0,114219.0,DebtCon,ProfExe,8.0,0.0,0.0,194.113173533089,2.0,36.0,41.3297639035293


## Load Data

In [5]:
data_train = pd.read_csv(TRAIN_DATA_PATH, sep=',')
data_test = pd.read_csv(TEST_DATA_PATH, sep=',')
data_val = pd.read_csv(VAL_DATA_PATH, sep=',')

data_train.head(5)                                                  

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,0,34400,97971.0,145124.0,DebtCon,Other,13.0,0.0,0.0,67.832042,1.0,36.0,40.402706
1,0,13600,89937.0,110986.0,DebtCon,Sales,14.0,,2.0,146.718742,1.0,17.0,33.747116
2,1,10800,75000.0,87400.0,HomeImp,Other,7.0,1.0,0.0,101.466667,2.0,19.0,
3,0,14900,87167.0,114219.0,DebtCon,ProfExe,8.0,0.0,0.0,194.113174,2.0,36.0,41.329764
4,0,7200,98691.0,115750.0,HomeImp,Office,22.0,0.0,0.0,118.000142,0.0,11.0,37.720359


In [6]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4827 entries, 0 to 4826
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      4827 non-null   int64  
 1   LOAN     4827 non-null   int64  
 2   MORTDUE  4405 non-null   float64
 3   VALUE    4738 non-null   float64
 4   REASON   4618 non-null   object 
 5   JOB      4601 non-null   object 
 6   YOJ      4419 non-null   float64
 7   DEROG    4262 non-null   float64
 8   DELINQ   4362 non-null   float64
 9   CLAGE    4578 non-null   float64
 10  NINQ     4421 non-null   float64
 11  CLNO     4651 non-null   float64
 12  DEBTINC  3804 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 490.4+ KB


In [7]:
data_train.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BAD,4827.0,0.198674,0.399043,0.0,0.0,0.0,0.0,1.0
LOAN,4827.0,18617.112078,11231.974061,1100.0,11100.0,16300.0,23300.0,89900.0
MORTDUE,4405.0,73796.339364,43741.460123,2063.0,46884.0,65206.0,91491.0,399412.0
VALUE,4738.0,101633.021895,56564.609914,8000.0,66260.25,89407.5,119732.5,855909.0
YOJ,4419.0,8.957377,7.6045,0.0,3.0,7.0,13.0,41.0
DEROG,4262.0,0.244486,0.823733,0.0,0.0,0.0,0.0,10.0
DELINQ,4362.0,0.446813,1.138853,0.0,0.0,0.0,0.0,15.0
CLAGE,4578.0,179.902913,86.744368,0.0,114.858318,173.497696,231.876088,1168.233561
NINQ,4421.0,1.199276,1.745287,0.0,0.0,1.0,2.0,17.0
CLNO,4651.0,21.322081,10.111162,0.0,15.0,20.0,26.0,71.0


**Comment**: We notice that several variables (numerical and categorical) have missing values.

## Import Data in Tensorflow

Based on what I understood when you import data in Tensorflow you need two elements:

**1. input_fn**: specifies how data is converted to a tf.data.Dataset that feeds the input pipeline.

**2. feature column**: a construct that indicates a feature's data type.

In our case, we notice that variables have missing. Then we need to impute them. Also we need to normalize data. 

And because we want to use Tensorflow framework, we can implement data preprocessing and transformation operations in the TensorFlow model itself. In this way, **it becomes an integral part of the model when the model is exported and deployed for predictions.**

TensorFlow transformations can be accomplished in one of the following ways:

1. Extending your base feature_columns (using crossed_column, embedding_column, bucketized_column, and so on).

2. Implementing all of the instance-level transformation logic in a function that you call in all three input functions: train_input_fn, eval_input_fn, and serving_input_fn.

3. If you are creating custom estimators, putting the code in the model_fn function.

Then, we have two approaches to inputs:

**1. Inside the input_fn**

**2. While creating feature_column**

Personally I prefer 

1. Preprocess data in the input_fn 

2. Do feature engineering while creating feature_column.

About **the Data preprocessing strategy of impute missings**, 

- numerical variables: impute with mean

- categorical variables: create 'other' class

## Define input_fn 

In [8]:
TARGET = ['BAD']
CATEGORICAL_VARIABLES = ['REASON', 'JOB']
NUMERICAL_VARIABLES = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']

In [9]:
## Create the input function to load data into dataset

def get_dataset(dataframe:pd.DataFrame, target:str, num_epochs=2, shuffle=True, batch_size=5, prefetch=True):
    
    def input_fn():
        '''input_fn to read the data and impute missings'''
        
        # Extract
        df = _set_categorical_type(dataframe)
        df = _set_categorical_empty(df)
        df = _set_numerical_type(df)
        predictors = dict(df)
        label = predictors.pop(target)
        dataset = tf.data.Dataset.from_tensor_slices((predictors, label))
        
        #Transform
        dataset = dataset.map(_impute_missing_categorical)
        dataset = dataset.map(_impute_missing_numerical)
        dataset = dataset.repeat(num_epochs) # repeat the original dataset 3 times 
        if shuffle:
            dataset = dataset.shuffle(buffer_size=1000, seed=8) # shuffle with a buffer of 1000 element
        dataset = dataset.batch(5, drop_remainder=True) # small batch size to print result
        
        #Load
        if prefetch:
            dataset = dataset.prefetch(1) #just to use it. It optimize training parallelizing batch loading over CPU and GPU
            
        #Load: to check
        return dataset
    
    return input_fn

In [10]:
train_input_fn = get_dataset(data_train, 'BAD')
test_input_fn = get_dataset(data_test, 'BAD')

In [11]:
for feature_batch, label_batch in train_input_fn().take(1):
    print('Feature keys:', list(feature_batch.keys()))
    print('A batch of REASON:', feature_batch['REASON'].numpy())
    print('A batch of Labels:', label_batch.numpy())

Feature keys: ['LOAN', 'MORTDUE', 'VALUE', 'REASON', 'JOB', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
A batch of REASON: [b'DebtCon' b'DebtCon' b'DebtCon' b'DebtCon' b'DebtCon']
A batch of Labels: [0 0 0 0 0]


## Define features and configures feature_columns

In order to import our training data into TensorFlow, we need to specify what type of data each feature contains. 

In our case, we have:

1. **Categorical Data**: 'REASON', 'JOB'

2. **Numerical Data**: 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC'

In TensorFlow, we indicate a feature's data type using a construct called a **feature column**. 

Feature columns store only a description of the feature data; they do not contain the feature data itself.

In [12]:
feature_columns = []

In [13]:
train_dataset = train_input_fn()            
# We will use this batch to demonstrate several types of feature columns
example_batch = next(iter(train_dataset))[0]

In [14]:
# https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column

# Numerical variables

for col_name in NUMERICAL_VARIABLES:
    num_feature = tf.feature_column.numeric_column(col_name, dtype=tf.float64)
    feature_columns.append(num_feature)
    
check_feature(feature_columns[0])



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[27700.]
 [ 7700.]
 [19000.]
 [33000.]
 [14900.]]


In [15]:
# Categorical variables

labels_dict= {'REASON': ['DebtCon', 'HomeImp', 'Missing'],
              'JOB' : ['Other', 'Sales', 'ProfExe', 'Office', 'Mgr', 'Self', 'Missing']}

for col_name in CATEGORICAL_VARIABLES:
    cat_feature = tf.feature_column.categorical_column_with_vocabulary_list(col_name, labels_dict[col_name])
    indicator_column = tf.feature_column.indicator_column(cat_feature)
    feature_columns.append(indicator_column)

check_feature(feature_columns[-1])



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]]


In [16]:
feature_columns

[NumericColumn(key='LOAN', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 NumericColumn(key='MORTDUE', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 NumericColumn(key='VALUE', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 NumericColumn(key='YOJ', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 NumericColumn(key='DEROG', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 NumericColumn(key='DELINQ', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 NumericColumn(key='CLAGE', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 NumericColumn(key='NINQ', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 NumericColumn(key='CLNO', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 NumericColumn(key='DEBTINC', shape=(1,), default_value=None, dtype=tf.float64, normalizer_fn=None),
 Indicator

In [17]:
## Create a get_features function
def get_features(num_features: list, cat_features: list, labels_dict:dict) -> list:
    
    # Create an empty list for feature
    feature_columns = []
    
    #Get numerical features
    for col_name in num_features:
        num_feature = tf.feature_column.numeric_column(col_name, dtype=tf.float64)
        feature_columns.append(num_feature)
    
    #Get categorical features
    for col_name in cat_features:
        cat_feature = tf.feature_column.categorical_column_with_vocabulary_list(col_name, labels_dict[col_name])
        indicator_column = tf.feature_column.indicator_column(cat_feature)
        feature_columns.append(indicator_column)
        
    return feature_columns

In [18]:
feature_columns = get_features(NUMERICAL_VARIABLES, CATEGORICAL_VARIABLES, labels_dict)

## Create Estimator

In [19]:
# Create Feature Layer
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [20]:
# Use Base Estimator classifier
model_dir = tempfile.mkdtemp()
linear_classifier_base = tf.estimator.LinearClassifier(
    model_dir=model_dir, 
    feature_columns=feature_columns,
    n_classes=2
)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmppshfzeqb', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [21]:
train_input_fn = get_dataset(data_train, 'BAD', batch_size=500)
test_input_fn = get_dataset(data_test, 'BAD', batch_size=500)

linear_classifier_base = linear_classifier_base.train(input_fn=train_input_fn, steps=10)

Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into

In [22]:
result = linear_classifier_base.evaluate(input_fn=train_input_fn, steps=20)

for key, value in result.items():
    print(key, ":", value)

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-09-20T08:13:39Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmppshfzeqb/model.ckpt-10
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [2/20]
INFO:tensorflow:Evaluation [4/20]
INFO:tensorflow:Evaluation [6/20]
INFO:tensorflow:Evaluation [8/20]
INFO:tensorflow:Evaluation [10/20]
INFO:tensorflow:Evaluation [12/20]
INFO:tensorflow:Evaluation [14/20]
INFO:tensorflow:Evaluation [16/20]
INFO:tensorflow:Evaluation [18/20]
INFO:tensorflow:Evaluation [20/20]
INFO:tensorflow:Inference T

In [23]:
for pred in linear_classifier_base.predict(test_input_fn):
    for key, value in pred.items():
        print(key, ":", value)
    break

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmppshfzeqb/model.ckpt-10
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
logits : [-7812.947]
logistic : [0.]
probabilities : [1. 0.]
class_ids : [0]
classes : [b'0']
all_class_ids : [0 1]
all_classes : [b'0' b'1']
