# Overview 

## Objective

This notebook provides an example of how to train Tensorflow classifiers using the HMEQ dataset

The goal is to predict whether a customer is a BAD (default) borrower, which in this dataset is a binary classification task.

## Assumption

We are working in big data context. 

Then, I'm going to work with HMEQ dataset as it is so large that it would not fit in RAM. 

Then we use the Tensorflow framework to deal with that.

## Imports and setup

In [None]:
#General
import os
import functools
import pprint
import tempfile

#Analysis
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers

## Define variables

In [None]:
BASE_DIR_PATH = os.getcwd()
DATA_DIR_PATH = os.path.join(BASE_DIR_PATH, '../data')

# Data directories paths
TRAIN_DIR_PATH = os.path.join(DATA_DIR_PATH, 'train')
TEST_DIR_PATH = os.path.join(DATA_DIR_PATH, 'test')
VAL_DIR_PATH = os.path.join(DATA_DIR_PATH, 'val')

# Data file paths
TRAIN_DATA_PATH = os.path.join(TRAIN_DIR_PATH, 'train.csv')
TEST_DATA_PATH = os.path.join(TEST_DIR_PATH, 'test.csv')
VAL_DATA_PATH = os.path.join(VAL_DIR_PATH, 'val.csv')

# Modeldir
MODEL_DIR = os.path.join(BASE_DIR_PATH, '../models')

## Define Helpers

In [None]:
## Preprocessing data
def _set_categorical_type(dataframe: pd.DataFrame) -> pd.DataFrame:
    '''Set the categorical type as string if needed'''
    for column in CATEGORICAL_VARIABLES:
        if (dataframe[column].dtype == 'O'):
            dataframe[column] = dataframe[column].astype('string')
    return dataframe

def _set_categorical_empty(dataframe: pd.DataFrame) -> pd.DataFrame:
    '''Change object type for categorical variable to avoid TF issue '''
    for column in CATEGORICAL_VARIABLES:
        if any(dataframe[column].isna()):
            dataframe[column] = dataframe[column].fillna('')
    return dataframe

def _set_numerical_type(dataframe: pd.DataFrame) -> pd.DataFrame:
    '''Set the numerical type as float64 if needed'''
    for column in NUMERICAL_VARIABLES:
        if (dataframe[column].dtype == 'int64'):
            dataframe[column] = dataframe[column].astype('float64')
    return dataframe

def _get_impute_parameters_cat(categorical_variables: list) -> dict:
    '''For each column in the numerical features, calculate mean.'''
    impute_parameters = {}
    for column in categorical_variables:
        impute_parameters[column] = 'Missing'
    return impute_parameters
    
def _impute_missing_categorical(inputs: dict, target) -> dict:
    impute_parameters = _get_impute_parameters_cat(CATEGORICAL_VARIABLES)
    # Since we modify just some features, 
    # we need to start by setting `outputs` to a copy of `inputs.
    output = inputs.copy()
    for key, value in impute_parameters.items():
        is_blank = tf.math.equal('', inputs[key])
        tf_other = tf.constant(value, dtype=np.string_)
        output[key] = tf.where(is_blank, tf_other, inputs[key])
    return output, target

def _get_mean_parameter(dataframe: pd.DataFrame, column: str) -> float:
    ''' Given a column, calculate mean'''
    mean = dataframe[column].mean()
    return mean

def _get_impute_parameters_num(dataframe: pd.DataFrame, numerical_variables: list) -> dict:
    '''For each column in the numerical features, calculate mean.'''
    impute_parameters = {}
    for column in numerical_variables:
        impute_parameters[column] = _get_mean_parameter(dataframe, column)
    return impute_parameters

def _impute_missing_numerical(inputs: dict, target) -> dict:
    '''Impute missing based on training mean'''
    impute_parameters = _get_impute_parameters_num(data_train, NUMERICAL_VARIABLES) ## Here we have data_train
    # Since we modify just some features, 
    # we need to start by setting `outputs` to a copy of `inputs.
    output = inputs.copy()
    for key, value in impute_parameters.items():
        is_miss = tf.math.is_nan(inputs[key])
        tf_mean = tf.constant(value, dtype=np.float64)
        output[key] = tf.where(is_miss, tf_mean, inputs[key])
    return output, target
            
# A utility method to create a feature column
# and to transform a batch of data
def check_feature(feature_column):
    feature_layer = layers.DenseFeatures(feature_column)
    print(feature_layer(example_batch).numpy())

# Data

## Preview data

In [None]:
!head -n 5 ../data/train/train.csv

## Load Data

In [None]:
data_train = pd.read_csv(TRAIN_DATA_PATH, sep=',')
data_test = pd.read_csv(TEST_DATA_PATH, sep=',')
data_val = pd.read_csv(VAL_DATA_PATH, sep=',')

data_train.head(5)                                                  

In [None]:
data_train.info()

In [None]:
data_train.describe().transpose()

**Comment**: We notice that several variables (numerical and categorical) have missing values.

## Import Data in Tensorflow

Based on what I understood when you import data in Tensorflow you need two elements:

**1. input_fn**: specifies how data is converted to a tf.data.Dataset that feeds the input pipeline.

**2. feature column**: a construct that indicates a feature's data type.

In our case, we notice that variables have missing. Then we need to impute them. Also we need to normalize data. 

And because we want to use Tensorflow framework, we can implement data preprocessing and transformation operations in the TensorFlow model itself. In this way, **it becomes an integral part of the model when the model is exported and deployed for predictions.**

TensorFlow transformations can be accomplished in one of the following ways:

1. Extending your base feature_columns (using crossed_column, embedding_column, bucketized_column, and so on).

2. Implementing all of the instance-level transformation logic in a function that you call in all three input functions: train_input_fn, eval_input_fn, and serving_input_fn.

3. If you are creating custom estimators, putting the code in the model_fn function.

Then, we have two approaches to inputs:

**1. Inside the input_fn**

**2. While creating feature_column**

Personally I prefer 

1. Preprocess data in the input_fn 

2. Do feature engineering while creating feature_column.

About **the Data preprocessing strategy of impute missings**, 

- numerical variables: impute with mean

- categorical variables: create 'other' class

## Define input_fn 

In [None]:
TARGET = ['BAD']
CATEGORICAL_VARIABLES = ['REASON', 'JOB']
NUMERICAL_VARIABLES = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']

In [None]:
## Create the input function to load data into dataset

def get_dataset(dataframe:pd.DataFrame, target:str, num_epochs=2, repeat=True, shuffle=True, batch_size=5, prefetch=True):
    
    def input_fn():
        '''input_fn to read the data and impute missings'''
        
        # Extract
        df = _set_categorical_type(dataframe)
        df = _set_categorical_empty(df)
        df = _set_numerical_type(df)
        predictors = dict(df)
        label = predictors.pop(target)
        dataset = tf.data.Dataset.from_tensor_slices((predictors, label))
        
        #Transform
        dataset = dataset.map(_impute_missing_categorical)
        dataset = dataset.map(_impute_missing_numerical)
        if repeat:
            dataset = dataset.repeat(num_epochs) # repeat the original dataset 3 times 
        if shuffle:
            dataset = dataset.shuffle(buffer_size=1000, seed=8) # shuffle with a buffer of 1000 element
        dataset = dataset.batch(5, drop_remainder=True) # small batch size to print result
        
        #Load
        if prefetch:
            dataset = dataset.prefetch(1) #just to use it. It optimize training parallelizing batch loading over CPU and GPU
            
        #Load: to check
        return dataset
    
    return input_fn

In [None]:
train_input_fn = get_dataset(data_train, 'BAD')
test_input_fn = get_dataset(data_test, 'BAD', shuffle=False)

In [None]:
for feature_batch, label_batch in train_input_fn().take(1):
    print('Feature keys:', list(feature_batch.keys()))
    print('A batch of REASON:', feature_batch['REASON'].numpy())
    print('A batch of Labels:', label_batch.numpy())

## Define features and configures feature_columns

In order to import our training data into TensorFlow, we need to specify what type of data each feature contains. 

In our case, we have:

1. **Categorical Data**: 'REASON', 'JOB'

2. **Numerical Data**: 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC'

In TensorFlow, we indicate a feature's data type using a construct called a **feature column**. 

Feature columns store only a description of the feature data; they do not contain the feature data itself.

In [None]:
feature_columns = []

In [None]:
train_dataset = train_input_fn()            
# We will use this batch to demonstrate several types of feature columns
example_batch = next(iter(train_dataset))[0]

In [None]:
# https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column

# Numerical variables

for col_name in NUMERICAL_VARIABLES:
    num_feature = tf.feature_column.numeric_column(col_name, dtype=tf.float64)
    feature_columns.append(num_feature)
    
check_feature(feature_columns[0])

In [None]:
# Categorical variables

labels_dict= {'REASON': ['DebtCon', 'HomeImp', 'Missing'],
              'JOB' : ['Other', 'Sales', 'ProfExe', 'Office', 'Mgr', 'Self', 'Missing']}

for col_name in CATEGORICAL_VARIABLES:
    cat_feature = tf.feature_column.categorical_column_with_vocabulary_list(col_name, labels_dict[col_name])
    indicator_column = tf.feature_column.indicator_column(cat_feature)
    feature_columns.append(indicator_column)

check_feature(feature_columns[-1])

In [None]:
feature_columns

In [None]:
## Create a get_features function
def get_features(num_features: list, cat_features: list, labels_dict:dict) -> list:
    
    # Create an empty list for feature
    feature_columns = []
    
    #Get numerical features
    for col_name in num_features:
        num_feature = tf.feature_column.numeric_column(col_name, dtype=tf.float64)
        feature_columns.append(num_feature)
    
    #Get categorical features
    for col_name in cat_features:
        cat_feature = tf.feature_column.categorical_column_with_vocabulary_list(col_name, labels_dict[col_name])
        indicator_column = tf.feature_column.indicator_column(cat_feature)
        feature_columns.append(indicator_column)
        
    return feature_columns

In [None]:
feature_columns = get_features(NUMERICAL_VARIABLES, CATEGORICAL_VARIABLES, labels_dict)

## Create Estimator

In [None]:
# Create Feature Layer
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [None]:
# Use Base Estimator classifier
# model_dir = tempfile.mkdtemp()
os.mkdir('./test')
modeldir = './test'
linear_classifier_base = tf.estimator.LinearClassifier(
    model_dir=modeldir, 
    feature_columns=feature_columns,
    n_classes=2
)

In [None]:
def build_estimator(feature_columns, learning_rate=0.1):
    """
     Build an estimator.
     
    """
    feature_layer = tf.keras.layers.DenseFeatures(feature_columns)
    
#     model_dir = tempfile.mkdtemp() #make temporary cause we're going to use tensorboard
    os.rmdir('./test')
    os.mkdir('./test')
    modeldir = './test'
    runconfig = tf.estimator.RunConfig(tf_random_seed=8)
    
    linear_classifier_base = tf.estimator.LinearClassifier(
    model_dir=modeldir, 
    feature_columns=feature_columns,
    n_classes=2,
    optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate),
    )
    
    return linear_classifier_base

In [None]:
estimator = build_estimator(feature_columns)

## Train and Evaluate model

In [None]:
train_input_fn = get_dataset(data_train, 'BAD', batch_size=500)
test_input_fn = get_dataset(data_test, 'BAD', batch_size=500, repeat=False, shuffle=False)

linear_classifier_base = linear_classifier_base.train(input_fn=train_input_fn, steps=10)

In [None]:
metrics = linear_classifier_base.evaluate(input_fn=test_input_fn, steps=10)

for key, value in metrics.items():
    print(key, ":", value)

In [None]:
def train_and_evaluate():
    '''Remember to parametrize'''
    # Get dataset
    train_input_fn = get_dataset(data_train, 'BAD', batch_size=500)
    test_input_fn = get_dataset(data_test, 'BAD', batch_size=500, repeat=False, shuffle=False)
    # Get Features
    feature_columns = get_features(NUMERICAL_VARIABLES, CATEGORICAL_VARIABLES, labels_dict)
    # Get estimator
    estimator = build_estimator(feature_columns)
    # Train the estimator
    estimator_train = estimator.train(input_fn=train_input_fn, steps=10)
    # Evaluate 
    metrics = estimator_train.evaluate(input_fn=test_input_fn, steps=10)
    return estimator_train, metrics

In [None]:
model, metrics = train_and_evaluate()

In [None]:
# # Create estimator train and evaluate function
# def train_and_evaluate(args):
#     tf.compat.v1.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file
#     estimator = build_estimator(args['output_dir'], args['nbuckets'], args['hidden_units'].split(' '))
#     train_spec = tf.estimator.TrainSpec(
#         input_fn = read_dataset(
#             filename = args['train_data_paths'],
#             mode = tf.estimator.ModeKeys.TRAIN,
#             batch_size = args['train_batch_size']),
#         max_steps = args['train_steps'])
#     exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
#     eval_spec = tf.estimator.EvalSpec(
#         input_fn = read_dataset(
#             filename = args['eval_data_paths'],
#             mode = tf.estimator.ModeKeys.EVAL,
#             batch_size = args['eval_batch_size']),
#         steps = 100,
#         exporters = exporter)
#     tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

## Test Model Predict

In [None]:
# an example of predictions
for pred in linear_classifier_base.predict(test_input_fn):
    for key, value in pred.items():
        print(key, ":", value)
    break

## Save your model

In [None]:
version = 1
export_path = os.path.join(MODEL_DIR, str(version))
os.rmdir(export_path)
os.mkdir(export_path, mode=777)
print('export_path = {}\n'.format(export_path))

In [None]:
feature_columns

In [None]:
inputs = {}
for feature in feature_columns:
    inputs = {**inputs, **tf.feature_column.make_parse_example_spec([feature])}
    serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn({**inputs, **tf.feature_column.make_parse_example_spec([feature])})
estimator_path = estimator.export_saved_model(export_path, serving_input_fn)

In [None]:
serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(
  tf.feature_column.make_parse_example_spec([input_column]))
estimator_base_path = os.path.join(tmpdir, 'from_estimator')
estimator_path = estimator.export_saved_model(estimator_base_path, serving_input_fn)