# Python Machine Learning Practice - Part l

## Overview of the Predictive Modeling Case
A large financial institution has created a new product for its customers. They have marketed the product to random customers and gathered demographic and financial information from these customers. Your goal is to build a model to predict which customers are most likely to purchase the new product. 

## Data
The **BANK_PROMO** data set contains information from account holders at a large financial services firm. The accounts represent consumers of home equity lines of credit, automobile loans, and other short to medium term credit instruments. 

| Name          | Model Role | Measurement Level | Description                                                            |
|:--------------|:-----------|:------------------|:-----------------------------------------------------------------------|
| B_TGT         | Target     | Binary            | 1 = customer pruchased new product, 0 = customer did not purchase      |
| CAT_INPUT1    | Input      | Nominal           | Account activity level                                                 |
| CAT_INPUT2    | Input      | Nominal           | Customer value level                                                   |
| DEMOG_AGE     | Input      | Interval          | Customer age                                                           |
| DEMOG_GEN     | Input      | Binary            | Customer gender                                                        |
| DEMOG_HOS     | Input      | Binary            | Homeowner status                                                       |
| DEMOG_HOMEVAL | Input      | Interval          | Home value                                                             |
| DEMOG_INC     | Input      | Interval          | Income                                                                 |
| RFM5          | Input      | Interval          | Purchase count past three years                                        |
| RFM6          | Input      | Interval          | Purchase count lifetime                                                |
| RFM7          | Input      | Interval          | Purchase count past three years direct promotion response              |
| RFM8          | Input      | Interval          | Purchase count lifetime direct promotion response                      |
| RFM9          | Input      | Interval          | Months since last purchase                                             |
| RFM10         | Input      | Interval          | Total count promos past year                                           |
| RFM11         | Input      | Interval          | Direct promos count past year                                          |
| RFM12         | Input      | Interval          | Customer Tenure                                                        |

# Load Packages

In [None]:
import os
import sys
import swat
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
swat.options.cas.print_messages = True

# Connect to CAS

In [None]:
conn = swat.CAS(os.environ.get("CASHOST"), os.environ.get("CASPORT"),None,os.environ.get("SAS_VIYA_TOKEN"))

# CAS Session

In [None]:
# Change Timeout
mytime = 60*60*12
conn.session.timeout(time=mytime)
conn.session.sessionStatus()

# Load Data onto the Server

In [None]:
# Read in the bank_promo CSV to an in-memory data table and create a CAS table object reference
castbl = conn.read_csv(os.environ.get("HOME")+"/Courses/EVMLOPRC/DATA/bank_promo.csv", casout = dict(name="bank", replace=True))

# Create variable for the in-memory data set name
indata = 'bank'

# Explore the Data

In [None]:
castbl.head()

In [None]:
display(castbl.shape)
list(castbl)

In [None]:
castbl.mean()

In [None]:
castbl.describe(include=['numeric', 'character'])

# Transform the demog_age variable

In [None]:
conn.loadActionSet('dataStep')
actions = conn.builtins.help(actionSet='dataStep')

In [None]:
conn.dataStep.runCode(code=
    '''
    data bank;
        set bank;
        if demog_age < 18 and demog_age ^= . then demog_age = 18;
    run;
    '''
)

In [None]:
castbl.describe()['demog_age']

# Explore the Data using CAS Actions

In [None]:
conn.loadActionSet('simple')
actions = conn.builtins.help(actionSet='simple')

In [None]:
conn.simple.correlation(
    table = indata,
    inputs = ['rfm5','rfm6','rfm7','rfm8','rfm9','rfm10','rfm11','rfm12']
)['Correlation']

In [None]:
conn.simple.distinct(
    table = indata,
    inputs = ['b_tgt','cat_input1','cat_input2','demog_gen','demog_hos']
)

In [None]:
conn.simple.freq(
    table = indata,
    inputs = ['b_tgt','cat_input1','cat_input2','demog_gen','demog_hos']
)

In [None]:
conn.simple.crossTab(
    table = indata,
    row = "b_tgt", col = "cat_input2"
)

In [None]:
conn.loadActionSet('cardinality')
actions = conn.builtins.help(actionSet='cardinality')

In [None]:
conn.cardinality.summarize(
    table = indata,
    cardinality = dict(name='card', replace=True)
)

In [None]:
# Create connection object for the card data table
card = conn.CASTable(name = "card")
display(castbl.head())
card.shape

# Visualize Numeric Variables Locally

In [None]:
conn.loadActionSet('sampling')
actions = conn.builtins.help(actionSet='sampling')

In [None]:
conn.sampling.srs(
    table   = indata,
    samppct = 25,
    seed = 123,
    partind = False,
    output  = dict(casOut = dict(name = 'mysam', replace = True),  copyVars = 'ALL')
)

In [None]:
# Create connection object
mysam = conn.CASTable(name = "mysam")

# Bring data locally
df = mysam.to_frame()

# Create histograms of the numeric columns
df.hist(bins=20, figsize=(10,10))
plt.show()

# Check Data for Missing Values

In [None]:
# Create a casDataFrame containing number of missing values for each variable
tbl = castbl.distinct()['Distinct'][['Column', 'NMiss']]
tbl

In [None]:
# Plot the percent of missing values locally
nr = castbl.shape[0]
tbl['PctMiss'] = tbl['NMiss']/nr
MissPlot = tbl.plot(x='Column', y='PctMiss', kind='bar', figsize=(8,8), fontsize=15)
MissPlot.set_xlabel('Variable', fontsize=15)
MissPlot.set_ylabel('Percent Missing', fontsize=15)
MissPlot.legend_.remove()
plt.show()

# Impute Missing Values

In [None]:
conn.dataPreprocess.impute(
    table = indata,
    methodContinuous = 'MEDIAN',
    methodNominal    = 'MODE',
    inputs           = list(castbl)[1:],
    copyAllVars      = True,
    casOut           = dict(name = indata, replace = True)
)

# Create Variable Shortcuts

In [None]:
# Get variable info and types
colinfo = conn.table.columninfo(table=indata)['ColumnInfo']
colinfo

In [None]:
# Target variable is the first variable
target = colinfo['Column'][0]

# Get all variables
inputs = list(colinfo['Column'][1:])
nominals = list(colinfo.query('Type=="varchar"')['Column'])

# Get only imputed variables
inputs = [k for k in inputs if 'IMP_' in k]
nominals = [k for k in nominals if 'IMP_' in k]
nominals = [target] + nominals

display(target)
display(inputs)
display(nominals)

# Python Machine Learning Practice - Part ll

# Split the Data into Training and Validation

In [None]:
conn.sampling.srs(
    table   = indata,
    samppct = 70,
    seed = 919,
    partind = True,
    output  = dict(casOut = dict(name = indata, replace = True),  copyVars = 'ALL')
)

# View the partition

In [None]:
# Refresh the castbl object
castbl = conn.CASTable(name=indata)

# Make sure the partition worked correctly using Python code
castbl['_PartInd_'].mean()

# Decision Tree

In [None]:
conn.loadActionSet('decisionTree')
actions = conn.builtins.help(actionSet='decisionTree')

In [None]:
conn.decisionTree.dtreeTrain(
    table    = dict(name = indata, where = '_PartInd_ = 1'),
    target   = target, 
    inputs   = inputs, 
    nominals = nominals,
    casOut   = dict(name = 'dt_model', replace = True)
)

# Random Forest

In [None]:
conn.decisionTree.forestTrain(
    table    = dict(name = indata, where = '_PartInd_ = 1'),
    target   = target, 
    inputs   = inputs, 
    nominals = nominals,
    nTree    = 1000,
    casOut   = dict(name = 'rf_model', replace = True)
)

# Gradient Boosting

In [None]:
conn.decisionTree.gbtreeTrain(
    table    = dict(name = indata, where = '_PartInd_ = 1'),
    target   = target, 
    inputs   = inputs, 
    nominals = nominals,
    nTree    = 1000,
    casOut   = dict(name = 'gbt_model', replace = True)
)

# Score the Models

In [None]:
#Score the decision tree model
dt_score_obj = conn.decisionTree.dtreeScore(
    table    = dict(name = indata, where = '_PartInd_ = 0'),
    model = "dt_model",
    casout = dict(name="dt_scored",replace=True),
    copyVars = target,
    encodename = True,
    assessonerow = True
)

#Score the random forest model
rf_score_obj = conn.decisionTree.forestScore(
    table    = dict(name = indata, where = '_PartInd_ = 0'),
    model = "rf_model",
    casout = dict(name="rf_scored",replace=True),
    copyVars = target,
    encodename = True,
    assessonerow = True
)

#Score the gradient boosting model
gb_score_obj = conn.decisionTree.gbtreeScore(
    table    = dict(name = indata, where = '_PartInd_ = 0'),
    model = "gbt_model",
    casout = dict(name="gbt_scored",replace=True),
    copyVars = target,
    encodename = True,
    assessonerow = True
)

# Assess the Models

In [None]:
conn.loadActionSet('percentile')
actions = conn.builtins.help(actionSet='percentile')

In [None]:
# Create prediction variable name
assess_input = 'P_'+target+'1'

# Assess the decision tree model
dt_assess_obj = conn.percentile.assess(
   table = "dt_scored",
   inputs = assess_input,
   casout = dict(name="dt_assess",replace=True),
   response = target,
   event = "1"
)

# Assess the random forest model
rf_assess_obj = conn.percentile.assess(
   table = "rf_scored",
   inputs = assess_input,
   casout = dict(name="rf_assess",replace=True),
   response = target,
   event = "1"
)

# Assess the gradient boosting model
gb_assess_obj = conn.percentile.assess(
   table = "gbt_scored",
   inputs = assess_input,
   casout = dict(name="gbt_assess",replace=True),
   response = target,
   event = "1"
)

# Bring Results to the Client

In [None]:
# Create table objects from the assess output, 
# bring data to the client, 
# and add new variable to data frame indicating model name

dt_assess_ROC = conn.CASTable(name = "dt_assess_ROC")
dt_assess_ROC = dt_assess_ROC.to_frame()
dt_assess_ROC['Model']= 'Decision Tree'

rf_assess_ROC = conn.CASTable(name = "rf_assess_ROC")
rf_assess_ROC = rf_assess_ROC.to_frame()
rf_assess_ROC['Model'] = 'Random Forest'

gbt_assess_ROC = conn.CASTable(name = "gbt_assess_ROC")
gbt_assess_ROC = gbt_assess_ROC.to_frame()
gbt_assess_ROC['Model'] = 'Gradient Boosting'

# Compare Confusion Matrix

In [None]:
df_assess = pd.DataFrame()
df_assess = pd.concat([dt_assess_ROC,rf_assess_ROC,gbt_assess_ROC])
cutoff_index = round(df_assess['_Cutoff_'],2)==0.5
compare = df_assess[cutoff_index].reset_index(drop=True)
compare[['Model','_TP_','_FP_','_FN_','_TN_']]

# Compare Misclassification

In [None]:
compare['Misclassification'] = 1-compare['_ACC_']
miss = compare[round(compare['_Cutoff_'],2)==0.5][['Model','Misclassification']]
miss.sort_values('Misclassification')

# Compare ROC Curves

In [None]:
plt.figure(figsize=(8,8))
plt.plot()
models = list(df_assess.Model.unique())

# Iteratively add curves to the plot
for X in models:
    tmp = df_assess[df_assess['Model']==X]
    plt.plot(tmp['_FPR_'],tmp['_Sensitivity_'], label=X+' (C=%0.2f)'%tmp['_C_'].mean())

plt.xlabel('False Positive Rate', fontsize=15)
plt.ylabel('True Positive Rate', fontsize=15)
plt.legend(loc='lower right')
plt.show()

# Python Machine Learning Practice - Part lll

# Efficient Scoring - Looping

In [None]:
models = ['dt','rf','gbt']
actions = ['conn.decisionTree.dtreeScore','conn.decisionTree.forestScore','conn.decisionTree.gbtreeScore']

# Create function to score a given model
def score_func(model):
    tmp_dict = dict(
        table    = dict(name = indata, where = '_PartInd_ = 0'),
        model = model+'_model',
        casout = dict(name=model+'_scored', replace=True),
        copyVars = target,
        encodename = True,
        assessonerow = True
    )
    return tmp_dict

# Loop over the models and actions
for i in list(range(len(models))):
    params = score_func(models[i])
    tmp_str = actions[i]+'(**params)'
    obj = eval(tmp_str)
    print(models[i])
    print(obj['ScoreInfo'].iloc[[2]])

# Efficient Assessment - Looping

In [None]:
# Create function to assess a given model
def assess_func(model):
    tmp_dict = dict(
        table = model+'_scored',
        inputs = 'P_'+target+'1',
        casout = dict(name=model+'_assess' ,replace=True),
        response = target,
        event = "1"
    )
    return tmp_dict

# Loop over the models
for i in list(range(len(models))):
    params = assess_func(models[i])
    obj = conn.percentile.assess(**params)
    print(obj['OutputCasTables'][['Name','Rows','Columns']])

# Create Confusion Matrix

In [None]:
# Create function to bring assess tables to the client
def assess_local_roc(model):
    castbl_obj = conn.CASTable(name = model+'_assess_ROC')
    local_tbl = castbl_obj.to_frame()
    local_tbl['Model'] = model
    return local_tbl

# Bring result tables to the client in a loop
df_assess = pd.DataFrame()
for i in list(range(len(models))):
    df_assess = pd.concat([df_assess, assess_local_roc(models[i])])

cutoff_index = round(df_assess['_Cutoff_'],2)==0.5
compare = df_assess[cutoff_index].reset_index(drop=True)
compare[['Model','_TP_','_FP_','_FN_','_TN_']]

# Compare Misclassification

In [None]:
compare['Misclassification'] = 1-compare['_ACC_']
miss = compare[round(compare['_Cutoff_'],2)==0.5][['Model','Misclassification']]
miss.sort_values('Misclassification')

# Add caslib to specify a data source

In [None]:
conn.table.addCaslib(name="mycl", path=os.environ.get("HOME"), dataSource="PATH", activeOnAdd = False)

# Save the Best Model

In [None]:
conn.table.save(caslib = 'mycl', table = dict(name = 'gbt_model'), name = 'best_model_gbt', replace = True)

In [None]:
conn.table.attribute(caslib = 'CASUSER', table = 'gbt_model_attr', name = 'gbt_model', task='convert')
conn.table.save(caslib = 'mycl', table = 'gbt_model_attr', name = 'attr', replace = True)

# Promote Data Table to Global Scope

In [None]:
conn.table.promote(caslib="casuser", name=indata)
conn.table.tableInfo()

# End the Session

In [None]:
conn.session.endSession()

# Python Machine Learning Practice - Part lV

![title](sas_viya_logo.png)