##### To demonstrate the performance and productivity improvements that SAS Viya Workbench can provide, the training data set that was created in the SAS notebook (annuity_advisors_prep) was run through the SAS Data Maker application to generate a very large sample of synthetic data.  Using this larger table will showcase how much faster the SAS Viya Workbench Python APIs are relative to native Python libraries.

##### Using a gradient boosting algorigm from the scikit-learn library and the SAS Viya Workbench gradient boosting Python API, a speed test is run to compare the performance of training a classifer model.  The results show that the SAS Viya Workbench gradient boosting Python API ran significantly faster as noted in the time stamps within the cells.

In [None]:
########################
### Create Dataframe ###
########################

import pandas as pd
from pathlib import Path

workspace_dir = "/workspaces/chris_parrish/sas_viya/poc/1_data_management/annuity_advisors"
data_table = "annuity_advisors_prep_synthetic.csv"

dm_inputdf = pd.read_csv(Path(workspace_dir) / data_table, header=0)
print(dm_inputdf.dtypes)
print("Dimenstions of Synthetic Data:", dm_inputdf.shape)

advisor                            int64
advisor_event_indicator            int64
sf_face_2_face                     int64
sf_call_outbound                   int64
sf_call_inbound                    int64
sf_email_inbound                   int64
channel_bank                       int64
channel_wirehouse                  int64
channel_ria                        int64
primary_prod_sold_fixed            int64
primary_prod_sold_va               int64
sf_email_campaigns                 int64
advisor_hh_children                int64
annuity_mkt_opp                  float64
advisor_advising_years           float64
advisor_aum                      float64
advisor_annuity_selling_years    float64
advisor_age                      float64
advisor_net_worth                float64
advisor_credit_hist_mos          float64
advisor_firm_changes               int64
advisor_credit_score             float64
wholesaler                         int64
region_ca                          int64
region_ny       

In [2]:
########################
### Model Parameters ###
########################

### import python libraries
import numpy as np
from sklearn.utils import shuffle

### model manager information
metadata_output_dir = 'outputs'
model_name = 'logit_python_annuity_workbench'
project_name = 'Annuity Advisors'
description = 'Logistic Regression'
model_type = 'logistic_regression'
model_function = 'Classification'
predict_syntax = 'predict_proba'

### define macro variables for model
dm_dec_target = 'advisor_event_indicator'
dm_partitionvar = 'analytic_partition'
create_new_partition = 'no' # 'yes', 'no'
dm_key = 'advisor' 
dm_classtarget_level = ['0', '1']
dm_partition_validate_val, dm_partition_train_val, dm_partition_test_val = [0, 1, 2]
dm_partition_validate_perc, dm_partition_train_perc, dm_partition_test_perc = [0.3, 0.6, 0.1]

### create list of regressors
keep_predictors = [
    ]
rejected_predictors = [
    'channel_ria',
    'region_we',
    'primary_prod_sold_fixed'
    ] 

### mlflow
use_mlflow = 'no' # 'yes', 'no'
mlflow_run_to_use = 0
mlflow_class_labels =['TENSOR']
mlflow_predict_syntax = 'predict'

### var to consider in bias assessment
bias_vars = []

### var to consider in partial dependency
pd_var1 = ''
pd_var2 = ''

### create partition column, if not already in dataset
if create_new_partition == 'yes':
    dm_inputdf = shuffle(dm_inputdf)
    dm_inputdf.reset_index(inplace=True, drop=True)
    validate_rows = round(len(dm_inputdf)*dm_partition_validate_perc)
    train_rows = round(len(dm_inputdf)*dm_partition_train_perc) + validate_rows
    test_rows = len(dm_inputdf)-train_rows
    dm_inputdf.loc[0:validate_rows,dm_partitionvar] = dm_partition_validate_val
    dm_inputdf.loc[validate_rows:train_rows,dm_partitionvar] = dm_partition_train_val
    dm_inputdf.loc[train_rows:,dm_partitionvar] = dm_partition_test_val

In [3]:
##############################
### Final Modeling Columns ###
##############################

### create list of model variables
dm_input = list(dm_inputdf.columns.values)
macro_vars = (dm_dec_target + ' ' + dm_partitionvar + ' ' + dm_key).split()
#rejected_predictors = [i for i in dm_input if i not in keep_predictors]
rejected_vars = rejected_predictors + macro_vars #(include macro_vars if rejected_predictors are explicitly listed - not contra keep_predictors)
for i in rejected_vars:
    dm_input.remove(i)
print(dm_input)

### create prediction variables
dm_predictionvar = [str('P_') + dm_dec_target + dm_classtarget_level[0], str('P_') + dm_dec_target + dm_classtarget_level[1]]
dm_classtarget_intovar = str('I_') + dm_dec_target

##################
### Data Split ###
##################

### create train, test, validate datasets using existing partition column
dm_traindf = dm_inputdf[dm_inputdf[dm_partitionvar] == dm_partition_train_val]
X_train = dm_traindf.loc[:, dm_input]
y_train = dm_traindf[dm_dec_target]
dm_testdf = dm_inputdf.loc[(dm_inputdf[dm_partitionvar] == dm_partition_test_val)]
X_test = dm_testdf.loc[:, dm_input]
y_test = dm_testdf[dm_dec_target]
dm_validdf = dm_inputdf.loc[(dm_inputdf[dm_partitionvar] == dm_partition_validate_val)]
X_valid = dm_validdf.loc[:, dm_input]
y_valid = dm_validdf[dm_dec_target]
fullX = dm_inputdf.loc[:, dm_input]
fully = dm_inputdf[dm_dec_target]

['sf_face_2_face', 'sf_call_outbound', 'sf_call_inbound', 'sf_email_inbound', 'channel_bank', 'channel_wirehouse', 'primary_prod_sold_va', 'sf_email_campaigns', 'advisor_hh_children', 'annuity_mkt_opp', 'advisor_advising_years', 'advisor_aum', 'advisor_annuity_selling_years', 'advisor_age', 'advisor_net_worth', 'advisor_credit_hist_mos', 'advisor_firm_changes', 'advisor_credit_score', 'wholesaler', 'region_ca', 'region_ny', 'region_fl', 'region_tx', 'region_ne', 'region_so', 'region_mw', 'sf_email_responses']


In [5]:
from time import time

##### This cell trains the model using the community-sourced sklearn library.

In [8]:
##############
### Python ###
##############

from sklearn.ensemble import GradientBoostingClassifier

### estimate & fit model
dm_model = GradientBoostingClassifier()

start = time()
dm_model.fit(X_train, y_train)
finish = time()

print('score_train:', dm_model.score(X_train, y_train))
print('score_test:', dm_model.score(X_test, y_test))
print('score_valid:', dm_model.score(X_valid, y_valid))

time_to_complete = finish-start
print("Time to complete model fit with Python:", time_to_complete)

score_train: 0.9927112570585429
score_test: 0.9858264124914474
score_valid: 0.9838816882685277
Time to complete model fit with Python: 33.37234878540039


##### This cell trains the model with the SAS Viya Workbench Python API available within the sasviya.ml library.  The Python APIs run SAS technology and will not experience library version conflicts - ever.  This ensures that results will be accurate, consistent, and dependable now and in the future.  The APIs have been shown to be 30x faster than competeting platforms and algorithms in third party studies.  Note the difference in the "Time to complete model..." output from each cell.

In [10]:
##################
### Python API ###
##################

from sasviya.ml.tree import GradientBoostingClassifier

### estimate & fit model
dm_model = GradientBoostingClassifier()

start = time()
dm_model.fit(X_train, y_train)
finish = time()

print('score_train:', dm_model.score(X_train, y_train))
print('score_test:', dm_model.score(X_test, y_test))
print('score_valid:', dm_model.score(X_valid, y_valid))

time_to_complete = finish-start
print("Time to complete model fit with Python API:", time_to_complete)

score_train: 0.9960838260908152
score_test: 0.9867664973772188
score_valid: 0.9848786972416085
Time to complete model fit with Python API: 0.9744877815246582


In [11]:
dm_model.describe()

Key
58595A685DC188763AA50EC5B297348FFE72C883

Attribute,Value
Analytic Engine,tree-based models
Time Created,24Oct2024:20:50:19

Name,Length,Role,Type,RawType,FormatName
sf_face_2_face,8.0,Input,Interval,Num,
sf_call_outbound,8.0,Input,Interval,Num,
sf_call_inbound,8.0,Input,Interval,Num,
sf_email_inbound,8.0,Input,Interval,Num,
channel_bank,8.0,Input,Interval,Num,
channel_wirehouse,8.0,Input,Interval,Num,
primary_prod_sold_va,8.0,Input,Interval,Num,
sf_email_campaigns,8.0,Input,Interval,Num,
advisor_hh_children,8.0,Input,Interval,Num,
annuity_mkt_opp,8.0,Input,Interval,Num,

Name,Length,Type,Label
P_advisor_event_indicator1,8.0,Num,Predicted: advisor_event_indicator=1
P_advisor_event_indicator0,8.0,Num,Predicted: advisor_event_indicator=0
I_advisor_event_indicator,12.0,Character,Into: advisor_event_indicator
_WARN_,4.0,Character,Warnings
