## Introduction

**Prism score prediction example:**  
___
In this example we will show how to:
- Setup the required environment for accessing the ecosystem prediction server.
- Setup access to the mongo database.
- Enrich feature stores.
- Build and test a prediction model for prism scores.

## Setup

**Setting up import path:**  
___
Add path of ecosystem notebook wrappers.

In [None]:
# Set path for accessing ecosystem python wrappers
import sys
sys.path.append("/path of ecosystem server python wrappers")

**Setup prediction server access:**  
___
Create access token for prediction server.
- **url:** Url for the prediction server to access.
- **username:** Username for prediction server.
- **password:** Password for prediction server.

In [1]:
#Access the server
from prediction import jwt_access

url = "http://demo.ecosystem.ai:3001/api"
username = "user@ecosystem.ai"
password = "cd486be3-9955-4364-8ccc-a9ab3ffbc168"

auth = jwt_access.Authenticate(url, username, password)

Login Successful.


**Import required packages:**  
___
Import and load all packages required for the following usecase.

In [2]:
#Load packages
import pymongo
from bson.son import SON
import pprint
import pandas as p
import json
import numpy
import operator
import datetime
import time

from prediction.apis import functions as uf
from prediction.apis import data_management_engine as d
from prediction.apis import data_munging_engine as dm
from prediction.apis import worker_h2o as hw
from prediction.apis import prediction_engine as pe

## Mongo Database

**Setting up mongo connection string:**  
___
Creat connection string to allow access to mongo database.

In [4]:
client = pymongo.MongoClient(
   "mongodb://ecosystem_user:EcoEco321@demo.ecosystem.ai:54445/?authSource=admin"
)

**Connect to mongo database:**  
___
Connect to specified mongo database.
- **database:** Name of database to access.

In [5]:
database = "notebook_algorithms"
db = client[database]

**Show mongo collections:**  
___
Show all collections for the specifed mongo database.

In [29]:
collections = db.list_collection_names()
print(collections)

['bank_transactions_MAR2019', 'lags_bank_transactions_JAN2019', 'bank', 'test_sample', 'bank_transactions_FEB2019', 'bank_transaction', 'lags_bank_transactions_FEB2019', 'bank_transactions_JAN2019', 'test_sample2']


## Feature Store Enrichment

**List fields:**  
___
List all fields in the specified collection.
- **collection:** Name of the collection. (See list above for available collections.)

In [27]:
collection = "bank_transaction"
list_of_fields = uf.get_list_of_fields(db, collection)
print(list_of_fields)

['effReformatted', 'account_type', 'MCC', 'eff_date', 'customer', 'effYearMonth', 'trns_amt', 'intl_ind', 'trns_type']


**List of feature stores:**  
___
Create list of feature stores to enrich.
- **list_of_fs:** Names of the collections to enrich. (See list above for available collections.)

In [24]:
list_of_fs = [
    "bank_transactions_FEB2019",
    "bank_transactions_JAN2019",
    "bank_transactions_MAR2019"
]

**Add lag to feature stores:**  
___
Add a single step lag to all features stores listed in **list_of_fs**.
- **lag_prefix:** Prefix to add to new feature stores created with added lag.

In [None]:
lag_prefix = "lags_"

In [28]:
# ---- Uneditible ----
for j in range(len(list_of_fs)-1):
    print(j)
    current_fs = list_of_fs[j+1]
    previous_fs = list_of_fs[j]
    write_fs = lag_prefix + list_of_fs[j]
    ratio_pipeline = [
                        {
                        "$lookup":{
                                "from":previous_fs
                                ,"localField":"_id"
                                ,"foreignField":"_id"
                                ,"as":"subs"
                                }
                        }
                        ,{"$unwind":"$subs"}
                        ,{
                        "$addFields":{
                                    }
                        }
                        ,{"$unset":"subs"}
                        ,{"$out":write_fs}
                    ]
    
    for i in list_of_fields:
        add_field = i+"Ratio"
        add_field_appear = i+"Appear"
        current_value = "$"+ i
        previous_value = "$subs." + i
        ratio_pipeline[2]["$addFields"][add_field]={"$switch":{"branches":[
                         {"case":{"$and":[{"$ne":[{"$type":current_value}, "missing"]},{"$ne":[{"$type":previous_value}, "missing"]}]}, "then":{"$divide":[current_value,previous_value]}}
                        ,{"case":{"$and":[{"$ne":[{"$type":current_value}, "missing"]},{"$eq":[{"$type":previous_value}, "missing"]}]}, "then":0}
                        ], "default":None}}
        ratio_pipeline[2]["$addFields"][add_field_appear]={"$cond":[{"$and":[{"$eq":[{"$type":current_value}, "missing"]},{"$ne":[{"$type":previous_value}, "missing"]}]},1,0]}
    
    db[current_fs].aggregate(ratio_pipeline)
# ---- Uneditible ----

0
1


**Add behavior change indicator:**  
___
Add a behaviour change indicator showing the number of categories appearing or disappearing.
- **prism_lag_prefix:** Prefix to add to new feature stores created with added behaviour change indicators.

In [None]:
prism_lag_prefix = "prism_lags"

In [None]:
# ---- Uneditible ----
for j in range(len(list_of_fs)-1):
    add_dict = {"$addFields":
                       {
                        "appear":{"$add":[]}
                        ,"disappear":{"$add":[]}
                       }
           }
    for i in list_of_fields:
        field_disappear = "$"+i+"Ratio"
        field_appear = "$"+i+"Appear"
        add_dict["$addFields"]["appear"]["$add"].append({"$cond":[{"$eq":[field_appear,1]},1,0]})
        add_dict["$addFields"]["disappear"]["$add"].append({"$cond":[{"$eq":[field_disappear,0]},1,0]})

    write_fs = prism_lag_prefix + list_of_fs[j]
    print(write_fs)
    behav_change_pipeline = [
        add_dict
        ,{"$out":write_fs}
    ]

    db[write_fs].aggregate(behav_change_pipeline)
# ---- Uneditible ----

## Build model

## Build an h2o model using the ecosystem.ai packages

**Export training data:**  
___
Find and export training data for prediction model and then read exported data into a dataframe.
- **fs:** Name for training data feature store.
- **db_name:** Name of database to access.
- **record_count:** Number of records to export.

In [3]:
fs = "prism_data"
db_name = "notebook_algorithms"
record_count = 75000

In [4]:
# ---- Uneditible ----
export_store = fs
export_store_file = export_store + ".csv"
example_data = d.get_data(auth, db_name, fs, "{}", 10, "{}", 0)
example_data_frame = p.DataFrame(example_data)
listOfColumnNames = list(example_data_frame.columns)
export_projection = ""
for i in listOfColumnNames:
    export_projection = export_projection + i + ","
export_projection = export_projection[:-1]
# ---- Uneditible ----

get /getMongoDBFind?database=notebook_algorithms&collection=prism_data&field={}&limit=10&projections={}&skip=0&


In [31]:
# ---- Uneditible ----
d.export_documents(auth, export_store, "csv", db_name, export_store, "{}", "{}", export_projection, record_count)  
parsed = hw.file_to_frame(auth, export_store_file, 1, "comma")
# ---- Uneditible ----

get /getMongoDBFind?database=notebook_algorithms&collection=prism_data&field={}&limit=10&projections={}&skip=0&
get /exportMongoDocuments?file_name=prism_data&file_type=csv&database=notebook_algorithms&collection=prism_data&field={}&sort={}&projection=NO_ZFN_ACCT,Experiential,15427Spend,15712Spend,WedSpend,Introvert,5541Frequency,CUST_CREDIT_LIMIT,DiscretionaryFrequency,CIF_ADDR_VERIFY,CUST_TOT_DR_BAL,NO_POST_ADDR,15309Frequency,RGN_CDE,15314Spend,Enthusiastic,ExtrovertFrequency,MKT_POST,15309Spend,CashSpend,EssentialSpend,15749Frequency,KYC_IND,CUST_AGE,badIndicatorBrs,15466Frequency,GRAD_IND,LANG_CDE,ID_ISSUER,SAL_IND,RSK_RSN_CDE,TOT_NO_SUBPROD,SCHEME_IND,BOND_IND,6919Spend,PRIM_BUS,SIC_CDE,CUST_CNTCT_TEL_NO,TURNOVER,15314Frequency,15427Frequency,FriSpend,PRI_SEG,HIGH_EDU_LVL,GRAD_TYPE,4820Spend,NO_BANK_SERV,15616Frequency,15318Spend,NO_DDA_ACCT,DEBT_COUNSEL_IND,CNTRY_ESTBLSHMNT,15696Spend,IndustriousSpend,Intentional,6544Frequency,4820Frequency,15466Spend,15303Spend,ACCT_LINK_IND,CU

In [5]:
# ---- Uneditible ----
hexframename = uf.save_userframe(auth, fs, username)
print(hexframename)
# ---- Uneditible ----

get /processFileToFrameImport?file_name=prism_data.csv&first_row_column_names=1&separator=,&
delete /deleteFrame?frame=prism_data.hex&
post /saveUserFrame
	{'timestamp_parsed': '2021-07-22T08:50:37.881170', 'parser': 'csv', 'import': '{"import": {"path": "file:///data/prism_data.csv", "files": ["/data/prism_data.csv"], "destination_frames": ["nfs://data/prism_data.csv"], "fails": [], "dels": [], "_exclude_fields": ""}, "parseSetup": {"source_frames": [{"name": "nfs://data/prism_data.csv", "type": "Key<Frame>", "URL": "/3/Frames/nfs://data/prism_data.csv"}], "parse_type": "CSV", "separator": 44, "single_quotes": false, "check_header": 1, "column_names": ["NO_ZFN_ACCT", "Experiential", "15427Spend", "15712Spend", "WedSpend", "Introvert", "5541Frequency", "CUST_CREDIT_LIMIT", "DiscretionaryFrequency", "CIF_ADDR_VERIFY", "CUST_TOT_DR_BAL", "NO_POST_ADDR", "15309Frequency", "RGN_CDE", "15314Spend", "Enthusiastic", "ExtrovertFrequency", "MKT_POST", "15309Spend", "CashSpend", "EssentialSpend"

post /processToFrameParse
	{'timestamp_parsed': '2021-07-22T08:50:37.881170', 'parser': 'csv', 'import': '{"import": {"path": "file:///data/prism_data.csv", "files": ["/data/prism_data.csv"], "destination_frames": ["nfs://data/prism_data.csv"], "fails": [], "dels": [], "_exclude_fields": ""}, "parseSetup": {"source_frames": [{"name": "nfs://data/prism_data.csv", "type": "Key<Frame>", "URL": "/3/Frames/nfs://data/prism_data.csv"}], "parse_type": "CSV", "separator": 44, "single_quotes": false, "check_header": 1, "column_names": ["NO_ZFN_ACCT", "Experiential", "15427Spend", "15712Spend", "WedSpend", "Introvert", "5541Frequency", "CUST_CREDIT_LIMIT", "DiscretionaryFrequency", "CIF_ADDR_VERIFY", "CUST_TOT_DR_BAL", "NO_POST_ADDR", "15309Frequency", "RGN_CDE", "15314Spend", "Enthusiastic", "ExtrovertFrequency", "MKT_POST", "15309Spend", "CashSpend", "EssentialSpend", "15749Frequency", "KYC_IND", "CUST_AGE", "badIndicatorBrs", "15466Frequency", "GRAD_IND", "LANG_CDE", "ID_ISSUER", "SAL_IND", "

In [7]:
user_frame = hw.file_to_frame(auth, "prism_data.csv", 1, ",")
print(type(user_frame))
print(user_frame)
print(user_frame.keys())

# worker_h2o.delete_frame(auth, hexframename)
# pe.save_user_frame(auth,user_frame)
# frame = hw.featurestore_to_frame(auth, user_frame)
# pe.save_user_frame(auth,user_frame)
# frame = hw.featurestore_to_frame(auth, user_frame)
print("hello")

get /processFileToFrameImport?file_name=prism_data.csv&first_row_column_names=1&separator=,&
<class 'dict'>
{'import': {'path': 'file:///data/prism_data.csv', 'files': ['/data/prism_data.csv'], 'destination_frames': ['nfs://data/prism_data.csv'], 'fails': [], 'dels': ['nfs://data/prism_data.csv'], '_exclude_fields': ''}, 'parseSetup': {'source_frames': [{'name': 'nfs://data/prism_data.csv', 'type': 'Key<Frame>', 'URL': '/3/Frames/nfs://data/prism_data.csv'}], 'parse_type': 'CSV', 'separator': 44, 'single_quotes': False, 'check_header': 1, 'column_names': ['NO_ZFN_ACCT', 'Experiential', '15427Spend', '15712Spend', 'WedSpend', 'Introvert', '5541Frequency', 'CUST_CREDIT_LIMIT', 'DiscretionaryFrequency', 'CIF_ADDR_VERIFY', 'CUST_TOT_DR_BAL', 'NO_POST_ADDR', '15309Frequency', 'RGN_CDE', '15314Spend', 'Enthusiastic', 'ExtrovertFrequency', 'MKT_POST', '15309Spend', 'CashSpend', 'EssentialSpend', '15749Frequency', 'KYC_IND', 'CUST_AGE', 'badIndicatorBrs', '15466Frequency', 'GRAD_IND', 'LANG_CD

In [25]:
user_frame = hw.file_to_frame(auth, "prism_data.csv", 1, ",")
uf.save_userframe(auth, user_frame)
print("hello")
print(tester)

get /processFileToFrameImport?file_name=prism_data.csv&first_row_column_names=1&separator=,&
hello
{'import': {'path': 'file:///data/prism_data.csv', 'files': ['/data/prism_data.csv'], 'destination_frames': ['nfs://data/prism_data.csv'], 'fails': [], 'dels': ['nfs://data/prism_data.csv'], '_exclude_fields': ''}, 'parseSetup': {'source_frames': [{'name': 'nfs://data/prism_data.csv', 'type': 'Key<Frame>', 'URL': '/3/Frames/nfs://data/prism_data.csv'}], 'parse_type': 'CSV', 'separator': 44, 'single_quotes': False, 'check_header': 1, 'column_names': ['NO_ZFN_ACCT', 'Experiential', '15427Spend', '15712Spend', 'WedSpend', 'Introvert', '5541Frequency', 'CUST_CREDIT_LIMIT', 'DiscretionaryFrequency', 'CIF_ADDR_VERIFY', 'CUST_TOT_DR_BAL', 'NO_POST_ADDR', '15309Frequency', 'RGN_CDE', '15314Spend', 'Enthusiastic', 'ExtrovertFrequency', 'MKT_POST', '15309Spend', 'CashSpend', 'EssentialSpend', '15749Frequency', 'KYC_IND', 'CUST_AGE', 'badIndicatorBrs', '15466Frequency', 'GRAD_IND', 'LANG_CDE', 'ID_I

**Train Model:**
___
Set training parameters for model and train.
- **predict_id:** Id for the prediction (for logging). 
- **description:** Description of model (for logging).
- **model_id:** Id for the model (for logging).
- **model_type:** Type of model to build (for logging). 
- **frame_name:** Name of frame used (for logging).
- **frame_name_desc:** Description of frame used (for logging).
- **model_purpose:** Purpose of model (for logging).
- **version:** Model version (for logging).

The following parameters are dependend on what is selected in the algo parameter.

- **algo:** Algorithm to use to train model. (Availble algorithms: "H20-AUTOML")
- **training_frame:** Data frame to use for training the model.
- **validation_frame:** Data frame to use for validating the model.
- **max_models:** Maximum number of models to build.
- **stopping_tolerance:** (TODO)
- **max_runtime_secs:** Maximum number of seconds to spend on training.
- **stopping_rounds:** (TODO)
- **stopping_metric:** (TODO)
- **nfolds:** (TODO)
- **response_column:** The column or field in the dataset to predict.
- **ignored_columns:** List of columns to exclude in the model training.
- **hidden:** (TODO)
- **exclude_algos:** Algorithms to exclude in the automl run.

In [7]:
version = "1.0"
model_id = fs + version
model_purpose = "Prediction of whether nonbehavioural prism model is correct"
description = "Automated features store generated for " + fs
model_params = { 
        "predict_id": fs,
        "description": description,
        "model_id": model_id,
        "model_type": "AUTOML",
        "frame_name": hexframename,
        "frame_name_desc": description,
        "model_purpose": model_purpose,
        "version": version,
        "model_parms": {
              "algo": "H2O-AUTOML",
              "training_frame": hexframename,
              "validation_frame": hexframename,
              "max_models": 10,
              "stopping_tolerance": 0.005,
              "note_stop": "stopping_tolerance of 0.001 for 1m rows and 0.004 for 100k rows",
              "max_runtime_secs": 3600,
              "stopping_rounds": 15,
              "stopping_metric": "AUTO",
              "nfolds": 4,
              "note_folds": "nfolds=0 will disable the stacked ensemble creation process",
              "response_column": "prismResponse",
              "ignored_columns": [            
                  "prismResponse",
                  "other columns in feature store you don't want to be included in the model"
              ],
              "hidden": [
                "1"
              ],
              "exclude_algos": [
                "GLM",
                "StackedEnsemble",
                "XGBoost",
                "DeepLearning",
                "GBM",
                "Any algorithms that you don't want to be included in the automl run"
              ]
            }
    }


In [8]:
# ---- Uneditible ----
hw.train_model(auth, model_id, "automl", json.dumps(model_params["model_parms"]))
# ---- Uneditible ----

get /buildModel?model_id=prism_data1.0&model_type=automl&model_parms={"algo": "H2O-AUTOML", "training_frame": "prism_data.hex", "validation_frame": "prism_data.hex", "max_models": 10, "stopping_tolerance": 0.005, "note_stop": "stopping_tolerance of 0.001 for 1m rows and 0.004 for 100k rows", "max_runtime_secs": 3600, "stopping_rounds": 15, "stopping_metric": "AUTO", "nfolds": 4, "note_folds": "nfolds=0 will disable the stacked ensemble creation process", "response_column": "prismResponse", "ignored_columns": ["prismResponse", "other columns in feature store you don't want to be included in the model"], "hidden": ["1"], "exclude_algos": ["GLM", "StackedEnsemble", "XGBoost", "DeepLearning", "GBM", "Any algorithms that you don't want to be included in the automl run"]}&


<Response [200]>

In [None]:
#Save the model
h2o_name = "name of the best model"
correct_models_dict[j] = h2o_name
zip_name = h2o_name + ".zip"
hw.download_model_mojo(auth,h2o_name)
high_level_mojo = hw.get_train_model(auth, h2o_name, "eric")
model_to_save = high_level_mojo["models"][0]
model_to_save["model_identity"] = h2o_name
model_to_save["userid"] = "jayvanzyl"
model_to_save["timestamp"] = "time_stamp"
pe.save_model(auth,model_to_save)

#See some statistics from the saved model
pe.get_user_model(auth,h2o_name)
stats = hw.get_model_stats(auth,h2o_name,"ecosystem","variable_importances")