# Import and Initialise Explorer
Import the Explorer SDK, and initialise it as `e`.

In [2]:
from explorer import Explorer
e = Explorer()

[38;20m2024-07-16 15:26:42 [INFO] No repo provided. Will be using algo_explorer_dev as playground repo.[0m


# Create a branch based on the `template/binary/lightgbm`
1. Start using the explorer by calling the `e.start()` function which creates a branch of the lightgbm binary template. 
2. At this step, you'd be prompted to set some configs. Set the objective to be `binary` and the framework to be `lightgbm`. 
3. Address the following prompts accordingly, but leave `UNIQUE_ID`, `TIMESTAMP`, `TARGET`, `FEATURES`, `CATEGORICAL_FEATURES` empty for now.

In [5]:
e.start(branch="UAT4/dchin")

[38;20m2024-07-16 15:27:23 [INFO] Branch UAT4/dchin already exists. Switching to branch.[0m


You should be able to see that the `algo_explorer_dev` repository is imported locally, and the right branch is checked out. The answered prompts are stored in `algo_explorer_dev/explorer_config.json`. 

# Data loading
1. Load your data by editing the `algo_explorer_dev` repo directly. Navigate to `algo_explorer_dev/src/data/loader.py` to tell explorer how to load your dataset. 
2. Edit the load() method under DataLoader class to load the data in the repo. This data will be used for training the model and testing the model during model validation. 
3. In this case we will be loading from the `sample_dataset.parquet`. Use pandas to read the full path to the parquet file.
4. After this is done, to inspect the data:

In [6]:
df = e.load()
df.head(3)

Unnamed: 0,category,primary_account_type,is_payfreq_undefined,is_digital_bank,txn_count,outflow_moving_average_std_28d,inflow100_size_std_14d,inflow100_timegap_std,days_from_last_inflow100,outflow150below_count_56d,paid_off_dpd14,revenue,ID,timestamp
0,_missing,checking,False,True,86.0,60.730772,0.0,0.236277,172.0,2,1,5.636913,e36ebdd7c7fa443cb1358faa888c03e4,2023-09-22 20:00:00
1,Service Photography,checking,False,False,2.0,562.084887,0.0,4.471382,26.0,6,1,18.928834,f37f9aade48e4ec5abd6eb4228199e93,2023-09-17 15:00:00
2,Gas stations/Fuel,checking,False,True,38.0,625.148381,391.222382,5.401739,0.0,36,0,0.0,ebd1071e8d1c47e69bd187718c9d6ff9,2023-08-08 21:00:00


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   category                        10000 non-null  object        
 1   primary_account_type            10000 non-null  object        
 2   is_payfreq_undefined            10000 non-null  bool          
 3   is_digital_bank                 10000 non-null  bool          
 4   txn_count                       9866 non-null   float64       
 5   outflow_moving_average_std_28d  9995 non-null   float64       
 6   inflow100_size_std_14d          7370 non-null   float64       
 7   inflow100_timegap_std           9242 non-null   float64       
 8   days_from_last_inflow100        9696 non-null   float64       
 9   outflow150below_count_56d       10000 non-null  int64         
 10  paid_off_dpd14                  10000 non-null  int64         
 11  rev

#  Preliminary choice of features and target 
1. After inspection, select the appropriate features, categorical features and target from your data, and update the config at `algo_explorer_dev/explorer_config.json`. 
2. This can be updated again if you’ve decided to use a different set of features.
3. In this case, `paid_off_dpd14` is set as the target, and all other columns except `ID` and `timestamp` are chosen as features. 
4. Remember to set the `UNIQUE_ID` and `TIMESTAMP` values to be the respective column names. 
5. To see the data filter down to features and target columns, try:

In [8]:
data_loader = e.data_loader()
df.loc[:, data_loader.features + [data_loader.target]].head(3)

Unnamed: 0,category,primary_account_type,is_payfreq_undefined,is_digital_bank,txn_count,outflow_moving_average_std_28d,inflow100_size_std_14d,inflow100_timegap_std,days_from_last_inflow100,outflow150below_count_56d,revenue,paid_off_dpd14
0,_missing,checking,False,True,86.0,60.730772,0.0,0.236277,172.0,2,5.636913,1
1,Service Photography,checking,False,False,2.0,562.084887,0.0,4.471382,26.0,6,18.928834,1
2,Gas stations/Fuel,checking,False,True,38.0,625.148381,391.222382,5.401739,0.0,36,0.0,0


# Preprocessing logic for data
1. To implement your logic of preprocessing the data for training, go to `algo_explorer_dev/src/preprocess.py` and add your logic to the online_preprocess and offline_preprocess function.  
2. The logic written under online_preprocess will be called during training and inference, while the logic written under offline_preprocess will only be called during training.
3. In this case, the only preprocess logic written is converting the datatype of the `category` and `primary_account_type` columns to `dtype='category'`
4. To inspect this change, run the follwing:

In [6]:
e.offline_preprocess(df).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   category                        10000 non-null  category      
 1   primary_account_type            10000 non-null  category      
 2   is_payfreq_undefined            10000 non-null  bool          
 3   is_digital_bank                 10000 non-null  bool          
 4   txn_count                       9866 non-null   float64       
 5   outflow_moving_average_std_28d  9995 non-null   float64       
 6   inflow100_size_std_14d          7370 non-null   float64       
 7   inflow100_timegap_std           9242 non-null   float64       
 8   days_from_last_inflow100        9696 non-null   float64       
 9   outflow150below_count_56d       10000 non-null  int64         
 10  paid_off_dpd14                  10000 non-null  int64         
 11  rev

which shows the first two columns are converted to `category` datatype. Note that we are not storing the preprocessed dataframe in a variable yet, because we will only preprocess the train and test data for training and validation. 

# Train-test split
1. Determine the train-test splitting of the data by navigating to `algo_explorer_dev/data/loader` and edit the splitting under `def train_test_split()`. 
2. Inspect the size of the training and testing data in your notebook:

In [9]:
train_df, test_df = e.train_test_split()
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7000 entries, 6252 to 6219
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   category                        7000 non-null   object        
 1   primary_account_type            7000 non-null   object        
 2   is_payfreq_undefined            7000 non-null   bool          
 3   is_digital_bank                 7000 non-null   bool          
 4   txn_count                       6897 non-null   float64       
 5   outflow_moving_average_std_28d  6997 non-null   float64       
 6   inflow100_size_std_14d          5132 non-null   float64       
 7   inflow100_timegap_std           6463 non-null   float64       
 8   days_from_last_inflow100        6794 non-null   float64       
 9   outflow150below_count_56d       7000 non-null   int64         
 10  paid_off_dpd14                  7000 non-null   int64         
 11  r

After checking the train and test data, apply the preprocessing logic to it and inspect the data again 

In [10]:
train_df_preprocessed = e.offline_preprocess(train_df)
test_df_preprocessed = e.offline_preprocess(test_df)
train_df_preprocessed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7000 entries, 6252 to 6219
Data columns (total 14 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   category                        7000 non-null   category      
 1   primary_account_type            7000 non-null   category      
 2   is_payfreq_undefined            7000 non-null   bool          
 3   is_digital_bank                 7000 non-null   bool          
 4   txn_count                       6897 non-null   float64       
 5   outflow_moving_average_std_28d  6997 non-null   float64       
 6   inflow100_size_std_14d          5132 non-null   float64       
 7   inflow100_timegap_std           6463 non-null   float64       
 8   days_from_last_inflow100        6794 non-null   float64       
 9   outflow150below_count_56d       7000 non-null   int64         
 10  paid_off_dpd14                  7000 non-null   int64         
 11  r

# Finalise choice of features and target
1. After inspecting the data and gaining a better intuition of the data, finalise the choice of features and target to be used for training the model. 
2. Go back to algo_explorer_dev/explorer_config.json and update the latest choice of features, categorical features and target.
3. In this case, we are sticking with the our initial choice of features and target.

# Data validation
1. Once the the data has gone through the train-test splitting and preprocessing, it should be passed into the data validation function for checks before training. 
2. Go to `algo_explorer_dev/src/data/validation` and implement your data validation checks.
3. In this case, uncomment the `check_intra_group_leakage`, `check_inter_group_leakage` and `check_label_dsitribution`. 
4. Test this validation implementation.

In [9]:
data_validation_output = e.validate_data(df, train_df_preprocessed, test_df_preprocessed)
data_validation_output["status"]

True

# Model Training
1. After the data is validated and ready to be trained, go to `algo_explorer_dev/src/model/trainer.py` and implement your model training logic under `def train()`.
2. In this case, the implementation has been written in the template which uses `lightgbm` to train a binary classification model. 
3. To train the model, run:

In [None]:
model = e.train(train_df_preprocessed)

# Model Validation
1. To validate the model, implement your model validation and metrics under `algo_explorer_dev/src/model/validation`. 
2. In this case, the implementation has already been done in the template. 
3. To see the implementation, first run inference on the train and test data using the trained model object.

In [11]:
X_train, y_train = train_df_preprocessed[data_loader.features], train_df_preprocessed[data_loader.target]
X_test, y_test = test_df_preprocessed[data_loader.features], test_df_preprocessed[data_loader.target]
y_hat_train = model.predict(X_train)
y_hat_test = model.predict(X_test)

4. After running inference, run the model validation function using the SDK and inspect the output. For example, if the metrics are stored in metrics, run:

In [12]:
output = e.model_metric_validation(y_train, y_hat_train, y_test, y_hat_test)
for metric,value in output["metrics"].items():
    print(metric, value)

test accuracy 0.841
test average_precision 0.9606698475440976
train accuracy 0.887
train average_precision 0.9700203602982358


5. It is also possible to see the ROC, precision-recall and confusion matrix plot, with a slider on the threshold. Run the following in your notebook.

In [32]:
output["figures"]["model_metrics.html"].show()

# Run pipeline
1. Once all the testing is done and the model is relatively stable, use the `run_pipeline` function to log your data, model and metrics to mlflow. 
2. This function comprises of data validation, data preprocessing, model training, hyperparameter tuning and model validation. 
3. This step logs the training metrics and model metrics on mlflow and store a snapshot of the unprocessed data.
4. Run the following in your notebook:

In [None]:
e.run_pipeline(run_name="dev/dchin-run", hyperparameter_tuning = True)

5. This run is now registered to the mlflow. Head to [mlflow](https://staging-mlflow.moneylion.io) and look for the experiment which shares the same name as the git branch. In this case it's under `algo_explorer_dev/UAT4/dchin`. The relavant run is the `dev/dchin-run`. Click into it and and explore the metrics and artifacts stored. 

# Inference
1. To use the trained and logged model on mlflow, edit `algo_explorer_dev/src/constants.py`, set `MLFLOW_MODEL_URI` as the trained model uri from the run above. The uri will look like `runs:/<RUN_ID>/model`.
2. Head to `algo_explorer_dev/src/data/preprocessing.py` and inspect the `online_preprocessing` method as it will be used to preprocess the unseen data before running inference. 
3. Navigate to `algo_explorer_dev/src/model/postprocess.py` and implement your mode output postprocessing logic here. This decides how the output from the predict method looks like.
4. Head to algo_explorer_dev/src/model/predictor.py and implement the sequence of preprocessing, inference and postprocessing accordingly under the `def predict() `method.
5. In this case, the output of the predict will be a dataframe with features + predictions + label columns. 
6. Load in the unseen data, which is stored in `unseen_data.json`

In [29]:
import pandas as pd
import json
with open('unseen_data.json', 'r') as f:
    unseen_dict = json.load(f)
unseen_df = pd.DataFrame([unseen_dict])

7. Run inference on the unseen data.

In [30]:
e.predict(unseen_df)

Unnamed: 0,category,primary_account_type,is_payfreq_undefined,is_digital_bank,txn_count,outflow_moving_average_std_28d,inflow100_size_std_14d,inflow100_timegap_std,days_from_last_inflow100,outflow150below_count_56d,revenue,predictions,paid_off_dpd14
0,Groceries,checking,False,True,86.0,60.730772,0.0,0.236277,172.0,2,5.636913,0.896432,1


# Deploy locally
1. To deploy a local model endpoint, head to `algo_explorer_dev/src/schema.py` and configure the appropriate input schema based on your selected features, and the output schema based on the desired output. 
2. Then, head to `algo_explorer_dev/src/server.py` to configure the endpoint behaviour. The main changes would occur under async def batch_handler() as it stores the logic for running inference when the `/predict` endpoint is called.
3. To test your implementation, run the following in your notebook, which deplots the endpoint to `localhost:8000`

In [None]:
e.deploy()

4. To test this endpoint, open a terminal in this directory and utilise the `curl` tool to make a POST request. Insert the following command:
```
curl -X POST localhost:8000/predict \
-H 'Content-Type:application/json' \
--data-binary @"unseen_data.json"
```

# Appendix
## Feature Ranking and Selection
It is also possible to use explorer to do feature ranking by running `e.features_ranking()`  This teturns a features importance dataframe, and a auc vs number of top features plot. 

In [None]:
df_feature_importance, figure_feature_auc = e.features_ranking(train_df_preprocessed, steps=1)

Inspect the feature importance by running:

In [12]:
df_feature_importance

Unnamed: 0,feature_name,importance_gain,importance_split
0,revenue,17227.508344,81
1,inflow100_timegap_std,337.982199,63
2,txn_count,357.882759,58
3,outflow_moving_average_std_28d,338.58742,56
4,outflow150below_count_56d,319.566699,55
5,inflow100_size_std_14d,331.544271,53
6,days_from_last_inflow100,318.959111,53
7,is_digital_bank,73.417829,12
8,is_payfreq_undefined,58.52457,10
9,category,48.81949,9


To see the auc vs number of top features plot, run:

In [13]:
figure_feature_auc.show()

## Logging an experimental model
If you're attempting some experiments and have yet to reach a stable model, but you'd like to store the model in mlflow for future reference without running `e.run_pipeline()`, you can tell `e.train()` to log the experimental model.

In [None]:
model = e.train(train_df_preprocessed, log_mlflow=True, run_name = 'test/dchin-run',autolog= True)