# Citibike ML
In this example we use the [Citibike dataset](https://ride.citibikenyc.com/system-data). Citibike is a bicycle sharing system in New York City. Everyday users choose from 20,000 bicycles at 1300 stations around New York City.

To ensure customer satisfaction Citibike needs to predict how many bicycles will be needed at each station. Maintenance teams from Citibike will check each station and repair or replace bicycles. Additionally, the team will relocate bicycles between stations based on predicted demand. The business needs to be able to run reports of how many bicycles will be needed at a given station on a given day.

## ML  Monitoring and Evaluation
In this section of the demo, we will utilize Snowpark's Python client-side Dataframe API and server-side runtime to build an **ML ops monitoring process**.  We will start from the automated pipeline that has been built for ingest, feature engineering, traing and inference and add evaluation and monitoring steps.

The ML Engineer must create a pipeline step to evaluate the ML model performance over time. Because we are retraining with each inference we will evaluate performance for final 30 days of the predictions.  Additionally, since data scientists may use many different model frameworks, we want to have a standard evaluation framework instead of using the model built-in evaluation which will be different for each framework.  We will deploy the evaluation functions to the Snowpark Python server-side runtime as UDF so that all projects will have a **standard, centralized framework for evaluation and monitoring**.  We will save the model performance metrics in tables for historical analysis and drift detection as well as full reproducibility to support the company's GDPR policies.

For this demo flow we will assume that the organization has the following **policies and processes** :   
-**Dev Tools**: The ML engineer can develop in their tool of choice (ie. VS Code, IntelliJ, Pycharm, Eclipse, etc.).  Snowpark Python makes it possible to use any environment where they have a python kernel.  For the sake of a demo we will use Jupyter.  
-**Data Governance**: To preserve customer privacy no data can be stored locally.  The ingest system may store data temporarily but it must be assumed that, in production, the ingest system will not preserve intermediate data products between runs. Snowpark Python allows the user to push-down all operations to Snowflake and bring the code to the data.   
-**Automation**: Although the ML engineer can use any IDE or notebooks for development purposes the final product must be python code at the end of the work stream.  Well-documented, modularized code is necessary for good ML operations and to interface with the company's CI/CD and orchestration tools.  
-**Compliance**: Any ML models must be traceable back to the original data set used for training.  The business needs to be able to easily remove specific user data from training datasets and retrain models. 

Input: Predictions in `PRED_<model_id>` table. Unique model ID number.  
Output: Evaluation metrics in `EVAL_<model_id>` table. 

### 1. Load  credentials and connect to Snowflake

In [None]:
import snowflake.snowpark as snp
from snowflake.snowpark import functions as F 

from datetime import datetime
import json
import getpass

with open('creds.json') as f:
    data = json.load(f)
    connection_parameters = {
      'account': data['account'],
      'user': data['username'],
      'password': data['password'], #getpass.getpass(),
      'role': data['role'],
      'warehouse': data['warehouse']}

session = snp.Session.builder.configs(connection_parameters).create()

In [None]:
project_db_name = 'CITIBIKEML'
project_schema_name = 'DEMO'
project_db_schema = str(project_db_name)+'.'+str(project_schema_name)

###For testing we will hard code an existing model_id
model_id = '56CDBD02_7C61_11EC_B130_ACDE48001122'
###


pred_table_name = str(project_db_schema)+'.'+'PREDICTIONS_'+str(model_id)

eval_table_name = str(project_db_schema)+'.'+'EVAL_'+str(model_id)
_ = session.sql('DROP TABLE IF EXISTS '+eval_table_name).collect()

_ = session.sql('USE DATABASE ' + str(project_db_name)).collect()
_ = session.sql('USE SCHEMA ' + str(project_schema_name)).collect()

### Evaluation: 
We will use [rexmex](https://rexmex.readthedocs.io/en/latest/index.html) for consistent evaluation rather than the models' built-in eval metrics.  Evaluation metrics will be saved as table output tagged with the model_id.  
  
First we will create a UDF for the evaluation with Rexmex.

In [None]:
def eval_model_output_func(input_data: list, 
                           y_true_name: str, 
                           y_score_name: str,
                           group_id_name: str) -> str:
    import pandas as pd
    from rexmex import RatingMetricSet, ScoreCard
    
    metric_set = RatingMetricSet()
    score_card = ScoreCard(metric_set)
    
    input_column_names = [y_true_name, y_score_name, group_id_name]
    df = pd.DataFrame(input_data, columns = input_column_names)
    df.rename(columns={y_true_name: 'y_true', y_score_name:'y_score'}, inplace=True)
    
    df = score_card.generate_report(df,grouping=[group_id_name]).reset_index()
    df.drop('level_1', axis=1, inplace=True)
    
    return [df.values.tolist(), df.columns.tolist()]

Deploying the UDF to Snowflake makes it available for all users.  This is a regression evaluation.  Likely we will want to deploy a categorical function as well or add if/then logic to our single instance.

In [None]:
#from citibike_ml.model_eval import eval_model_output_func
#Deploy the model eval UDF

dep = 'rexmex.zip'
source_dir = './dependencies/'

session.clearImports()
session.addImport(source_dir+dep)
session.addImport('citibike_ml')

model_stage_name = str(project_db_schema)+'.'+'model_stage'
_ = session.sql('CREATE STAGE IF NOT EXISTS model_stage').collect()

eval_model_output_udf = session.udf.register(eval_model_output_func, 
                                              name="eval_model_output_udf",
                                              is_permanent=True,
                                              stage_location='@'+str(model_stage_name), 
                                              replace=True)

eval_model_output_udf.name

In [None]:
#Test model eval output

eval_df = session.table(pred_table_name)\
                 .select(F.array_agg(F.array_construct('COUNT', 'PRED', 'STATION_ID')).alias('input_data'))

output_df = eval_df.select(F.call_udf('eval_model_output_udf',
                                      'INPUT_DATA',
                                      F.lit('COUNT'), 
                                      F.lit('PRED'),
                                      F.lit('STATION_ID'))).collect()

df = pd.DataFrame(data = ast.literal_eval(output_df[0][0])[0], 
                      columns = ast.literal_eval(output_df[0][0])[1])

eval_df = session.createDataFrame(df).write.saveAsTable(eval_table_name)

df = session.table(eval_table_name).toPandas()
df

Consolidate all functions for orchestration.

In [None]:
%%writefile citibike_ml/model_eval.py

def eval_model_output_func(input_data: list, 
                           y_true_name: str, 
                           y_score_name: str,
                           group_id_name: str) -> str:
    import pandas as pd
    from rexmex import RatingMetricSet, ScoreCard
    
    metric_set = RatingMetricSet()
    score_card = ScoreCard(metric_set)
    
    input_column_names = [y_true_name, y_score_name, group_id_name]
    df = pd.DataFrame(input_data, columns = input_column_names)
    df.rename(columns={y_true_name: 'y_true', y_score_name:'y_score'}, inplace=True)
    
    df = score_card.generate_report(df,grouping=[group_id_name]).reset_index()
    df.drop('level_1', axis=1, inplace=True)
    
    return [df.values.tolist(), df.columns.tolist()]

def deploy_eval_udf(session, function_name, model_stage_name) -> str:
    from citibike_ml.model_eval import eval_model_output_func

    dep = 'rexmex.zip'
    source_dir = './dependencies/'

    session.clearImports()
    session.addImport(source_dir+dep)
    session.addImport('citibike_ml')

    eval_model_output_udf = session.udf.register(eval_model_output_func, 
                                                  name=function_name,
                                                  is_permanent=True,
                                                  stage_location='@'+str(model_stage_name), 
                                                  replace=True)

    return eval_model_output_udf.name

def evaluate_station_predictions(session, pred_table_name, eval_model_udf_name, eval_table_name) -> str:
    from snowflake.snowpark import functions as F
    import pandas as pd
    import ast
    
    eval_df = session.table(pred_table_name)\
                     .select(F.array_agg(F.array_construct('COUNT', 'PRED', 'STATION_ID')).alias('input_data'))

    output_df = eval_df.select(F.call_udf(eval_model_udf_name,
                                          'INPUT_DATA',
                                          F.lit('COUNT'), 
                                          F.lit('PRED'),
                                          F.lit('STATION_ID'))).collect()
    
    df = pd.DataFrame(data = ast.literal_eval(output_df[0][0])[0], 
                      columns = ast.literal_eval(output_df[0][0])[1])

    eval_df = session.createDataFrame(df).write.saveAsTable(eval_table_name)


    return eval_table_name

In [None]:
#vectorize: eval_df = eval_df.groupBy('STATION_ID')\
#                  .agg(F.array_agg(F.array_construct(F.col('COUNT'), F.col('PRED'))).alias('INPUT_DATA'))\
#                  #.filter(F.col('STATION_ID') == F.lit('426'))