# Citibike ML
In this example we use the [Citibike dataset](https://ride.citibikenyc.com/system-data). Citibike is a bicycle sharing system in New York City. Everyday users choose from 20,000 bicycles at 1300 stations around New York City.

To ensure customer satisfaction Citibike needs to predict how many bicycles will be needed at each station. Maintenance teams from Citibike will check each station and repair or replace bicycles. Additionally, the team will relocate bicycles between stations based on predicted demand. The business needs to be able to run reports of how many bicycles will be needed at a given station on a given day.

## ML Ops
In this section of the demo, we will utilize Snowpark's Python client-side Dataframe API as well as the Snowpark server-side runtime to create an **ML ops pipeline**.  We will take the functions created by the ML Engineer and create a set of functions that can be easily automated with the company's orchestration tools. 

The ML Engineer must create a pipeline to **automate deployment** of models and batch predictions where the business users can consume them easily from dashboards and analytics tools like Tableau or Power BI.  Predictions will be made for the top 10 busiest stations.  The predictions must be accompanied by an explanation of which features were most impactful for the prediction.  

For this demo flow we will assume that the organization has the following **policies and processes** :   
-**Dev Tools**: The ML engineer can develop in their tool of choice (ie. VS Code, IntelliJ, Pycharm, Eclipse, etc.).  Snowpark Python makes it possible to use any environment where they have a python kernel.  For the sake of a demo we will use Jupyter.  
-**Data Governance**: To preserve customer privacy no data can be stored locally.  The ingest system may store data temporarily but it must be assumed that, in production, the ingest system will not preserve intermediate data products between runs. Snowpark Python allows the user to push-down all operations to Snowflake and bring the code to the data.   
-**Automation**: Although the ML engineer can use any IDE or notebooks for development purposes the final product must be python code at the end of the work stream.  Well-documented, modularized code is necessary for good ML operations and to interface with the company's CI/CD and orchestration tools.  
-**Compliance**: Any ML models must be traceable back to the original data set used for training.  The business needs to be able to easily remove specific user data from training datasets and retrain models. 

Input: Data in `trips` table.  Feature engineering, train, predict functions from data scientist.  
Output: Automatable pipeline of feature engineering, train, predict.

### 1. Load  credentials and connect to Snowflake

In [None]:
%%writefile dag/snowpark_connection.py

def snowpark_connect(creds_file='creds.json'):
    import snowflake.snowpark as snp
    import os, json 
    
    with open(os.path.join(creds_file)) as f:
        data = json.load(f)
        connection_parameters = {
            'account': data['account'],
            'user': data['username'],
            'password': data['password'],
            'role': data['role'],
            'warehouse': data['task_warehouse'],
            'database': data['database'],
            'schema': data['schema']
        }
        compute_parameters = {
            'load_warehouse': data['load_warehouse'],
            'fe_warehouse': data['fe_warehouse'],
            'train_warehouse': data['train_warehouse'],
        }
    
    session = snp.Session.builder.configs(connection_parameters).create()
    return session, compute_parameters


In [1]:
from dag.snowpark_connection import snowpark_connect
session, compute_parameters = snowpark_connect('creds.json')

### 1. Setup Pipeline

In [3]:
from snowflake.snowpark import functions as F
import uuid
state_dict = {
        "trips_table_name":"TRIPS",
        "load_stage_name":"LOAD_STAGE",
        "model_stage_name":"MODEL_STAGE",
        "model_id": str(uuid.uuid1()).replace('-', '_')
    }

start_date, end_date = session.table(state_dict['trips_table_name']) \
                              .select(F.min('STARTTIME'), F.max('STARTTIME')).collect()[0][0:2]
state_dict.update({"start_date":start_date})
state_dict.update({"end_date":end_date})

The business doesn't actively maintain bicycle stock at EVERY station.  We only need predictions for the `top_n` number of stations.  Initially that is 10 but it might change.

In [4]:
top_n = 10

We will deploy the model training and inference as a permanent [Python Snowpark User-Defined Function (UDF)](https://docs.snowflake.com/en/LIMITEDACCESS/snowpark-python.html#creating-user-defined-functions-udfs-for-dataframes). This will make the function available to not only our automated training/inference pipeline but also to any users needing the function for manually generated predictions.  
  
As a permanent function we will need a staging area.

In [5]:
_ = session.sql('CREATE STAGE IF NOT EXISTS ' + state_dict['model_stage_name']).collect()

For production we need to be able to reproduce results.  The `trips` table will change as new data is loaded each month so we need a point-in-time snapshot.  Snowflake [Zero-Copy Cloning](https://docs.snowflake.com/en/sql-reference/sql/create-clone.html) allows us to do this with copy-on-write features so we don't have multiple copies of the same data.  We will create a unique ID to identify each training/inference run as well as the features and predictions generated.  We can use [object tagging](https://docs.snowflake.com/en/user-guide/object-tagging.html) to tag each object with the `model_id` as well.

In [7]:
clone_table_name = 'TRIPS_CLONE_'+state_dict["model_id"]
state_dict.update({"clone_table_name":clone_table_name})

_ = session.sql('CREATE OR REPLACE TABLE '+clone_table_name+" CLONE "+state_dict["trips_table_name"]).collect()
_ = session.sql('CREATE TAG IF NOT EXISTS model_id_tag').collect()
_ = session.sql("ALTER TABLE "+clone_table_name+" SET TAG model_id_tag = '"+state_dict["model_id"]+"'").collect()

We will start by importing the functions created by the ML Engineer.

In [8]:
from dag.mlops_pipeline import deploy_pred_train_udf
from dag.mlops_pipeline import materialize_holiday_table
from dag.mlops_pipeline import materialize_precip_table
#from dag.mlops_pipeline import generate_feature_views
#from dag.mlops_pipeline import train_predict_feature_views

The pipeline will be orchestrated by our companies orchestration framework but we will test the steps here.

In [10]:
model_udf_name = deploy_pred_train_udf(session=session, 
                                       model_stage_name=state_dict['model_stage_name']
                                      )
                
state_dict.update({"model_udf_name":model_udf_name})

ModuleNotFoundError: No module named 'station_train_predict'

In [None]:
holiday_table_name = materialize_holiday_table(session=session,
                                               start_date=state_dict['start_date'], 
                                               end_date=state_dict['end_date'], 
                                               holiday_table_name='holidays'
                                              )
        
state_dict.update({"holiday_table_name":holiday_table_name})

In [None]:
precip_table_name = materialize_precip_table(session=session,
                                             start_date=state_dict['start_date'], 
                                             end_date=state_dict['end_date'], 
                                             precip_table_name='weather'
                                             )
state_dict.update({"precip_table_name":precip_table_name})

In [None]:
from snowflake.snowpark import functions as F
testdf = session.table('CLONE_454A3F10_808F_11EC_A712_ACDE48001122')

In [None]:
agg_period = 'DAY'
date_win = snp.Window.orderBy('DATE')
holiday_df = session.table(holiday_table_name)
precip_df = session.table(precip_table_name)

In [None]:
testdf.select('STARTTIME', 'START_STATION_ID')\
      .withColumn('DATE', F.call_builtin('DATE_TRUNC', (agg_period, F.col('STARTTIME'))))\
      .join(holiday_df, 'DATE', join_type='left').na.fill({'HOLIDAY':0})\
      .join(precip_df, 'DATE', 'inner')\
      .groupBy(F.col('DATE'), F.col('START_STATION_ID'))\
      .count()\
      .filter(F.col('START_STATION_ID') == '519')\
      .sort('DATE')\
      .show()


In [None]:
testdf.select('STARTTIME', 'START_STATION_ID')\
      .withColumn('DATE', F.call_builtin('DATE_TRUNC', (agg_period, F.col('STARTTIME'))))\
      .join(holiday_df, 'DATE', join_type='left').na.fill({'HOLIDAY':0})\
      .groupBy(F.col('DATE'), F.col('START_STATION_ID'))\
      .count()\
      .sort('DATE')\
      .show()

In [None]:
%%time 
feature_view_names = generate_feature_views(session=session, 
                                            clone_table_name=clone_table_name, 
                                            feature_view_name=feature_view_name, 
                                            holiday_table_name=holiday_table_name, 
                                            precip_table_name=holiday_table_name,
                                            target_column='COUNT', 
                                            top_n=top_n)

In [None]:
%%time 
pred_table_name = train_predict_feature_views(session=session, 
                                               station_train_pred_udf_name=model_udf_name, 
                                               feature_view_names=feature_view_names, 
                                               pred_table_name=pred_table_name)

In [None]:
session.table(pred_table_name).select('STATION_ID').distinct().count() #.show()

In [None]:
#%%writefile dag/mlops_pipeline.py

def materialize_holiday_table(session, start_date, end_date, holiday_table_name) -> str:
    from feature_engineering import generate_holiday_df
    from datetime import datetime

    holiday_df = generate_holiday_df(session=session, start_date=start_date, end_date=datetime.now())
    holiday_df.write.mode('overwrite').saveAsTable(holiday_table_name)
    
    return holiday_table_name

def materialize_precip_table(session, start_date, end_date, precip_table_name) -> str:
    from feature_engineering import generate_precip_df
    from datetime import datetime

    precip_df = generate_precip_df(session=session, start_date=start_date, end_date=datetime.now())
    precip_df.write.mode('overwrite').saveAsTable(precip_table_name)
    
    return precip_table_name


def deploy_pred_train_udf(session, model_stage_name) -> str:
    from station_train_predict import station_train_predict_func

    session.clearImports()
    session.addImport('pytorch_tabnet.zip')
    session.addImport('station_train_predict.py')

    station_train_predict_udf = session.udf.register(station_train_predict_func, 
                                                  name="station_train_predict_udf",
                                                  is_permanent=True,
                                                  stage_location='@'+str(model_stage_name), 
                                                  replace=True)
    return station_train_predict_udf.name


def generate_feature_views(session, 
                           clone_table_name, 
                           feature_view_name, 
                           holiday_table_name, 
                           precip_table_name, 
                           target_column, 
                           top_n) -> list:
    from feature_engineering import generate_features
    from snowflake.snowpark import functions as F

    feature_view_names = list()
    
    top_n_station_ids = session.table(clone_table_name).filter(F.col('START_STATION_ID').is_not_null()) \
                                                       .groupBy('START_STATION_ID') \
                                                       .count() \
                                                       .sort('COUNT', ascending=False) \
                                                       .limit(top_n) \
                                                       .collect()
    top_n_station_ids = [stations['START_STATION_ID'] for stations in top_n_station_ids]

    for station in top_n_station_ids:
        feature_df = generate_features(session=session, 
                                       input_df=session.table(clone_table_name)\
                                                       .filter(F.col('START_STATION_ID') == station)\
                                                       .sort('DATE', ascending=True), 
                                       holiday_table_name=holiday_table_name, 
                                       precip_table_name=precip_table_name)

        input_columns_str = str(' ').join(feature_df.columns).replace('\"', "")

        feature_df = feature_df.select(F.array_agg(F.array_construct(F.col('*'))).alias('input_data'), 
                                       F.lit(station).alias('station_id'),
                                       F.lit(input_columns_str).alias('input_column_names'),
                                       F.lit(target_column).alias('target_column'))  

        station_feature_view_name = feature_view_name.replace('<station_id>', station)
        feature_df.createOrReplaceView(station_feature_view_name)
        feature_view_names.append(station_feature_view_name)

    return feature_view_names


def train_predict_feature_views(session, station_train_pred_udf_name, feature_view_names, pred_table_name) -> str:
    from snowflake.snowpark import functions as F
    import pandas as pd
    import ast
    
    cutpoint=365
    max_epochs=1000
    
    for view in feature_view_names:
        feature_df = session.table(view)
        output_df = feature_df.select(F.call_udf(station_train_pred_udf_name, 
                                                 'INPUT_DATA', 
                                                 'INPUT_COLUMN_NAMES', 
                                                 'TARGET_COLUMN', 
                                                 F.lit(cutpoint), 
                                                 F.lit(max_epochs))).collect()

        df = pd.DataFrame(data = ast.literal_eval(output_df[0][0])[0], 
                      columns = ast.literal_eval(output_df[0][0])[1])

        df['DATE'] = pd.to_datetime(df['DATE']).dt.date
        df['STATION_ID'] = feature_df.select('STATION_ID').collect()[0][0]

        output_df = session.createDataFrame(df).write.saveAsTable(pred_table_name)
    
    return pred_table_name
