# Citibike ML
In this example we use the [Citibike dataset](https://ride.citibikenyc.com/system-data). Citibike is a bicycle sharing system in New York City. Everyday users choose from 20,000 bicycles at 1300 stations around New York City.

To ensure customer satisfaction Citibike needs to predict how many bicycles will be needed at each station. Maintenance teams from Citibike will check each station and repair or replace bicycles. Additionally, the team will relocate bicycles between stations based on predicted demand. The business needs to be able to run reports of how many bicycles will be needed at a given station on a given day.

## ML Engineering Development
In this section of the demo, we will utilize Snowpark's Python client-side Dataframe API to build an develope code for the **MLops pipeline**.  We will take the functions and model training/inference definition from the data scientist and put it into production using the Snowpark server-side runtime and Snowpark Python user-defined functions for ML model training and inference.

The ML Engineer will start by exploring the deoployment options and testing the deployed model before building a pipeline.

For this demo flow we will assume that the organization has the following **policies and processes** :   
-**Dev Tools**: The ML engineer can develop in their tool of choice (ie. VS Code, IntelliJ, Pycharm, Eclipse, etc.).  Snowpark Python makes it possible to use any environment where they have a python kernel.  For the sake of a demo we will use Jupyter.  
-**Data Governance**: To preserve customer privacy no data can be stored locally.  The ingest system may store data temporarily but it must be assumed that, in production, the ingest system will not preserve intermediate data products between runs. Snowpark Python allows the user to push-down all operations to Snowflake and bring the code to the data.   
-**Automation**: Although the ML engineer can use any IDE or notebooks for development purposes the final product must be python code at the end of the work stream.  Well-documented, modularized code is necessary for good ML operations and to interface with the company's CI/CD and orchestration tools.  
-**Compliance**: Any ML models must be traceable back to the original data set used for training.  The business needs to be able to easily remove specific user data from training datasets and retrain models. 

Input: Data in `trips` table.  Feature engineering, train, predict functions from data scientist.  
Output: Prediction models available to business users in SQL. Evaluation reports for monitoring.

In [None]:
import snowflake.snowpark as snp
from snowflake.snowpark import functions as F 

### 1. Load  credentials and connect to Snowflake

In [None]:
from datetime import datetime
import json
import getpass

with open('creds.json') as f:
    data = json.load(f)
    connection_parameters = {
      'account': data['account'],
      'user': data['username'],
      'password': data['password'], #getpass.getpass(),
      'role': data['role'],
      'warehouse': data['warehouse']}

session = snp.Session.builder.configs(connection_parameters).create()

### 2.  Create Feature Pipelines


In [None]:
project_db_name = 'CITIBIKEML'
project_schema_name = 'DEMO'
project_db_schema = str(project_db_name)+'.'+str(project_schema_name)

trips_table_name = str(project_db_schema)+'.'+'TRIPS'
holiday_table_name = str(project_db_schema)+'.'+'HOLIDAYS'
precip_table_name = str(project_db_schema)+'.'+'WEATHER'

_ = session.sql('USE DATABASE ' + str(project_db_name)).collect()
_ = session.sql('USE SCHEMA ' + str(project_schema_name)).collect()

We will materialize the holiday and weather datasets as tables instead of calculating each time in the inference and training pipelines.

In [None]:
from citibike_ml.feature_engineering import generate_holiday_df, generate_precip_df

start_date, end_date = session.table(trips_table_name) \
                              .select(F.min('STARTTIME'), F.max('STARTTIME')).collect()[0][0:2]

holiday_df = generate_holiday_df(session=session, start_date=start_date, end_date=datetime.now())
holiday_df.write.mode('overwrite').saveAsTable(holiday_table_name)

precip_df = generate_precip_df(session=session, start_date=start_date, end_date=datetime.now())
precip_df.write.mode('overwrite').saveAsTable(precip_table_name)

### 3. Create UDF for Training and Inference

Since this is a time series prediction we will retrain a model each time we do inference.  We don't need to save the model artefacts but we will save the predictions in an predictions table.  
  
Here we can use Snowpark User Defined Functions for training as well as inference without having to pull data out of Snowflake.

In [None]:
%%writefile citibike_ml/station_train_predict.py

def station_train_predict_func(input_data: list, 
                               input_columns_str: str, 
                               target_column: str,
                               cutpoint: int, 
                               max_epochs: int) -> str:
    
    input_columns = input_columns_str.split(' ')
    feature_columns = input_columns.copy()
    feature_columns.remove('DATE')
    feature_columns.remove(target_column)
    
    from torch import tensor
    import pandas as pd
    from pytorch_tabnet.tab_model import TabNetRegressor
    
    model = TabNetRegressor()

    df = pd.DataFrame(input_data, columns = input_columns)
    
    y_valid = df[target_column][-cutpoint:].values.reshape(-1, 1)
    X_valid = df[feature_columns][-cutpoint:].values
    y_train = df[target_column][:-cutpoint].values.reshape(-1, 1)
    X_train = df[feature_columns][:-cutpoint].values

    model.fit(
        X_train, y_train,
        eval_set=[(X_valid, y_valid)],
        max_epochs=max_epochs,
        patience=100,
        batch_size=1024, 
        virtual_batch_size=128,
        num_workers=0,
        drop_last=False)
    
    
    df['PRED'] = model.predict(tensor(df[feature_columns].values))
    df = pd.concat([df, pd.DataFrame(model.explain(df[feature_columns].values)[0], 
                           columns = feature_columns).add_prefix('EXPL_')], axis=1)
    return [df.values.tolist(), df.columns.tolist()]

The Snowpark server-side Anaconda runtime has a large [list of Python modules included](https://docs.snowflake.com/en/LIMITEDACCESS/udf-python-packages.html#list-of-the-third-party-packages-from-anaconda) for our UDF.  However, the data scientist built this code based on pytorch-tabnet which is not currently in the Snowpark distribution.
  
  We can simply add [pytorch_tabnet](https://github.com/dreamquark-ai/tabnet), as well as our own teams python code, as import dependencies.

In [None]:
from citibike_ml.station_train_predict import station_train_predict_func
import os 

dep = 'pytorch_tabnet'
#source_dir = os.environ['CONDA_PREFIX']+'/lib/python3.8/site-packages/'
source_dir = './dependencies/'

model_stage_name = str(project_db_schema)+'.'+'model_stage'
_ = session.sql('CREATE STAGE IF NOT EXISTS model_stage').collect()

session.clearImports()
session.addImport(source_dir+dep)
session.addImport('citibike_ml')

station_train_predict_udf = session.udf.register(station_train_predict_func, 
                                              name="station_train_predict_udf",
                                              is_permanent=True,
                                              stage_location='@'+str(model_stage_name), 
                                              replace=True)

### 4. Test the training/inference pipeline and prediction output.

We will create an array aggregation to feed the training data to our UDF.

In [None]:
%%time

from citibike_ml.feature_engineering import generate_features

import snowflake.snowpark as snp
from snowflake.snowpark import functions as F 

station_id = '519'
trips_table_name = trips_table_name
target_column = 'COUNT'

snowdf = session.table(trips_table_name).filter(F.col('START_STATION_ID') == station_id)

feature_df = generate_features(session=session, 
                               input_df=snowdf, 
                               holiday_table_name=holiday_table_name, 
                               precip_table_name=precip_table_name)

input_columns_str = str(' ').join(feature_df.columns).replace('\"', "")

feature_df = feature_df.select(F.array_agg(F.array_construct(F.col('*'))).alias('input_data'), 
                              F.lit(station_id).alias('station_id'),
                              F.lit(input_columns_str).alias('input_column_names'),
                              F.lit(target_column).alias('target_column'))

cutpoint=365
max_epochs = 100

output_df = feature_df.select(F.call_udf('station_train_predict_udf', 
                                       'INPUT_DATA', 
                                       'INPUT_COLUMN_NAMES', 
                                       'TARGET_COLUMN', 
                                       F.lit(cutpoint), 
                                       F.lit(max_epochs))).collect()

In [None]:
import pandas as pd
import ast
df = pd.DataFrame(data = ast.literal_eval(output_df[0][0])[0], 
                  columns = ast.literal_eval(output_df[0][0])[1])

df['DATE'] = pd.to_datetime(df['DATE']).dt.date
df.head()

In [None]:
def plot(df, x_lab:str, y_true_lab:str, y_pred_lab:str):
    plt.figure(figsize=(15, 8))
    df = pd.melt(df, id_vars=[x_lab], value_vars=[y_true_lab, y_pred_lab])
    ax = sns.lineplot(x=x_lab, y='value', hue='variable', data=df)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
   
plot(df, 'DATE', 'COUNT', 'PRED')

We will end by consolidating the functions we created.

In [None]:
%%writefile citibike_ml/mlops_pipeline.py
from typing import Tuple

def materialize_holiday_weather(session, trips_table_name, holiday_table_name, precip_table_name) -> Tuple[str, str]:
    from citibike_ml.feature_engineering import generate_holiday_df, generate_precip_df
    from snowflake.snowpark import functions as F
    from datetime import datetime

    start_date, end_date = session.table(trips_table_name) \
                                  .select(F.min('STARTTIME'), F.max('STARTTIME')).collect()[0][0:2]

    holiday_df = generate_holiday_df(session=session, start_date=start_date, end_date=datetime.now())
    holiday_df.write.mode('overwrite').saveAsTable(holiday_table_name)

    precip_df = generate_precip_df(session=session, start_date=start_date, end_date=datetime.now())
    precip_df.write.mode('overwrite').saveAsTable(precip_table_name)
    
    return holiday_table_name, precip_table_name


def deploy_pred_train_udf(session, function_name, model_stage_name) -> str:
    from citibike_ml.station_train_predict import station_train_predict_func

    dep = 'pytorch_tabnet'
    source_dir = './dependencies/'

    session.clearImports()
    session.addImport(source_dir+dep)
    session.addImport('citibike_ml')

    station_train_predict_udf = session.udf.register(station_train_predict_func, 
                                                  name="station_train_predict_udf",
                                                  is_permanent=True,
                                                  stage_location='@'+str(model_stage_name), 
                                                  replace=True)
    return station_train_predict_udf.name


def generate_feature_views(session, 
                           clone_table_name, 
                           feature_view_name, 
                           holiday_table_name, 
                           precip_table_name, 
                           target_column, 
                           top_n) -> list:
    from citibike_ml.feature_engineering import generate_features
    from snowflake.snowpark import functions as F

    feature_view_names = list()
    
    top_n_station_ids = session.table(clone_table_name).filter(F.col('START_STATION_ID').is_not_null()) \
                                                       .groupBy('START_STATION_ID') \
                                                       .count() \
                                                       .sort('COUNT', ascending=False) \
                                                       .limit(top_n) \
                                                       .collect()
    top_n_station_ids = [stations['START_STATION_ID'] for stations in top_n_station_ids]

    for station in top_n_station_ids:
        feature_df = generate_features(session=session, 
                                       input_df=session.table(clone_table_name)\
                                                       .filter(F.col('START_STATION_ID') == station), 
                                       holiday_table_name=holiday_table_name, 
                                       precip_table_name=precip_table_name)

        input_columns_str = str(' ').join(feature_df.columns).replace('\"', "")

        feature_df = feature_df.select(F.array_agg(F.array_construct(F.col('*'))).alias('input_data'), 
                                       F.lit(station).alias('station_id'),
                                       F.lit(input_columns_str).alias('input_column_names'),
                                       F.lit(target_column).alias('target_column'))  

        station_feature_view_name = feature_view_name.replace('<station_id>', station)
        feature_df.createOrReplaceView(station_feature_view_name)
        feature_view_names.append(station_feature_view_name)

    return feature_view_names


def train_predict_feature_views(session, station_train_pred_udf_name, feature_view_names, pred_table_name) -> str:
    from snowflake.snowpark import functions as F
    import pandas as pd
    import ast
    
    cutpoint=365
    max_epochs=1000
    
    for view in feature_view_names:
        feature_df = session.table(view)
        output_df = feature_df.select(F.call_udf(station_train_pred_udf_name, 
                                                 'INPUT_DATA', 
                                                 'INPUT_COLUMN_NAMES', 
                                                 'TARGET_COLUMN', 
                                                 F.lit(cutpoint), 
                                                 F.lit(max_epochs))).collect()

        df = pd.DataFrame(data = ast.literal_eval(output_df[0][0])[0], 
                      columns = ast.literal_eval(output_df[0][0])[1])

        df['DATE'] = pd.to_datetime(df['DATE']).dt.date
        df['STATION_ID'] = feature_df.select('STATION_ID').collect()[0][0]

        output_df = session.createDataFrame(df).write.saveAsTable(pred_table_name)
    
    return pred_table_name

In [None]:
#vectorize: eval_df = eval_df.groupBy('STATION_ID')\
#                  .agg(F.array_agg(F.array_construct(F.col('COUNT'), F.col('PRED'))).alias('INPUT_DATA'))\
#                  #.filter(F.col('STATION_ID') == F.lit('426'))