# Citibike ML
In this example we use the [Citibike dataset](https://ride.citibikenyc.com/system-data). Citibike is a bicycle sharing system in New York City. Everyday users choose from 20,000 bicycles at 1300 stations around New York City.

To ensure customer satisfaction Citibike needs to predict how many bicycles will be needed at each station. Maintenance teams from Citibike will check each station and repair or replace bicycles. Additionally, the team will relocate bicycles between stations based on predicted demand. The business needs to be able to run reports of how many bicycles will be needed at a given station on a given day.

## End-to-End Pipeline
In this section of the demo, we consolidate all previous steps for a full, end-to-end pipeline for incremental ingest, feature engineering, training, prediction, and evaluation.

This will be integrated into **our company's orchestration framework** but showing it all in one place will allow our dev ops team to implement it. 

For this demo flow we will assume that the organization has the following **policies and processes** :   
-**Dev Tools**: The ML engineer can develop in their tool of choice (ie. VS Code, IntelliJ, Pycharm, Eclipse, etc.).  Snowpark Python makes it possible to use any environment where they have a python kernel.  For the sake of a demo we will use Jupyter.  
-**Data Governance**: To preserve customer privacy no data can be stored locally.  The ingest system may store data temporarily but it must be assumed that, in production, the ingest system will not preserve intermediate data products between runs. Snowpark Python allows the user to push-down all operations to Snowflake and bring the code to the data.   
-**Automation**: Although the ML engineer can use any IDE or notebooks for development purposes the final product must be python code at the end of the work stream.  Well-documented, modularized code is necessary for good ML operations and to interface with the company's CI/CD and orchestration tools.  
-**Compliance**: Any ML models must be traceable back to the original data set used for training.  The business needs to be able to easily remove specific user data from training datasets and retrain models. 

Input: Set of python functions from the Data Engineer, Data Scientist, and ML Engineer.  
Output: N/A

In [None]:
files_to_download = ['202003-citibike-tripdata.csv.zip']

def snowpark_citibike_ml_taskflow(files_to_download:list):
    from snowpark_connection import snowpark_connect

    import uuid
    
    state_dict = {
        "download_base_url":"https://s3.amazonaws.com/tripdata/",
        "load_table_name":"RAW_",
        "trips_table_name":"TRIPS",
        "load_stage_name":"LOAD_STAGE",
        "model_stage_name":"MODEL_STAGE",
        "model_id": str(uuid.uuid1()).replace('-', '_')
    }
    
    def snowpark_database_setup(state_dict:dict)-> dict: 
        import snowflake.snowpark.functions as F
        session, compute_parameters = snowpark_connect()
        start_date, end_date = session.table(state_dict['trips_table_name']) \
                              .select(F.min('STARTTIME'), F.max('STARTTIME')).collect()[0][0:2]
        state_dict.update({"start_date":start_date})
        state_dict.update({"end_date":end_date})
        
        _ = session.sql('CREATE STAGE IF NOT EXISTS ' + state_dict['model_stage_name']).collect()
        _ = session.sql('CREATE STAGE IF NOT EXISTS ' + state_dict['load_stage_name']).collect()
        
        session.close()

        return state_dict
    
    def  incremental_elt_task(state_dict: dict, files_to_download:list)-> dict:
        from ingest import incremental_elt

        session, compute_parameters = snowpark_connect()
        _ = session.sql('USE WAREHOUSE '+compute_parameters['load_warehouse']).collect()

        print('Ingesting '+str(files_to_download))
        _ = incremental_elt(session=session, 
                            load_stage_name=state_dict['load_stage_name'], 
                            files_to_download=files_to_download, 
                            download_base_url=state_dict['download_base_url'], 
                            load_table_name=state_dict['load_table_name'], 
                            trips_table_name=state_dict['trips_table_name']
                            )
        
        session.close()
        return state_dict
    
    def deploy_model_udf_task(state_dict:dict)-> dict:
        from mlops_pipeline import deploy_pred_train_udf
        print('Deploying station model')
        session, compute_parameters = snowpark_connect()
        model_udf_name = deploy_pred_train_udf(session=session, 
                                               model_stage_name=state_dict['model_stage_name']
                                              )
                
        state_dict.update({"model_udf_name":model_udf_name})

        session.close()
        return state_dict

    def materialize_holiday_task(state_dict: dict)-> dict:
        from mlops_pipeline import materialize_holiday_table
        print('Materializing holiday table')
        session, compute_parameters = snowpark_connect()
        
        holiday_table_name = materialize_holiday_table(session=session,
                                                       start_date=state_dict['start_date'], 
                                                       end_date=state_dict['end_date'], 
                                                       holiday_table_name='holidays'
                                                      )
        
        state_dict.update({"holiday_table_name":holiday_table_name})

        session.close()
        return state_dict

    def materialize_precip_task(state_dict: dict)-> dict:
        from mlops_pipeline import materialize_precip_table
        print('Materializing weather table')
        
        session, compute_parameters = snowpark_connect()
        
        precip_table_name = materialize_precip_table(session=session,
                                                     start_date=state_dict['start_date'], 
                                                     end_date=state_dict['end_date'], 
                                                     precip_table_name='weather'
                                                    )
        
        state_dict.update({"precip_table_name":precip_table_name})

        session.close()
        return state_dict

    def generate_feature_table_task(state_dict:dict)-> dict: 
        from parallel_udf import generate_feature_table
        print('Generating feature table for all stations.')
        session, compute_parameters = snowpark_connect()
        
        _ = session.sql('USE WAREHOUSE '+compute_parameters['fe_warehouse']).collect()

        clone_table_name = 'TRIPS_CLONE_'+state_dict["model_id"]
        state_dict.update({"clone_table_name":clone_table_name})
        
        _ = session.sql('CREATE OR REPLACE TABLE '+clone_table_name+" CLONE "+state_dict["trips_table_name"]).collect()
        _ = session.sql('CREATE TAG IF NOT EXISTS model_id_tag').collect()
        _ = session.sql("ALTER TABLE "+clone_table_name+" SET TAG model_id_tag = '"+state_dict["model_id"]+"'").collect()
        
        feature_table_name = generate_feature_table(session=session, 
                                                    clone_table_name=state_dict["clone_table_name"], 
                                                    feature_table_name='TRIPS_FEATURES_'+state_dict["model_id"], 
                                                    holiday_table_name=state_dict["holiday_table_name"],
                                                    precip_table_name=state_dict["precip_table_name"]
                                                   )
        state_dict.update({"feature_table_name":feature_table_name})

        session.close()
        return state_dict
    
    def bulk_train_predict_task(state_dict:dict)-> dict: 
        from parallel_udf import train_predict_feature_table
        print('Running bulk training and inference on feature table')
        session, compute_parameters = snowpark_connect()
        
        _ = session.sql('USE WAREHOUSE '+compute_parameters['train_warehouse']).collect()
        
        pred_table_name = train_predict_feature_table(session=session, 
                                                      station_train_pred_udf_name=state_dict["model_udf_name"], 
                                                      feature_table_name=state_dict["feature_table_name"], 
                                                      pred_table_name='PRED_'+state_dict["model_id"]
                                                     )
        
        state_dict.update({"pred_table_name":pred_table_name})
        session.close()
        return state_dict
    
    def deploy_eval_udf_task(state_dict:dict)-> dict:
        from model_eval import deploy_eval_udf
        print('Deploying udf for model evaluation.')
        session, compute_parameters = snowpark_connect()
        eval_model_udf_name = deploy_eval_udf(session=session, 
                                              model_stage_name=state_dict['model_stage_name']
                                              )
                
        state_dict.update({"eval_model_udf_name":eval_model_udf_name})

        session.close()
        return state_dict

    def eval_station_preds_task(state_dict:dict)-> dict:
        from model_eval import evaluate_station_predictions
        print('Running eval UDF for model output')
        session, compute_parameters = snowpark_connect()
        
        _ = session.sql('USE WAREHOUSE '+compute_parameters['fe_warehouse']).collect()

        eval_table_name = evaluate_station_predictions(session=session, 
                                                       pred_table_name=state_dict['pred_table_name'],
                                                       eval_model_udf_name=state_dict['eval_model_udf_name'],
                                                       eval_table_name='EVAL_'+state_dict["model_id"]
                                                       )
        state_dict.update({"eval_table_name":eval_table_name})

        session.close()
        return state_dict                                               
    
    #Task order
    state_dict = snowpark_database_setup(state_dict)
    #state_dict = incremental_elt_task(state_dict, files_to_download)
    
    state_dict = deploy_model_udf_task(state_dict)
    #for testing
    #state_dict.update({"model_udf_name":'station_train_predict_udf'})
    
    state_dict = materialize_holiday_task(state_dict)
    state_dict = materialize_precip_task(state_dict)
    #for testing
    state_dict.update({"holiday_table_name":'HOLIDAYS'})
    state_dict.update({"precip_table_name":'WEATHER'})
    
    state_dict = generate_feature_table_task(state_dict) 
    #for testing
    #state_dict.update({"feature_table_name":'TRIPS_FEATURES_6BFB8E62_811A_11EC_8C7C_ACDE48001122'})
    #state_dict.update({"model_id":'6BFB8E62_811A_11EC_8C7C_ACDE48001122'})
    
    state_dict = bulk_train_predict_task(state_dict)
    #for testing
    #state_dict.update({"pred_table_name":'PRED_6BFB8E62_811A_11EC_8C7C_ACDE48001122'})

    #state_dict = deploy_eval_udf_task(state_dict)
    #state_dict.update({"eval_model_udf_name":'eval_model_output_udf'})

    #state_dict = eval_station_preds_task(state_dict)        

    return state_dict


In [None]:
from bulk_load_internal import bulk_load
#bulk_load()

In [None]:
%%time 
state_dict = snowpark_citibike_ml_taskflow(files_to_download)

In [None]:
state_dict

In [None]:
from snowpark_connection import snowpark_connect

session, compute_parameters = snowpark_connect('creds.json')
#session.table(state_dict['eval_table_name']).count()

In [None]:
#session.table(state_dict['eval_table_name']).show()

In [None]:
session.close()

In [None]:
from snowpark_connection import snowpark_connect
from snowflake.snowpark import Window
from snowflake.snowpark import functions as F
from snowflake.snowpark import udf
import ast
session, compute_parameters = snowpark_connect('creds.json')

In [None]:
clone_df = session.table(state_dict['clone_table_name']) #.filter(F.col('START_STATION_ID') == '3631')
feature_df = session.table(state_dict['feature_table_name']) #.filter(F.col('STATION_ID') == '519')
pred_df = session.table(state_dict['pred_table_name']) #.filter(F.col('START_STATION_ID') == '3631')
holiday_df = session.table(state_dict['holiday_table_name'])
precip_df = session.table(state_dict['precip_table_name'])

In [None]:
pred_df.filter(F.col('PRED') == 'NULL').select('STATION_ID').distinct().show()

In [None]:
pred_df.filter(F.col('STATION_ID') == '3668').count()

In [None]:
# output_list = feature_df.select('STATION_ID', F.call_udf('station_train_predict_udf', 
#                                                          'INPUT_DATA', 
#                                                           'INPUT_COLUMN_LIST', 
#                                                           'TARGET_COLUMN', 
#                                                           F.lit(10)).alias('OUTPUT_DATA')).collect()

In [None]:
# import ast
# import pandas as pd

# for row in range(len(output_list)):
#     tempdf = pd.DataFrame(data = ast.literal_eval(output_list[row]['OUTPUT_DATA'])[0], 
#                                 columns=ast.literal_eval(output_list[row]['OUTPUT_DATA'])[1]
#                                 )
#     tempdf['STATION_ID'] = str(output_list[row]['STATION_ID'])
#     print(tempdf.head())

In [None]:
# window = Window.partitionBy(F.col('STATION_ID')).orderBy(F.col('DATE').asc())

# feature_df = clone_df.select(F.to_date(F.col('STARTTIME')).alias('DATE'),
#                              F.col('START_STATION_ID').alias('STATION_ID'))\
#                      .groupBy(F.col('STATION_ID'), F.col('DATE'))\
#                         .count()\
#                      .withColumn('LAG_1', F.lag(F.col('COUNT'), offset=1, default_value=None).over(window))\
#                      .withColumn('LAG_7', F.lag(F.col('COUNT'), offset=7, default_value=None).over(window))\
#                         .na.drop()\
#                      .join(holiday_df, 'DATE', join_type='left').na.fill({'HOLIDAY':0})\
#                      .join(precip_df, 'DATE', 'inner')

# feature_column_list = feature_df.columns
# feature_column_list.remove('\"STATION_ID\"')
# feature_column_list = [f.replace('\"', "") for f in feature_column_list]
# feature_column_array = F.array_construct(*[F.lit(x) for x in feature_column_list])

# feature_df_stuffed = feature_df.groupBy(F.col('STATION_ID'))\
#                                .agg(F.array_agg(F.array_construct(*feature_column_list)).alias('INPUT_DATA'))\
#                                .withColumn('INPUT_COLUMN_LIST', feature_column_array)\
#                                .withColumn('TARGET_COLUMN', F.lit('COUNT'))

In [None]:
# feature_df.count()

In [None]:
# feature_df_stuffed.show()

In [None]:
# input_data = ast.literal_eval(feature_df_stuffed.select('INPUT_DATA').collect()[0][0])
# len(input_data)

In [None]:
# input_data2 = ast.literal_eval(session.table(state_dict['feature_table_name']).filter(F.col('STATION_ID') == '3631').select('INPUT_DATA').collect()[0][0])
# len(input_data2)

In [None]:
# feature_df = session.table(state_dict['feature_table_name']).filter(F.col('STATION_ID') == '290')

# import ast
# import pandas as pd
# input_data = ast.literal_eval(feature_df.select('INPUT_DATA').collect()[0][0])
# input_columns = ast.literal_eval(feature_df.select('INPUT_COLUMN_LIST').collect()[0][0])
# target_column = feature_df.select('TARGET_COLUMN').collect()[0][0]
# station_id = feature_df.select('STATION_ID').collect()[0][0]
# max_epochs=10

# df = pd.DataFrame(input_data, columns = input_columns)

# if len(df) < 365*2:
#         df['PRED'] = 'NULL'
# else:
#     print('big')
#     feature_columns = input_columns.copy()
#     feature_columns.remove('DATE')
#     feature_columns.remove(target_column)
#     print(feature_columns)
    
#     from torch import tensor
#     from pytorch_tabnet.tab_model import TabNetRegressor

#     model = TabNetRegressor()

#     #cutpoint = round(len(df)*(train_valid_split/100))
#     cutpoint = 365

#     ##NOTE: in order to do train/valid split on time-based portion the input data must be sorted by date    
#     df['DATE'] = pd.to_datetime(df['DATE'])
#     df = df.sort_values(by='DATE', ascending=True)

#     y_valid = df[target_column][-cutpoint:].values.reshape(-1, 1)
#     X_valid = df[feature_columns][-cutpoint:].values
#     y_train = df[target_column][:-cutpoint].values.reshape(-1, 1)
#     X_train = df[feature_columns][:-cutpoint].values
#     print(station_id, y_valid.shape, X_valid.shape, y_train.shape, X_train.shape)

#     model.fit(
#         X_train, y_train,
#         eval_set=[(X_valid, y_valid)],
#         max_epochs=max_epochs,
#         patience=100,
#         batch_size=1024, 
#         virtual_batch_size=128,
#         num_workers=0,
#         drop_last=False)


#     df['PRED'] = model.predict(tensor(df[feature_columns].values))
#     df['DATE'] = df['DATE'].dt.strftime('%Y-%m-%d')
#     df = pd.concat([df, pd.DataFrame(model.explain(df[feature_columns].values)[0], 
#                            columns = feature_columns).add_prefix('EXPL_').round(2)], axis=1)
    
# from station_train_predict import station_train_predict_func as stpf
# output_list = stpf(station_id=station_id,
#                                input_data=input_data,
#                                input_columns_str=input_columns_str,
#                                target_column=target_column,
#                                train_valid_split=train_valid_split,
#                                max_epochs=max_epochs)

# output_list

# print(df.head())

In [None]:
# #show how many rows are stuffed
# import ast

# feature_df2 = session.table(state_dict['feature_table_name'])

# station_list = list(feature_df2.select('STATION_ID').toPandas()['STATION_ID'].values)
# for station in station_list:
#     input_data = ast.literal_eval(feature_df2.filter(F.col('STATION_ID') == station).select('INPUT_DATA').collect()[0][0])
#     print(station, len(input_data))

In [None]:
# clone_df.select('STARTTIME', 'START_STATION_ID')\
#       .withColumn('DATE', F.call_builtin('DATE_TRUNC', (agg_period, F.col('STARTTIME'))))\
#       .join(holiday_df, 'DATE', join_type='left').na.fill({'HOLIDAY':0})\
#       .join(precip_df, 'DATE', 'inner')\
#       .groupBy(F.col('DATE'), F.col('START_STATION_ID'))\
#         .count()\
#       .groupBy(F.col('START_STATION_ID'))\
#         .count()\
#       .sort('COUNT', ascending=False)\
#       .show()

In [None]:
# output_list = feature_df.select(F.call_udf('station_train_predict_udf', 
#                                           'STATION_ID',
#                                           'INPUT_DATA', 
#                                           'INPUT_COLUMN_NAMES', 
#                                           'TARGET_COLUMN', 
#                                           F.lit(1), 
#                                           F.lit(10)).alias('OUTPUT_DATA')).collect()

In [None]:
#session.sql('USE WAREHOUSE XXXX4L').collect()

In [None]:
# output_list = feature_df.select(F.call_udf('station_train_predict_udf', 
#                                           'STATION_ID',
#                                           'INPUT_DATA', 
#                                           'INPUT_COLUMN_NAMES', 
#                                           'TARGET_COLUMN', 
#                                           F.lit(1), 
#                                           F.lit(10)).alias('OUTPUT_DATA')).collect()

In [None]:
#session.sql('USE WAREHOUSE load_wh').collect()

In [None]:
#feature_df.select(F.col('STATION_ID').alias('START_STATION_ID')).sort('START_STATION_ID').show()

In [None]:
# clone_df.select('STARTTIME', 'START_STATION_ID')\
#       .withColumn('DATE', F.call_builtin('DATE_TRUNC', (agg_period, F.col('STARTTIME'))))\
#       .join(holiday_df, 'DATE', join_type='left').na.fill({'HOLIDAY':0})\
#       .join(precip_df, 'DATE', 'inner')\
#       .groupBy(F.col('DATE'), F.col('START_STATION_ID'))\
#         .count()\
#       .groupBy(F.col('START_STATION_ID'))\
#         .count()\
#       .join(feature_df.select(F.col('STATION_ID').alias('START_STATION_ID')), 'START_STATION_ID')\
#       .sort('START_STATION_ID', ascending=True)\
#       .show(100)

In [None]:
#Test predict func
# import ast
# import pandas as pd
# input_data = ast.literal_eval(feature_df.limit(1).select('INPUT_DATA').collect()[0][0])
# station_id = ast.literal_eval(feature_df.limit(1).select('STATION_ID').collect()[0][0])
# input_columns_str = feature_df.limit(1).select('INPUT_COLUMN_NAMES').collect()[0][0]
# target_column = feature_df.limit(1).select('TARGET_COLUMN').collect()[0][0]
# train_valid_split=20
# max_epochs=10

# input_columns = input_columns_str.split(' ')
# feature_columns = input_columns.copy()
# feature_columns.remove('DATE')
# feature_columns.remove(target_column)

# df = pd.DataFrame(input_data, columns = input_columns)
# df['DATE'] = pd.to_datetime(df['DATE'])
# df = df.sort_values(by='DATE', ascending=True)
# df['DATE'] = df['DATE'].dt.strftime('%Y-%m-%d')
# cutpoint = round(len(df)*(train_valid_split/100))
# y_valid = df[target_column][-cutpoint:].values.reshape(-1, 1)
# X_valid = df[feature_columns][-cutpoint:].values
# y_train = df[target_column][:-cutpoint].values.reshape(-1, 1)
# X_train = df[feature_columns][:-cutpoint].values

# from station_train_predict import station_train_predict_func as stpf
# output_list = stpf(station_id=station_id,
#                                input_data=input_data,
#                                input_columns_str=input_columns_str,
#                                target_column=target_column,
#                                train_valid_split=train_valid_split,
#                                max_epochs=max_epochs)

# output_list

In [None]:
# session.sql('USE WAREHOUSE '+compute_parameters['fe_warehouse']).collect()

# output_list = session.table('TRIPS_FEATURES_6BFB8E62_811A_11EC_8C7C_ACDE48001122')\
#                        .select(F.call_udf('station_train_predict_udf', 
#                                           'STATION_ID',
#                                           'INPUT_DATA', 
#                                           'INPUT_COLUMN_NAMES', 
#                                           'TARGET_COLUMN', 
#                                           F.lit(train_valid_split), 
#                                           F.lit(max_epochs)).alias('OUTPUT_DATA')).collect()

# output_list

In [None]:
# ast.literal_eval(output_list[row]['OUTPUT_DATA'])

In [None]:
# df = pd.DataFrame()

# for row in range(len(output_list)):
#     df = pd.concat([df, 
#                     pd.DataFrame(data = ast.literal_eval(output_list[row]['OUTPUT_DATA'])[0], 
#                                 columns=ast.literal_eval(output_list[row]['OUTPUT_DATA'])[1]
#                                 )
#                    ], 
#                    axis=0)
# df

In [None]:
# station_list = list(feature_df.select('STATION_ID').toPandas()['STATION_ID'].values)
# for station in station_list:
#     input_data = ast.literal_eval(feature_df.filter(F.col('STATION_ID') == station).select('INPUT_DATA').collect()[0][0])
#     print(stpf(station_id=station,
#                                input_data=input_data,
#                                input_columns_str=input_columns_str,
#                                target_column=target_column,
#                                train_valid_split=train_valid_split,
#                                max_epochs=max_epochs))

In [None]:
# station_list = list(feature_df.select('STATION_ID').toPandas()['STATION_ID'].values)
# for station in station_list:
#     input_data = ast.literal_eval(feature_df.filter(F.col('STATION_ID') == station).select('INPUT_DATA').collect()[0][0])

#     print(station+' '+str(len(input_data)))
    