# Citibike ML
In this example we use the [Citibike dataset](https://ride.citibikenyc.com/system-data). Citibike is a bicycle sharing system in New York City. Everyday users choose from 20,000 bicycles at 1300 stations around New York City.

To ensure customer satisfaction Citibike needs to predict how many bicycles will be needed at each station. Maintenance teams from Citibike will check each station and repair or replace bicycles. Additionally, the team will relocate bicycles between stations based on predicted demand. The business needs to be able to run reports of how many bicycles will be needed at a given station on a given day.

## End-to-End Pipeline
In this section of the demo, we consolidate all previous steps for a full, end-to-end pipeline for incremental ingest, feature engineering, training, prediction, and evaluation.

This will be integrated into **our company's orchestration framework** but showing it all in one place will allow our dev ops team to implement it. 

For this demo flow we will assume that the organization has the following **policies and processes** :   
-**Dev Tools**: The ML engineer can develop in their tool of choice (ie. VS Code, IntelliJ, Pycharm, Eclipse, etc.).  Snowpark Python makes it possible to use any environment where they have a python kernel.  For the sake of a demo we will use Jupyter.  
-**Data Governance**: To preserve customer privacy no data can be stored locally.  The ingest system may store data temporarily but it must be assumed that, in production, the ingest system will not preserve intermediate data products between runs. Snowpark Python allows the user to push-down all operations to Snowflake and bring the code to the data.   
-**Automation**: Although the ML engineer can use any IDE or notebooks for development purposes the final product must be python code at the end of the work stream.  Well-documented, modularized code is necessary for good ML operations and to interface with the company's CI/CD and orchestration tools.  
-**Compliance**: Any ML models must be traceable back to the original data set used for training.  The business needs to be able to easily remove specific user data from training datasets and retrain models. 

Input: Set of python functions from the Data Engineer, Data Scientist, and ML Engineer.  
Output: N/A

### 1. Load  credentials and connect to Snowflake

In [4]:
!ls /code

Dockerfile	       dags			    packages.txt
README.md	       dependencies		    plugins
airflow_settings.yaml  docker-compose.override.yml  requirements.txt
citibike_ml	       include			    weather.csv
conda-env.yml	       k8s_yaml_files		    xray
creds.json	       notebooks


In [1]:
!pip3 freeze

anyio==3.5.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asn1crypto==1.4.0
asttokens==2.0.5
attrs==21.4.0
Babel==2.9.1
backcall==0.2.0
black==21.12b0
bleach==4.1.0
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.10
click==8.0.3
cloudpickle==2.0.0
cryptography==36.0.1
cycler==0.11.0
debugpy==1.5.1
decorator==5.1.1
defusedxml==0.7.1
entrypoints==0.3
executing==0.8.2
fonttools==4.29.0
idna==3.3
importlib-resources==5.4.0
ipykernel==6.7.0
ipython==8.0.1
ipython-genutils==0.2.0
ipywidgets==7.6.5
jedi==0.18.1
Jinja2==3.0.3
joblib==1.1.0
json5==0.9.6
jsonschema==4.4.0
jupyter==1.0.0
jupyter-client==7.1.2
jupyter-console==6.4.0
jupyter-core==4.9.1
jupyter-server==1.13.4
jupyterlab==3.2.8
jupyterlab-pygments==0.1.2
jupyterlab-server==2.10.3
jupyterlab-widgets==1.0.2
kiwisolver==1.3.2
MarkupSafe==2.0.1
matplotlib==3.5.1
matplotlib-inline==0.1.3
mistune==0.8.4
mypy-extensions==0.4.3
nbclassic==0.3.5
nbclient==0.5.10
nbconvert==6.4.0

In [2]:
import snowflake.snowpark as snp

from datetime import datetime
import json
import getpass
import uuid

with open('/code/include/creds.json') as f:
    data = json.load(f)
    connection_parameters = {
      'account': data['account'],
      'user': data['username'],
      'password': data['password'], #getpass.getpass(),
      'role': data['role'],
      'warehouse': data['warehouse']}

session = snp.Session.builder.configs(connection_parameters).create()

### 1. Setup Pipeline

In [3]:
project_db_name = 'CITIBIKEML_JF'
project_schema_name = 'DEMO'
project_db_schema = str(project_db_name)+'.'+str(project_schema_name)

top_n = 5

model_id = str(uuid.uuid1()).replace('-', '_')

download_base_url = 'https://s3.amazonaws.com/tripdata/'

load_table_name = str(project_db_schema)+'.'+'RAW_'
trips_table_name = str(project_db_schema)+'.'+'TRIPS'
holiday_table_name = str(project_db_schema)+'.'+'HOLIDAYS'
precip_table_name = str(project_db_schema)+'.'+'WEATHER'
model_stage_name = str(project_db_schema)+'.'+'model_stage'
clone_table_name = str(project_db_schema)+'.'+'CLONE_'+str(model_id)
feature_view_name = str(project_db_schema)+'.'+'STATION_<station_id>_VIEW_'+str(model_id)
pred_table_name = str(project_db_schema)+'.'+'PREDICTIONS_'+str(model_id)
eval_table_name = str(project_db_schema)+'.'+'EVAL_'+str(model_id)
load_stage_name = 'load_stage'

_ = session.sql('USE DATABASE ' + str(project_db_name)).collect()
_ = session.sql('USE SCHEMA ' + str(project_schema_name)).collect()

_ = session.sql('CREATE STAGE IF NOT EXISTS ' + str(model_stage_name)).collect()
_ = session.sql('CREATE OR REPLACE TEMPORARY STAGE ' + str(load_stage_name)).collect()

_ = session.sql('CREATE OR REPLACE TABLE '+str(clone_table_name)+" CLONE "+str(trips_table_name)).collect()
_ = session.sql('CREATE TAG IF NOT EXISTS model_id_tag').collect()
_ = session.sql("ALTER TABLE "+str(clone_table_name)+" SET TAG model_id_tag = '"+str(model_id)+"'").collect()
_ = session.sql('DROP TABLE IF EXISTS '+pred_table_name).collect()
_ = session.sql('DROP TABLE IF EXISTS '+eval_table_name).collect()


In [4]:
import sys
sys.path.append("/code/dags/")

In [5]:
sys.path

['/code/notebooks',
 '/usr/lib/python38.zip',
 '/usr/lib/python3.8',
 '/usr/lib/python3.8/lib-dynload',
 '',
 '/usr/local/lib/python3.8/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/code/dags/']

In [6]:
from citibike_ml.ingest import incremental_elt
from citibike_ml.mlops_pipeline import deploy_pred_train_udf
from citibike_ml.mlops_pipeline import materialize_holiday_weather
from citibike_ml.mlops_pipeline import generate_feature_views
from citibike_ml.mlops_pipeline import train_predict_feature_views
from citibike_ml.model_eval import deploy_eval_udf
from citibike_ml.model_eval import evaluate_station_predictions

Incremental ELT

In [7]:
%%time 
file_name_end2 = '202102-citibike-tripdata.csv.zip'
file_name_end1 = '201402-citibike-tripdata.zip'

files_to_download = [file_name_end1, file_name_end2]

trips_table_name = incremental_elt(session=session, 
                                   load_stage_name=load_stage_name, 
                                   files_to_download=files_to_download, 
                                   download_base_url=download_base_url, 
                                   load_table_name=load_table_name, 
                                   trips_table_name=trips_table_name)

Downloading file https://s3.amazonaws.com/tripdata/201402-citibike-tripdata.zip
Gzipping file 2014-02 - Citi Bike trip data.csv
Putting file 201402-citibike-tripdata.gz to stage load_stage
Downloading file https://s3.amazonaws.com/tripdata/202102-citibike-tripdata.csv.zip
Gzipping file 202102-citibike-tripdata.csv
Putting file 202102-citibike-tripdata.csv.gz to stage load_stage
CPU times: user 11.6 s, sys: 487 ms, total: 12.1 s
Wall time: 1min 49s


In [10]:
%%time 
model_udf_name = deploy_pred_train_udf(session=session, 
                                       function_name='station_train_predict_func', 
                                       model_stage_name=model_stage_name,
                                      path_pytorch_tabnet="/code/include/",
                                      path_citibike_ml="/code/dags/")

CPU times: user 246 ms, sys: 0 ns, total: 246 ms
Wall time: 45.7 s


In [13]:
!pip3 install pandas



In [14]:
%%time 
holiday_table_name, precip_table_name = materialize_holiday_weather(session=session, 
                                                                   trips_table_name=trips_table_name, 
                                                                   holiday_table_name=holiday_table_name, 
                                                                   precip_table_name=precip_table_name,
                                                                   path="/code/include/")

MissingDependencyError: Missing optional dependency: pandas

In [None]:
%%time 
feature_view_names = generate_feature_views(session=session, 
                                            clone_table_name=clone_table_name, 
                                            feature_view_name=feature_view_name, 
                                            holiday_table_name=holiday_table_name, 
                                            precip_table_name=holiday_table_name,
                                            target_column='COUNT', 
                                            top_n=top_n)

In [None]:
%%time 
pred_table_name = train_predict_feature_views(session=session, 
                                              station_train_pred_udf_name=model_udf_name, 
                                              feature_view_names=feature_view_names, 
                                              pred_table_name=pred_table_name)

In [None]:
%%time
eval_model_udf_name = deploy_eval_udf(session=session, 
                                      function_name='eval_model_output_func', 
                                      model_stage_name=model_stage_name)

In [None]:
%%time
eval_table_name = evaluate_station_predictions(session=session, 
                                               pred_table_name=pred_table_name, 
                                               eval_model_udf_name=eval_model_udf_name, 
                                               eval_table_name=eval_table_name)

In [None]:
session.table(feature_view_names[0]).show()

In [None]:
session.table(pred_table_name).show()

In [None]:
session.table(eval_table_name).show()