## Sacred for Repeat Ecperiments
- [Sacred Github](https://github.com/IDSIA/sacred)
- Sacred?
  - Sacred is a tool to help configure, organize, log and reproduce experiments developed at IDSIA
  - Tools to help save and manage settings as we proceed with machine learning modeling
  - Why do we need it?
    - Frequent examples in Kaggle
      - What's the feature  used?
      - What parameters did you use?
      - What's the result?
    - We need tools to help us run various experiments quickly and automatically record them without having to record them by hand.
  - What if we implement it with Scratch?
    - Define Logger and store features, parameters, etc. in Logger whenever necessary.
    - Parsing Logger to view values
    - Sacred helps us easily use the above method as a decorator.

- Main mechanisms of Sacred
  - ConfigScopes : Can conveniently handle the function's local variable @ex.config decorator
  - Config Injection: All functions have access to settings
  - Command-line interface : Can be executed by changing parameters with a command line
  - Observer: All information from the experiment is provided to Observer and stored. MongoDB / S3 etc. => Just save it locally in this project.
  - Automatic seeding: Helps control randomness in an experiment

In [35]:
!pip3 install sacred

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Example of Sacred on the official website

In [37]:
from IPython.display import set_matplotlib_formats

set_matplotlib_formats('retina')

In [69]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.linear_model import LinearRegression
import seaborn as sns
import warnings
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
import os
import json
from numpy.random import permutation
from sklearn import svm, datasets
from sacred import Experiment
from sacred.observers import FileStorageObserver
from IPython.display import set_matplotlib_formats

warnings.filterwarnings('ignore')

set_matplotlib_formats('retina')

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
#Replace 'project_id' with your BigQuery project ID
from google.cloud import bigquery
client = bigquery.Client(project='nyctaxi-demand-forecast')

In [None]:
# Notebooks must specify interactive=True, and this option is not required if it is in script format.
ex = Experiment('iris_rbf_svm', interactive=True)
ex.observers.append(FileStorageObserver.create('my_runs'))

# Save config as Decorator
@ex.config
def cfg():
    C = 1.0
    gamma = 0.7

# Specify ex.main => At this point, the factors in cfg are automatically injected.
# Also, in the Notebook, ex.Write main and use ex.automain in script file

@ex.main
def run(C, gamma):
    iris = datasets.load_iris()
    per = permutation(iris.target.size)
    iris.data = iris.data[per]
    iris.target = iris.target[per]
    clf = svm.SVC(C=C, kernel='rbf', gamma=gamma)
    clf.fit(iris.data[:90],
          iris.target[:90])
    return clf.score(iris.data[90:],
                   iris.target[90:])

In [None]:
run_result = ex.run()

In [None]:
run_result.config

{'C': 1.0, 'gamma': 0.7, 'seed': 968624907}

In [None]:
run_result.result

0.9833333333333333

In [None]:
run_result.experiment_info

{'name': 'iris_rbf_svm',
 'base_dir': '/content',
 'sources': [],
 'dependencies': ['google-colab==1.0.0',
  'ipython==7.9.0',
  'ipywidgets==7.7.1',
  'matplotlib==3.7.1',
  'numpy==1.22.4',
  'pandas==1.4.4',
  'scikit-learn==1.2.2'],
 'repositories': [],
 'mainfile': None}

### Integrate into existing projects

In [None]:
ex = Experiment('nyc-demand-prediction', interactive=True)

# experiment_dir가 없으면 폴더 생성하고 FileStorageObserver로 저장
experiment_dir = os.path.join('./', 'experiments')
if not os.path.isdir(experiment_dir): 
    os.makedirs(experiment_dir)
ex.observers.append(FileStorageObserver.create(experiment_dir))

### Pre-Processing


In [None]:
%%time
base_query = """
WITH base_data AS 
(
  SELECT nyc_taxi.*, gis.* EXCEPT (zip_code_geom)
  FROM (
    SELECT *
    FROM `bigquery-public-data.new_york.tlc_yellow_trips_2015`
    WHERE 
        EXTRACT(MONTH from pickup_datetime) = 1
        and pickup_latitude  <= 90 and pickup_latitude >= -90
    ) AS nyc_taxi
  JOIN (
    SELECT zip_code, state_code, state_name, city, county, zip_code_geom
    FROM `bigquery-public-data.geo_us_boundaries.zip_codes`
    WHERE state_code='NY'
    ) AS gis 
  ON ST_CONTAINS(zip_code_geom, st_geogpoint(pickup_longitude, pickup_latitude))
)

SELECT 
    zip_code,
    DATETIME_TRUNC(pickup_datetime, hour) as pickup_hour,
    EXTRACT(MONTH FROM pickup_datetime) AS month,
    EXTRACT(DAY FROM pickup_datetime) AS day,
    CAST(format_datetime('%u', pickup_datetime) AS INT64) -1 AS weekday,
    EXTRACT(HOUR FROM pickup_datetime) AS hour,
    CASE WHEN CAST(FORMAT_DATETIME('%u', pickup_datetime) AS INT64) IN (6, 7) THEN 1 ELSE 0 END AS is_weekend,
    COUNT(*) AS cnt
FROM base_data 
GROUP BY zip_code, pickup_hour, month, day, weekday, hour, is_weekend
ORDER BY pickup_hour
"""

base_df = client.query(base_query).to_dataframe()

CPU times: user 88.9 ms, sys: 32.4 ms, total: 121 ms
Wall time: 1.79 s


### Feautre Engineering

In [42]:
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(base_df[['zip_code']])
ohe_output = enc.transform(base_df[['zip_code']]).toarray()
ohe_df = pd.concat([base_df, pd.DataFrame(ohe_output, columns='zip_code_'+enc.categories_[0])], axis=1)
ohe_df['log_cnt'] = np.log10(ohe_df['cnt'])

In [43]:
def split_train_and_test(df, date):
    """
    Dataframe divided by train_df and test_df

    df : Time series data frame
    date : Reference point date
    """
    train_df = df[df['pickup_hour'] < date]
    test_df = df[df['pickup_hour'] >= date]
    return train_df, test_df

### Split Train / Test 

In [44]:
train_df, test_df = split_train_and_test(ohe_df, '2015-01-24')

In [45]:
train_df.tail()

Unnamed: 0,zip_code,pickup_hour,month,day,weekday,hour,is_weekend,cnt,zip_code_10001,zip_code_10002,...,zip_code_12729,zip_code_12771,zip_code_13029,zip_code_13118,zip_code_13656,zip_code_13691,zip_code_14072,zip_code_14527,zip_code_14801,log_cnt
65113,10038,2015-01-23 23:00:00+00:00,1,23,4,23,0,209,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.320146
65114,11231,2015-01-23 23:00:00+00:00,1,23,4,23,0,32,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.50515
65115,11371,2015-01-23 23:00:00+00:00,1,23,4,23,0,305,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.4843
65116,10173,2015-01-23 23:00:00+00:00,1,23,4,23,0,5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.69897
65117,10001,2015-01-23 23:00:00+00:00,1,23,4,23,0,1274,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.105169


- Delete columns I don't want to use

In [46]:
del train_df['zip_code']
del train_df['pickup_hour']
del test_df['zip_code']
del test_df['pickup_hour']

In [None]:
train_df.head(2)

Unnamed: 0,month,day,weekday,hour,is_weekend,cnt,zip_code_10001,zip_code_10002,zip_code_10003,zip_code_10004,...,zip_code_12729,zip_code_12771,zip_code_13029,zip_code_13118,zip_code_13656,zip_code_13691,zip_code_14072,zip_code_14527,zip_code_14801,log_cnt
0,1,1,3,0,0,97,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.986772
1,1,1,3,0,0,99,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.995635


In [47]:
y_train_raw = train_df.pop('cnt')
y_train_log = train_df.pop('log_cnt')
y_test_raw = test_df.pop('cnt')
y_test_log = test_df.pop('log_cnt')

In [48]:
y_true = y_test_raw.values.copy()

In [49]:
x_train = train_df.copy()
x_test = test_df.copy()

In [50]:
def evaluation(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    score = pd.DataFrame([mape, mae, mse], index=['mape', 'mae', 'mse'], columns=['score']).T
    return score

### Experimental Settings
- Ex = Experimental ('nyc-demand-prediction', interactive=True') above, but save the setting as ex.config
- ex.capture uses the corresponding setting to return the function
- ex.main contains what to do when the experiment runs

In [51]:
@ex.config
def config():
    fit_intercept=True
    normalize=False

In [62]:
@ex.capture
def get_model(fit_intercept, normalize):
    return LinearRegression(fit_intercept=fit_intercept)

In [63]:
# _log and _run can be used as factors for functions without having to be defined separately
@ex.main
def run(_log, _run):
    lr_reg = get_model()
    lr_reg.fit(x_train, y_train_raw)
    pred = lr_reg.predict(x_test)
    # Save log to logFile
    _log.info("Predict End")
    score = evaluation(y_test_raw, pred)
    _run.log_scalar('model_name', lr_reg.__class__.__name__)
    
    # If you want to save it on Metrics side, use it as below.
    _run.log_scalar('metrics', score)
    
    # If you want to save it to the result side, use it as below.
    return score.to_dict()


In [64]:
experiment_result = ex.run()

In [65]:
experiment_result.config

{'fit_intercept': True, 'normalize': False, 'seed': 756697530}

### Parser to verify Experience
- The function to use depends on how you take log in Experiment
  - 1) If you store metrics in \_run.log\_scalar: recommended
  - 2) @ex.To return the result to the function of the main

In [77]:
# 1) If you store metrics in _run.log_scalar
def parsing_output(ex_id):
    with open(f'./experiments/{ex_id}/metrics.json') as json_file:
        json_data = json.load(json_file)
    with open(f'./experiments/{ex_id}/config.json') as config_file:
        config_data = json.load(config_file)
    
    output_df = pd.DataFrame(json_data['model_name']['values'], columns=['model_name'], index=['score'])
    output_df['experiment_num'] = ex_id
    output_df['config'] = str(config_data)
    metric_df = pd.DataFrame(json_data['metrics']['values'][0]['values'])
    
    output_df = pd.concat([output_df, metric_df], axis=1)
    output_df = output_df.round(2)
    return output_df

In [78]:
# 2) @ex.To return the result to the function of the main
def parsing_output(ex_id):
    with open(f'./experiments/{ex_id}/run.json') as json_file:
        json_data = json.load(json_file)
    output = pd.DataFrame(json_data['result'])
    return output

In [80]:
parsing_output(2)

### If you're curious about more details,
- [Sacred Github](https://github.com/IDSIA/sacred)
- [Introducing Python Sacred to help with the machine learning experiment] (https://zzsza.github.io/mlops/2019/07/21/python-sacred/))
- [Experimental and log monitoring using Sacred and Omniboard] (https://zzsza.github.io/mlops/2019/07/22/sacred-with-omniboard/)