**Narrowing down DonorsChoose for a quick triage demo** 

In [1]:
import yaml
import pandas as pd
import psycopg2
from triage.experiments import MultiCoreExperiment
from triage import create_engine
from sqlalchemy.engine.url import URL

#### DB Conn

In [2]:
with open('database.yaml', 'r') as f:
    config = yaml.safe_load(f)
    db_url = URL(
        'postgres',
        host=config['host'],
        username=config['user'],
        database=config['db'],
        password=config['pass'],
        port=config['port'],
    )

    conn = create_engine(db_url)

Steps:
- Create a sampled down version of the donors choose data
- Create a handful of features to demonstrate the feature engineering capabilities of triage
- run a small model grid with three models (Logit, DT, RF)
- Target is to get the triage run finished in a few mins

#### Creating the Sampled down version of donors

For testing, I'm creating a new projects table that contains prjects from ~10% schools in the dataset and changing the cohort query read from the "new" projects table.

Note -- There are about 57000 different schools. We can change how we sample

#### Schema

We create a schema that has a sampled data set from projects

In [41]:
sampled_schema_script = 'create_sampled_schema.sql'

with open(sampled_schema_script, 'r') as script:
    conn.execute(script.read())

#### Triage config

Contains six features. Two static, two time-series but precomputed, two time-series calculated during triage run

In [44]:
with open('demo_config.yaml', 'r') as f:
    triage_config = yaml.safe_load(f)
    
# TODO -- replace with an S3 bucket
project_folder = '/mnt/data/experiment_data/donors/demo/'

#### Experiment

In [50]:
experiment = MultiCoreExperiment(
    config=triage_config,
    db_engine=conn,
    n_processes=2,
    n_db_processes=2,
    project_path=project_folder,
    replace=True,
    save_predictions=True
)

[32m2021-09-20 22:47:54[0m - [1;30mVERBOSE[0m [34mMatrices and trained models will be saved in /mnt/data/experiment_data/donors/demo/[0m
[32m2021-09-20 22:47:54[0m - [1;30m NOTICE[0m [35mReplace flag is set to true. Matrices, models, evaluations and predictions (if exist) will be replaced[0m
[32m2021-09-20 22:47:54[0m - [1;30m NOTICE[0m [35mRandom seed not specified. A random seed will be provided. This could have interesting side effects, e.g. new models per model group are trained, tested and evaluated everytime that you run this experiment configuration[0m
[32m2021-09-20 22:47:54[0m - [1;30mVERBOSE[0m [34mUsing random seed [3711149] for running the experiment[0m
[32m2021-09-20 22:47:54[0m - [1;30m NOTICE[0m [35mbias_audit_config missing in the configuration file or unrecognized. Without protected groups, you will not audit your models for bias and fairness.[0m
[32m2021-09-20 22:47:54[0m - [1;30m NOTICE[0m [35mscoring.subsets missing in the configur

In [51]:
%%time
experiment.run()

[32m2021-09-20 22:47:57[0m - [1;30mSUCCESS[0m [1;32mExperiment validation ran to completion with no errors[0m
[32m2021-09-20 22:47:57[0m - [1;30mVERBOSE[0m [34mComputed and stored temporal split definitions[0m
[32m2021-09-20 22:47:57[0m - [1;30m   INFO[0m Setting up cohort
[32m2021-09-20 22:48:00[0m - [1;30mSUCCESS[0m [1;32mCohort setted up in the table cohort_all_entities_c86920bbaf9b0aefd0005b5c6773a88a successfully[0m
[32m2021-09-20 22:48:00[0m - [1;30m   INFO[0m Setting up labels
[32m2021-09-20 22:48:11[0m - [1;30mSUCCESS[0m [1;32mLabels setted up in the table labels_quickstart_label_b4877d7091c3e743a36e324c945d1a97 successfully [0m
[32m2021-09-20 22:48:11[0m - [1;30m   INFO[0m Creating features tables (before imputation) 
[32m2021-09-20 22:48:11[0m - [1;30m   INFO[0m Creating collate aggregations
[32m2021-09-20 22:48:11[0m - [1;30mVERBOSE[0m [34mStarting Feature aggregation[0m


  % (k, dialect_name)


[32m2021-09-20 22:48:12[0m - [1;30m NOTICE[0m [35mImputed feature table project_features_aggregation_imputed looks good, skipping feature building![0m
[32m2021-09-20 22:48:12[0m - [1;30m NOTICE[0m [35mImputed feature table teachr_funding_aggregation_imputed looks good, skipping feature building![0m
[32m2021-09-20 22:48:13[0m - [1;30m NOTICE[0m [35mImputed feature table donation_features_aggregation_imputed looks good, skipping feature building![0m
[32m2021-09-20 22:48:13[0m - [1;30m   INFO[0m Processing query tasks with 2 processes
[32m2021-09-20 22:48:13[0m - [1;30m   INFO[0m Processing features for project_features_entity_id
[32m2021-09-20 22:48:13[0m - [1;30m   INFO[0m Beginning insert batch
[32m2021-09-20 22:48:13[0m - [1;30m   INFO[0m Beginning insert batch
[32m2021-09-20 22:48:13[0m - [1;30m   INFO[0m Beginning insert batch
[32m2021-09-20 22:48:13[0m - [1;30m   INFO[0m Beginning insert batch
[32m2021-09-20 22:48:13[0m - [1;30m   INFO[

#### Model evaluations

In [47]:
q = "select run_hash from triage_metadata.triage_runs order by start_time desc limit 1;"

experiment_hash = pd.read_sql(q, conn)['run_hash'].iloc[0]
experiment_hash

'4b137fe4406b4d943768d3047dbcc0b1'

In [48]:
q = """
    select 
        to_char(max(train_end_time), 'YYYY-MM-DD') as last_time
    from triage_metadata.experiment_models
    join triage_metadata.models using(model_hash)
    where experiment_hash = '{experiment_hash}'
""".format(experiment_hash=experiment_hash)

last_train_end_time = pd.read_sql(q, conn)['last_time'].iloc[0]
last_train_end_time

'2012-11-01'

In [49]:
q = """
    select 
        model_id, model_type, metric, parameter, best_value, worst_value, stochastic_value
    from triage_metadata.experiment_models
    join triage_metadata.models using(model_hash)
    join test_results.evaluations using(model_id)
    where experiment_hash = '{experiment_hash}' and train_end_time='{last_split}' 
    and metric='precision@' and parameter='15_pct'
""".format(
    experiment_hash = experiment_hash,
    last_split=last_train_end_time
)

evals = pd.read_sql(q, conn)
evals

Unnamed: 0,model_id,model_type,metric,parameter,best_value,worst_value,stochastic_value
0,85,sklearn.tree.DecisionTreeClassifier,precision@,15_pct,0.858586,0.070707,0.4367
1,86,triage.component.catwalk.estimators.classifier...,precision@,15_pct,0.525253,0.363636,0.450505
2,87,triage.component.catwalk.baselines.rankers.Per...,precision@,15_pct,0.474747,0.474747,0.474747
3,90,sklearn.ensemble.RandomForestClassifier,precision@,15_pct,0.444444,0.444444,0.444444
