**Narrowing down DonorsChoose for a quick triage demo** 

In [1]:
import yaml
import pandas as pd
import psycopg2
from triage.experiments import MultiCoreExperiment
from triage import create_engine
from sqlalchemy.engine.url import URL

#### DB Conn

In [2]:
with open('database.yaml', 'r') as f:
    config = yaml.safe_load(f)
    db_url = URL(
        'postgres',
        host=config['host'],
        username=config['user'],
        database=config['db'],
        password=config['password'],
        port=config['port'],
    )

    conn = create_engine(db_url)

Steps:
- Create a sampled down version of the donors choose data
- Create a handful of features to demonstrate the feature engineering capabilities of triage
- run a small model grid with three models (Logit, DT, RF)
- Target is to get the triage run finished in a few mins

#### Creating the Sampled down version of donors

For testing, I'm creating a new projects table that contains prjects from ~10% schools in the dataset and changing the cohort query read from the "new" projects table.

Note -- There are about 57000 different schools. We can change how we sample

In [None]:
q = """

    drop table if exists optimized.projects_sampled_temp; 
    drop table if exists optimized.donations_sampled_temp;
    
    create table optimized.projects_sampled_temp as (
        with schools as (
        select 
            distinct schoolid 
        from optimized.projects
    ),
    sampled_schools as (
        select * from schools order by random() limit 5700
    )
    select 
        *
    from sampled_schools join optimized.projects using(schoolid)
    );
    
    create table optimized.donations_sampled_temp as (
        select b.* from optimized.projects_sampled_temp a join optimized.donations b using(entity_id)
    ) 
"""

conn.execute(q)

#### Triage config

currently contains four features. Two static and two dynamic. But, since the dynamic features are precomputed, from a demo perspective they appear static. We could see how long it takes to compute the features on-the-fly maybe with indexed tables. (currently the optimized donations table doesn't have any indexes)

In [None]:
with open('demo_config.yaml', 'r') as f:
    triage_config = yaml.safe_load(f)
    
# TODO -- replace with an S3 bucket
project_folder = '/mnt/data/experiment_data/donors/demo/'

#### Experiment

In [32]:
experiment = MultiCoreExperiment(
    config=triage_config,
    db_engine=conn,
    n_processes=2,
    n_db_processes=4,
    project_path=project_folder,
    replace=False,
    save_predictions=False
)

[32m2021-09-18 20:09:23[0m - [1;30mVERBOSE[0m [34mMatrices and trained models will be saved in /mnt/data/experiment_data/donors/demo/[0m
[32m2021-09-18 20:09:23[0m - [1;30m NOTICE[0m [35mSave predictions flag is set to false. Predictions won't be stored in the predictions table. This will decrease both the running time of an experiment and also decrease the space needed in the db[0m


ProgrammingError: (psycopg2.errors.InsufficientPrivilege) permission denied for table results_schema_versions

[SQL: select version_num from results_schema_versions limit 1]
(Background on this error at: http://sqlalche.me/e/13/f405)