# ORM Benchmark

This notebook compares MIT's original cohort and feature generator classes against IBC's ORM based versions of those same classes. We will use the EOL cohort definition for comparison, as the logic of the original and ORM based versions should be identical.

This notebook will loosely follow this structure:
1. Compare cohorts, and
2. Compare features.

For cohorts table outputs, they will reside in the same schema, called `eol_cohort_comparison`, which is defined in `config.py`. MIT's table will be named `original_eol_cohort`, whereas IBC's will be named `eol_cohort` (which is harded coded in the definition of EolCohortTable, a requirement for the ORM functionality).

## MIT's Cohort Creation

### Imports

In [1]:
# Third party imports
import numpy as np
from getpass import getpass

# Project imports
import config
from config import user_schema
import Utils.dbutils as dbutils
import Generators.CohortGenerator as CohortGenerator

### 1. Set up a connection to the OMOP CDM database

Parameters for connection to be specified in ./config.py

In [2]:
# database connection
username = 'ibx8568'
password = getpass()
database_name = 'pgsql01/ds_omop_cdm'

config_path = 'postgresql://{username}:{password}@{database_name}'.format(
    username = username,
    password = password,
    database_name = database_name
)

# schemas 
schema_name = user_schema # all created tables will be created using this schema

# caching
reset_schema = True # if true, rebuild all data from scratch

# set up database, reset schemas as needed
db = dbutils.Database(config_path, schema_name)
if reset_schema:
    db.execute(
        'drop schema if exists {} cascade'.format(schema_name)
    )
db.execute(
    'create schema if not exists {}'.format(schema_name)
)

········


  "expression-based index %s" % idx_name


Executed 1 SQLs
Executed 1 SQLs


### 2. Generate the Cohort as per the given SQL file

In [3]:
cohort_name = 'original_eol_cohort'
cohort_script_path = config.SQL_PATH_COHORTS + '/gen_EOL_cohort.sql'

# cohort parameters  
params = {
          'cohort_table_name'     : cohort_name,
          'schema_name'           : schema_name,
          'aux_data_schema'       : config.CDM_AUX_SCHEMA,
          'training_start_date'   : '2016-01-01',
          'training_end_date'     : '2017-01-01',
          'gap'                   : '3 months',
          'outcome_window'        : '6 months'
         }

original_eol_cohort = CohortGenerator.Cohort(
    schema_name=schema_name,
    cohort_table_name=cohort_name,
    cohort_generation_script=cohort_script_path,
    cohort_generation_kwargs=params,
    outcome_col_name='y'
)
original_eol_cohort.build(db, replace=True)

Regenerating Table (replace=True)
Regenerated Cohort in 14.98386812210083 seconds


## IBC's Cohort Creation

### Imports

In [4]:
# Standard imports
from datetime import datetime
import time

# Project imports
from Utils.ORM.postgres_db import PostgresDatabase
from Generators.ORM.EolCohortGenerator import EolCohortTable

In [5]:
db_orm = PostgresDatabase(username, password, database_name, schema_name)
remake_cohort = True

In [6]:
cohort_args = {
    'training_start_date':datetime.strptime('2016-01-01', '%Y-%m-%d'),
    'training_end_date':datetime.strptime('2017-01-01', '%Y-%m-%d'),
    'gap_months':3,
    'outcome_months':6,
    'min_enroll_proportion':0.95
}

In [7]:
orm_cohort = EolCohortTable(**cohort_args)
tic = time.perf_counter()
orm_cohort.build(db_orm, remake_cohort)
toc = time.perf_counter()
print('Cohort build took {:,.2f} seconds'.format(toc - tic))

Cohort build took 19.51 seconds


## Compare

We assume that each person_id only appears once in each dataframe, and that the start and end dates are the same. So, the `person_id` and outcome column `y` will be our primary basis for comparison.

In [8]:
orm_check = orm_cohort._cohort[['person_id', 'y']].merge(original_eol_cohort._cohort[['person_id', 'y', 'start_date']], on=['person_id', 'y'], how='left')
ori_check = original_eol_cohort._cohort[['person_id', 'y']].merge(orm_cohort._cohort[['person_id', 'y', 'start_date']], on=['person_id', 'y'], how='left')

In [9]:
print('ORM rows not in original: {}'.format(orm_check.start_date.isnull().sum()))
print('Original rows not in ORM: {}'.format(ori_check.start_date.isnull().sum()))

ORM rows not in original: 0
Original rows not in ORM: 0


## MIT Feature Creation

In [10]:
import Generators.FeatureGenerator as FeatureGenerator

In [11]:
featureSet = FeatureGenerator.FeatureSet(db)
featureSet.add_default_features(
    ['drugs','conditions','procedures','specialty'],
    schema_name,
    cohort_name
)

In [12]:
%%time
# Build the Feature Set by executing SQL queries and reading into sparse matrices
cache_data_path = '/tmp/cache_data_eol_test'
featureSet.build(original_eol_cohort, from_cached=False, cache_file=cache_data_path)

Data loaded to buffer in 1233.04 seconds
Got Unique Concepts and Timestamps in 127.29 seconds
Created Index Mappings in 0.04 seconds
124191
Generated Sparse Representation of Data in 247.83 seconds
CPU times: user 6min 33s, sys: 1min 8s, total: 7min 41s
Wall time: 26min 48s


## IBC Feature Creation

In [13]:
from Generators.ORM.FeatureGenerator import FeatureSet as OrmFeatureSet

In [14]:
ormFeatureSet = OrmFeatureSet(db_orm, EolCohortTable)
ormFeatureSet.add_default_features(['Drugs','Conditions','Procedures','Specialty'])

In [15]:
%%time
ormFeatureSet.build(orm_cohort, from_cached=False)

Data loaded to buffer in 1,868.17 seconds


0it [00:00, ?it/s]

Got Unique Concepts and Timestamps in 115.01 seconds
Created Index Mappings in 0.04 seconds


56it [03:12,  3.44s/it, 17,472.89 MB]


Generated Sparse Representation of Data in 256.06 seconds
CPU times: user 19min 43s, sys: 1min 45s, total: 21min 28s
Wall time: 37min 24s


## Compare

We do a simple equality test of the sparse matrix arrays.

In [16]:
featureSet._spm_arr

0,1
Format,coo
Data Type,float64
Shape,"(37257, 4313, 124191)"
nnz,81539196
Density,4.085911548588132e-06
Read-only,True
Size,2.4G
Storage ratio,0.0


In [17]:
ormFeatureSet._spm_arr

0,1
Format,coo
Data Type,int64
Shape,"(37257, 4313, 124191)"
nnz,81539196
Density,4.085911548588132e-06
Read-only,True
Size,2.4G
Storage ratio,0.0


In [19]:
np.all(featureSet._spm_arr == ormFeatureSet._spm_arr)

True