# Replicating Results


### What is this?
This notebook provides step-by-step instructions for replicating the results for multi-adjustment

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

import os
import yaml
import sqlalchemy

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
pd.options.display.max_columns = None

from IPython.display import display 
from itertools import permutations, combinations
from jinja2 import Template
import dateparser


from ohio.ext.numpy import pg_copy_to_table

Specify The Database, demo_col of interest, demo_values of interest and the procedure you wish to use for adjustment

In [2]:
database="san_jose_housing_triage"

In [3]:
working_schema, results_schema = "kit_bias_adj", "bias_results"

In [4]:
demo_col = "median_income"

In [5]:
demo_values = ["over55k", "under55k"]

## Getting Set Up

You will need a database with the following schemas:
- **public**: The raw data from donors choose as well as some tables with calculated features and intermediate modeling tables
- **model_metadata**: Information about the models we ran, such as model types and hyperparameters (models were run with `triage`, which generates this schema. In `triage` a "model group" specifies a type of model and associated hyperparameter values, while a "model" is an instantiation of a given model group on a specific temporal validation split). Note that this schema contains information on other model runs with this dataset, in addition to the run used for the current study of fairness-accuracy trade-offs.
- **test_results**: Validation set statistics and predictions for the models. Here, `test_results.predictions` contains project-level predicted scores from each model in the grid, while `test_results.evaluations` contains aggregated summary statistics for each model.
- **train_results**: Training set statistics for the models, including feature importances.
- **features**: Intermediate tables containing calculated features from the `triage` run.
- **bias_working**: Intermediate tables from the bias analysis, as well as the mapping table between projects and school poverty levels, `bias_working.entity_demos`.
- **bias_results_submitted**: Results of the fairness-accuracy trade-offs from the study as submitted (see below to use these to replicate the figures from the study).
- **bias_results**: Empty bias analysis results tables that will be populated by re-running the fairness adjustments (see below for instructions).


Finally, to connect to the database, you'll need a yaml file named `db_profile.yaml` in the same directory as this notebook with your connection info:
```yaml
host: {POSTGRES_HOST}
user: {POSTGRES_USER}
db: education_crowdfunding
pass: {POSTGRES_PASSWORD}
port: {POSTGRES_PORT}
```


In [6]:
with open('../../config/db_default_profile.yaml') as fd:
    config = yaml.full_load(fd)
    dburl = sqlalchemy.engine.url.URL(
        "postgresql",
        host=config["host"],
        username=config["user"],
        database=database,
        password=config["pass"],
        port=config["port"],
    )
    engine = sqlalchemy.create_engine(dburl, poolclass=sqlalchemy.pool.QueuePool)

  dburl = sqlalchemy.engine.url.URL(


In [7]:
def get_frac_demos(model_group_id=None, train_end_time=None):
    if model_group_id is None and train_end_time is None:
        sel_string = ""
    elif model_group_id is None:
        sel_string = f"and train_end_time = '{train_end_time}'"
    elif train_end_time is None:
        sel_string = f"and model_group_id = {model_group_id}"
    else:
        sel_string = f"and train_end_time = '{train_end_time}' and model_group_id = {model_group_id}"
    sql = f"""
    select model_group_id, train_end_time, weight, count(distinct entity_id) 
    from {results_schema}.selected_entities se natural join {working_schema}.entity_demos ed
    where ed.{demo_col} = '{demo_values[0]}' {sel_string}
    group by 1, 2, 3 
    """
    ts_df = pd.read_sql(sql, engine)
    return ts_df

In [8]:
def get_difference(weight1, weight2, model_group_id=None, train_end_time=None):
    if model_group_id is None and train_end_time is None:
        sel_string = f""
    elif model_group_id is None:
        sel_string = f"WHERE train_end_time = '{train_end_time}'"
    elif train_end_time is None:
        sel_string = f"WHERE model_group_id = {model_group_id}"
    else:
        sel_string = f"WHERE model_group_id = {model_group_id} and train_end_time = '{train_end_time}'"
    
    sql = f"""
        with all_entities as (select * 
        from {results_schema}.selected_entities se natural join {working_schema}.entity_demos ed)
        , e1 as (select * from all_entities where weight = {weight1})
        , e2 as (select * from all_entities where weight = {weight2})
        , e3 as (select * from e1 where (model_group_id , train_end_time , entity_id) not in (select model_group_id , train_end_time , entity_id  from e2))
        , e4 as (select * from e2 where (model_group_id , train_end_time , entity_id) not in (select model_group_id , train_end_time , entity_id  from e1))
        SELECT distinct * from e3 {sel_string} UNION ALL SELECT distinct * FROM e4 {sel_string}
        """    
    ts_df = pd.read_sql(sql, engine)
    return ts_df

In [10]:
diff_df = get_difference(weight1=0.9, weight2=0.8)

In [11]:
print(f"Model Group IDs Used")
print(diff_df['model_group_id'].unique())
print(f"Train End Times Used")
print(diff_df['train_end_time'].unique())

Model Group IDs Used
[67]
Train End Times Used
['2014-06-01T00:00:00.000000000' '2015-09-01T00:00:00.000000000'
 '2014-12-01T00:00:00.000000000' '2015-06-01T00:00:00.000000000'
 '2014-09-01T00:00:00.000000000' '2015-12-01T00:00:00.000000000'
 '2015-03-01T00:00:00.000000000' '2016-03-01T00:00:00.000000000']


In [25]:
model_group_id = 67

In [26]:
train_end_time = None

In [27]:
diff_df = get_difference(weight1=0.1, weight2=0.9, model_group_id=model_group_id, train_end_time=train_end_time)

In [28]:
diff_df

Unnamed: 0,entity_id,model_group_id,train_end_time,score,label_value,model_rank,rn_demo,weight,as_of_date,median_income,poverty_level,majority_white
0,644.0,67,2014-06-01,0.72932,,3,2,0.1,2011-03-01,under55k,high,nonwhite
1,644.0,67,2014-06-01,0.72932,,3,2,0.1,2011-06-01,under55k,high,nonwhite
2,644.0,67,2014-06-01,0.72932,,3,2,0.1,2011-09-01,under55k,high,nonwhite
3,644.0,67,2014-06-01,0.72932,,3,2,0.1,2011-12-01,under55k,high,nonwhite
4,644.0,67,2014-06-01,0.72932,,3,2,0.1,2012-03-01,under55k,low,nonwhite
...,...,...,...,...,...,...,...,...,...,...,...,...
83177,323242.0,67,2015-06-01,0.73394,,240,107,0.9,2015-06-01,over55k,high,white
83178,323242.0,67,2015-06-01,0.73394,,240,107,0.9,2015-09-01,over55k,high,white
83179,323242.0,67,2015-06-01,0.73394,,240,107,0.9,2015-12-01,over55k,high,white
83180,323242.0,67,2015-06-01,0.73394,,240,107,0.9,2016-03-01,over55k,high,white


In [30]:
diff_df.groupby(["weight", demo_col, "label_value"]).count()["entity_id"]

weight  median_income  label_value
0.1     over55k        0.0            2736
                       1.0            9194
        under55k       0.0            1048
                       1.0            4666
0.9     over55k        0.0            2826
                       1.0            8876
        under55k       0.0            1420
                       1.0            4984
Name: entity_id, dtype: int64

# Performing Entity Analysis

Assumes that the model_adjustment_results_{demo_col} table is populated correctly for model selection

In [1]:
from DJRecallAdjuster import ra_procedure

In [2]:
ra_procedure(weights=None, pause_phases=False, entity_selection=True)

0.99
[('2014-06-01', '2014-06-01'), ('2014-06-01', '2014-09-01')]
[('2014-09-01', '2014-09-01'), ('2014-09-01', '2014-12-01')]
[('2014-12-01', '2014-12-01'), ('2014-12-01', '2015-03-01')]
[('2015-03-01', '2015-03-01'), ('2015-03-01', '2015-06-01')]
[('2015-06-01', '2015-06-01'), ('2015-06-01', '2015-09-01')]
[('2015-09-01', '2015-09-01'), ('2015-09-01', '2015-12-01')]
[('2015-12-01', '2015-12-01'), ('2015-12-01', '2016-03-01')]
[('2016-03-01', '2016-03-01'), ('2016-03-01', '2016-04-01')]
0.9
[('2014-06-01', '2014-06-01'), ('2014-06-01', '2014-09-01')]
[('2014-09-01', '2014-09-01'), ('2014-09-01', '2014-12-01')]
[('2014-12-01', '2014-12-01'), ('2014-12-01', '2015-03-01')]
[('2015-03-01', '2015-03-01'), ('2015-03-01', '2015-06-01')]
[('2015-06-01', '2015-06-01'), ('2015-06-01', '2015-09-01')]
[('2015-09-01', '2015-09-01'), ('2015-09-01', '2015-12-01')]
[('2015-12-01', '2015-12-01'), ('2015-12-01', '2016-03-01')]
[('2016-03-01', '2016-03-01'), ('2016-03-01', '2016-04-01')]
0.8
[('2014-06-