# MODEL 2: Chicago Women at-risk of Developing Breast Cancer (2017)
-----------------------------------------------------------------------------------------------------------

This is the second model Civis trained for our work with the Chicago Department of Public Health. This notebook goes through each step of Civis' process for predicting the population of Chicago women who are at-risk of developing breast cancer. We follow a 2-step modeling process to estimate the percentage of women in each Census tract in Chicago that are likely to be uninsured. In the first step, we'll train a model on individual-level data, where each row is a person. In the second step, we'll train a model on geographic-level data, where we use the same features as the individual-level model, but the values are aggregated by our geographic-level of interest (in this case, Census tracts). We'll then use our model to predict the proportion of uninsured women in each Census tract in Chicago.

This 2-step modeling process helps us avoid a reverse ecological fallacy problem, or an exception fallacy, which is a potential problem that arises from modeling based off survey response data.

To conduct our analyses, we'll be using the Civis Analytics platform API to connect to our data, create tables, and query our data. Through the Civis API, we'll also be able to use CivisML, our machine learning package. We'll use this to train and test our models, as well as make predictions.

First, we'll set up our workspace. We'll be using the Civis Analytics platform API to connect to our data, create tables, and query our data. Through the Civis API, we'll also be able to use CivisML, our machine learning package. We'll use this to train and test our models, as well as make predictions.

To learn more about Civis Analytics and understand the data science platform we use to build this model, check out our website at the following link: https://www.civisanalytics.com/

#### NOTE: Some variable names and functions have been changed to protect proprietary information.

In [None]:
import civis

import pandas as pd
import numpy as np
from civis.ml import ModelPipeline  # we'll be using Civis's model pipeline to create and run our models

client = civis.APIClient()

------------------------------------------------------------------------------------------

# STEP 1: Create Individual-Level Training Table (female-only)

1.  Grab data for modeling
2.  Append responses from the survey we ran in 2017 (Oct - Nov 2017)
3.  Recode the survey responses to be the relative risk of having a specific breast cancer risk factor, using information from published studies.
4.  Create a variable for the baseline risk for breast cancer based off race/age using race/age data from our modeling data. (source: https://seer.cancer.gov/archive/csr/1975_2012/results_merged/topic_lifetime_risk.pdf; table 5.17). We used the relative risk value for developing cancer in the next 10 years for each age and race group. For groups that had a relative risk of zero, we assigned them a relative risk of 0.0001 so that their overall relative risk wouldn't just be zero because their baseline risk is zero.
6.  Coalesce Breast Cancer Risk factors into one variable ("bc_risk") by multiplying them together to get overall relative risk; use baseline risk as original value.
7.  Split the breast cancer risk variable into a binary variable ("bc_risk_2cat"). Use 0.05 as the cut-off, as it is approximately double the median baseline risk for all races/ages.



### Sources of relative risk values for each breast cancer risk value:
- https://www.ncbi.nlm.nih.gov/books/NBK1247/ (Table 2)
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1514477/pdf/20030400s00007p474.pdf (Table 1)
- http://roserobixby.com/RoseroBixby/Publicaciones_files/61.pdf (Figure 2a)

In [None]:
%%civisquery -- this allows us to query our data using SQL in this cell (only when using Notebooks on the Civis Platform)


DROP VIEW IF EXISTS cdph_bcrisk_train_all_female;
DROP VIEW IF EXISTS cdph_bcrisk_train_basefile_female;
CREATE VIEW cdph_bcrisk_train_basefile_female AS
SELECT 
*,
-- CREATE BINARY VARIABLE USING RELATIVE RISK; CUT-OFF == DOUBLE THE APPROX. MEDIAN BASELINE RISK
CASE WHEN bc_risk >= 0.05 THEN 1 ELSE 0 END AS bc_risk_2cat
FROM
(
    SELECT *,
    -- MULTIPLY RELATIVE RISK VALUES TO GET OVERALL RELATIVE RISK VALUE
    (race_age_baseline_bcrisk 
     * cdph_pregnant::float 
     * cdph_pregnant_age::float 
     * cdph_diagnosis_breastcancer::float 
     * cdph_family_breastcancer::float 
     * cdph_brca_test::float 
     * cdph_oral_contraceptives::float 
     * cdph_female_hormones::float
    ) AS bc_risk
    FROM
    (
        SELECT data.*, 
        DECODE(s.cdph_insured,
               'Yes - through the government (Medicare, Medicaid)', 0,
               'Yes - through my employer or spouse/partner\'s employer, or I purchase it myself', 0,
               'No - I don\'t have health insurance', 1,
               NULL) AS cdph_uninsured,
        -- NEVER PREGNANT RR = 1.8 (baseline = first child before 20)
        DECODE(s.cdph_pregnant,
               'Yes', 1.0000,
               'No', 1.8000,
               1) AS cdph_pregnant,
        -- PREGNANT AFTER 30 RR = 1.8 (baseline = first child before 20)
        DECODE(s.cdph_pregnant_age,
               'less than 15 years old', 1.0000,
               '15-19 years old', 1.0000,
               '20-24 years old', 1.0000,
               '25-29 years old', 1.0000,
               '30-34 years old', 1.8000,
               '35-39 years old', 1.8000,
               '40+ years old', 1.8000,
               1.0000) AS cdph_pregnant_age,
        -- HISTORY OF BREAST CANCER = 6.8 (baseline = no history breast cancer)
        DECODE(s.cdph_diagnosis_breastcancer,
               'Yes', 6.8000,
               'No', 1.0000,
               1.0000) AS cdph_diagnosis_breastcancer,
        -- FIRST DEGREE RELATIVE WITH BREAST CANCER = 2.9 (baseline = no first degree relative with breast cancer)
        DECODE(s.cdph_family_breastcancer,
               'Yes', 2.9000,
               'No', 1.0000,
               1.0000) AS cdph_family_breastcancer,
        -- BRCA GENE TEST POSITIVE = 0.64/0.12 -- i.e. risk/general risk (baseline = no BRCA gene mutation)
        DECODE(s.cdph_brca_test,
               'No - I\'ve never had the BRCA gene blood test', 1.0000,
               'Yes - and it was negative', 1.0000, 
               'Yes - and it was positive', (0.6400 / 0.12),
               1.0000) AS cdph_brca_test,
        -- ORAL CONTRACEPTIVE LENGTH OF USE; ranges from 1.07 - 1.12 based on length of time (baseline = no oral contraceptive use)
        CASE 
            WHEN s.cdph_oral_contraceptives = 'Yes' OR s.cdph_oral_contraceptives = 'No - but I took birth control pills or oral contraceptives in the past' THEN
                CASE
                    WHEN s.cdph_oral_contraceptives_time = '1 year or less' THEN 1.0700
                    WHEN s.cdph_oral_contraceptives_time = '2-3 years' OR s.cdph_oral_contraceptives_time = '4-5 years' THEN 1.0500
                    WHEN s.cdph_oral_contraceptives_time = '6-9 years' THEN 1.0900
                    WHEN s.cdph_oral_contraceptives_time = '10+ years' THEN 1.1200
                    ELSE 1.0000
                END
            ELSE 1.0000
        END AS cdph_oral_contraceptives,
        -- USE FEMALE HORMONES FOR AT LEAST 5 YEARS = 1.35 (baseline = no use female hormones)
        CASE
            WHEN s.cdph_female_hormones = 'No - but I took female hormones or combined hormone therapy in the past' OR s.cdph_female_hormones = 'Yes' THEN
                CASE
                    WHEN s.cdph_female_hormones_time = '6-9 years' OR s.cdph_female_hormones_time = '10+ years' THEN 1.3500
                    ELSE 1.0000
                END
            ELSE 1.0000
        END AS cdph_female_hormones
        FROM
        (
            SELECT *
            FROM
            (
                SELECT *,
                ROW_NUMBER() OVER (PARTITION BY id) AS dupes -- remove duplicate survey respondents
                FROM cdph_survey
                WHERE id IS NOT NULL
            ) WHERE dupes = 1
        ) AS s 
        LEFT JOIN
        (
            SELECT *,
            -- https://seer.cancer.gov/archive/csr/1975_2012/results_merged/topic_lifetime_risk.pdf
            CASE
                WHEN race_white = 1 THEN
                    CASE
                        WHEN age < 10 THEN 0.0001
                        WHEN age >= 10 AND age < 20 THEN 0.0001
                        WHEN age >= 20 AND age < 30 THEN 0.0006
                        WHEN age >= 30 AND age < 40 THEN 0.0044
                        WHEN age >= 40 AND age < 50 THEN 0.0146
                        WHEN age >= 50 AND age < 60 THEN 0.0231
                        WHEN age >= 60 AND age < 70 THEN 0.0356
                        WHEN age >= 70 AND age < 80 THEN 0.0407
                        WHEN age >= 80 THEN 0.0312
                        ELSE 0.0001
                    END
                WHEN race_afam = 1 THEN
                    CASE
                        WHEN age < 10 THEN 0.0001
                        WHEN age >= 10 AND age < 20 THEN 0.0001
                        WHEN age >= 20 AND age < 30 THEN 0.0008
                        WHEN age >= 30 AND age < 40 THEN 0.0050
                        WHEN age >= 40 AND age < 50 THEN 0.0142
                        WHEN age >= 50 AND age < 60 THEN 0.0229
                        WHEN age >= 60 AND age < 70 THEN 0.0332
                        WHEN age >= 70 AND age < 80 THEN 0.0355
                        WHEN age >= 80 THEN 0.0283
                        ELSE 0.0001
                    END
                WHEN race_asian = 1 THEN
                    CASE
                        WHEN age < 10 THEN 0.0001
                        WHEN age >= 10 AND age < 20 THEN 0.0001
                        WHEN age >= 20 AND age < 30 THEN 0.0005
                        WHEN age >= 30 AND age < 40 THEN 0.0041
                        WHEN age >= 40 AND age < 50 THEN 0.0137
                        WHEN age >= 50 AND age < 60 THEN 0.0200
                        WHEN age >= 60 AND age < 70 THEN 0.0268
                        WHEN age >= 70 AND age < 80 THEN 0.0255
                        WHEN age >= 80 THEN 0.0190
                        ELSE 0.0001
                    END
                WHEN race_native = 1 THEN
                    CASE
                        WHEN age < 10 THEN 0.0001
                        WHEN age >= 10 AND age < 20 THEN 0.0001
                        WHEN age >= 20 AND age < 30 THEN 0.0003
                        WHEN age >= 30 AND age < 40 THEN 0.0028
                        WHEN age >= 40 AND age < 50 THEN 0.0083
                        WHEN age >= 50 AND age < 60 THEN 0.0148
                        WHEN age >= 60 AND age < 70 THEN 0.0250
                        WHEN age >= 70 AND age < 80 THEN 0.0272
                        WHEN age >= 80 THEN 0.0173
                        ELSE 0.0001
                    END
                WHEN race_hispanic = 1 THEN
                    CASE
                        WHEN age < 10 THEN 0.0001
                        WHEN age >= 10 AND age < 20 THEN 0.0001
                        WHEN age >= 20 AND age < 30 THEN 0.0005
                        WHEN age >= 30 AND age < 40 THEN 0.0034
                        WHEN age >= 40 AND age < 50 THEN 0.0111
                        WHEN age >= 50 AND age < 60 THEN 0.0181
                        WHEN age >= 60 AND age < 70 THEN 0.0261
                        WHEN age >= 70 AND age < 80 THEN 0.0271
                        WHEN age >= 80 THEN 0.0208
                        ELSE 0.0001
                    END
                ELSE 0.0001
            END AS race_age_baseline_bcrisk
            FROM modeling_data AS md
            LEFT JOIN
            (
                SELECT join_key, uninsured_2016
                FROM uninsured2016_score
            ) AS uninsured
            ON md.id = uninsured.join_key
            LEFT JOIN
            (
                SELECT join_key2, 
                gender,
                race
                FROM basic
            ) AS basic
            ON md.id = basic.join_key2
        ) AS data
        ON data.id = s.id
    )
    WHERE (cdph_pregnant IS NOT NULL 
         OR cdph_pregnant_age IS NOT NULL
         OR cdph_diagnosis_breastcancer IS NOT NULL
         OR cdph_family_breastcancer IS NOT NULL
         OR cdph_brca_test IS NOT NULL
         OR cdph_oral_contraceptives IS NOT NULL
         OR cdph_female_hormones IS NOT NULL)
         AND gender = 'Female' -- ONLY SELECT FEMALES
)
;


### Append BRFSS 2016 Data
We also appended Behavioral Risk Factor Surveillance System 2016 data (BRFSS). The data is available at the following link: https://www.cdc.gov/brfss/annual_data/annual_2016.html. 

Before joining on the BRFSS 2016 data, we first selected specific columns of interest and recoded the data. The script for this ETL was written in R and is available on GitHub. The name of the R script is "BRFSS2016_ETL.R". 

To append the BRFSS 2016 data to our original modeling data set, we grouped BRFSS 2016 participants by state, gender, race, and age. We then took the average values for each grouping, and joined them onto our modeling data set by their groups.

In [None]:
%%civisquery


DROP VIEW IF EXISTS cdph_bcrisk_train_all_female;
CREATE VIEW cdph_bcrisk_train_all_female AS
SELECT *
FROM
(
    SELECT
    *
    FROM
    (
        SELECT *,
        (state_og + gender_og + race_og + age_og) AS demographics_join_key1
        FROM
        (
            SELECT 
            *,
            CASE
                WHEN state = 'AL' THEN 100000 
                WHEN state = 'AK' THEN 200000 
                WHEN state = 'AZ' THEN 400000 
                WHEN state = 'AR' THEN 500000 
                WHEN state = 'CA' THEN 600000 
                WHEN state = 'CO' THEN 800000 
                WHEN state = 'CT' THEN 900000 
                WHEN state = 'DE' THEN 1000000 
                WHEN state = 'DC' THEN 1100000 
                WHEN state = 'FL' THEN 1200000 
                WHEN state = 'GA' THEN 1300000 
                WHEN state = 'HI' THEN 1500000 
                WHEN state = 'ID' THEN 1600000 
                WHEN state = 'IL' THEN 1700000 
                WHEN state = 'IN' THEN 1800000 
                WHEN state = 'IA' THEN 1900000 
                WHEN state = 'KS' THEN 2000000 
                WHEN state = 'KY' THEN 2100000 
                WHEN state = 'LA' THEN 2200000 
                WHEN state = 'ME' THEN 2300000 
                WHEN state = 'MD' THEN 2400000 
                WHEN state = 'MA' THEN 2500000 
                WHEN state = 'MI' THEN 2600000 
                WHEN state = 'MN' THEN 2700000 
                WHEN state = 'MS' THEN 2800000 
                WHEN state = 'MO' THEN 2900000 
                WHEN state = 'MT' THEN 3000000 
                WHEN state = 'NE' THEN 3100000 
                WHEN state = 'NV' THEN 3200000 
                WHEN state = 'NH' THEN 3300000 
                WHEN state = 'NJ' THEN 3400000
                WHEN state = 'NM' THEN 3500000 
                WHEN state = 'NY' THEN 3600000
                WHEN state = 'NC' THEN 3700000 
                WHEN state = 'ND' THEN 3800000 
                WHEN state = 'OH' THEN 3900000 
                WHEN state = 'OK' THEN 4000000 
                WHEN state = 'OR' THEN 4100000 
                WHEN state = 'PA' THEN 4200000 
                WHEN state = 'RI' THEN 4400000 
                WHEN state = 'SC' THEN 4500000 
                WHEN state = 'SD' THEN 4600000 
                WHEN state = 'TN' THEN 4700000 
                WHEN state = 'TX' THEN 4800000 
                WHEN state = 'UT' THEN 4900000 
                WHEN state = 'VT' THEN 5000000
                WHEN state = 'VA' THEN 5100000 
                WHEN state = 'WA' THEN 5300000 
                WHEN state = 'WV' THEN 5400000 
                WHEN state = 'WI' THEN 5500000 
                WHEN state = 'WY' THEN 5600000 
                WHEN state = 'GU' THEN 6600000 
                WHEN state = 'PR' THEN 7200000 
                WHEN state = 'VI' THEN 7800000 
                ELSE 0
                END AS state_og,
            CASE
                WHEN age > 80 THEN 80
                ELSE age
                END AS age_og,
            CASE
                WHEN gender = 'Female' THEN 20000
                WHEN gender = 'Male' THEN 10000
                ELSE 0
                END AS gender_og,
            CASE
                WHEN race = 'White' THEN 1000
                WHEN race = 'AfAm' THEN 2000
                WHEN race = 'Asian' THEN 3000
                WHEN race = 'Native' THEN 4000
                WHEN race = 'Hispanic' THEN 5000
                END AS race_og
            FROM cdph_bcrisk_train_basefile_female
        )
    ) AS og
    LEFT JOIN
    (
        SELECT
        demographics_join_key2, 
        COALESCE(AVG(breastcncr_nulls), 0) AS breastcncr, COALESCE(AVG(obese), 0) AS obese,
        COALESCE(AVG(pvtresd), 0) AS pvtresd, COALESCE(AVG(colg_hous), 0) AS colg_hous,
        COALESCE(AVG(gen_hlth), 0) AS gen_hlth, COALESCE(AVG(phys_hlth), 0) AS phys_hlth,
        COALESCE(AVG(ment_hlth), 0) AS ment_hlth, COALESCE(AVG(poor_hlth), 0) AS poor_hlth,
        COALESCE(AVG(hlth_pln), 0) AS hlth_pln, COALESCE(AVG(persdoc), 0) AS persdoc,
        COALESCE(AVG(doc_toocostly), 0) AS doc_toocostly, COALESCE(AVG(checkup), 0) AS checkup,
        COALESCE(AVG(exercise), 0) AS exercise, COALESCE(AVG(sleephrs), 0) AS sleephrs,
        COALESCE(AVG(heartattack), 0) AS heartattack, COALESCE(AVG(coronaryheartdisease), 0) AS coronaryheartdisease,
        COALESCE(AVG(stroke), 0) AS stroke, COALESCE(AVG(asthma), 0) AS asthma,
        COALESCE(AVG(asthma_now), 0) AS asthma_now, COALESCE(AVG(skin_cancer), 0) AS skin_cancer,
        COALESCE(AVG(cancer), 0) AS cancer, COALESCE(AVG(copd), 0) AS copd,
        COALESCE(AVG(arthritis), 0) AS arthritis, COALESCE(AVG(depression), 0) AS depression,
        COALESCE(AVG(kidneydisease), 0) AS kidneydisease, COALESCE(AVG(diabetes), 0) AS diabetes,
        COALESCE(AVG(borderline_diab), 0) AS borderline_diab, COALESCE(AVG(preg_diab), 0) AS preg_diab,
        COALESCE(AVG(age_diabetes), 0) AS age_diabetes, COALESCE(AVG(student), 0) AS student,
        COALESCE(AVG(numchildren), 0) AS numchildren, COALESCE(AVG(internet_use), 0) AS internet_use,
        COALESCE(AVG(weight_kg), 0) AS weight_kg, COALESCE(AVG(height_in), 0) AS height_in,
        COALESCE(AVG(pregnant_now), 0) AS pregnant_now, COALESCE(AVG(hard_hearing), 0) AS hard_hearing,
        COALESCE(AVG(hard_seeing), 0) AS hard_seeing, COALESCE(AVG(diff_decide), 0) AS diff_decide,
        COALESCE(AVG(diff_walk), 0) AS diff_walk, COALESCE(AVG(diff_dress), 0) AS diff_dress,
        COALESCE(AVG(diff_alone), 0) AS diff_alone, COALESCE(AVG(smoked), 0) AS smoked,
        COALESCE(AVG(smoke_freq), 0) AS smoke_freq, COALESCE(AVG(snuff_freq), 0) AS snuff_freq,
        COALESCE(AVG(ecig_freq), 0) AS ecig_freq, COALESCE(AVG(last_smoke), 0) AS last_smoke,
        COALESCE(AVG(alc_past_week), 0) AS alc_past_week, COALESCE(AVG(alc_past_month), 0) AS alc_past_month,
        COALESCE(AVG(avg_drinks), 0) AS avg_drinks, COALESCE(AVG(flu_vacc), 0) AS flu_vacc,
        COALESCE(AVG(pnem_vacc), 0) AS pnem_vacc, COALESCE(AVG(tetanus_vacc), 0) AS tetanus_vacc,
        COALESCE(AVG(num_fall), 0) AS num_fall, COALESCE(AVG(num_bad_falls), 0) AS num_bad_falls,
        COALESCE(AVG(seatbelt_use), 0) AS seatbelt_use, COALESCE(AVG(drunk_drive), 0) AS drunk_drive,
        COALESCE(AVG(mammogram), 0) AS mammogram, COALESCE(AVG(last_mammogram), 0) AS last_mammogram,
        COALESCE(AVG(pap), 0) AS pap, COALESCE(AVG(last_pap), 0) AS last_pap,
        COALESCE(AVG(hpv_test), 0) AS hpv_test, COALESCE(AVG(last_hpvtest), 0) AS last_hpvtest,
        COALESCE(AVG(hpv_vacc), 0) AS hpv_vacc, COALESCE(AVG(hysterectomy), 0) AS hysterectomy,
        COALESCE(AVG(psa_test_discussion), 0) AS psa_test_discussion, COALESCE(AVG(psa_test_suggest), 0) AS psa_test_suggest,
        COALESCE(AVG(psa_test), 0) AS psa_test, COALESCE(AVG(last_psatest), 0) AS last_psatest,
        COALESCE(AVG(fam_history_prostatecncr), 0) AS fam_history_prostatecncr, COALESCE(AVG(blood_stool_test), 0) AS blood_stool_test,
        COALESCE(AVG(last_bloodstooltest), 0) AS last_bloodstooltest, COALESCE(AVG(had_colsig), 0) AS had_colsig,
        COALESCE(AVG(sigmoidoscopy), 0) AS sigmoidoscopy, COALESCE(AVG(colonoscopy), 0) AS colonoscopy,
        COALESCE(AVG(last_colsig), 0) AS last_colsig, COALESCE(AVG(hiv_test), 0) AS hiv_test,
        COALESCE(AVG(hiv_risk), 0) AS hiv_risk, COALESCE(AVG(diabetes_test), 0) AS diabetes_test,
        COALESCE(AVG(insulin_now), 0) AS insulin_now, COALESCE(AVG(diabetes_consult), 0) AS diabetes_consult,
        COALESCE(AVG(feet_check), 0) AS feet_check, COALESCE(AVG(last_eyeexam), 0) AS last_eyeexam,
        COALESCE(AVG(retinopathy), 0) AS retinopathy, COALESCE(AVG(diab_edu), 0) AS diab_edu,
        COALESCE(AVG(pain_days), 0) AS pain_days, COALESCE(AVG(sad_days), 0) AS sad_days,
        COALESCE(AVG(anxious_days), 0) AS anxious_days, COALESCE(AVG(energized_days), 0) AS energized_days,
        COALESCE(AVG(medicare_now), 0) AS medicare_now, COALESCE(AVG(hc_employer), 0) AS hc_employer,
        COALESCE(AVG(hc_personal), 0) AS hc_personal, COALESCE(AVG(hc_medicare), 0) AS hc_medicare,
        COALESCE(AVG(hc_medicaid), 0) AS hc_medicaid, COALESCE(AVG(hc_tricare), 0) AS hc_tricare,
        COALESCE(AVG(hc_native), 0) AS hc_native, COALESCE(AVG(hc_none_current), 0) AS hc_none_current,
        COALESCE(AVG(delay_appt), 0) AS delay_appt, COALESCE(AVG(delay_wait), 0) AS delay_wait,
        COALESCE(AVG(delay_transport), 0) AS delay_transport, COALESCE(AVG(hc_none_thisyr), 0) AS hc_none_thisyr,
        COALESCE(AVG(drvisit), 0) AS drvisit, COALESCE(AVG(med_toocostly), 0) AS med_toocostly,
        COALESCE(AVG(satisfied_care), 0) AS satisfied_care, COALESCE(AVG(med_bills), 0) AS med_bills,
        COALESCE(AVG(get_medadvic), 0) AS get_medadvic, COALESCE(AVG(undrstnd_medadvic), 0) AS undrstnd_medadvic,
        COALESCE(AVG(understand_writtenmedadvic), 0) AS understand_writtenmedadvic, COALESCE(AVG(caregiver), 0) AS caregiver,
        COALESCE(AVG(mem_loss), 0) AS mem_loss, COALESCE(AVG(mem_loss_assist), 0) AS mem_loss_assist,
        COALESCE(AVG(mem_loss_gethelp), 0) AS mem_loss_gethelp, COALESCE(AVG(mem_loss_inhibit), 0) AS mem_loss_inhibit,
        COALESCE(AVG(mem_loss_doctor), 0) AS mem_loss_doctor, COALESCE(AVG(soda_day), 0) AS soda_day,
        COALESCE(AVG(soda_week), 0) AS soda_week, COALESCE(AVG(soda_month), 0) AS soda_month,
        COALESCE(AVG(sugarbev_day), 0) AS sugarbev_day, COALESCE(AVG(subarbev_week), 0) AS sugarbev_week,
        COALESCE(AVG(sugarbev_month), 0) AS sugarbev_month, COALESCE(AVG(calorie), 0) AS calorie,
        COALESCE(AVG(marijuana), 0) AS marijuana, COALESCE(AVG(shingles_vacc), 0) AS shingles_vacc,
        COALESCE(AVG(num_types_cncr), 0) AS num_types_cncr, COALESCE(AVG(age_diagnosis_cncr), 0) AS age_diagnosis_cncr,
        COALESCE(AVG(cncr_insurance), 0) AS cncr_insurance, COALESCE(AVG(breast_exam), 0) AS breast_exam,
        COALESCE(AVG(last_breastexam), 0) AS last_breastexam, COALESCE(AVG(hetero), 0) AS hetero,
        COALESCE(AVG(homo), 0) AS homo, COALESCE(AVG(bisexual), 0) AS bisexual,
        COALESCE(AVG(transgender), 0) AS transgender, COALESCE(AVG(emo_support), 0) AS emo_support,
        COALESCE(AVG(life_dissatisfaction), 0) AS life_dissatisfaction, COALESCE(AVG(life_limited), 0) AS life_limited,
        COALESCE(AVG(special_equip), 0) AS special_equip, COALESCE(AVG(english), 0) AS english,
        COALESCE(AVG(spanish), 0) AS spanish, COALESCE(AVG(bmi), 0) AS bmi
        FROM
        (
            SELECT *,
            (state_brfss + gender_brfss + race_brfss + age_brfss) AS demographics_join_key2
            FROM
            (
                SELECT *,
                DECODE(state,
                      1, 100000, -- AL
                      2, 200000, -- AK
                      4, 400000, -- AZ
                      5, 500000, -- AR
                      6, 600000, -- CA
                      8, 800000, -- CO
                      9, 900000, -- CT
                      10, 1000000, -- DE
                      11, 1100000, -- DC
                      12, 1200000, -- FL
                      13, 1300000, -- GA
                      15, 1500000, -- HI
                      16, 1600000, -- ID
                      17, 1700000, -- IL
                      18, 1800000, -- IN
                      19, 1900000, -- IA
                      20, 2000000, -- KS
                      21, 2100000, -- KY
                      22, 2200000, -- LA
                      23, 2300000, -- ME
                      24, 2400000, -- MD
                      25, 2500000, -- MA
                      26, 2600000, -- MI
                      27, 2700000, -- MN
                      28, 2800000, -- MS
                      29, 2900000, -- MO
                      30, 3000000, -- MT
                      31, 3100000, -- NE
                      32, 3200000, -- NV
                      33, 3300000, -- NH
                      34, 3400000, -- NJ 
                      35, 3500000, -- NM
                      36, 3600000, -- NY 
                      37, 3700000, -- NC
                      38, 3800000, -- ND
                      39, 3900000, -- OH
                      40, 4000000, -- OK
                      41, 4100000, -- OR
                      42, 4200000, -- PA
                      44, 4400000, -- RI
                      45, 4500000, -- SC
                      46, 4600000, -- SD
                      47, 4700000, -- TN
                      48, 4800000, -- TX
                      49, 4900000, -- UT
                      50, 5000000, -- VT 
                      51, 5100000, -- VA
                      53, 5300000, -- WA
                      54, 5400000, -- WV
                      55, 5500000, -- WI
                      56, 5600000, -- WY
                      66, 6600000, -- GU
                      72, 7200000, -- PR
                      78, 7800000, -- VI
                       NULL) AS state_brfss,
                DECODE(female,
                       1, 20000,
                       0, 10000,
                       0) AS gender_brfss,
                CASE
                    WHEN white = 1 THEN 1000
                    WHEN black = 1 THEN 2000
                    WHEN asian = 1 THEN 3000
                    WHEN native = 1 THEN 4000
                    WHEN hispanic = 1 THEN 5000
                    ELSE 0
                    END AS race_brfss,
                age AS age_brfss
                FROM brfss2016
            )
        )
        GROUP BY demographics_join_key2
    ) AS brfss
    ON og.demographics_join_key1 = brfss.demographics_join_key2
)
;


As our training table has over 500 columns, we'll reduce our training data to only the most predictive features. We use a script written by a member of Civis's Data Science Research & Development (DSRD) team to identify the most predictive features for our dependent variable.

Then, we'll create a training table with just these features. This will improve the computational performance of our modeling process, as well as remove redundant features that capture similar patterns in the data.

In [None]:
%%civisquery

DROP VIEW IF EXISTS cdph_bcrisk_train_topfeatures_female;
CREATE VIEW cdph_bcrisk_train_topfeatures_female AS
SELECT 
id, -- primary key
bc_risk_2cat, -- dependent variable
weight,
parent,
age,
age_65_plus,
age_35_49,
age_18_34,
age_50_64,
last_smoke,
borderline_diab,
children_in_hh,
retired,
last_colsig,
poor_hlth,
last_bloodstooltest,
age_diabetes
FROM cdph_bcrisk_train_all_female
;

------------------------------------------------------------------------

# STEP 2: Train Individual-level Model
To train our model, we'll first specify a few classifier models offered in CivisML, the modeling package we use at Civis.

In [None]:
models = {
    'sparse_logistic': {},
    'extra_trees_classifier': "hyperband",
    'gradient_boosting_classifier': "hyperband",
    'random_forest_classifier': "hyperband"
} 

futures = []



Next, we'll loop through each of the models we specified. CivisML allows us to tune the hyperparameters of our models using "hyperband", which is a more efficient approach to hyperparameter optimization.

For each model, we'll create a pipeline where we provide the dependent variable and a primary key (i.e. a column with a unique indicator for each row or observation). We can also choose to exclude specific columns. We will then train the model using the training table we created.

The CivisML modeling pipeline automatically uses stratified k-fold cross validation to cross validate the model.


###  Create Pipeline for Each Model and Train 

In [None]:
## Iterate through each model

for i, params in models.items():
        
    print("Currently testing model: " + i)
    print("Params: " + str(params))
    print("--------------------------")

    m = ModelPipeline(model = i, 
                      model_name = i + ", params: " + str(params) + ", DV : bc_risk_2cat",
                      dependent_variable = 'bc_risk_2cat',
                      primary_key = 'id',
                      cross_validation_parameters = params,
                      memory_requested = 5000,
                      cpu_requested = 1024)

    train = m.train(table_name = 'cdph_bcrisk_train_topfeatures_female',
                    fit_params = {'sample_weight': 'weight'},
                    database_name = 'database')
   
    futures.append(train)

Check to see if the models are running in the Civis Platform:

In [None]:
# check to see that model jobs have been kicked off in the Civis Platform
for f in futures:
    print("Model running?  " + str(f.running()))
    print("Job ID: " + str(f.job_id))
    print("Train Job ID: " + str(f.train_job_id))
    print("Run ID: " + str(f.train_run_id))
    print("------------------------------------------")

----------------------------------------------------------------------------------------
# STEP 3: Compare Individual-Level Model Performance

After our models have finished running, we'll print out the metrics for each one and compare them. We'll select the best performing model to score our dataset.

#### Among the models we tested, the random forest classifier model ended up having the best performance.

In [None]:
# print model metrics once they have finished training
for f in agg_futures: 
    print("\n************************************\n")
    if str(f.running()) == "False" and f.metadata['run']['status'] != "exception":
        print("MODEL: " + f.metadata['model']['model'])
        print("DV: " + f.metadata['run']['configuration']['data']['y'][0])
        print("TRAINING TABLE: " + f.metadata['data_platform']['table_source']['tablename'])
        print("Job ID: " + str(f.job_id))
        try:
            print("\n-----------------------\n")
            print("AUC: " + str(f.metrics['roc_auc']))
            print("\n------------------------\n")
            print("CONFUSION MATRIX:  " + str(f.metrics['confusion_matrix']))
            print("\n------------------------------------\n")
            print("BEST PARAMS:")
            print(f.metadata['model']['cv_best_params'])
        except:
            pass
        print("\n************************************\n")
    else:
        print("Model not finished running")



----------------------------------------------------------------------------------------
# STEP 4: Create Individual-Level Scoring Table and Score

We'll create a scoring table with the same features as our training set. This scoring table only has data on females.

In [None]:
%%civisquery


DROP VIEW IF EXISTS score_table_bcrisk_female;
CREATE VIEW score_table_bcrisk_female AS
SELECT *
FROM
(
    SELECT * 
    FROM
    (
        SELECT *,
        (D.state_og + D.gender_og + D.race_og + D.age_og) AS demographics_join_key1
        FROM
        (
            SELECT 
            *,
            CASE
                WHEN C.state = 'AL' THEN 100000 
                WHEN C.state = 'AK' THEN 200000 
                WHEN C.state = 'AZ' THEN 400000 
                WHEN C.state = 'AR' THEN 500000 
                WHEN C.state = 'CA' THEN 600000 
                WHEN C.state = 'CO' THEN 800000 
                WHEN C.state = 'CT' THEN 900000 
                WHEN C.state = 'DE' THEN 1000000 
                WHEN C.state = 'DC' THEN 1100000 
                WHEN C.state = 'FL' THEN 1200000 
                WHEN C.state = 'GA' THEN 1300000 
                WHEN C.state = 'HI' THEN 1500000 
                WHEN C.state = 'ID' THEN 1600000 
                WHEN C.state = 'IL' THEN 1700000 
                WHEN C.state = 'IN' THEN 1800000 
                WHEN C.state = 'IA' THEN 1900000 
                WHEN C.state = 'KS' THEN 2000000 
                WHEN C.state = 'KY' THEN 2100000 
                WHEN C.state = 'LA' THEN 2200000 
                WHEN C.state = 'ME' THEN 2300000 
                WHEN C.state = 'MD' THEN 2400000 
                WHEN C.state = 'MA' THEN 2500000 
                WHEN C.state = 'MI' THEN 2600000 
                WHEN C.state = 'MN' THEN 2700000 
                WHEN C.state = 'MS' THEN 2800000 
                WHEN C.state = 'MO' THEN 2900000 
                WHEN C.state = 'MT' THEN 3000000 
                WHEN C.state = 'NE' THEN 3100000 
                WHEN C.state = 'NV' THEN 3200000 
                WHEN C.state = 'NH' THEN 3300000 
                WHEN C.state = 'NJ' THEN 3400000
                WHEN C.state = 'NM' THEN 3500000 
                WHEN C.state = 'NY' THEN 3600000
                WHEN C.state = 'NC' THEN 3700000 
                WHEN C.state = 'ND' THEN 3800000 
                WHEN C.state = 'OH' THEN 3900000 
                WHEN C.state = 'OK' THEN 4000000 
                WHEN C.state = 'OR' THEN 4100000 
                WHEN C.state = 'PA' THEN 4200000 
                WHEN C.state = 'RI' THEN 4400000 
                WHEN C.state = 'SC' THEN 4500000 
                WHEN C.state = 'SD' THEN 4600000 
                WHEN C.state = 'TN' THEN 4700000 
                WHEN C.state = 'TX' THEN 4800000 
                WHEN C.state = 'UT' THEN 4900000 
                WHEN C.state = 'VT' THEN 5000000
                WHEN C.state = 'VA' THEN 5100000 
                WHEN C.state = 'WA' THEN 5300000 
                WHEN C.state = 'WV' THEN 5400000 
                WHEN C.state = 'WI' THEN 5500000 
                WHEN C.state = 'WY' THEN 5600000 
                WHEN C.state = 'GU' THEN 6600000 
                WHEN C.state = 'PR' THEN 7200000 
                WHEN C.state = 'VI' THEN 7800000 
                ELSE 0
                END AS state_og,
            CASE
                WHEN C.age > 80 THEN 80
                ELSE C.age
                END AS age_og,
            CASE
                WHEN C.gender = 'Female' THEN 20000
                WHEN C.gender = 'Male' THEN 10000
                ELSE 0
                END AS gender_og,
            CASE
                WHEN C.race = 'White' THEN 1000
                WHEN C.race = 'AfAm' THEN 2000
                WHEN C.race = 'Asian' THEN 3000
                WHEN C.race = 'Native' THEN 4000
                WHEN C.race = 'Hispanic' THEN 5000
                END AS race_og
            FROM
            (
                SELECT bridge.tract, bridge.ca, A.*, B.score AS uninsured_2016, basic.*
                FROM
                (
                    SELECT id AS join_key, tract, ca
                    FROM area_id_bridge
                    WHERE ca IS NOT NULL
                ) AS bridge
                LEFT JOIN
                (
                    SELECT *
                    FROM modeling_data
                ) AS A
                ON bridge.join_key = A.id
                LEFT JOIN
                (
                    SELECT score, id AS join_key2
                    FROM uninsured2016_score 
                ) AS B
                ON A.id = B.join_key2
                LEFT JOIN
                (
                    SELECT id AS join_key3,
                    gender,
                    race
                    FROM basic
                ) AS basic
                ON A.id = basic.join_key3
            ) AS C
            WHERE C.uninsured_2016 IS NOT NULL
            AND C.gender = 'Female'    -- only select women
            AND bridge.ca IS NOT NULL  -- only select Chicago residents
        ) AS D
    ) AS F
    LEFT JOIN
    (
        SELECT
        demographics_join_key2, 
        COALESCE(AVG(breastcncr_nulls), 0) AS breastcncr, COALESCE(AVG(obese), 0) AS obese,
        COALESCE(AVG(pvtresd), 0) AS pvtresd, COALESCE(AVG(colg_hous), 0) AS colg_hous,
        COALESCE(AVG(gen_hlth), 0) AS gen_hlth, COALESCE(AVG(phys_hlth), 0) AS phys_hlth,
        COALESCE(AVG(ment_hlth), 0) AS ment_hlth, COALESCE(AVG(poor_hlth), 0) AS poor_hlth,
        COALESCE(AVG(hlth_pln), 0) AS hlth_pln, COALESCE(AVG(persdoc), 0) AS persdoc,
        COALESCE(AVG(doc_toocostly), 0) AS doc_toocostly, COALESCE(AVG(checkup), 0) AS checkup,
        COALESCE(AVG(exercise), 0) AS exercise, COALESCE(AVG(sleephrs), 0) AS sleephrs,
        COALESCE(AVG(heartattack), 0) AS heartattack, COALESCE(AVG(coronaryheartdisease), 0) AS coronaryheartdisease,
        COALESCE(AVG(stroke), 0) AS stroke, COALESCE(AVG(asthma), 0) AS asthma,
        COALESCE(AVG(asthma_now), 0) AS asthma_now, COALESCE(AVG(skin_cancer), 0) AS skin_cancer,
        COALESCE(AVG(cancer), 0) AS cancer, COALESCE(AVG(copd), 0) AS copd,
        COALESCE(AVG(arthritis), 0) AS arthritis, COALESCE(AVG(depression), 0) AS depression,
        COALESCE(AVG(kidneydisease), 0) AS kidneydisease, COALESCE(AVG(diabetes), 0) AS diabetes,
        COALESCE(AVG(borderline_diab), 0) AS borderline_diab, COALESCE(AVG(preg_diab), 0) AS preg_diab,
        COALESCE(AVG(age_diabetes), 0) AS age_diabetes, COALESCE(AVG(student), 0) AS student,
        COALESCE(AVG(numchildren), 0) AS numchildren, COALESCE(AVG(internet_use), 0) AS internet_use,
        COALESCE(AVG(weight_kg), 0) AS weight_kg, COALESCE(AVG(height_in), 0) AS height_in,
        COALESCE(AVG(pregnant_now), 0) AS pregnant_now, COALESCE(AVG(hard_hearing), 0) AS hard_hearing,
        COALESCE(AVG(hard_seeing), 0) AS hard_seeing, COALESCE(AVG(diff_decide), 0) AS diff_decide,
        COALESCE(AVG(diff_walk), 0) AS diff_walk, COALESCE(AVG(diff_dress), 0) AS diff_dress,
        COALESCE(AVG(diff_alone), 0) AS diff_alone, COALESCE(AVG(smoked), 0) AS smoked,
        COALESCE(AVG(smoke_freq), 0) AS smoke_freq, COALESCE(AVG(snuff_freq), 0) AS snuff_freq,
        COALESCE(AVG(ecig_freq), 0) AS ecig_freq, COALESCE(AVG(last_smoke), 0) AS last_smoke,
        COALESCE(AVG(alc_past_week), 0) AS alc_past_week, COALESCE(AVG(alc_past_month), 0) AS alc_past_month,
        COALESCE(AVG(avg_drinks), 0) AS avg_drinks, COALESCE(AVG(flu_vacc), 0) AS flu_vacc,
        COALESCE(AVG(pnem_vacc), 0) AS pnem_vacc, COALESCE(AVG(tetanus_vacc), 0) AS tetanus_vacc,
        COALESCE(AVG(num_fall), 0) AS num_fall, COALESCE(AVG(num_bad_falls), 0) AS num_bad_falls,
        COALESCE(AVG(seatbelt_use), 0) AS seatbelt_use, COALESCE(AVG(drunk_drive), 0) AS drunk_drive,
        COALESCE(AVG(mammogram), 0) AS mammogram, COALESCE(AVG(last_mammogram), 0) AS last_mammogram,
        COALESCE(AVG(pap), 0) AS pap, COALESCE(AVG(last_pap), 0) AS last_pap,
        COALESCE(AVG(hpv_test), 0) AS hpv_test, COALESCE(AVG(last_hpvtest), 0) AS last_hpvtest,
        COALESCE(AVG(hpv_vacc), 0) AS hpv_vacc, COALESCE(AVG(hysterectomy), 0) AS hysterectomy,
        COALESCE(AVG(psa_test_discussion), 0) AS psa_test_discussion, COALESCE(AVG(psa_test_suggest), 0) AS psa_test_suggest,
        COALESCE(AVG(psa_test), 0) AS psa_test, COALESCE(AVG(last_psatest), 0) AS last_psatest,
        COALESCE(AVG(fam_history_prostatecncr), 0) AS fam_history_prostatecncr, COALESCE(AVG(blood_stool_test), 0) AS blood_stool_test,
        COALESCE(AVG(last_bloodstooltest), 0) AS last_bloodstooltest, COALESCE(AVG(had_colsig), 0) AS had_colsig,
        COALESCE(AVG(sigmoidoscopy), 0) AS sigmoidoscopy, COALESCE(AVG(colonoscopy), 0) AS colonoscopy,
        COALESCE(AVG(last_colsig), 0) AS last_colsig, COALESCE(AVG(hiv_test), 0) AS hiv_test,
        COALESCE(AVG(hiv_risk), 0) AS hiv_risk, COALESCE(AVG(diabetes_test), 0) AS diabetes_test,
        COALESCE(AVG(insulin_now), 0) AS insulin_now, COALESCE(AVG(diabetes_consult), 0) AS diabetes_consult,
        COALESCE(AVG(feet_check), 0) AS feet_check, COALESCE(AVG(last_eyeexam), 0) AS last_eyeexam,
        COALESCE(AVG(retinopathy), 0) AS retinopathy, COALESCE(AVG(diab_edu), 0) AS diab_edu,
        COALESCE(AVG(pain_days), 0) AS pain_days, COALESCE(AVG(sad_days), 0) AS sad_days,
        COALESCE(AVG(anxious_days), 0) AS anxious_days, COALESCE(AVG(energized_days), 0) AS energized_days,
        COALESCE(AVG(medicare_now), 0) AS medicare_now, COALESCE(AVG(hc_employer), 0) AS hc_employer,
        COALESCE(AVG(hc_personal), 0) AS hc_personal, COALESCE(AVG(hc_medicare), 0) AS hc_medicare,
        COALESCE(AVG(hc_medicaid), 0) AS hc_medicaid, COALESCE(AVG(hc_tricare), 0) AS hc_tricare,
        COALESCE(AVG(hc_native), 0) AS hc_native, COALESCE(AVG(hc_none_current), 0) AS hc_none_current,
        COALESCE(AVG(delay_appt), 0) AS delay_appt, COALESCE(AVG(delay_wait), 0) AS delay_wait,
        COALESCE(AVG(delay_transport), 0) AS delay_transport, COALESCE(AVG(hc_none_thisyr), 0) AS hc_none_thisyr,
        COALESCE(AVG(drvisit), 0) AS drvisit, COALESCE(AVG(med_toocostly), 0) AS med_toocostly,
        COALESCE(AVG(satisfied_care), 0) AS satisfied_care, COALESCE(AVG(med_bills), 0) AS med_bills,
        COALESCE(AVG(get_medadvic), 0) AS get_medadvic, COALESCE(AVG(undrstnd_medadvic), 0) AS undrstnd_medadvic,
        COALESCE(AVG(understand_writtenmedadvic), 0) AS understand_writtenmedadvic, COALESCE(AVG(caregiver), 0) AS caregiver,
        COALESCE(AVG(mem_loss), 0) AS mem_loss, COALESCE(AVG(mem_loss_assist), 0) AS mem_loss_assist,
        COALESCE(AVG(mem_loss_gethelp), 0) AS mem_loss_gethelp, COALESCE(AVG(mem_loss_inhibit), 0) AS mem_loss_inhibit,
        COALESCE(AVG(mem_loss_doctor), 0) AS mem_loss_doctor, COALESCE(AVG(soda_day), 0) AS soda_day,
        COALESCE(AVG(soda_week), 0) AS soda_week, COALESCE(AVG(soda_month), 0) AS soda_month,
        COALESCE(AVG(sugarbev_day), 0) AS sugarbev_day, COALESCE(AVG(subarbev_week), 0) AS sugarbev_week,
        COALESCE(AVG(sugarbev_month), 0) AS sugarbev_month, COALESCE(AVG(calorie), 0) AS calorie,
        COALESCE(AVG(marijuana), 0) AS marijuana, COALESCE(AVG(shingles_vacc), 0) AS shingles_vacc,
        COALESCE(AVG(num_types_cncr), 0) AS num_types_cncr, COALESCE(AVG(age_diagnosis_cncr), 0) AS age_diagnosis_cncr,
        COALESCE(AVG(cncr_insurance), 0) AS cncr_insurance, COALESCE(AVG(breast_exam), 0) AS breast_exam,
        COALESCE(AVG(last_breastexam), 0) AS last_breastexam, COALESCE(AVG(hetero), 0) AS hetero,
        COALESCE(AVG(homo), 0) AS homo, COALESCE(AVG(bisexual), 0) AS bisexual,
        COALESCE(AVG(transgender), 0) AS transgender, COALESCE(AVG(emo_support), 0) AS emo_support,
        COALESCE(AVG(life_dissatisfaction), 0) AS life_dissatisfaction, COALESCE(AVG(life_limited), 0) AS life_limited,
        COALESCE(AVG(special_equip), 0) AS special_equip, COALESCE(AVG(english), 0) AS english,
        COALESCE(AVG(spanish), 0) AS spanish, COALESCE(AVG(bmi), 0) AS bmi
        FROM
        (
            SELECT *,
            (state_brfss + gender_brfss + race_brfss + age_brfss) AS demographics_join_key2
            FROM
            (
                SELECT *,
                DECODE(state,
                       1, 100000, -- AL
                       2, 200000, -- AK
                       4, 400000, -- AZ
                       5, 500000, -- AR
                       6, 600000, -- CA
                       8, 800000, -- CO
                       9, 900000, -- CT
                       10, 1000000, -- DE
                       11, 1100000, -- DC
                       12, 1200000, -- FL
                       13, 1300000, -- GA
                       15, 1500000, -- HI
                       16, 1600000, -- ID
                       17, 1700000, -- IL
                       18, 1800000, -- IN
                       19, 1900000, -- IA
                       20, 2000000, -- KS
                       21, 2100000, -- KY
                       22, 2200000, -- LA
                       23, 2300000, -- ME
                       24, 2400000, -- MD
                       25, 2500000, -- MA
                       26, 2600000, -- MI
                       27, 2700000, -- MN
                       28, 2800000, -- MS
                       29, 2900000, -- MO
                       30, 3000000, -- MT
                       31, 3100000, -- NE
                       32, 3200000, -- NV
                       33, 3300000, -- NH
                       34, 3400000, -- NJ 
                       35, 3500000, -- NM
                       36, 3600000, -- NY 
                       37, 3700000, -- NC
                       38, 3800000, -- ND
                       39, 3900000, -- OH
                       40, 4000000, -- OK
                       41, 4100000, -- OR
                       42, 4200000, -- PA
                       44, 4400000, -- RI
                       45, 4500000, -- SC
                       46, 4600000, -- SD
                       47, 4700000, -- TN
                       48, 4800000, -- TX
                       49, 4900000, -- UT
                       50, 5000000, -- VT 
                       51, 5100000, -- VA
                       53, 5300000, -- WA
                       54, 5400000, -- WV
                       55, 5500000, -- WI
                       56, 5600000, -- WY
                       66, 6600000, -- GU
                       72, 7200000, -- PR
                       78, 7800000, -- VI
                       NULL) AS state_brfss,
                DECODE(female,
                       1, 20000,
                       0, 10000,
                       0) AS gender_brfss,
                CASE
                    WHEN white = 1 THEN 1000
                    WHEN black = 1 THEN 2000
                    WHEN asian = 1 THEN 3000
                    WHEN native = 1 THEN 4000
                    WHEN hispanic = 1 THEN 5000
                    ELSE 0
                    END AS race_brfss,
                age AS age_brfss
                FROM brfss2016
            )
        )
        GROUP BY demographics_join_key2
    ) AS brfss
    ON F.demographics_join_key1 = brfss.demographics_join_key2
)
;


Using the Civis API, we'll grab the model with the best performance, and use this model to score our data set.

In [None]:
# Grab Job ID from output or Civis Platform UI
job_id = "ID NUMBER"

run_id = client.jobs.get(job_id)['last_run']['id']
name = client.jobs.get(job_id)['name']
    
print("NAME: " + name)
print("RUN ID: " + str(run_id))
print("--------------------------------------")
    
loaded_model = ModelPipeline.from_existing(job_id, run_id)

scoring = loaded_model.predict(table_name = "score_table_bcrisk_female", 
                            database_name = "database",
                            output_table = "bcrisk")

-----------------------------------------------------------------------------------------------------------------------------------------------
# STEP 5: Aggregate Scores by Geographic Level; Create Aggregate Training & Scoring Tables

Now, we'll create a new training table and scoring table, where the features are aggregated by the geographic level of interest (i.e. Census tract). However, for our training table our survey response data will still be used as the dependent variable.

We'll also use the scores output from our individual-level model as a feature in our geographic-level training and scoring tables. We'll take the average scores for each geographic level (i.e. each Census tract), and then we'll append them onto our training and scoring tables.

### Aggregate features at the Census tract level

In [None]:
%%civisquery

-- Dependent views -- 
DROP VIEW IF EXISTS cdph_train_bcrisk_topfeatures_agg;
DROP VIEW IF EXISTS cdph_train_bcrisk_all_agg;
DROP VIEW IF EXISTS agg_score_table_bcrisk;

-- Create aggregated table by census tract --
DROP VIEW IF EXISTS tract_aggregate_bcrisk_basefile;
CREATE VIEW tract_aggregate_bcrisk_basefile AS
SELECT
  LEFT(census_block, 11) AS census_tract -- ID Census tracts
  , AVG(head_hh_flag::float) AS head_hh_flag
  , AVG(hh_count::float) AS hh_count
  , AVG(age::float) AS coalesced_age
  , AVG(age_18_34::float) AS age_18_34
  , AVG(age_35_49::float) AS age_35_49
  , AVG(age_50_64::float) AS age_50_64
  , AVG(age_65_plus::float) AS age_65_plus
  , AVG(hh_avg_age::float) AS hh_avg_age
  , AVG(race_afam::float) AS race_afam
  , AVG(race_hispanic::float) AS race_hispanic
  , AVG(race_asian::float) AS race_asian
  , AVG(race_white::float) AS race_white
  , AVG(race_native::float) AS race_native
  , AVG(race_afam::float) AS race_afam
  , AVG(race_hispanic::float) AS race_hispanic
  , AVG(race_asian::float) AS race_asian
  , AVG(race_white::float) AS race_white
  , AVG(race_native::float) AS race_native
  , AVG(hh_race_afam::float) AS hh_race_afam
  , AVG(hh_race_hispanic::float) AS hh_race_hispanic
  , AVG(hh_race_asian::float) AS hh_race_asian
  , AVG(hh_race_white::float) AS hh_race_white
  , AVG(hh_race_native::float) AS hh_race_native
  , AVG(hh_all_afam::float) AS hh_all_afam
  , AVG(hh_all_hispanic::float) AS hh_all_hispanic
  , AVG(hh_all_asian::float) AS hh_all_asian
  , AVG(hh_all_white::float) AS hh_all_white
  , AVG(hh_all_native::float) AS hh_all_native
  , AVG(subeth_african_american::float) AS subeth_african_american
  , AVG(subeth_west_indian::float) AS subeth_west_indian
  , AVG(subeth_haitian::float) AS subeth_haitian
  , AVG(subeth_african::float) AS subeth_african
  , AVG(subeth_other_black::float) AS subeth_other_black
  , AVG(subeth_mexican::float) AS subeth_mexican
  , AVG(subeth_cuban::float) AS subeth_cuban
  , AVG(subeth_puerto_rican::float) AS subeth_puerto_rican
  , AVG(subeth_dominican::float) AS subeth_dominican
  , AVG(subeth_other_latin_american::float) AS subeth_other_latin_american
  , AVG(subeth_other_hispanic::float) AS subeth_other_hispanic
  , AVG(subeth_chinese::float) AS subeth_chinese
  , AVG(subeth_indian::float) AS subeth_indian
  , AVG(subeth_filipino::float) AS subeth_filipino
  , AVG(subeth_japanese::float) AS subeth_japanese
  , AVG(subeth_vietnamese::float) AS subeth_vietnamese
  , AVG(subeth_korean::float) AS subeth_korean
  , AVG(subeth_other_asian::float) AS subeth_other_asian
  , AVG(subeth_hmong::float) AS subeth_hmong
  , AVG(spanish::float) AS spanish
  , AVG(gender::float) AS gender
  , AVG(hh_gender::float) AS hh_gender
  , AVG(phone_presence::float) AS phone_presence
  , AVG(phone_listed::float) AS phone_listed
  , AVG(phone_unlisted::float) AS phone_unlisted
  , AVG(email_append_ind::float) AS email_append_ind
  , AVG(email_append_hh::float) AS email_append_hh
  , AVG(deceased::float) AS deceased
  , AVG(elections_pres_2012::float) AS elections_pres_2012
  , AVG(elections_pres_2008::float) AS elections_pres_2008
  , AVG(elections_pres_2004::float) AS elections_pres_2004
  , AVG(elections_gov_recent::float) AS elections_gov_recent
  , AVG(elections_state_avg::float) AS elections_state_avg
  , AVG(elections_fedstate_diff::float) AS elections_fedstate_diff
  , AVG(married::float) AS married
  , AVG(unmarried::float) AS unmarried
  , AVG(children_in_hh::float) AS children_in_hh
  , AVG(num_children_in_hh::float) AS num_children_in_hh
  , AVG(young_children_in_hh::float) AS young_children_in_hh
  , AVG(teenage_children_in_hh::float) AS teenage_children_in_hh
  , AVG(education_highschool::float) AS education_highschool
  , AVG(education_collegegrad::float) AS education_collegegrad
  , AVG(num_in_hh::float) AS num_in_hh
  , AVG(religion_jewish::float) AS religion_jewish
  , AVG(religion_mormon::float) AS religion_mormon
  , AVG(religion_muslim::float) AS religion_muslim
  , AVG(religion_catholic::float) AS religion_catholic
  , AVG(religion_evangelical_protestant::float) AS religion_evangelical_protestant
  , AVG(religion_mainline_protestant::float) AS religion_mainline_protestant
  , AVG(religion_orthodox_christian::float) AS religion_orthodox_christian
  , AVG(religion_jehovah_witness::float) AS religion_jehovah_witness
  , AVG(religion_hindu::float) AS religion_hindu
  , AVG(religion_buddhist::float) AS religion_buddhist
  , AVG(vehicleowner::float) AS vehicleowner
  , AVG(homeowner::float) AS homeowner
  , AVG(property_type_singlefamily::float) AS property_type_singlefamily
  , AVG(property_type_multifamily::float) AS property_type_multifamily
  , AVG(address_type_firm::float) AS address_type_firm
  , AVG(address_type_general_delivery::float) AS address_type_general_delivery
  , AVG(address_type_high_rise::float) AS address_type_high_rise
  , AVG(address_type_po_box::float) AS address_type_po_box
  , AVG(address_type_rural::float) AS address_type_rural
  , AVG(address_type_street::float) AS address_type_street
  , AVG(has_ncoa_return_code::float) AS has_ncoa_return_code
  , AVG(length_of_residence::float) AS length_of_residence
  , AVG(length_of_residence_is_null::float) AS length_of_residence_is_null
  , AVG(home_sqft::float) AS home_sqft
  , AVG(home_sqft_is_null::float) AS home_sqft_is_null
  , AVG(home_total_value::float) AS home_total_value
  , AVG(home_total_value_is_unknown::float) AS home_total_value_is_unknown
  , AVG(log_home_value_bucket::float) AS log_home_value_bucket
  , AVG(home_value_is_null::float) AS home_value_is_null
  , AVG(home_purchase_year::float) AS home_purchase_year
  , AVG(home_purchase_year_is_null::float) AS home_purchase_year_is_null
  , AVG(log_household_net_worth_bucket::float) AS log_household_net_worth_bucket
  , AVG(household_net_worth_is_null::float) AS household_net_worth_is_null
  , AVG(head_hh_salary_amt::float) AS head_hh_salary_amt
  , AVG(head_hh_salary_amt_is_null::float) AS head_hh_salary_amt_is_null
  , AVG(log_household_income_bucket::float) AS log_household_income_bucket
  , AVG(household_income_is_null::float) AS household_income_is_null
  , AVG(donor_charity::float) AS donor_charity
  , AVG(donor_enviro_causes::float) AS donor_enviro_causes
  , AVG(hh_donor_charity_pctile::float) AS hh_donor_charity_pctile
  , AVG(hh_donor_charity::float) AS hh_donor_charity
  , AVG(hh_donor_enviro_causes::float) AS hh_donor_enviro_causes
  , AVG(dog_enthusiast::float) AS dog_enthusiast
  , AVG(cat_enthusiast::float) AS cat_enthusiast
  , AVG(pet_enthusiast::float) AS pet_enthusiast
  , AVG(travel_domestic_foreign::float) AS travel_domestic_foreign
  , AVG(license_hunt_fish::float) AS license_hunt_fish
  , AVG(religious_purchase::float) AS religious_purchase
  , AVG(political_purchase::float) AS political_purchase
  , AVG(health_institution_purchase::float) AS health_institution_purchase
  , AVG(general_purchase::float) AS general_purchase
  , AVG(hh_license_hunt_fish::float) AS hh_license_hunt_fish
  , AVG(hh_religious_purchase::float) AS hh_religious_purchase
  , AVG(hh_political_purchase::float) AS hh_political_purchase
  , AVG(hh_health_institution_purchase::float) AS hh_health_institution_purchase
  , AVG(hh_general_purchase::float) AS hh_general_purchase
  , AVG(presence_of_cell_phone::float) AS presence_of_cell_phone
  , AVG(presence_of_cell_phone_modeled::float) AS presence_of_cell_phone_modeled
  , AVG(hh_has_credit_card::float) AS hh_has_credit_card
  , AVG(online_is_online::float) AS online_is_online
  , AVG(online_facebook::float) AS online_facebook
  , AVG(online_purchaser::float) AS online_purchaser
  , AVG(hh_online_social_network_twentile::float) AS hh_online_social_network_twentile
  , AVG(hh_online_is_online::float) AS hh_online_is_online
  , AVG(hh_online_facebook::float) AS hh_online_facebook
  , AVG(employed::float) AS employed
  , AVG(employed_unknown::float) AS employed_unknown
  , AVG(retired::float) AS retired
  , AVG(hh_business_owner::float) AS hh_business_owner
  , AVG(emp_nurse::float) AS emp_nurse
  , AVG(emp_beauty::float) AS emp_beauty
  , AVG(emp_self_employed::float) AS emp_self_employed
  , AVG(emp_healthcare::float) AS emp_healthcare
  , AVG(emp_nursing_provider::float) AS emp_nursing_provider
  , AVG(emp_dental_provider::float) AS emp_dental_provider
  , AVG(emp_healthcare_provider::float) AS emp_healthcare_provider
  , AVG(emp_realestate::float) AS emp_realestate
  , AVG(emp_educator::float) AS emp_educator
  , AVG(emp_pilot::float) AS emp_pilot
  , AVG(emp_aviation_industry::float) AS emp_aviation_industry
  , AVG(emp_postal::float) AS emp_postal
  , AVG(emp_federal::float) AS emp_federal
  , AVG(emp_military::float) AS emp_military
  , AVG(hh_emp_active_military::float) AS hh_emp_active_military
  , AVG(hh_emp_active_military_modeled::float) AS hh_emp_active_military_modeled
  , AVG(hh_emp_inactive_military::float) AS hh_emp_inactive_military
  , AVG(hh_emp_inactive_military_modeled::float) AS hh_emp_inactive_military_modeled
  , AVG(hh_employed::float) AS hh_employed
  , AVG(hh_retired::float) AS hh_retired
  , AVG(hh_emp_nurse::float) AS hh_emp_nurse
  , AVG(hh_emp_beauty::float) AS hh_emp_beauty
  , AVG(hh_emp_healthcare::float) AS hh_emp_healthcare
  , AVG(hh_emp_realestate::float) AS hh_emp_realestate
  , AVG(hh_emp_educator::float) AS hh_emp_educator
  , AVG(zip5_pct_catholic::float) AS zip5_pct_catholic
  , AVG(zip5_pct_jewish::float) AS zip5_pct_jewish
  , AVG(zip5_pct_dems_per_reg::float) AS zip5_pct_dems_per_reg
  , AVG(zip5_pct_registered::float) AS zip5_pct_registered
  , AVG(zip5_pct_feccontributions_dem_2way::float) AS zip5_pct_feccontributions_dem_2way
  , AVG(urban::float) AS urban
  , AVG(suburban::float) AS suburban
  , AVG(rural::float) AS rural
  , AVG(county_is_in_msa::float) AS county_is_in_msa
  , AVG(is_in_place::float) AS is_in_place
  , AVG(place_is_in_principal_city::float) AS place_is_in_principal_city
  , AVG(pct_under18::float) AS pct_under18
  , AVG(pct_18plus::float) AS pct_18plus
  , AVG(pct_race5way_hispanic::float) AS pct_race5way_hispanic
  , AVG(pct_race5way_black::float) AS pct_race5way_black
  , AVG(pct_race5way_asian::float) AS pct_race5way_asian
  , AVG(pct_race5way_native::float) AS pct_race5way_native
  , AVG(total_hh::float) AS total_hh
  , AVG(pct_hh_owned_with_loan::float) AS pct_hh_owned_with_loan
  , AVG(pct_hh_owned_no_loan::float) AS pct_hh_owned_no_loan
  , AVG(pct_hh_renter::float) AS pct_hh_renter
  , AVG(pct_hh_single_family_head::float) AS pct_hh_single_family_head
  , AVG(pct_hh_with_own_children_under18::float) AS pct_hh_with_own_children_under18
  , AVG(pct_hh_1_person::float) AS pct_hh_1_person
  , AVG(pct_hh_husband_wife_family::float) AS pct_hh_husband_wife_family
  , AVG(fpl_under138::float) AS fpl_under138
  , AVG(fpl_139to400::float) AS fpl_139to400
  , AVG(civismodel_incomeunder40k::float) AS civismodel_incomeunder40k
  , AVG(civismodel_incomeover80k::float) AS civismodel_incomeover80k
  , AVG(total_pop::float) AS total_pop
  , AVG(log_pop_density::float) AS log_pop_density
  , AVG(log_total_hispanic_density::float) AS log_total_hispanic_density
  , AVG(log_total_white_density::float) AS log_total_white_density
  , AVG(log_total_black_density::float) AS log_total_black_density
  , AVG(log_total_native_american_density::float) AS log_total_native_american_density
  , AVG(log_total_asian_density::float) AS log_total_asian_density
  , AVG(total_hawaiian_pac_islander_pct::float) AS total_hawaiian_pac_islander_pct
  , AVG(log_total_other_race_density::float) AS log_total_other_race_density
  , AVG(total_multi_race_pct::float) AS total_multi_race_pct
  , AVG(log_total_multi_race_density::float) AS log_total_multi_race_density
  , AVG(adult_pct::float) AS adult_pct
  , AVG(adult_hispanic_pct::float) AS adult_hispanic_pct
  , AVG(log_adult_hispanic_density::float) AS log_adult_hispanic_density
  , AVG(adult_white_pct::float) AS adult_white_pct
  , AVG(log_adult_white_density::float) AS log_adult_white_density
  , AVG(adult_black_pct::float) AS adult_black_pct
  , AVG(log_adult_black_density::float) AS log_adult_black_density
  , AVG(adult_native_american_pct::float) AS adult_native_american_pct
  , AVG(log_adult_native_american_density::float) AS log_adult_native_american_density
  , AVG(adult_asian_pct::float) AS adult_asian_pct
  , AVG(log_adult_asian_density::float) AS log_adult_asian_density
  , AVG(adult_hawaiian_pac_islander_pct::float) AS adult_hawaiian_pac_islander_pct
  , AVG(adult_other_race_pct::float) AS adult_other_race_pct
  , AVG(log_adult_other_race_density::float) AS log_adult_other_race_density
  , AVG(adult_multi_race_pct::float) AS adult_multi_race_pct
  , AVG(log_adult_multi_race_density::float) AS log_adult_multi_race_density
  , AVG(minor_pct::float) AS minor_pct
  , AVG(minor_hispanic_pct::float) AS minor_hispanic_pct
  , AVG(log_minor_hispanic_density::float) AS log_minor_hispanic_density
  , AVG(minor_white_pct::float) AS minor_white_pct
  , AVG(log_minor_white_density::float) AS log_minor_white_density
  , AVG(minor_black_pct::float) AS minor_black_pct
  , AVG(log_minor_black_density::float) AS log_minor_black_density
  , AVG(minor_native_american_pct::float) AS minor_native_american_pct
  , AVG(minor_asian_pct::float) AS minor_asian_pct
  , AVG(log_minor_multi_race_density::float) AS log_minor_multi_race_density
  , AVG(median_age::float) AS median_age
  , AVG(median_age_male::float) AS median_age_male
  , AVG(median_age_female::float) AS median_age_female
  , AVG(univacant_density::float) AS univacant_density
  , AVG(univacant_rented_pct::float) AS univacant_rented_pct
  , AVG(univacant_for_sale_pct::float) AS univacant_for_sale_pct
  , AVG(univacant_sold_pct::float) AS univacant_sold_pct
  , AVG(univacant_seasonal_etc_use_pct::float) AS univacant_seasonal_etc_use_pct
  , AVG(univacant_migrant_workers_pct::float) AS univacant_migrant_workers_pct
  , AVG(univacant_other_pct::float) AS univacant_other_pct
  , AVG(pct_employment_construction::float) AS pct_employment_construction
  , AVG(pct_employment_manufacturing::float) AS pct_employment_manufacturing
  , AVG(pct_employment_wholesale::float) AS pct_employment_wholesale
  , AVG(pct_employment_retail::float) AS pct_employment_retail
  , AVG(pct_employment_finance::float) AS pct_employment_finance
  , AVG(pct_employment_management::float) AS pct_employment_management
  , AVG(pct_employment_educational::float) AS pct_employment_educational
  , AVG(pct_employment_health::float) AS pct_employment_health
  , AVG(pct_employment_arts::float) AS pct_employment_arts
  , AVG(indian_reservation::float) AS indian_reservation
  , AVG(military_base::float) AS military_base
  , AVG(pct_pop_corrections_facility::float) AS pct_pop_corrections_facility
  , AVG(pct_pop_nursing_facility::float) AS pct_pop_nursing_facility
  , AVG(pct_pop_military_quarters::float) AS pct_pop_military_quarters
  , AVG(pct_pop_student_housing::float) AS pct_pop_student_housing
  , AVG(pct_pop_evangelical::float) AS pct_pop_evangelical
  , AVG(pct_pop_catholic::float) AS pct_pop_catholic
  , AVG(pct_pop_black_protestant::float) AS pct_pop_black_protestant
  , AVG(pct_pop_muslim::float) AS pct_pop_muslim
  , AVG(pct_pop_mormon::float) AS pct_pop_mormon
  , AVG(pct_pop_jewish::float) AS pct_pop_jewish
  , AVG(pct_pop_buddhist::float) AS pct_pop_buddhist
  , AVG(pct_pop_hindu::float) AS pct_pop_hindu
  , AVG(median_age_white_male::float) AS median_age_white_male
  , AVG(pct_white_male_less_high_school::float) AS pct_white_male_less_high_school
  , AVG(pct_white_female_less_high_school::float) AS pct_white_female_less_high_school
  , AVG(median_hh_income::float) AS median_hh_income
  , AVG(pct_noncitizen::float) AS pct_noncitizen
  , AVG(pct_moved_past_year_within_county::float) AS pct_moved_past_year_within_county
  , AVG(pct_moved_past_year_all::float) AS pct_moved_past_year_all
  , AVG(pct_leave_for_work_4pmto5am::float) AS pct_leave_for_work_4pmto5am
  , AVG(pct_commute_over90min::float) AS pct_commute_over90min
  , AVG(pct_vehicle_available::float) AS pct_vehicle_available
  , AVG(pct_enrolled_in_higher_ed::float) AS pct_enrolled_in_higher_ed
  , AVG(pct_educ_no_hs::float) AS pct_educ_no_hs
  , AVG(pct_educ_bachelors::float) AS pct_educ_bachelors
  , AVG(pct_in_labor_force::float) AS pct_in_labor_force
  , AVG(pct_disabled::float) AS pct_disabled
  , AVG(outflow_same_state::float) AS outflow_same_state
  , AVG(outflow_foreign::float) AS outflow_foreign
  , AVG(outflow_total::float) AS outflow_total
  , AVG(inflow_same_state::float) AS inflow_same_state
  , AVG(inflow_diff_state::float) AS inflow_diff_state
  , AVG(inflow_foreign::float) AS inflow_foreign
  , AVG(inflow_total::float) AS inflow_total
  , AVG(exemptions_per_return::float) AS exemptions_per_return
  , AVG(dependenper_return::float) AS dependenper_return
  , AVG(pct_single_returns::float) AS pct_single_returns
  , AVG(pct_joint_returns::float) AS pct_joint_returns
  , AVG(pct_headofhousehold_returns::float) AS pct_headofhousehold_returns
  , AVG(avg_adjusted_gross_income::float) AS avg_adjusted_gross_income
  , AVG(pct_farm_returns::float) AS pct_farm_returns
  , AVG(pct_retiree_returns::float) AS pct_retiree_returns
  , AVG(pct_salarywages_returns::float) AS pct_salarywages_returns
  , AVG(avg_amnt_salarywages::float) AS avg_amnt_salarywages
  , AVG(pct_businessincome_returns::float) AS pct_businessincome_returns
  , AVG(avg_amnt_businessincome::float) AS avg_amnt_businessincome
  , AVG(avg_amnt_unemployed_benefits::float) AS avg_amnt_unemployed_benefits
  , AVG(pct_socsecurity_returns::float) AS pct_socsecurity_returns
  , AVG(avg_amnt_socsecurity_benefits::float) AS avg_amnt_socsecurity_benefits
  , AVG(avg_amnt_contributions::float) AS avg_amnt_contributions
  , AVG(pct_energy_taxcredit_returns::float) AS pct_energy_taxcredit_returns
  , AVG(avg_amnt_energy_taxcredit_per_return::float) AS avg_amnt_energy_taxcredit_per_return
  , AVG(avg_amnt_child_taxcredit_per_return::float) AS avg_amnt_child_taxcredit_per_return
  , AVG(pct_eitc_returns::float) AS pct_eitc_returns
  , AVG(avg_amnt_eitc_per_return::float) AS avg_amnt_eitc_per_return
  , AVG(veteran14_pct::float) AS veteran14_pct
  , AVG(current_rate::float) AS current_rate
  , AVG(ten_month_low::float) AS ten_month_low
  , AVG(retired_workers::float) AS retired_workers
  , AVG(disabled_workers::float) AS disabled_workers
  , AVG(widowers::float) AS widowers
  , AVG(children::float) AS children
  , AVG(retired_workers_benefits::float) AS retired_workers_benefits
  , AVG(widower_benefits::float) AS widower_benefits
  , AVG(total_crimes::float) AS total_crimes
  , AVG(rape::float) AS rape
  , AVG(agg_assault::float) AS agg_assault
  , AVG(prostitution::float) AS prostitution
  , AVG(sex_offense::float) AS sex_offense
  , AVG(vehicle_theft::float) AS vehicle_theft
  , AVG(disorderly_conduct::float) AS disorderly_conduct
  , AVG(drug_total::float) AS drug_total
  , AVG(median_home_list_price::float) AS median_home_list_price
  , AVG(median_rental_price::float) AS median_rental_price
  , AVG(median_listing_price_sqfoot::float) AS median_listing_price_sqfoot
  , AVG(pct_homes_price_reduction::float) AS pct_homes_price_reduction
  , AVG(median_price_reduction::float) AS median_price_reduction
  , AVG(log_distance_to_starbucks::float) AS log_distance_to_starbucks
  , AVG(log_distance_to_elementary_school::float) AS log_distance_to_elementary_school
  , AVG(log_distance_to_high_school::float) AS log_distance_to_high_school
  , AVG(log_distance_to_homeless_shelter::float) AS log_distance_to_homeless_shelter
  , AVG(log_distance_to_hospital::float) AS log_distance_to_hospital
  , AVG(log_distance_to_assisted_living::float) AS log_distance_to_assisted_living
  , AVG(log_distance_to_independent_living::float) AS log_distance_to_independent_living
  , AVG(log_distance_to_kindergarten::float) AS log_distance_to_kindergarten
  , AVG(log_distance_to_middle_school::float) AS log_distance_to_middle_school
  , AVG(log_distance_to_nursing_home::float) AS log_distance_to_nursing_home
  , AVG(log_distance_to_synagogue::float) AS log_distance_to_synagogue
  , AVG(log_distance_to_university::float) AS log_distance_to_university
  , AVG(housing_temp_address::float) AS housing_temp_address
  , AVG(amenities99_amenity_rank::float) AS amenities99_amenity_rank
  , AVG(fast_food_restaurants::float) AS fast_food_restaurants
  , AVG(full_service_restaurants::float) AS full_service_restaurants
  , AVG(grocery_stores_per_cap::float) AS grocery_stores_per_cap
  , AVG(rec_facilities::float) AS rec_facilities
  , AVG(snap_benefiper_cap::float) AS snap_benefiper_cap
  , AVG(wic_redemptions_per_cap::float) AS wic_redemptions_per_cap
  , AVG(sfree_lunch_eligible_pct::float) AS sfree_lunch_eligible_pct
  , AVG(sreduced_lunch_eligible_pct::float) AS sreduced_lunch_eligible_pct
  , AVG(low_access_to_store_pct::float) AS low_access_to_store_pct
  , AVG(no_car_low_access_to_store_pct::float) AS no_car_low_access_to_store_pct
  , AVG(mail_return_pct::float) AS mail_return_pct
  , AVG(parent::float) AS parent
  , AVG(uninsured_2015::float) AS uninsured_2015
  , AVG(marriage::float) AS marriage
  , MAX(incidence_rate::float) AS incidence_rate  -- only take the max, as this is already aggregated at a higher geographic level (state)
  , AVG(breastcncr::float) AS breastcncr
  , AVG(obese::float) AS obese
  , AVG(pvtresd::float) AS pvtresd
  , AVG(colg_hous::float) AS colg_hous
  , AVG(gen_hlth::float) AS gen_hlth
  , AVG(phys_hlth::float) AS phys_hlth
  , AVG(ment_hlth::float) AS ment_hlth
  , AVG(poor_hlth::float) AS poor_hlth
  , AVG(hlth_pln::float) AS hlth_pln
  , AVG(persdoc::float) AS persdoc
  , AVG(doc_toocostly::float) AS doc_toocostly
  , AVG(checkup::float) AS checkup
  , AVG(exercise::float) AS exercise
  , AVG(sleephrs::float) AS sleephrs
  , AVG(heartattack::float) AS heartattack
  , AVG(coronaryheartdisease::float) AS coronaryheartdisease
  , AVG(stroke::float) AS stroke
  , AVG(asthma::float) AS asthma
  , AVG(asthma_now::float) AS asthma_now
  , AVG(skin_cancer::float) AS skin_cancer
  , AVG(cancer::float) AS cancer
  , AVG(copd::float) AS copd
  , AVG(arthritis::float) AS arthritis
  , AVG(depression::float) AS depression
  , AVG(kidneydisease::float) AS kidneydisease
  , AVG(diabetes::float) AS diabetes
  , AVG(borderline_diab::float) AS borderline_diab
  , AVG(preg_diab::float) AS preg_diab
  , AVG(age_diabetes::float) AS age_diabetes
  , AVG(student::float) AS student
  , AVG(numchildren::float) AS numchildren
  , AVG(internet_use::float) AS internet_use
  , AVG(weight_kg::float) AS weight_kg
  , AVG(height_in::float) AS height_in
  , AVG(pregnant_now::float) AS pregnant_now
  , AVG(hard_hearing::float) AS hard_hearing
  , AVG(hard_seeing::float) AS hard_seeing
  , AVG(diff_decide::float) AS diff_decide
  , AVG(diff_walk::float) AS diff_walk
  , AVG(diff_dress::float) AS diff_dress
  , AVG(diff_alone::float) AS diff_alone
  , AVG(smoked::float) AS smoked
  , AVG(smoke_freq::float) AS smoke_freq
  , AVG(snuff_freq::float) AS snuff_freq
  , AVG(ecig_freq::float) AS ecig_freq
  , AVG(last_smoke::float) AS last_smoke
  , AVG(alc_past_week::float) AS alc_past_week
  , AVG(alc_past_month::float) AS alc_past_month
  , AVG(avg_drinks::float) AS avg_drinks
  , AVG(flu_vacc::float) AS flu_vacc
  , AVG(pnem_vacc::float) AS pnem_vacc
  , AVG(tetanus_vacc::float) AS tetanus_vacc
  , AVG(num_fall::float) AS num_fall
  , AVG(num_bad_falls::float) AS num_bad_falls
  , AVG(seatbelt_use::float) AS seatbelt_use
  , AVG(drunk_drive::float) AS drunk_drive
  , AVG(mammogram::float) AS mammogram
  , AVG(last_mammogram::float) AS last_mammogram
  , AVG(pap::float) AS pap
  , AVG(last_pap::float) AS last_pap
  , AVG(hpv_test::float) AS hpv_test
  , AVG(last_hpvtest::float) AS last_hpvtest
  , AVG(hpv_vacc::float) AS hpv_vacc
  , AVG(hysterectomy::float) AS hysterectomy
  , AVG(psa_test_discussion::float) AS psa_test_discussion
  , AVG(psa_test_suggest::float) AS psa_test_suggest
  , AVG(psa_test::float) AS psa_test
  , AVG(last_psatest::float) AS last_psatest
  , AVG(fam_history_prostatecncr::float) AS fam_history_prostatecncr
  , AVG(blood_stool_test::float) AS blood_stool_test
  , AVG(last_bloodstooltest::float) AS last_bloodstooltest
  , AVG(had_colsig::float) AS had_colsig
  , AVG(sigmoidoscopy::float) AS sigmoidoscopy
  , AVG(colonoscopy::float) AS colonoscopy
  , AVG(last_colsig::float) AS last_colsig
  , AVG(hiv_test::float) AS hiv_test
  , AVG(hiv_risk::float) AS hiv_risk
  , AVG(diabetes_test::float) AS diabetes_test
  , AVG(insulin_now::float) AS insulin_now
  , AVG(diabetes_consult::float) AS diabetes_consult
  , AVG(feet_check::float) AS feet_check
  , AVG(last_eyeexam::float) AS last_eyeexam
  , AVG(retinopathy::float) AS retinopathy
  , AVG(diab_edu::float) AS diab_edu
  , AVG(pain_days::float) AS pain_days
  , AVG(sad_days::float) AS sad_days
  , AVG(anxious_days::float) AS anxious_days
  , AVG(energized_days::float) AS energized_days
  , AVG(medicare_now::float) AS medicare_now
  , AVG(hc_employer::float) AS hc_employer
  , AVG(hc_personal::float) AS hc_personal
  , AVG(hc_medicare::float) AS hc_medicare
  , AVG(hc_medicaid::float) AS hc_medicaid
  , AVG(hc_tricare::float) AS hc_tricare
  , AVG(hc_native::float) AS hc_native
  , AVG(hc_none_current::float) AS hc_none_current
  , AVG(delay_appt::float) AS delay_appt
  , AVG(delay_wait::float) AS delay_wait
  , AVG(delay_transport::float) AS delay_transport
  , AVG(hc_none_thisyr::float) AS hc_none_thisyr
  , AVG(drvisit::float) AS drvisit
  , AVG(med_toocostly::float) AS med_toocostly
  , AVG(satisfied_care::float) AS satisfied_care
  , AVG(med_bills::float) AS med_bills
  , AVG(get_medadvic::float) AS get_medadvic
  , AVG(undrstnd_medadvic::float) AS undrstnd_medadvic
  , AVG(understand_writtenmedadvic::float) AS understand_writtenmedadvic
  , AVG(caregiver::float) AS caregiver
  , AVG(mem_loss::float) AS mem_loss
  , AVG(mem_loss_assist::float) AS mem_loss_assist
  , AVG(mem_loss_gethelp::float) AS mem_loss_gethelp
  , AVG(mem_loss_inhibit::float) AS mem_loss_inhibit
  , AVG(mem_loss_doctor::float) AS mem_loss_doctor
  , AVG(soda_day::float) AS soda_day
  , AVG(soda_week::float) AS soda_week
  , AVG(soda_month::float) AS soda_month
  , AVG(sugarbev_day::float) AS sugarbev_day
  , AVG(sugarbev_week::float) AS sugarbev_week
  , AVG(sugarbev_month::float) AS sugarbev_month
  , AVG(calorie::float) AS calorie
  , AVG(marijuana::float) AS marijuana
  , AVG(shingles_vacc::float) AS shingles_vacc
  , AVG(num_types_cncr::float) AS num_types_cncr
  , AVG(age_diagnosis_cncr::float) AS age_diagnosis_cncr
  , AVG(cncr_insurance::float) AS cncr_insurance
  , AVG(breast_exam::float) AS breast_exam
  , AVG(last_breastexam::float) AS last_breastexam
  , AVG(hetero::float) AS hetero
  , AVG(homo::float) AS homo
  , AVG(bisexual::float) AS bisexual
  , AVG(transgender::float) AS transgender
  , AVG(emo_support::float) AS emo_support
  , AVG(life_dissatisfaction::float) AS life_dissatisfaction
  , AVG(life_limited::float) AS life_limited
  , AVG(special_equip::float) AS special_equip
  , AVG(english::float) AS english
  , AVG(spanish::float) AS spanish
  , AVG(bmi::float) AS bmi
  FROM score_table_bcrisk_female_all  -- use individual-level scoring table for model trained on individual-level data
  WHERE census_block IS NOT NULL  
  GROUP BY 1  -- group by Census tracts
;

### Aggregate individual-level model scores at the Census tract level

In [None]:
%%civisquery

-- drop dependent views
DROP VIEW IF EXISTS cdph_train_bcrisk_all_agg;
DROP VIEW IF EXISTS agg_score_table_bcrisk;

-- drop if already exists
DROP VIEW IF EXISTS agg_bcrisk_scores;

CREATE VIEW agg_bcrisk_scores AS
SELECT 
LEFT(modeling.census_block, 11) AS census_tract
, AVG(scores.bc_risk_2cat) AS bcrisk_scores
FROM bcrisk AS scores
LEFT JOIN modeling_data AS modeling
ON modeling.id = scores.id
GROUP BY 1
;

### Create new scoring table with data aggregated at the census tract level; aggregated individual-level scores appended on as a column.

Each row is a geographic level.

In [None]:
%%civisquery


DROP VIEW IF EXISTS agg_score_table_bcrisk;

CREATE VIEW agg_score_table_bcrisk AS
SELECT 
bridge.ca,
agg_basefile.*,
agg_scores.bcrisk_scores
FROM tract_aggregate_bcrisk_basefile AS agg_basefile
LEFT JOIN
agg_bcrisk_scores AS agg_scores
ON agg_basefile.census_tract = agg_scores.census_tract
LEFT JOIN
ca_tract_bridge AS bridge  -- this is a table connect Census tracts to Chicago community areas
ON bridge.tract = agg_basefile.census_tract
WHERE bridge.tract IS NOT NULL  -- subsetting scoring table to only include Chicago Census tracts
;

### Create new training table with aggregated data; this is at the individual level but the features are aggregated by Census tract; aggregated individual-level scores are appended on as a column
Each row is an individual, but the features/columns are aggregated by geographic level.

In [None]:
%%civisquery

-- drop dependent views
DROP VIEW IF EXISTS cdph_train_bcrisk_topfeatures_agg;

-- drop view if exists
DROP VIEW IF EXISTS cdph_train_bcrisk_all_agg;

CREATE VIEW cdph_train_bcrisk_all_agg AS
SELECT 
A.id,
A.bc_risk_2cat,
A.weight,
B.*
FROM
(
    SELECT 
    train.id,
    train.bc_risk_2cat,
    train.weight,
    LEFT(modeling.census_block, 11) AS census_tract 
    FROM cdph_bcrisk_train_all_female AS train
    LEFT JOIN modeling_data AS modeling
    ON train.id = modeling.id
) AS A
LEFT JOIN
(
    SELECT 
    agg_basefile.*, 
    agg_scores.bcrisk_scores
    FROM tract_aggregate_bcrisk_basefile AS agg_basefile
    LEFT JOIN
    agg_bcrisk_scores AS agg_scores
    ON agg_basefile.census_tract = agg_scores.census_tract
) AS B
ON A.census_tract = B.census_tract
WHERE A.bc_risk_2cat IS NOT NULL
;

### Top features agg bcrisk

Again, we'll reduce the number of features in our training table so that only the most predictive features are present. We use the same script produced by a member of Civis's DSRD team to identify the variables that have the most predictive value for our dependent variable.

In [None]:
%%civisquery

DROP VIEW IF EXISTS cdph_train_bcrisk_topfeatures_agg;

CREATE VIEW cdph_train_bcrisk_topfeatures_agg AS
SELECT
id,
bc_risk_2cat,
weight,
retired,
household_income_is_null,
hh_general_purchase,
emp_dental_provider,
pct_race5way_black,
property_type_singlefamily,
pct_employment_finance,
email_append_ind,
emp_postal,
age,
adult_other_race_pct,
housing_temp_address,
median_age_white_male,
dog_enthusiast,
bcrisk_scores
FROM cdph_train_bcrisk_all_agg
;

------------------------------------------------------------------------------------------------
# STEP 6: Train Geographic-level Model

Set up modeling pipeline by specifying which models to train, and selecting the parameters for the different training tables.

In [None]:
agg_models = {
    'sparse_logistic': {},
    'extra_trees_classifier': 'hyperband',
    'gradient_boosting_classifier': 'hyperband',
    'random_forest_classifier': 'hyperband'
} 


agg_futures = []

In [None]:
for i, params in agg_models.items():

    print("Currently testing model: " + i)
    print("Params: " + str(params))
    print("--------------------------")

    m = ModelPipeline(model = i, 
                      model_name = 'AGG DATA, WEIGHTED -- ' + i + ' DV is bc_risk_2cat',
                      dependent_variable = 'bc_risk_2cat',
                      primary_key = 'id',
                      cross_validation_parameters = params,
                      memory_requested = 5000,
                      cpu_requested = 1024)

    train = m.train(table_name = 'cdph_train_bcrisk_topfeatures_agg',
                    fit_params = {'sample_weight': 'weight'},
                    database_name = 'database'
                    )
    
    agg_futures.append(train)

Check that jobs are running in the Civis platform:

In [None]:
for f in agg_futures:
    print("Model running?  " + str(f.running()))
    print("Job ID: " + str(f.job_id))
    print("Train Job ID: " + str(f.train_job_id))
    print("Run ID: " + str(f.train_run_id))
    print("------------------------------------------")

------------------------------------------------------------------------------------------
# STEP 7: Compare Geographic-level Model Performance

After our models have finished running, we'll print out the metrics for each one and compare them. We'll select the best performing model to score our dataset.

#### Among the models we tested, the sparse logistic model ended up having the best performance.

In [None]:
for f in agg_futures: 
    print("\n************************************\n")
    if str(f.running()) == "False" and f.metadata['run']['status'] != "exception":
        print("MODEL: " + f.metadata['model']['model'])
        print("DV: " + f.metadata['run']['configuration']['data']['y'][0])
        print("TRAINING TABLE: " + f.metadata['data_platform']['table_source']['tablename'])
        print("Job ID: " + str(f.job_id))
        try:
            print("\n-----------------------\n")
            print("AUC: " + str(f.metrics['roc_auc']))
            print("\n------------------------\n")
            print("CONFUSION MATRIX:  " + str(f.metrics['confusion_matrix']))
            print("\n------------------------------------\n")
            print("BEST PARAMS:")
            print(f.metadata['model']['cv_best_params'])
        except:
            pass
        print("\n************************************\n")
    else:
        print("Model not finished running")

-----------------------------------------------------------------------------
# STEP 8: Score Aggregated Scoring Table

### Load model trained on aggregated data for scoring

In [None]:
job_id = "ID NUMBER"

run_id = client.jobs.get(job_id)['last_run']['id']
name = client.jobs.get(job_id)['name']

# Print Model Info
print("NAME: " + name)
print("JOB ID: " + str(job_id))
print("RUN ID: " + str(run_id))

# Load model
loaded_agg_model = ModelPipeline.from_existing(job_id, run_id)

model_type = loaded_agg_model.model
print(model_type)

### Score scoring table with aggregated data

In [None]:
# Score table using model
scoring = loaded_agg_model.predict(table_name = "agg_score_table_bcrisk", 
                               database_name = "database",
                               output_table = "bcrisk_aggscores",
                               if_exists = "drop",
                               primary_key = "census_tract"