# Modeling the Uninsured Population (2017)
This notebook goes through the steps of creating a model to predict the uninsured population in America in 2017.

First, we'll set up our workspace. We'll be using the Civis Analytics platform API to connect to our data, create tables, and query our data. Through the Civis API, we'll also be able to use CivisML, our machine learning package. We'll use this to train and test our models, as well as make predictions.

To learn more about Civis Analytics and understand the data science platform we use to build this model, check out our website at the following link: https://www.civisanalytics.com/

#### NOTE: Some variable names and functions have been changed to protect proprietary information.

In [None]:
import pip
import imp
import civis

import pandas as pd
import numpy as np
from civis.ml import ModelPipeline  # we'll be using Civis's model pipeline to create and run our models

client = civis.APIClient()

# STEP 1: Create Training Table

1.  Grab data for modeling
2.  Append on 2016 uninsured scores
3.  Append responses from the survey we ran in 2017 (Oct 23, 2017 - Nov 24, 2017)
4.  Recode survey question about insurance status to be binary variable where uninsured = 1, insured = 0
5.  Rebalance table so that the split is approximately 80/20 for insured/uninsured, as there are too many insured people in the dataset to properly train the model.


In [None]:
%%civisquery redshift-general  -- this allows us to query our data using SQL in this cell


SET SEED TO .42;

DROP VIEW IF EXISTS cdph_train_uninsured;
CREATE VIEW cdph_train_uninsured AS
(
SELECT *
FROM
(
    SELECT full_modeling_data.*, 
    s.cdph_uninsured
    FROM
    (
        SELECT *
        FROM
        (
            SELECT 
            id,
            DECODE(cdph_insured,
                   'Yes - through the government (Medicare, Medicaid)', 0,
                   'Yes - through my employer or spouse/partner\'s employer, or I purchase it myself', 0,
                   'No - I don\'t have health insurance', 1,
                   NULL) AS cdph_uninsured,
            ROW_NUMBER() OVER (PARTITION BY id) AS dupes -- remove duplicate survey respondents
            FROM health_care.cdph_survey
       ) WHERE dupes = 1
    ) AS s 
    LEFT JOIN
    (
      SELECT *
      FROM modeling_data AS md
      LEFT JOIN
      (
          SELECT join_key, uninsured_2016
          FROM uninsured2016_data
      ) AS 2016data
      ON md.id = 2016data.join_key
    ) AS full_modeling_data
    ON full_modeling_data.id = s.id
)
WHERE cdph_uninsured = 1
)
UNION ALL
(
SELECT *
FROM
(
    SELECT full_modeling_data.*, 
    s.cdph_uninsured
    FROM
    (
        SELECT *
        FROM
        (
            SELECT 
            id,
            DECODE(cdph_insured,
                   'Yes - through the government (Medicare, Medicaid)', 0,
                   'Yes - through my employer or spouse/partner\'s employer, or I purchase it myself', 0,
                   'No - I don\'t have health insurance', 1,
                   NULL) AS cdph_uninsured,
            ROW_NUMBER() OVER (PARTITION BY id) AS dupes -- remove duplicate survey respondents
            FROM health_care.cdph_survey
       ) WHERE dupes = 1
    ) AS s 
    LEFT JOIN
    (
      SELECT *
      FROM modeling_data AS md
      LEFT JOIN
      (
          SELECT join_key, uninsured_2016
          FROM uninsured2016_data
      ) AS 2016data
      ON md.id = 2016data.join_key
    ) AS full_modeling_data
    ON full_modeling_data.id = s.id
)
WHERE cdph_uninsured = 0
ORDER BY RANDOM()  -- take a random subset of insured people
LIMIT 1940
)
;


# STEP 2: Train Model
To train our model, we'll first specify a few classifier models offered in the modeling package Civis uses, CivisML. 

In [None]:
models = {
    'sparse_logistic': {},
    'extra_trees_classifier': {"hyperband"},
    'gradient_boosting_classifier': {"hyperband"},
    'random_forest_classifier': {"hyperband"},
    'extra_trees_classifier': {"hyperband"},
    'multilayer_perceptron_classifier': {'hyperband'},
    'stacking_classifier': {}
} 

futures = []



Next, we'll loop through each of the models we specified. As of December 2017, we are using CivisML 2.0, which includes new models and more sophisticated tuning (i.e. "hyperband").

For each model, we'll create a model pipeline where we provide the dependent variable and a primary key (i.e. a column with a unique indicator for each row or observation). We can also choose to exclude specific columns. We will then train the model using the training table we created.

The CivisML modeling pipeline will automatically split the data into a train and test set, as well as cross-validate the model.

In [None]:
for i, params in models.items():
    
    # DEFAULT TO HYPERBAND WITH CivisML 2.0
    if i == "sparse_logistic" or i == "stacking_classifier":
        pass
    else:
        params = "hyperband"
        
    print("Currently testing model: " + i)
    print("Params: " + str(params))
    print("--------------------------")

    m = ModelPipeline(model = i, 
                      model_name = 'Uninsured model ' + i + ' DV is: cdph_uninsured',
                      dependent_variable = 'cdph_uninsured',
                      primary_key = 'id',
                      excluded_columns = ['state', 
                                          'join_key',
                                          'cdph_pregnant',
                                          'cdph_pregnant_age',
                                          'cdph_diagnosis_breastcancer',
                                          'cdph_family_breastcancer',
                                          'cdph_brca_test',
                                          'cdph_oral_contraceptives',
                                          'cdph_female_hormones',
                                         ],
                      cross_validation_parameters = params,
                      memory_requested = 3000,
                      cpu_requested = 800)

    train = m.train(table_name = 'cdph_train_uninsured',
                    database_name = 'database')
    futures.append(train)
        


Check to see if the models are running.

In [None]:
for f in futures:
    print("Model running?  " + str(f.running()))
    print("Job ID: " + str(f.job_id))
    print("Train Job ID: " + str(f.train_job_id))
    print("Run ID: " + str(f.train_run_id))
    print("------------------------------------------")

# STEP 3: Compare Model Performance
After our models have finished running, we'll print out the cross validation metrics for each one and compare them. We'll select the best performing model to score our dataset. 

#### The sparse logistic model ended up having the best performance among the 7 models we tested.

In [None]:
for f in futures: 
    print("\n************************************\n")
    if str(f.running()) == "False" and f.metadata['run']['status'] != "exception":
        print("MODEL: " + f.metadata['model']['model'])
        print("DV: " + f.metadata['run']['configuration']['data']['y'][0])
        print("TRAINING TABLE: " + f.metadata['data_platform']['table_source']['tablename'])
        try:
            print("\n-----------------------\n")
            print("AUC: " + str(f.metrics['roc_auc']))
            print("\n------------------------\n")
            print("CONFUSION MATRIX:  " + str(f.metrics['confusion_matrix']))
            print("\n------------------------------------\n")
            print("BEST PARAMS:")
            print(f.metadata['model']['cv_best_params'])
        except:
            pass
        print("\n************************************\n")
    else:
        print("Model not finished running")



# STEP 4: Create Scoring Table and Score
We'll create a scoring table with the same features as our training set. This scoring table has data on Chicago inhabitants, so once we're done scoring, we can use this table to create a heat map illustrating where the uninsured population in Chicago resides. 

In [None]:
%%civisquery redshift-general

DROP VIEW IF EXISTS score_table_uninsured2017;
CREATE VIEW score_table_uninsured2017 AS
SELECT A.*, B.score AS uninsured_2016
FROM
(
  (
    SELECT *
    FROM modeling_data
  ) AS A
  LEFT JOIN
  (
    SELECT score, join_key
    FROM 2016scores
  ) AS B
  ON A.id = B.join_key
)
WHERE B.score IS NOT NULL;



Using the Civis API, we'll grab the model with the best performance (sparse logistic), and use this model to score our data set. 

In [None]:
# Grab Job ID, Run ID, name from output or Civis Platform UI
job_id = ########
run_id = client.jobs.get(job_id)['last_run']['id']
name = client.jobs.get(job_id)['name']

# Print Model Info
print("NAME: " + name)
print("JOB ID: " + str(job_id))
print("RUN ID: " + str(run_id))

# Load model
loaded_model = ModelPipeline.from_existing(job_id, run_id)

model_type = loaded_model.model
print(model_type)  # model type is sparse logistic

# Score table using model
scoring = loaded_model.predict(table_name = "score_table_uninsured2017", 
                        database_name = "database",
                        output_table = "uninsured2017")

# STEP 5: Create Table for Plotting Map
We'll grab the output scores and join them to a table with geographic information, which we will then use to create maps.

In [None]:
%%civisquery redshift-general

DROP TABLE IF EXISTS insurance_mapping;
CREATE TABLE insurance_mapping 
AS
SELECT 
insured_scores.uninsured2017, 
CASE WHEN insured_scores.uninsured2017 >= 0.5 THEN 1 ELSE 0 END AS uninsured2017_2cat,
ds.* FROM
(
    (
        SELECT id, cdph_uninsured_1 AS uninsured2017
        FROM health_care.uninsured2017
    ) AS insured_scores
    LEFT JOIN
    (
        SELECT id, state, census_block, county, county_name
        FROM dataset
        WHERE gender = 'Female'
    ) AS ds
    ON insured_scores.id = ds.id
) 