The big picture idea here is that I want to build a machine learning model that can predict when a county will have a high (>75th percentile) rate of COVID fatalities. Then I will pull out the most important factors of that model. We have a lot of data that is moderately correlated with COVID fatality rates and often highly correlated with each other. So the machine learning model will help tease out what is really important within the raw data, in a way that minimizes my own biases.

First, a bunch of imports:

In [15]:
import sys
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import RidgeClassifierCV
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
import pandas as pd
from sklearn.metrics import roc_auc_score
from COVID_data import prepare_model_data

from sklearn.linear_model import Lasso, LogisticRegressionCV, LassoLarsCV, ElasticNetCV
from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
import pprint

class config:
    USE_CACHE = True
    CACHE_DIR = "/Users/caseydurfee/msds/data_mining_final_project/cache"

from COVID_data import all_data
data = all_data.get_all_data(config)

Let's look at correlations in year 1 of the pandemic versus year 2 and see if there were changes.  Note that we are looking at all factors, including ones with large rates of missing data, which will artificially inflate those correlation scores. Some of these factors won't be used by the model, because I am throwing out any fields where more than 5% of counties have missing data.

I will get the top 20 factors for each year, filtering out correlations with other death rates.

In [38]:
death_rate_corr = data.corr()['DEATH_RATE_FIRST_YEAR']

# omicron, alpha, delta, etc. death rates are not interesting here
death_cols = list(data.filter(regex = 'DEATH'))

y1_corr = death_rate_corr.reindex(death_rate_corr.abs().sort_values().index) \
    .drop(death_cols)

print(y1_corr[-10:].to_string())

% Some College                      -0.256085
% Free or Reduced Lunch              0.261195
% Uninsured (SVI)                    0.262803
Per Capita Income (SVI)             -0.264968
% Fair/Poor                          0.272457
Life Expectancy                     -0.284754
Infant Mortality Rate                0.285151
MEDIAN_FAMILY                       -0.285368
% Uninsured                          0.291818
% Children in Poverty                0.292585
% Physically Inactive                0.298508
Age-Adjusted Mortality               0.304026
Child Mortality Rate                 0.305280
Years of Potential Life Lost Rate    0.311401
Homicide Rate                        0.319756
% No HS Diploma (SVI)                0.321150
MV Mortality Rate                    0.336547
% Disconnected Youth                 0.337717
Teen Birth Rate                      0.350611
Age-Adjusted Mortality (Hispanic)    0.376489


In [39]:
death_rate_corr = data.corr()['DEATH_RATE_SECOND_YEAR']

death_cols = list(data.filter(regex = 'DEATH'))

y2_corr = death_rate_corr.reindex(death_rate_corr.abs().sort_values().index) \
    .drop(death_cols)

print(y2_corr[-10:].to_string())

% Physically Inactive                0.404044
% Excessive Drinking                -0.414861
Mentally Unhealthy Days              0.418606
REPUB_PARTISAN                       0.419108
Partial Coverage                    -0.420090
Per Capita Income (SVI)             -0.424035
Firearm Fatalities Rate              0.427629
% Frequent Mental Distress           0.431077
% Frequent Physical Distress         0.431779
% Diabetic                           0.432851
Booster Coverage                    -0.433765
Complete Coverage                   -0.439520
% Some College                      -0.442925
Teen Birth Rate                      0.450424
Physically Unhealthy Days            0.452853
MEDIAN_FAMILY                       -0.457310
Years of Potential Life Lost Rate    0.466963
% Disabled (SVI)                     0.482125
Life Expectancy                     -0.490274
Age-Adjusted Mortality               0.495016


Some things to note here:
* there are some metrics that are extremely similar (Age-adjusted Mortality is basically the inverse of Life Expectancy). That's to be expected since we're going to let the model decide which one is the best for predicting high COVID fatality rates.

* vaccination rates don't predict year one fatality rate, which makes sense, given time only flows in one direction. They do show up as a major factor in year two, as expected.

* The correlations in year 2 are all much stronger than year 1. This implies year 2 death rates were more predictable based on data we had before the pandemic (and hence the deaths were more preventable).

We'll start by evaluating models on the data over the entire pandemic, then see if we can do better by splitting up year 1 and year 2.

I am going to test a wide range of models and we'll go with whatever performs best.  Using ROC AUC as the scoring metric since we have unbalanced classes.

The data is going to be limited to counties with a population over 50,000, to be able to pull the most number of factors without doing too much imputation.

In [29]:
SEED = 2718
ITERS = 20000

# test_models = [ RandomForestClassifier(random_state = seed), 
            #     SVC(random_state = seed), 
            #     LogisticRegressionCV(max_iter=20000, random_state = seed), 
            #     RidgeClassifierCV(), 
            #     AdaBoostClassifier(random_state = seed), 
            #     BaggingClassifier(random_state = seed),
            #     GradientBoostingClassifier(random_state = seed),
            #     LinearSVC(max_iter=20000, random_state = seed),
            #     LassoLarsCV(max_iter=20000, normalize=False),
            #     ElasticNetCV(random_state = seed, max_iter=20000),
            # ]
test_model_classes = [ 
    RandomForestClassifier,
    SVC, 
    LogisticRegressionCV, 
    RidgeClassifierCV, 
    AdaBoostClassifier, 
    BaggingClassifier,
    GradientBoostingClassifier,
    LinearSVC,
    LassoLarsCV,
    ElasticNetCV,
]

test_models = []

## a little voodoo to pass arguments to the classes that take them.
import inspect
for k in test_model_classes:
    f_args = {}
    sig = inspect.signature(k.__init__)
    if 'random_state' in sig.parameters:
        f_args['random_state'] = SEED
    if 'max_iter' in sig.parameters:
        f_args['max_iter'] = ITERS
    test_models.append(k(**f_args))

roc_auc_scorer = make_scorer(roc_auc_score)

In [None]:
df, X, y = prepare_model_data.make_train_test(data, year=False, min_pop=50000, split=False)

roc_scores = {}
max_roc = 0
for model in test_models:
    model_name = repr(model.__class__)

    mean_score = cross_val_score(model, X, y, scoring=roc_auc_scorer).mean()

    if model_name not in roc_scores:
        roc_scores[model_name] = 0.0
    roc_scores[model_name] += mean_score
    if mean_score > max_roc:
        best_model = model
        max_roc = mean_score

sz = pd.Series(roc_scores)
print(sz.sort_values())

Now, let's try training models versus year 1 and 2 individually.

In [10]:
ITERS = 1 # to run test multiple times & take averages.

for year in [1,2]:
    df, X, y = prepare_model_data.make_train_test(data, year=year, min_pop=50000, split=False)

    roc_scores = {}
    max_roc = 0.0
    best_model = None
    for i in range(ITERS):
        for model in test_models:
            model_name = repr(model.__class__)
            
            mean_score = cross_val_score(model, X, y, scoring=roc_auc_scorer).mean()

            if model_name not in roc_scores:
                roc_scores[model_name] = 0.0
            roc_scores[model_name] += mean_score
            if mean_score > max_roc:
                best_model = model
                max_roc = mean_score

    sz = pd.Series(roc_scores)
    print(f">>>> mean scores for year {year}")
    print(sz / ITERS)




>>>> mean scores for year 1
<class 'sklearn.ensemble._forest.RandomForestClassifier'>          0.595732
<class 'sklearn.svm._classes.SVC'>                                 0.588568
<class 'sklearn.linear_model._logistic.LogisticRegressionCV'>      0.584277
<class 'sklearn.linear_model._ridge.RidgeClassifierCV'>            0.613578
<class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>     0.644440
<class 'sklearn.ensemble._bagging.BaggingClassifier'>              0.585939
<class 'sklearn.ensemble._gb.GradientBoostingClassifier'>          0.610915
<class 'sklearn.svm._classes.LinearSVC'>                           0.636336
<class 'sklearn.linear_model._least_angle.LassoLarsCV'>            0.740688
<class 'sklearn.linear_model._coordinate_descent.ElasticNetCV'>    0.739316
dtype: float64


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(


>>>> mean scores for year 2
<class 'sklearn.ensemble._forest.RandomForestClassifier'>          0.687783
<class 'sklearn.svm._classes.SVC'>                                 0.680794
<class 'sklearn.linear_model._logistic.LogisticRegressionCV'>      0.714251
<class 'sklearn.linear_model._ridge.RidgeClassifierCV'>            0.693006
<class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>     0.696235
<class 'sklearn.ensemble._bagging.BaggingClassifier'>              0.678718
<class 'sklearn.ensemble._gb.GradientBoostingClassifier'>          0.716683
<class 'sklearn.svm._classes.LinearSVC'>                           0.715096
<class 'sklearn.linear_model._least_angle.LassoLarsCV'>            0.865199
<class 'sklearn.linear_model._coordinate_descent.ElasticNetCV'>    0.865058
dtype: float64


Looks like ElasticNetCV and LassoLarsCV are our big winners. They both apply regularization, which punishes models for being too complex. This is perfect for our purposes, since we're trying to pull out the top factors that matter the most.

Because there's not a huge difference between them, I selected LassoLarsCV for the final models. It produces simpler models than ElasticNetCV and doesn't have trouble converging on a solution.

now, let's use permutation importance to figure out what really matters to these models that we've built.

In [45]:
from sklearn.inspection import permutation_importance

df, X, y = prepare_model_data.make_train_test(data, year=1, split=False)

best_model = LassoLarsCV(normalize=False).fit(X, y)

result = permutation_importance(best_model, X, y, 
            n_repeats=100)

y1_importance = pd.Series(result.importances_mean, index=df.columns)

disp = y1_importance.reindex(y1_importance.abs().sort_values().index)

print("Year one")
print(disp[disp > 0])

Year one
% Insufficient Sleep     0.000886
% Female                 0.000934
% Rural                  0.000949
MV Mortality Rate        0.001100
% Alcohol-Impaired       0.008845
% No Vehicle (SVI)       0.016103
Income Ratio             0.016110
% No HS Diploma (SVI)    0.021963
% Over 65 (SVI)          0.028022
% Hispanic               0.029458
% Physically Inactive    0.121710
dtype: float64


In [48]:
df, X, y = prepare_model_data.make_train_test(data, year=2, split=False)

best_model = LassoLarsCV(normalize=False).fit(X, y)

result = permutation_importance(best_model, X, y, 
            n_repeats=100)

perm_importances = pd.Series(result.importances_mean, index=df.columns)

y2_importance = perm_importances.reindex(perm_importances.abs().sort_values().index)

print("Year two")
print(y2_importance[y2_importance > 0][-20:])

Year two
% Physically Inactive               0.000861
% Smokers                           0.000921
% African American                  0.001317
Preventable Hosp. Rate              0.001364
Physically Unhealthy Days           0.001840
MV Mortality Rate                   0.001899
% Rural                             0.002279
% Asian                             0.002709
Complete Coverage                   0.003746
% Vaccinated                        0.004402
% Female                            0.004477
Injury Death Rate                   0.008716
Age-Adjusted Mortality              0.010363
Partial Coverage                    0.012298
% Over 65 (SVI)                     0.020325
% Uninsured (SVI)                   0.023296
Teen Birth Rate                     0.033336
% Single Parent Households (SVI)    0.035485
% Disabled (SVI)                    0.039334
REPUB_PARTISAN                      0.071253
dtype: float64


These importances don't tell us the direction of the correlation (whether they increase or decrease the likelihood of a county having high COVID rates.) We can unite this data with the correlation data to show all factors that mattered during the pandemic.

In [57]:
summary = y1_importance.rename('Y1_IMPORTANCE').to_frame()\
    .join(y1_corr.rename("Y1_CORR"))\
    .join(y2_importance.rename('Y2_IMPORTANCE'))\
    .join(y2_corr.rename("Y2_CORR"))

## todo: put ranks in for importance of factors (most important=1, etc.)


summary[(summary.Y1_IMPORTANCE > 0) | (summary.Y2_IMPORTANCE > 0)].sort_values(by='Y2_IMPORTANCE', ascending=False)

Unnamed: 0,Y1_IMPORTANCE,Y1_CORR,Y2_IMPORTANCE,Y2_CORR
REPUB_PARTISAN,0.0,0.136991,0.071253,0.419108
% Disabled (SVI),0.0,0.090178,0.039334,0.482125
% Single Parent Households (SVI),0.0,0.19183,0.035485,0.025415
Teen Birth Rate,0.0,0.350611,0.033336,0.450424
% Uninsured (SVI),0.0,0.262803,0.023296,0.253985
% Over 65 (SVI),0.028022,0.061256,0.020325,0.239974
Partial Coverage,0.0,-0.146361,0.012298,-0.42009
Age-Adjusted Mortality,0.0,0.304026,0.010363,0.495016
Injury Death Rate,0.0,0.126617,0.008716,0.36114
% Female,0.000934,0.008388,0.004477,0.031939


Another way to look at what's changed is the difference in correlation between year one and year two. In some cases, the correlation went up, and others it went down.

In [59]:
summary['CORR_DIFF'] = abs(summary['Y2_CORR'] - summary['Y1_CORR'])

summary.sort_values(by='CORR_DIFF', ascending=False)[:10]

Unnamed: 0,Y1_IMPORTANCE,Y1_CORR,Y2_IMPORTANCE,Y2_CORR,CORR_DIFF
Mentally Unhealthy Days,0.0,0.015366,0.0,0.418606,0.40324
% Disabled (SVI),0.0,0.090178,0.039334,0.482125,0.391947
Physically Unhealthy Days,0.0,0.114035,0.00184,0.452853,0.338818
Firearm Fatalities Rate,0.0,0.103636,0.000367,0.427629,0.323993
% Frequent Mental Distress,0.0,0.109787,0.0,0.431077,0.32129
% Minority (SVI),0.0,0.230978,0.0,-0.065145,0.296123
% Non-Hispanic White,0.0,-0.230316,0.0,0.063103,0.293418
HIV Prevalence Rate,0.0,0.200919,0.0,-0.086887,0.287806
REPUB_PARTISAN,0.0,0.136991,0.071253,0.419108,0.282117
Partial Coverage,0.0,-0.146361,0.012298,-0.42009,0.273729
