# Exploration of Socioeconomic Influences on Cancer Mortality:
# Coefficients of Best Performing Machine Learning Model

Now that the best performing regressor (unscaled Ridge Regression with an alpha of 0.001) has been identified through the iterative process of hyperparameter tuning, its regression coefficients are examined in this notebook in an effort to discover further significant contributors to cancer mortality that weren't identified in the Visual EDA notebook with predictive features' Pearson's correlation coefficients or in the Hypothesis Testing notebook.

In [1]:
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn import linear_model
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler

In [2]:
df = pd.read_csv('cancer_ml7.csv', index_col=['Geography'])

## Top Performing Regressor

The target variable is set as 'TARGET_deathRate', the per capita cancer mortality rate (per 100,000 people).

In [3]:
y = df['TARGET_deathRate']

The predictive feature set X is defined as the rest of the columns in the DataFrame.

In [4]:
target_name = ['TARGET_deathRate']
X = df[[cn for cn in df.columns if cn not in target_name]]

The best performing algorithm is the unscaled, Ridge Regression algorithm using unscaled data and the automatic solver with an Alpha of 0.001. This algorithm has an accuracy of 0.6465 and a Root Mean Squared Error (RMSE) of 16.6 for the training set. The test set has an accuracy of 0.6408 and a Root Mean Squared Error (RMSE) of 16.2.

In [5]:
lr_3 = linear_model.Ridge(alpha=0.001)
lr_3

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
lr_3.fit(X_train, y_train)

  overwrite_a=True).T


Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [8]:
y_pred_3_train = lr_3.predict(X_train)
y_pred_3_train[0:20]

array([193.6814711 , 190.81999526, 216.23272617, 139.13920381,
       225.16635654, 195.66524864, 172.79393977, 170.42103655,
       196.48290334, 230.26460023, 197.25887577, 175.61009965,
       202.61717549, 187.70993108, 155.03822506, 188.09679189,
       201.12734623, 156.97725576, 174.21388088, 179.27723478])

In [9]:
print("Training Set R^2: {}".format(lr_3.score(X_train, y_train)))
rmse_3_train = np.sqrt(mean_squared_error(y_train, y_pred_3_train))
print("Training Set Root Mean Squared Error: {}".format(rmse_3_train))

Training Set R^2: 0.6465369736893312
Training Set Root Mean Squared Error: 16.58992192614389


In [10]:
y_pred_3_test = lr_3.predict(X_test)
y_pred_3_test[0:20]

array([177.35025576, 175.37003439, 162.28118045, 175.61565049,
       178.79949234, 195.82574314, 173.08925325, 164.09948603,
       174.86180399, 172.19905272, 177.03470199, 206.9132008 ,
       157.80007659, 157.37575802, 220.00185197, 108.9025308 ,
       188.29362207, 205.9232775 , 208.2225993 , 183.78457323])

In [11]:
print("Test Set R^2: {}".format(lr_3.score(X_test, y_test)))
rmse_3_test = np.sqrt(mean_squared_error(y_test, y_pred_3_test))
print("Test Set Root Mean Squared Error: {}".format(rmse_3_test))

Test Set R^2: 0.6407644965711742
Test Set Root Mean Squared Error: 16.196424394992476


In [12]:
pd.set_option('display.max_rows', 350)

In [13]:
lr_3_coeffs = pd.Series(lr_3.coef_, index=X.columns) 
lr_3_coeffs

avgAnnCount                      -3.311788e-03
incidenceRate                     1.579581e-01
medIncome                         3.962400e-05
popEst2015                       -2.945975e-07
povertyPercent                   -2.999942e+00
studyPerCap                       1.610786e-04
MedianAge                         1.048151e+00
MedianAgeMale                    -5.903435e-01
MedianAgeFemale                   1.346875e+00
AvgHouseholdSize                 -6.493734e+00
PercentMarried                    2.234602e+00
PctNoHS18_24                     -1.135149e-02
PctHS18_24                        2.003942e-01
PctSomeCol18_24                  -2.494795e-01
PctBachDeg18_24                  -1.619434e-01
PctHS25_Over                      1.478055e+00
PctBachDeg25_Over                -4.930227e-01
PctEmployed16_Over                6.498003e+00
PctUnemployed16_Over              1.192618e-01
PctPrivateCoverage               -4.395087e+00
PctPrivateCoverageAlone           1.323289e-01
PctEmpPrivCov

The Ridge Regression coefficients often do not have similar values to the Pearson's correlation coefficients, frequently not even sharing the same polarities (positive or negative). This is an interesting finding suggesting that statistical relationships between predictive features and the target feature within a whole model can be quite different than one-on-one statistical relationships between individual features and the target feature.

The correlation coefficients between each predictive feature and the cancer mortality are called below with the .corrwith() method.

In [14]:
X_train_corrwith = X_train.corrwith(y_train)
X_train_corrwith

avgAnnCount                      -0.139998
incidenceRate                     0.429436
medIncome                        -0.432210
popEst2015                       -0.115986
povertyPercent                    0.435167
studyPerCap                      -0.024890
MedianAge                         0.001862
MedianAgeMale                    -0.017983
MedianAgeFemale                   0.019013
AvgHouseholdSize                 -0.041219
PercentMarried                   -0.263055
PctNoHS18_24                      0.090024
PctHS18_24                        0.276894
PctSomeCol18_24                  -0.104714
PctBachDeg18_24                  -0.287782
PctHS25_Over                      0.408759
PctBachDeg25_Over                -0.493734
PctEmployed16_Over               -0.413245
PctUnemployed16_Over              0.375829
PctPrivateCoverage               -0.394586
PctPrivateCoverageAlone          -0.341052
PctEmpPrivCoverage               -0.280062
PctPublicCoverage                 0.413010
PctPublicCo

The following cell returns the proportion of features' whose correlation coefficients and ridge regression coefficients share the same polarity, positive or negative. The proportion is 0.54, or roughly half of the feature set.

In [15]:
same_sign = ((X_train_corrwith >= 0) & (lr_3_coeffs >= 0) | (X_train_corrwith < 0) & (lr_3_coeffs < 0))
same_sign.sum()/len(same_sign)

0.5426829268292683

In [16]:
same_sign.value_counts()

True     178
False    150
dtype: int64

This difference between the two types of coefficients is explored more deeply below, to determine if there is something about the feature set that is causing this counterintuitive model behavior.

## Running the top performing regressor after removing logarithmic and exponential features

First, the logarithmic and exponential transformations of features which contributed to the model's accuracy are removed, as the nonlinear statistical relationships they uncovered may be causing volatility in the higher dimensional feature set. The best performing ridge regression regressor is then re-run and the 'same_sign' object is returned to see if there is any change in the proportion of features whose Pearson's correlation coefficients and Ridge Regression coefficients share the same polarity.

In [17]:
log_exp_features = ['povertyPercent_log', 'povertyPercent_sqrd', 'MedianAge_log', 'MedianAgeFemale_sqrd', 
                    'AvgHouseholdSize_log', 'PercentMarried_log', 'PercentMarried_sqrd', 'PctSomeCol18_24_log', 
                    'PctSomeCol18_24_sqrd', 'PctHS25_Over_sqrd', 'PctBachDeg25_Over_log', 
                    'PctEmployed16_Over_log', 'PctEmployed16_Over_sqrd', 'PctPrivateCoverage_log', 
                    'PctEmpPrivCoverage_log', 'PctPublicCoverage_log', 'PctPublicCoverageAlone_log', 
                    'PctPublicCoverageAlone_sqrd', 'PctWhite_sqrd', 'PctBlack_sqrd', 'INTPTLONG_sqrd', 
                    'mskcc_l1_log', 'mayo_l1_log', 'mayo_l1_sqrd', 'dfb_l1_log', 'dfb_l1_sqrd', 
                    'cleveland_l1_log', 'cleveland_l1_sqrd', 'upmcps_l1_log', 'mgs_l1_log', 'atlanta_l1_log', 
                    'denver_l1_sqrd', 'los_ang_l1_sqrd', 'seattle_l1_log', 'hopkins_l2_log', 'dfb_l2_log', 
                    'cleveland_l2_log', 'upmcps_l2_log', 'mgs_l2_log', 'atlanta_l2_log', 'city_min_distsl1_sqrd', 
                    'sc_min_dists_l1_log', 'PCT_LACCESS_CHILD10_sqrd', 'PCT_LACCESS_HHNV10_sqrd', 
                    'PC_DIRSALES07_sqrd', 'FMRKT13_sqrd', 'PCH_FMRKT_09_13_sqrd', 'PCT_OBESE_ADULTS13_log', 
                    'PCT_OBESE_ADULTS13_sqrd', 'CHILDPOVRATE10_log']

In [18]:
df_no_log_exp = df.drop(columns = log_exp_features)

In [19]:
y_no_log_exp = df_no_log_exp['TARGET_deathRate']

The predictive feature set X is defined as the rest of the columns in the DataFrame.

In [20]:
target_name = ['TARGET_deathRate']
X_no_log_exp = df_no_log_exp[[cn for cn in df_no_log_exp.columns if cn not in target_name]]

In [21]:
lr_3_no_log_exp = linear_model.Ridge(alpha=0.001)
lr_3_no_log_exp

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [22]:
X_train_no_log_exp, X_test_no_log_exp, y_train_no_log_exp, y_test_no_log_exp = train_test_split(X_no_log_exp, y_no_log_exp, test_size=0.2, random_state=42)

In [23]:
lr_3_no_log_exp.fit(X_train_no_log_exp, y_train_no_log_exp)

  overwrite_a=True).T


Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [24]:
y_pred_3_train_no_log_exp = lr_3_no_log_exp.predict(X_train_no_log_exp)
y_pred_3_train_no_log_exp[0:20]

array([193.1354574 , 189.47407918, 212.79050444, 137.93672939,
       222.25173963, 195.78307285, 175.23207322, 169.86445022,
       199.02298553, 233.44128926, 205.14056248, 177.36911884,
       204.47908468, 189.90139084, 148.67438434, 187.29930083,
       200.86680266, 155.70901687, 178.42238663, 175.65795097])

In [25]:
print("Training Set R^2: {}".format(lr_3_no_log_exp.score(X_train_no_log_exp, y_train_no_log_exp)))
rmse_3_train_no_log_exp = np.sqrt(mean_squared_error(y_train_no_log_exp, y_pred_3_train_no_log_exp))
print("Training Set Root Mean Squared Error: {}".format(rmse_3_train_no_log_exp))

Training Set R^2: 0.6287484338824065
Training Set Root Mean Squared Error: 17.002253754655197


In [26]:
y_pred_3_test_no_log_exp = lr_3_no_log_exp.predict(X_test_no_log_exp)
y_pred_3_test_no_log_exp[0:20]

array([180.61053715, 171.1252763 , 163.65259791, 171.67521823,
       177.44061123, 193.77795347, 172.86294423, 168.37960704,
       172.93836371, 172.84704751, 173.18651543, 204.29125291,
       156.82398779, 158.26004046, 220.13155275, 114.9226515 ,
       191.53689625, 205.50567657, 203.73335524, 182.28232812])

In [27]:
print("Test Set R^2: {}".format(lr_3_no_log_exp.score(X_test_no_log_exp, y_test_no_log_exp)))
rmse_3_test_no_log_exp = np.sqrt(mean_squared_error(y_test_no_log_exp, y_pred_3_test_no_log_exp))
print("Test Set Root Mean Squared Error: {}".format(rmse_3_test_no_log_exp))

Test Set R^2: 0.5944108066503957
Test Set Root Mean Squared Error: 17.20967661915464


In [28]:
lr_3_coeffs_no_log_exp = pd.Series(lr_3_no_log_exp.coef_, index=X_no_log_exp.columns) 
lr_3_coeffs_no_log_exp

avgAnnCount                      -3.276022e-03
incidenceRate                     1.574450e-01
medIncome                         5.252100e-05
popEst2015                        3.553815e-06
povertyPercent                   -3.821746e-01
studyPerCap                       7.936552e-05
MedianAge                         2.003663e-01
MedianAgeMale                    -5.520945e-01
MedianAgeFemale                  -2.177048e-01
AvgHouseholdSize                  2.464381e-01
PercentMarried                    2.636885e-01
PctNoHS18_24                      6.359269e-03
PctHS18_24                        2.198701e-01
PctSomeCol18_24                   6.376414e-02
PctBachDeg18_24                  -2.460465e-01
PctHS25_Over                      2.122190e-01
PctBachDeg25_Over                -4.779373e-01
PctEmployed16_Over               -1.594904e-01
PctUnemployed16_Over              8.542233e-02
PctPrivateCoverage                5.623283e-02
PctPrivateCoverageAlone           8.815375e-02
PctEmpPrivCov

In [29]:
X_train_no_log_exp_corrwith = X_train_no_log_exp.corrwith(y_train_no_log_exp)
X_train_no_log_exp_corrwith

avgAnnCount                      -0.139998
incidenceRate                     0.429436
medIncome                        -0.432210
popEst2015                       -0.115986
povertyPercent                    0.435167
studyPerCap                      -0.024890
MedianAge                         0.001862
MedianAgeMale                    -0.017983
MedianAgeFemale                   0.019013
AvgHouseholdSize                 -0.041219
PercentMarried                   -0.263055
PctNoHS18_24                      0.090024
PctHS18_24                        0.276894
PctSomeCol18_24                  -0.104714
PctBachDeg18_24                  -0.287782
PctHS25_Over                      0.408759
PctBachDeg25_Over                -0.493734
PctEmployed16_Over               -0.413245
PctUnemployed16_Over              0.375829
PctPrivateCoverage               -0.394586
PctPrivateCoverageAlone          -0.341052
PctEmpPrivCoverage               -0.280062
PctPublicCoverage                 0.413010
PctPublicCo

In [30]:
same_sign_no_log_exp = ((X_train_no_log_exp_corrwith >= 0) & (lr_3_coeffs_no_log_exp >= 0) | (X_train_no_log_exp_corrwith < 0) & (lr_3_coeffs_no_log_exp < 0))
same_sign_no_log_exp.sum()/len(same_sign_no_log_exp)

0.5251798561151079

In [31]:
same_sign_no_log_exp.value_counts()

True     146
False    132
dtype: int64

The 'same_sign' proportion actually drops slightly with the removal of the logarithmic and exponential features.

## Running the top performing regressor after keeping only the features with the strongest correlations with cancer mortality

The ridge regression algorithm is re-run just using the features with the strongest correlations with the target feature. To do this, a Boolean mask is created assigning a 'True' value to those features whose absolute value coefficient is greater than 0.3.

In [32]:
is_strong_feature = X_train_corrwith.abs() > 0.3
is_strong_feature.head()

avgAnnCount       False
incidenceRate      True
medIncome          True
popEst2015        False
povertyPercent     True
dtype: bool

In [33]:
strong_feature_names = X_train.columns[is_strong_feature]
strong_feature_names

Index(['incidenceRate', 'medIncome', 'povertyPercent', 'PctHS25_Over',
       'PctBachDeg25_Over', 'PctEmployed16_Over', 'PctUnemployed16_Over',
       'PctPrivateCoverage', 'PctPrivateCoverageAlone', 'PctPublicCoverage',
       'PctPublicCoverageAlone', 'hlmcc_l1', 'atlanta_l1', 'seattle_l1',
       'cleveland_l2', 'upmcps_l2', 'hlmcc_l2', 'atlanta_l2', 'los_ang_l2',
       'seattle_l2', 'san_fran_l2', 'PCT_LACCESS_HHNV10',
       'PCT_DIABETES_ADULTS09', 'PCT_DIABETES_ADULTS10', 'PCT_OBESE_ADULTS09',
       'PCT_OBESE_ADULTS10', 'PCT_OBESE_ADULTS13', 'CHILDPOVRATE10',
       'PERCHLDPOV10', 'povertyPercent_log', 'povertyPercent_sqrd',
       'PctHS25_Over_sqrd', 'PctBachDeg25_Over_log', 'PctEmployed16_Over_log',
       'PctEmployed16_Over_sqrd', 'PctPrivateCoverage_log',
       'PctPublicCoverage_log', 'PctPublicCoverageAlone_log',
       'PctPublicCoverageAlone_sqrd', 'atlanta_l1_log', 'cleveland_l2_log',
       'atlanta_l2_log', 'PCT_OBESE_ADULTS13_log', 'PCT_OBESE_ADULTS13_sqrd',


In [34]:
strong_feature_names_list = list(strong_feature_names)
strong_feature_names_list

['incidenceRate',
 'medIncome',
 'povertyPercent',
 'PctHS25_Over',
 'PctBachDeg25_Over',
 'PctEmployed16_Over',
 'PctUnemployed16_Over',
 'PctPrivateCoverage',
 'PctPrivateCoverageAlone',
 'PctPublicCoverage',
 'PctPublicCoverageAlone',
 'hlmcc_l1',
 'atlanta_l1',
 'seattle_l1',
 'cleveland_l2',
 'upmcps_l2',
 'hlmcc_l2',
 'atlanta_l2',
 'los_ang_l2',
 'seattle_l2',
 'san_fran_l2',
 'PCT_LACCESS_HHNV10',
 'PCT_DIABETES_ADULTS09',
 'PCT_DIABETES_ADULTS10',
 'PCT_OBESE_ADULTS09',
 'PCT_OBESE_ADULTS10',
 'PCT_OBESE_ADULTS13',
 'CHILDPOVRATE10',
 'PERCHLDPOV10',
 'povertyPercent_log',
 'povertyPercent_sqrd',
 'PctHS25_Over_sqrd',
 'PctBachDeg25_Over_log',
 'PctEmployed16_Over_log',
 'PctEmployed16_Over_sqrd',
 'PctPrivateCoverage_log',
 'PctPublicCoverage_log',
 'PctPublicCoverageAlone_log',
 'PctPublicCoverageAlone_sqrd',
 'atlanta_l1_log',
 'cleveland_l2_log',
 'atlanta_l2_log',
 'PCT_OBESE_ADULTS13_log',
 'PCT_OBESE_ADULTS13_sqrd',
 'CHILDPOVRATE10_log']

In [35]:
df_weak = df.drop(columns = strong_feature_names_list)

In [36]:
df_weak.columns

Index(['TARGET_deathRate', 'avgAnnCount', 'popEst2015', 'studyPerCap',
       'MedianAge', 'MedianAgeMale', 'MedianAgeFemale', 'AvgHouseholdSize',
       'PercentMarried', 'PctNoHS18_24',
       ...
       'dfb_l2_log', 'upmcps_l2_log', 'mgs_l2_log', 'city_min_distsl1_sqrd',
       'sc_min_dists_l1_log', 'PCT_LACCESS_CHILD10_sqrd',
       'PCT_LACCESS_HHNV10_sqrd', 'PC_DIRSALES07_sqrd', 'FMRKT13_sqrd',
       'PCH_FMRKT_09_13_sqrd'],
      dtype='object', length=284)

In [37]:
df_weak = df_weak.drop(columns = 'TARGET_deathRate')

In [38]:
weak_columns_list = list(df_weak.columns)

In [39]:
df_strong = df.drop(columns = weak_columns_list)

In [40]:
df_strong.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 3047 entries, Abbeville County, South Carolina to Zavala County, Texas
Data columns (total 46 columns):
TARGET_deathRate               3047 non-null float64
incidenceRate                  3047 non-null float64
medIncome                      3047 non-null int64
povertyPercent                 3047 non-null float64
PctHS25_Over                   3047 non-null float64
PctBachDeg25_Over              3047 non-null float64
PctEmployed16_Over             3047 non-null float64
PctUnemployed16_Over           3047 non-null float64
PctPrivateCoverage             3047 non-null float64
PctPrivateCoverageAlone        3047 non-null float64
PctPublicCoverage              3047 non-null float64
PctPublicCoverageAlone         3047 non-null float64
hlmcc_l1                       3047 non-null float64
atlanta_l1                     3047 non-null float64
seattle_l1                     3047 non-null float64
cleveland_l2                   3047 non-null float64
upmcp

In [41]:
y_strong = df_strong['TARGET_deathRate']

The predictive feature set X is defined as the rest of the columns in the DataFrame.

In [42]:
target_name = ['TARGET_deathRate']
X_strong = df_strong[[cn for cn in df_strong.columns if cn not in target_name]]

In [43]:
lr_3_strong = linear_model.Ridge(alpha=0.001)
lr_3_strong

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [44]:
X_train_strong, X_test_strong, y_train_strong, y_test_strong = train_test_split(X_strong, y_strong, test_size=0.2, random_state=42)

In [45]:
lr_3_strong.fit(X_train_strong, y_train_strong)

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [46]:
y_pred_3_train_strong = lr_3_strong.predict(X_train_strong)
y_pred_3_train_strong[0:20]

array([200.86521904, 193.77917402, 204.06070836, 144.87941304,
       224.44338897, 194.9364148 , 176.24435359, 157.65484388,
       185.00606335, 227.57661833, 206.80605854, 173.39462449,
       196.20679521, 194.33243619, 163.80309224, 178.42196292,
       203.16755069, 162.235815  , 167.70482752, 183.1082433 ])

In [47]:
print("Training Set R^2: {}".format(lr_3_strong.score(X_train_strong, y_train_strong)))
rmse_3_train_strong = np.sqrt(mean_squared_error(y_train_strong, y_pred_3_train_strong))
print("Training Set Root Mean Squared Error: {}".format(rmse_3_train_strong))

Training Set R^2: 0.5624285643339696
Training Set Root Mean Squared Error: 18.4585179557503


In [48]:
y_pred_3_test_strong = lr_3_strong.predict(X_test_strong)
y_pred_3_test_strong[0:20]

array([190.87409952, 176.66895925, 162.0335521 , 170.64802007,
       169.19942938, 195.31789543, 170.88186319, 169.17013045,
       166.32963449, 167.93619522, 190.13936157, 195.40973955,
       171.6354447 , 157.5461496 , 222.16316141, 115.3633002 ,
       189.87085075, 208.46775794, 202.12178461, 186.32402645])

In [49]:
print("Test Set R^2: {}".format(lr_3_strong.score(X_test_strong, y_test_strong)))
rmse_3_test_strong = np.sqrt(mean_squared_error(y_test_strong, y_pred_3_test_strong))
print("Test Set Root Mean Squared Error: {}".format(rmse_3_test_strong))

Test Set R^2: 0.5991849598412256
Test Set Root Mean Squared Error: 17.10809003019704


In [50]:
lr_3_coeffs_strong = pd.Series(lr_3_strong.coef_, index=X_strong.columns) 
lr_3_coeffs_strong

incidenceRate                    0.176537
medIncome                       -0.000041
povertyPercent                  -3.025795
PctHS25_Over                     1.397358
PctBachDeg25_Over                0.352229
PctEmployed16_Over               2.918746
PctUnemployed16_Over             0.099523
PctPrivateCoverage              -4.090137
PctPrivateCoverageAlone          0.122191
PctPublicCoverage               -0.786887
PctPublicCoverageAlone          -1.076725
hlmcc_l1                         0.579080
atlanta_l1                      -1.275515
seattle_l1                       0.469379
cleveland_l2                     1.424848
upmcps_l2                        1.351819
hlmcc_l2                         2.987665
atlanta_l2                      -5.569238
los_ang_l2                       3.893287
seattle_l2                      -0.286151
san_fran_l2                     -2.900157
PCT_LACCESS_HHNV10               0.201656
PCT_DIABETES_ADULTS09            1.176338
PCT_DIABETES_ADULTS10           -0

In [51]:
X_train_strong_corrwith = X_train_strong.corrwith(y_train_strong)
X_train_strong_corrwith

incidenceRate                  0.429436
medIncome                     -0.432210
povertyPercent                 0.435167
PctHS25_Over                   0.408759
PctBachDeg25_Over             -0.493734
PctEmployed16_Over            -0.413245
PctUnemployed16_Over           0.375829
PctPrivateCoverage            -0.394586
PctPrivateCoverageAlone       -0.341052
PctPublicCoverage              0.413010
PctPublicCoverageAlone         0.456392
hlmcc_l1                      -0.340472
atlanta_l1                    -0.357372
seattle_l1                     0.354138
cleveland_l2                  -0.300735
upmcps_l2                     -0.304242
hlmcc_l2                      -0.337991
atlanta_l2                    -0.354811
los_ang_l2                     0.308879
seattle_l2                     0.325172
san_fran_l2                    0.316731
PCT_LACCESS_HHNV10             0.339486
PCT_DIABETES_ADULTS09          0.535441
PCT_DIABETES_ADULTS10          0.539451
PCT_OBESE_ADULTS09             0.517969


In [52]:
same_sign_strong = ((X_train_strong_corrwith >= 0) & (lr_3_coeffs_strong >= 0) | (X_train_strong_corrwith < 0) & (lr_3_coeffs_strong < 0))
same_sign_strong.sum()/len(same_sign_strong)

0.5555555555555556

In [53]:
same_sign_strong.value_counts()

True     25
False    20
dtype: int64

The 'same_sign' proportion only improves by a percentage point after only the features with a correlation coefficient absolute value of greater than 0.3.

## Full DataFrame using top performing regressor and MinMax scaler

In [54]:
scaler = MinMaxScaler()

Next, the full DataFrame with all features is scaled using the MinMax Scaler to see if this changes the 'same_sign' proportion. 

In [55]:
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns, index=df.index)

  return self.partial_fit(X, y)


In [56]:
df_scaled.head()

Unnamed: 0_level_0,TARGET_deathRate,avgAnnCount,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,MedianAge,MedianAgeMale,MedianAgeFemale,...,city_min_distsl1_sqrd,sc_min_dists_l1_log,PCT_LACCESS_CHILD10_sqrd,PCT_LACCESS_HHNV10_sqrd,PC_DIRSALES07_sqrd,FMRKT13_sqrd,PCH_FMRKT_09_13_sqrd,PCT_OBESE_ADULTS13_log,PCT_OBESE_ADULTS13_sqrd,CHILDPOVRATE10_log
Geography,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Abbeville County, South Carolina",0.409106,0.003592,0.228321,0.125103,0.00237,0.411765,0.0,0.488372,0.432624,0.520737,...,0.001672,0.580436,0.042716,0.009878,0.000438,0.000233,0.0,0.796025,0.708192,0.733398
"Acadia Parish, Louisiana",0.56351,0.008311,0.289777,0.171164,0.006072,0.425339,0.0,0.311628,0.29078,0.343318,...,0.013547,0.494716,0.00021,0.000872,0.001855,0.0,0.0,0.882545,0.824751,0.767657
"Accomack County, Virginia",0.516331,0.005637,0.276551,0.15292,0.003161,0.366516,0.0,0.534884,0.479905,0.576037,...,0.006335,0.680184,0.000447,0.016028,9.3e-05,0.000233,0.005102,0.489515,0.367651,0.757758
"Ada County, Idaho",0.3032,0.045905,0.266209,0.342424,0.042616,0.190045,0.042464,0.313953,0.297872,0.329493,...,0.030102,0.632221,0.021139,9.1e-05,0.000165,0.005827,6.3e-05,0.658801,0.542797,0.572433
"Adair County, Iowa",0.39327,0.00118,0.238067,0.248323,0.000629,0.160633,0.014172,0.548837,0.534279,0.585253,...,0.015704,0.598702,0.002836,0.00095,0.001519,0.000233,0.0,0.770602,0.675815,0.529909


In [57]:
y_scaled = df_scaled['TARGET_deathRate']

In [58]:
X_scaled = df_scaled[[cn for cn in df_scaled.columns if cn not in target_name]]

In [59]:
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y_scaled, test_size=0.2, random_state=42)

In [60]:
lr_3.fit(X_train_scaled, y_train_scaled)

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [61]:
y_pred_3_train_scaled = lr_3.predict(X_train_scaled)
y_pred_3_train_scaled[0:20]

array([0.44493762, 0.43109871, 0.51299069, 0.26711805, 0.54446537,
       0.44679615, 0.37691202, 0.36009024, 0.44867764, 0.55845893,
       0.46169088, 0.38143237, 0.47327573, 0.42451011, 0.31053857,
       0.4234604 , 0.4672356 , 0.32277313, 0.37531404, 0.3891214 ])

In [62]:
print("Training Set R^2: {}".format(lr_3.score(X_train_scaled, y_train_scaled)))
rmse_3_train_scaled = np.sqrt(mean_squared_error(y_train_scaled, y_pred_3_train_scaled))
print("Training Set Root Mean Squared Error: {}".format(rmse_3_train_scaled))

Training Set R^2: 0.6443999375373357
Training Set Root Mean Squared Error: 0.054899365404627955


In [63]:
y_pred_3_test_scaled = lr_3.predict(X_test_scaled)
y_pred_3_test_scaled[0:20]

array([0.38988462, 0.38239639, 0.33830041, 0.38303357, 0.3921521 ,
       0.45145717, 0.37606612, 0.34454399, 0.38523108, 0.372456  ,
       0.38591119, 0.48510493, 0.32386365, 0.31947295, 0.53130749,
       0.17173156, 0.42484876, 0.48238201, 0.49212788, 0.40509016])

In [64]:
print("Test Set R^2: {}".format(lr_3.score(X_test_scaled, y_test_scaled)))
rmse_3_test_scaled = np.sqrt(mean_squared_error(y_test_scaled, y_pred_3_test_scaled))
print("Test Set Root Mean Squared Error: {}".format(rmse_3_test_scaled))

Test Set R^2: 0.6305999938088809
Test Set Root Mean Squared Error: 0.05418661665592595


In [65]:
lr_3_coeffs_scaled = pd.Series(lr_3.coef_, index=X_scaled.columns) 
lr_3_coeffs_scaled

avgAnnCount                      -0.412942
incidenceRate                     0.526562
medIncome                         0.025883
popEst2015                        0.024980
povertyPercent                   -0.467214
studyPerCap                       0.005271
MedianAge                         0.118236
MedianAgeMale                    -0.079948
MedianAgeFemale                   0.176237
AvgHouseholdSize                 -0.031973
PercentMarried                    0.323118
PctNoHS18_24                     -0.003259
PctHS18_24                        0.047229
PctSomeCol18_24                  -0.024950
PctBachDeg18_24                  -0.028001
PctHS25_Over                      0.231353
PctBachDeg25_Over                -0.064317
PctEmployed16_Over                1.027827
PctUnemployed16_Over              0.012912
PctPrivateCoverage               -0.954980
PctPrivateCoverageAlone           0.026920
PctEmpPrivCoverage                0.181727
PctPublicCoverage                -0.211387
PctPublicCo

In [66]:
X_train_corrwith_scaled = X_train_scaled.corrwith(y_train_scaled)

In [67]:
same_sign_scaled = ((X_train_corrwith_scaled >= 0) & (lr_3_coeffs_scaled >= 0) | (X_train_corrwith_scaled < 0) & (lr_3_coeffs_scaled < 0))
same_sign_scaled.sum()/len(same_sign_scaled)

0.5213414634146342

In [68]:
same_sign_scaled.value_counts()

True     171
False    157
dtype: int64

The 'same_sign' proportion decreases slightly, showing that scaling won't change this proportion significantly.

## Full DataFrame with Normalized Ridge Regression

Next, the normalization parameter for Ridge Regression is tried.

In [69]:
lr_3n = linear_model.Ridge(alpha=0.001, normalize = True)
lr_3n

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, random_state=None, solver='auto', tol=0.001)

In [70]:
X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X, y, test_size=0.2, random_state=42)

In [71]:
lr_3n.fit(X_train_n, y_train_n)

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, random_state=None, solver='auto', tol=0.001)

In [72]:
y_pred_3n_train = lr_3n.predict(X_train_n)
y_pred_3n_train[0:20]

array([193.52076956, 190.00029232, 213.90371792, 140.56854298,
       224.65419743, 195.58593097, 175.70206504, 169.70481956,
       194.99969583, 225.98053407, 203.1442749 , 177.49882336,
       203.24392608, 188.89962477, 151.39027409, 188.64776982,
       201.29350085, 155.69004109, 172.0786913 , 175.65970383])

In [73]:
print("Training Set R^2: {}".format(lr_3n.score(X_train_n, y_train_n)))
rmse_3n_train = np.sqrt(mean_squared_error(y_train_n, y_pred_3n_train))
print("Training Set Root Mean Squared Error: {}".format(rmse_3n_train))

Training Set R^2: 0.6391005116938605
Training Set Root Mean Squared Error: 16.76353013876298


In [74]:
y_pred_3n_test = lr_3n.predict(X_test_n)
y_pred_3n_test[0:20]

array([177.56281272, 174.82601959, 160.78204496, 176.20152971,
       178.64130635, 196.50829496, 172.58157318, 165.22272956,
       176.15359247, 172.07162774, 175.67700837, 205.92641467,
       158.40841898, 155.76765757, 220.92569225, 113.14000725,
       188.61185021, 206.78276153, 208.94959069, 181.93953815])

In [75]:
print("Test Set R^2: {}".format(lr_3n.score(X_test_n, y_test_n)))
rmse_3n_test = np.sqrt(mean_squared_error(y_test_n, y_pred_3n_test))
print("Test Set Root Mean Squared Error: {}".format(rmse_3n_test))

Test Set R^2: 0.613717280323078
Test Set Root Mean Squared Error: 16.795083310509035


In [76]:
lr_3n_coeffs = pd.Series(lr_3n.coef_, index=X.columns) 
lr_3n_coeffs

avgAnnCount                      -3.185257e-03
incidenceRate                     1.588717e-01
medIncome                         6.086941e-05
popEst2015                       -5.386750e-08
povertyPercent                   -1.186347e+00
studyPerCap                       1.854982e-04
MedianAge                        -1.917716e-01
MedianAgeMale                    -4.936142e-01
MedianAgeFemale                   3.834989e-01
AvgHouseholdSize                 -1.382834e+01
PercentMarried                    4.241531e-01
PctNoHS18_24                     -1.313290e-02
PctHS18_24                        2.106072e-01
PctSomeCol18_24                   1.408073e-01
PctBachDeg18_24                  -2.069110e-01
PctHS25_Over                      1.215832e+00
PctBachDeg25_Over                -6.908604e-01
PctEmployed16_Over                5.519394e-01
PctUnemployed16_Over              1.270967e-01
PctPrivateCoverage               -2.135922e+00
PctPrivateCoverageAlone           9.591723e-02
PctEmpPrivCov

In [77]:
X_train_corrwith_n = X_train_n.corrwith(y_train_n)
X_train_corrwith_n

avgAnnCount                      -0.139998
incidenceRate                     0.429436
medIncome                        -0.432210
popEst2015                       -0.115986
povertyPercent                    0.435167
studyPerCap                      -0.024890
MedianAge                         0.001862
MedianAgeMale                    -0.017983
MedianAgeFemale                   0.019013
AvgHouseholdSize                 -0.041219
PercentMarried                   -0.263055
PctNoHS18_24                      0.090024
PctHS18_24                        0.276894
PctSomeCol18_24                  -0.104714
PctBachDeg18_24                  -0.287782
PctHS25_Over                      0.408759
PctBachDeg25_Over                -0.493734
PctEmployed16_Over               -0.413245
PctUnemployed16_Over              0.375829
PctPrivateCoverage               -0.394586
PctPrivateCoverageAlone          -0.341052
PctEmpPrivCoverage               -0.280062
PctPublicCoverage                 0.413010
PctPublicCo

In [78]:
same_sign_n = ((X_train_corrwith_n >= 0) & (lr_3n_coeffs >= 0) | (X_train_corrwith_n < 0) & (lr_3n_coeffs < 0))
same_sign_n.sum()/len(same_sign_n)

0.5274390243902439

In [79]:
same_sign_n.value_counts()

True     173
False    155
dtype: int64

Again, the 'same_sign' proportion does not change much.

## Full DataFrame with svd Ridge Regression

Next, two different solvers of the Ridge Regression algorithm are tried to see if they make any difference in the 'same_sign' proportion. First, the 'svd' solver is used.

In [80]:
lr_3svd = linear_model.Ridge(alpha=0.001, solver = 'svd')
lr_3svd

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='svd', tol=0.001)

In [81]:
X_train_svd, X_test_svd, y_train_svd, y_test_svd = train_test_split(X, y, test_size=0.2, random_state=42)

In [82]:
lr_3svd.fit(X_train_svd, y_train_svd)

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='svd', tol=0.001)

In [83]:
y_pred_3svd_train = lr_3svd.predict(X_train_svd)
y_pred_3svd_train[0:20]

array([193.68147108, 190.81999526, 216.23272616, 139.13920382,
       225.16635656, 195.66524868, 172.79393979, 170.42103656,
       196.48290337, 230.26460023, 197.25887576, 175.61009964,
       202.61717548, 187.70993108, 155.03822504, 188.09679189,
       201.12734623, 156.97725576, 174.21388092, 179.27723481])

In [84]:
print("Training Set R^2: {}".format(lr_3svd.score(X_train_svd, y_train_svd)))
rmse_3svd_train = np.sqrt(mean_squared_error(y_train_svd, y_pred_3svd_train))
print("Training Set Root Mean Squared Error: {}".format(rmse_3svd_train))

Training Set R^2: 0.6465369737006035
Training Set Root Mean Squared Error: 16.589921925879356


In [85]:
y_pred_3svd_test = lr_3svd.predict(X_test_svd)
y_pred_3svd_test[0:20]

array([177.35025577, 175.37003436, 162.28118044, 175.61565048,
       178.79949234, 195.82574314, 173.08925326, 164.09948605,
       174.861804  , 172.19905271, 177.03470198, 206.9132008 ,
       157.8000766 , 157.37575802, 220.00185196, 108.90253078,
       188.29362207, 205.9232775 , 208.22259928, 183.78457325])

In [86]:
print("Test Set R^2: {}".format(lr_3svd.score(X_test_svd, y_test_svd)))
rmse_3svd_test = np.sqrt(mean_squared_error(y_test_svd, y_pred_3svd_test))
print("Test Set Root Mean Squared Error: {}".format(rmse_3svd_test))

Test Set R^2: 0.6407645009808149
Test Set Root Mean Squared Error: 16.196424295586358


In [87]:
lr_3svd_coeffs = pd.Series(lr_3svd.coef_, index=X.columns) 
lr_3svd_coeffs

avgAnnCount                      -3.311788e-03
incidenceRate                     1.579581e-01
medIncome                         3.962400e-05
popEst2015                       -2.945975e-07
povertyPercent                   -2.999942e+00
studyPerCap                       1.610786e-04
MedianAge                         1.048151e+00
MedianAgeMale                    -5.903435e-01
MedianAgeFemale                   1.346875e+00
AvgHouseholdSize                 -6.493734e+00
PercentMarried                    2.234602e+00
PctNoHS18_24                     -1.135149e-02
PctHS18_24                        2.003942e-01
PctSomeCol18_24                  -2.494795e-01
PctBachDeg18_24                  -1.619434e-01
PctHS25_Over                      1.478055e+00
PctBachDeg25_Over                -4.930227e-01
PctEmployed16_Over                6.498003e+00
PctUnemployed16_Over              1.192618e-01
PctPrivateCoverage               -4.395087e+00
PctPrivateCoverageAlone           1.323289e-01
PctEmpPrivCov

In [88]:
X_train_corrwith_svd = X_train_svd.corrwith(y_train_svd)
X_train_corrwith_svd

avgAnnCount                      -0.139998
incidenceRate                     0.429436
medIncome                        -0.432210
popEst2015                       -0.115986
povertyPercent                    0.435167
studyPerCap                      -0.024890
MedianAge                         0.001862
MedianAgeMale                    -0.017983
MedianAgeFemale                   0.019013
AvgHouseholdSize                 -0.041219
PercentMarried                   -0.263055
PctNoHS18_24                      0.090024
PctHS18_24                        0.276894
PctSomeCol18_24                  -0.104714
PctBachDeg18_24                  -0.287782
PctHS25_Over                      0.408759
PctBachDeg25_Over                -0.493734
PctEmployed16_Over               -0.413245
PctUnemployed16_Over              0.375829
PctPrivateCoverage               -0.394586
PctPrivateCoverageAlone          -0.341052
PctEmpPrivCoverage               -0.280062
PctPublicCoverage                 0.413010
PctPublicCo

In [89]:
same_sign_svd = ((X_train_corrwith_svd >= 0) & (lr_3svd_coeffs >= 0) | (X_train_corrwith_svd < 0) & (lr_3svd_coeffs < 0))
same_sign_svd.sum()/len(same_sign_svd)

0.5426829268292683

In [90]:
same_sign_svd.value_counts()

True     178
False    150
dtype: int64

There is no significant change in the 'same_sign' proportion after using the 'svd' solver.

## Full DataFrame with cholesky Ridge Regression

Next, the 'cholesky' solver is tried.

In [91]:
lr_3chol = linear_model.Ridge(alpha=0.001, solver = 'cholesky')
lr_3chol

Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='cholesky', tol=0.001)

In [92]:
X_train_chol, X_test_chol, y_train_chol, y_test_chol = train_test_split(X, y, test_size=0.2, random_state=42)

In [93]:
lr_3chol.fit(X_train_chol, y_train_chol)

  overwrite_a=True).T


Ridge(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='cholesky', tol=0.001)

In [94]:
y_pred_3chol_train = lr_3chol.predict(X_train_chol)
y_pred_3chol_train[0:20]

array([193.6814711 , 190.81999526, 216.23272617, 139.13920381,
       225.16635654, 195.66524864, 172.79393977, 170.42103655,
       196.48290334, 230.26460023, 197.25887577, 175.61009965,
       202.61717549, 187.70993108, 155.03822506, 188.09679189,
       201.12734623, 156.97725576, 174.21388088, 179.27723478])

In [95]:
print("Training Set R^2: {}".format(lr_3chol.score(X_train_chol, y_train_chol)))
rmse_3chol_train = np.sqrt(mean_squared_error(y_train_chol, y_pred_3chol_train))
print("Training Set Root Mean Squared Error: {}".format(rmse_3chol_train))

Training Set R^2: 0.6465369736893312
Training Set Root Mean Squared Error: 16.58992192614389


In [96]:
y_pred_3chol_test = lr_3chol.predict(X_test_chol)
y_pred_3chol_test[0:20]

array([177.35025576, 175.37003439, 162.28118045, 175.61565049,
       178.79949234, 195.82574314, 173.08925325, 164.09948603,
       174.86180399, 172.19905272, 177.03470199, 206.9132008 ,
       157.80007659, 157.37575802, 220.00185197, 108.9025308 ,
       188.29362207, 205.9232775 , 208.2225993 , 183.78457323])

In [97]:
print("Test Set R^2: {}".format(lr_3chol.score(X_test_chol, y_test_chol)))
rmse_3chol_test = np.sqrt(mean_squared_error(y_test_chol, y_pred_3chol_test))
print("Test Set Root Mean Squared Error: {}".format(rmse_3chol_test))

Test Set R^2: 0.6407644965711742
Test Set Root Mean Squared Error: 16.196424394992476


In [98]:
lr_3chol_coeffs = pd.Series(lr_3chol.coef_, index=X.columns) 
lr_3chol_coeffs

avgAnnCount                      -3.311788e-03
incidenceRate                     1.579581e-01
medIncome                         3.962400e-05
popEst2015                       -2.945975e-07
povertyPercent                   -2.999942e+00
studyPerCap                       1.610786e-04
MedianAge                         1.048151e+00
MedianAgeMale                    -5.903435e-01
MedianAgeFemale                   1.346875e+00
AvgHouseholdSize                 -6.493734e+00
PercentMarried                    2.234602e+00
PctNoHS18_24                     -1.135149e-02
PctHS18_24                        2.003942e-01
PctSomeCol18_24                  -2.494795e-01
PctBachDeg18_24                  -1.619434e-01
PctHS25_Over                      1.478055e+00
PctBachDeg25_Over                -4.930227e-01
PctEmployed16_Over                6.498003e+00
PctUnemployed16_Over              1.192618e-01
PctPrivateCoverage               -4.395087e+00
PctPrivateCoverageAlone           1.323289e-01
PctEmpPrivCov

In [99]:
X_train_corrwith_chol = X_train.corrwith(y_train_chol)
X_train_corrwith_chol

avgAnnCount                      -0.139998
incidenceRate                     0.429436
medIncome                        -0.432210
popEst2015                       -0.115986
povertyPercent                    0.435167
studyPerCap                      -0.024890
MedianAge                         0.001862
MedianAgeMale                    -0.017983
MedianAgeFemale                   0.019013
AvgHouseholdSize                 -0.041219
PercentMarried                   -0.263055
PctNoHS18_24                      0.090024
PctHS18_24                        0.276894
PctSomeCol18_24                  -0.104714
PctBachDeg18_24                  -0.287782
PctHS25_Over                      0.408759
PctBachDeg25_Over                -0.493734
PctEmployed16_Over               -0.413245
PctUnemployed16_Over              0.375829
PctPrivateCoverage               -0.394586
PctPrivateCoverageAlone          -0.341052
PctEmpPrivCoverage               -0.280062
PctPublicCoverage                 0.413010
PctPublicCo

In [100]:
same_sign_chol = ((X_train_corrwith_chol >= 0) & (lr_3chol_coeffs >= 0) | (X_train_corrwith_chol < 0) & (lr_3chol_coeffs < 0))
same_sign_chol.sum()/len(same_sign_chol)

0.5426829268292683

In [101]:
same_sign_chol.value_counts()

True     178
False    150
dtype: int64

There is no change in the 'same_sign' proportion using the 'cholesky' solver.

## Trying the basic OLS Linear Regression regressor

Next, the basic OLS Linear Regression algorithm is tried. This algorithm performed only slightly worse than Ridge Regression, but it is tried just in case there was something about the Ridge Regression algorithm which altered the relationship between the correlation coefficients and regression coefficients.

In [102]:
lr_2 = linear_model.LinearRegression()
lr_2

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [103]:
lr_2.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [104]:
y_pred_2_train = lr_2.predict(X_train)
y_pred_2_train[0:20]

array([193.43076022, 190.79549479, 216.14766197, 139.15685695,
       225.40659532, 196.12564798, 173.12573478, 170.47723738,
       196.51814602, 230.39423481, 197.22011038, 175.54912298,
       202.47658953, 187.68025628, 154.69078527, 188.18074488,
       201.16786053, 156.98379718, 174.44087904, 179.39307531])

In [105]:
print("Training Set R^2: {}".format(lr_2.score(X_train, y_train)))
rmse_2_train = np.sqrt(mean_squared_error(y_train, y_pred_2_train))
print("Training Set Root Mean Squared Error: {}".format(rmse_2_train))

Training Set R^2: 0.6466260171457876
Training Set Root Mean Squared Error: 16.587832150231034


In [106]:
y_pred_2_test = lr_2.predict(X_test)
y_pred_2_test[0:20]

array([177.32331018, 174.71390586, 162.20675843, 175.78673799,
       178.88472736, 195.88785672, 173.28360857, 164.24285469,
       175.00771172, 171.91171712, 177.00965019, 206.91234961,
       158.15670143, 157.54265212, 219.88087446, 108.69274019,
       188.3489999 , 205.9336845 , 208.31387718, 183.87287741])

In [107]:
print("Test Set R^2: {}".format(lr_2.score(X_test, y_test)))
rmse_2_test = np.sqrt(mean_squared_error(y_test, y_pred_2_test))
print("Test Set Root Mean Squared Error: {}".format(rmse_2_test))

Test Set R^2: 0.6406253865183126
Test Set Root Mean Squared Error: 16.199560036310984


In [108]:
lr_2_coeffs = pd.Series(lr_2.coef_, index=X.columns) 
lr_2_coeffs

avgAnnCount                      -3.338686e-03
incidenceRate                     1.579467e-01
medIncome                         3.944710e-05
popEst2015                       -2.944186e-07
povertyPercent                   -2.958029e+00
studyPerCap                       1.737154e-04
MedianAge                         1.132936e+00
MedianAgeMale                    -5.998807e-01
MedianAgeFemale                   1.407140e+00
AvgHouseholdSize                 -6.469404e+00
PercentMarried                    2.327768e+00
PctNoHS18_24                     -1.062508e-02
PctHS18_24                        1.991986e-01
PctSomeCol18_24                  -2.701692e-01
PctBachDeg18_24                  -1.615002e-01
PctHS25_Over                      1.481759e+00
PctBachDeg25_Over                -4.917822e-01
PctEmployed16_Over                6.686158e+00
PctUnemployed16_Over              1.184963e-01
PctPrivateCoverage               -4.421976e+00
PctPrivateCoverageAlone           1.308252e-01
PctEmpPrivCov

In [109]:
X_train_corrwith = X_train.corrwith(y_train)

In [110]:
same_sign = ((X_train_corrwith >= 0) & (lr_2_coeffs >= 0) | (X_train_corrwith < 0) & (lr_2_coeffs < 0))
same_sign.sum()/len(same_sign)

0.5426829268292683

Again, there is no significant change in the 'same_sign' proportion. Although the difference between the feature set's individual correlation coefficients and its regression coefficients as a whole are unexpected, there does not seem to be anything fundamentally flawed with the model. Therefore, the features with the strongest ridge regression coefficients are explored further below.

## Working with the Ridge Regression Coefficients to Identify Salient Predictive Features of Cancer Mortality

First, the ridge regression coefficients are sorted in descending order, with the strongest positive coefficients at the top of the series and the strongest negative coefficients at the bottom of the series. Although there surely is a complex web of interconnections and relationships between the features, one can look at the coefficients of each feature individually to see what impact they have on cancer mortality. Identifying the features with the strongest relationship with cancer mortality can help inform policy interventions that could help reduce cancer mortality.

Binary features' ridge regression coefficients can show that if a county's value is true for that binary feature, one can expect a change in the cancer mortality rate per 100,000 equal to the ridge regression coefficient of that feature. For example, the 'State_Nevada' feature can be examined. This feature stores data on whether a county is in Nevada (1) or not (0). If a county is in Nevada, one can expect an increase of 43 cancer deaths per 100,000 people.

For a continuous feature, an increase of one unit for that feature will result in a change equal to that feature's ridge regression coefficient. For example, the 'PCT_OBESE_ADULTS13' feature stores the percentage of each county's adults in 2013 who qualify as being obese. For every percentage point increase in the value of this feature for any given county, one can expect an increase of three cancer mortalities in 2015.

First, the non-normalized coefficients are called, sorted in descending order.

In [111]:
lr_3_coeffs_sorted = lr_3_coeffs.sort_values(ascending=False)
lr_3_coeffs_sorted

nw_mem_l2                         3.408147e+02
PctPrivateCoverage_log            2.525173e+02
mskcc_l2                          1.215137e+02
nw_mem_l1                         5.175472e+01
mgs_l2                            4.815532e+01
State_Nevada                      4.279767e+01
mgs_l1                            4.001968e+01
PctPublicCoverage_log             3.712718e+01
mgs_l2_log                        3.692593e+01
povertyPercent_log                3.561534e+01
State_Alaska                      3.512214e+01
State_California                  2.992599e+01
AvgHouseholdSize_log              2.910125e+01
atlanta_l1_log                    2.207967e+01
State_Florida                     1.996096e+01
FMRKT09_isnull                    1.868850e+01
FMRKTPTH09_isnull                 1.868850e+01
upmcps_l2                         1.802796e+01
State_Arizona                     1.802644e+01
cleveland_l1_log                  1.762623e+01
State_Utah                        1.741215e+01
mskcc_l1_log 

Features in the feature set are in different scales - they are measured in percentages, per capita rates (per 100,000 people), binary flags, and in real numbers. Because of this, the feature set must be normalized as a whole in order to return meaningful coefficients as to the relative effect of each feature on the target variable. This is done in the next cell. 

In [112]:
normalized_lr_3_coeffs = lr_3_coeffs/(X_train.max() - X_train.min())
normalized_lr_3_coeffs

avgAnnCount                      -8.682330e-08
incidenceRate                     1.966855e-04
medIncome                         3.978552e-10
popEst2015                       -2.896883e-14
povertyPercent                   -6.864855e-02
studyPerCap                       1.650005e-08
MedianAge                         2.437560e-02
MedianAgeMale                    -1.395611e-02
MedianAgeFemale                   3.103400e-02
AvgHouseholdSize                 -3.077599e+00
PercentMarried                    4.523485e-02
PctNoHS18_24                     -1.810445e-04
PctHS18_24                        2.779393e-03
PctSomeCol18_24                  -3.469812e-03
PctBachDeg18_24                  -3.126321e-03
PctHS25_Over                      3.178612e-02
PctBachDeg25_Over                -1.241871e-02
PctEmployed16_Over                1.103226e-01
PctUnemployed16_Over              4.112475e-03
PctPrivateCoverage               -6.530590e-02
PctPrivateCoverageAlone           2.093812e-03
PctEmpPrivCov

A quality control check is carried out which confirms that the normalized Ridge Regression coefficients are equal to the non-normalized coefficients divided by the difference between the maximum and minimum values of the features in the training set. There are seven features where this is not the case.

In [113]:
normalized_lr_3_coeffs_check = (normalized_lr_3_coeffs == lr_3_coeffs/(X_train.max() - X_train.min()))
type(normalized_lr_3_coeffs_check)

pandas.core.series.Series

In [114]:
normalized_lr_3_coeffs_check.values

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [115]:
len(normalized_lr_3_coeffs_check)

328

In [116]:
normalize_falses = normalized_lr_3_coeffs_check == False
normalize_falses

avgAnnCount                       False
incidenceRate                     False
medIncome                         False
popEst2015                        False
povertyPercent                    False
studyPerCap                       False
MedianAge                         False
MedianAgeMale                     False
MedianAgeFemale                   False
AvgHouseholdSize                  False
PercentMarried                    False
PctNoHS18_24                      False
PctHS18_24                        False
PctSomeCol18_24                   False
PctBachDeg18_24                   False
PctHS25_Over                      False
PctBachDeg25_Over                 False
PctEmployed16_Over                False
PctUnemployed16_Over              False
PctPrivateCoverage                False
PctPrivateCoverageAlone           False
PctEmpPrivCoverage                False
PctPublicCoverage                 False
PctPublicCoverageAlone            False
PctWhite                          False


In [117]:
normalized_lr_3_coeffs_check.value_counts()

True     321
False      7
dtype: int64

In [118]:
X_train.max()

avgAnnCount                       3.815000e+04
incidenceRate                     1.014200e+03
medIncome                         1.226410e+05
popEst2015                        1.017029e+07
povertyPercent                    4.740000e+01
studyPerCap                       9.762309e+03
MedianAge                         6.530000e+01
MedianAgeMale                     6.470000e+01
MedianAgeFemale                   6.570000e+01
AvgHouseholdSize                  3.970000e+00
PercentMarried                    7.250000e+01
PctNoHS18_24                      6.270000e+01
PctHS18_24                        7.210000e+01
PctSomeCol18_24                   7.900000e+01
PctBachDeg18_24                   5.180000e+01
PctHS25_Over                      5.480000e+01
PctBachDeg25_Over                 4.220000e+01
PctEmployed16_Over                7.650000e+01
PctUnemployed16_Over              2.940000e+01
PctPrivateCoverage                8.960000e+01
PctPrivateCoverageAlone           7.890000e+01
PctEmpPrivCov

In [119]:
X_train.min()

avgAnnCount                           6.000000
incidenceRate                       211.100000
medIncome                         23047.000000
popEst2015                          827.000000
povertyPercent                        3.700000
studyPerCap                           0.000000
MedianAge                            22.300000
MedianAgeMale                        22.400000
MedianAgeFemale                      22.300000
AvgHouseholdSize                      1.860000
PercentMarried                       23.100000
PctNoHS18_24                          0.000000
PctHS18_24                            0.000000
PctSomeCol18_24                       7.100000
PctBachDeg18_24                       0.000000
PctHS25_Over                          8.300000
PctBachDeg25_Over                     2.500000
PctEmployed16_Over                   17.600000
PctUnemployed16_Over                  0.400000
PctPrivateCoverage                   22.300000
PctPrivateCoverageAlone              15.700000
PctEmpPrivCov

In [120]:
df['FMRKT13_isnull'].loc['Franklin city, Virginia']

1

In [121]:
type(df['FMRKT13_isnull'].loc['Franklin city, Virginia'])

numpy.int64

So the following '_isnull' features had '1' values only in the test set, all in the county equivalent Franklin City, Virginia. That's why their minimum and maximum values are both zeroes in the training set.

FMRKT13_isnull

FMRKTPTH13_isnull

SLHOUSE07_isnull

FOODHUB12_isnull

CSA07_isnull

AGRITRSM_OPS07_isnull

FARM_TO_SCHOOL_isnull

These features will be removed from the positive and negative feature family proportions explored below.

In [122]:
X_train['FMRKT13_isnull'].loc['Franklin city, Virginia']

KeyError: 'Franklin city, Virginia'

In [123]:
X_test['FMRKT13_isnull'].loc['Franklin city, Virginia']

1

In [124]:
X_train['FMRKTPTH13_isnull'].loc['Franklin city, Virginia']

KeyError: 'Franklin city, Virginia'

In [125]:
X_test['FMRKTPTH13_isnull'].loc['Franklin city, Virginia']

1

In [126]:
X_train['SLHOUSE07_isnull'].loc['Franklin city, Virginia']

KeyError: 'Franklin city, Virginia'

In [127]:
X_test['SLHOUSE07_isnull'].loc['Franklin city, Virginia']

1

In [128]:
X_train['FOODHUB12_isnull'].loc['Franklin city, Virginia']

KeyError: 'Franklin city, Virginia'

In [129]:
X_test['FOODHUB12_isnull'].loc['Franklin city, Virginia']

1

In [130]:
X_train['CSA07_isnull'].loc['Franklin city, Virginia']

KeyError: 'Franklin city, Virginia'

In [131]:
X_test['CSA07_isnull'].loc['Franklin city, Virginia']

1

In [132]:
X_train['AGRITRSM_OPS07_isnull'].loc['Franklin city, Virginia']

KeyError: 'Franklin city, Virginia'

In [133]:
X_test['AGRITRSM_OPS07_isnull'].loc['Franklin city, Virginia']

1

In [134]:
X_train['FARM_TO_SCHOOL_isnull'].loc['Franklin city, Virginia']

KeyError: 'Franklin city, Virginia'

In [135]:
X_test['FARM_TO_SCHOOL_isnull'].loc['Franklin city, Virginia']

1

In [136]:
type(normalized_lr_3_coeffs)

pandas.core.series.Series

The sorted normalized Ridge Regression coefficients are called below.

In [137]:
normalized_lr_3_coeffs_sorted = normalized_lr_3_coeffs.sort_values(ascending=False)
normalized_lr_3_coeffs_sorted

PctPrivateCoverage_log            1.815667e+02
State_Nevada                      4.279767e+01
AvgHouseholdSize_log              3.838255e+01
State_Alaska                      3.512214e+01
State_California                  2.992599e+01
PctPublicCoverage_log             2.173945e+01
RECFACPTH12                       2.121466e+01
State_Florida                     1.996096e+01
FMRKT09_isnull                    1.868850e+01
FMRKTPTH09_isnull                 1.868850e+01
State_Arizona                     1.802644e+01
State_Utah                        1.741215e+01
State_District of Columbia        1.431427e+01
povertyPercent_log                1.396522e+01
NATAMEN_isnull                    1.075218e+01
State_Oklahoma                    1.065815e+01
VEG_ACRESPTH07_isnull             9.896747e+00
State_South Carolina              9.442584e+00
State_Mississippi                 9.357084e+00
PCT_OBESE_CHILD08_isnull          8.649288e+00
PCT_LOCLFARM07_isnull             8.253386e+00
FMRKTPTH09   

## The features which have a positive coefficient of 10 or higher are detailed below:

In [138]:
top_pos_normalized_lr_3_coeffs = normalized_lr_3_coeffs_sorted[normalized_lr_3_coeffs_sorted > 10]
top_pos_normalized_lr_3_coeffs

PctPrivateCoverage_log        181.566710
State_Nevada                   42.797673
AvgHouseholdSize_log           38.382552
State_Alaska                   35.122136
State_California               29.925988
PctPublicCoverage_log          21.739450
RECFACPTH12                    21.214659
State_Florida                  19.960958
FMRKT09_isnull                 18.688499
FMRKTPTH09_isnull              18.688499
State_Arizona                  18.026440
State_Utah                     17.412152
State_District of Columbia     14.314275
povertyPercent_log             13.965216
NATAMEN_isnull                 10.752182
State_Oklahoma                 10.658145
dtype: float64

In [139]:
top_neg_normalized_lr_3_coeffs = normalized_lr_3_coeffs_sorted[normalized_lr_3_coeffs_sorted < -10]
top_neg_normalized_lr_3_coeffs

CHILDPOVRATE10_log        -10.022924
State_New Hampshire       -10.340521
State_Michigan            -10.651165
State_Ohio                -12.605970
State_North Carolina      -13.709772
State_Illinois            -13.837224
State_Iowa                -15.199227
State_Connecticut         -17.194789
State_Rhode Island        -19.000810
MedianAge_log             -22.160897
PercentMarried_log        -24.820850
RECFACPTH07               -24.859940
PctEmpPrivCoverage_log    -26.617080
State_Hawaii              -43.058453
PctEmployed16_Over_log   -121.830764
dtype: float64

'PctPrivateCoverage_log', the logarithmic transformation of the percent of county residents with private health coverage, has a Ridge Regression coefficient of 182 (rounded up). This means that for each whole number increase for this feature, there was an increase in cancer mortality of 182 people per capita (100,000 people) in 2015. This seems to conflict with the negative correlation coefficient that the non-transformed version of this feature has, which is an area for future research.

'State_Nevada', the feature which stores whether a county is in Nevada or not, has a Ridge Regression coefficient of 43 (rounded up). This means that if a county is in the state of Nevada, there was an increase in cancer mortality of 43 people per capita (100,000) in 2015.

'AvgHouseholdSize_log', the feature which stores the logarithmic transformation of the average household size, has a Ridge Regression coefficient of 38 (rounded down). This means that for each one-person increase in average household size, there was an increase in cancer mortality of 38 deaths per capita (100,000) in 2015.

'State_Alaska', the feature which stores whether a county is in Alaska or not, has a Ridge Regression coefficient of 35 (rounded down). This means that if a county is in the state of Alaska, there was an increase in cancer mortality of 35 deaths per capita (100,000) in 2015.

'State_California', the feature which stores whether a county is in California or not, has a Ridge Regression coefficient of 30 (rounded up). This means that if a county is in the state of California, there was an increase in cancer mortality of 30 deaths per capita (100,000) in 2015.

'PctPublicCoverage_log', the logarithmic transformation of the percent of county residents with public health coverage, has a Ridge Regression coefficient of 22 (rounded up). This means that for each whole number increase in this feature, there was an increase in cancer mortality of 22 deaths per capita (100,000 people) in 2015.

'RECFACPTH12', the feature holding the number of recreation and fitness facilities per 1,000 people in 2012, has a Ridge Regression coefficient of 21 (rounded down). This means that for each new facility per 1,000 people, there was an increase in cancer mortality of 21 deaths per capita (100,000 people) in 2015. This is a counterintuitive result that warrants further research. 

'State_Florida', the feature which stores whether a county is in Florida or not, has a Ridge Regression coefficient of 20 (rounded up). This means that if a county is in the state of Florida, there was an increase in cancer mortality of 20 deaths per capita (100,000) in 2015.

'FMRKT09_isnull', the feature which stores whether a county has a missing value for the number of farmer's markets it had in 2009, has a Ridge Regression coefficient of 19 (rounded up). This means that if a county has a missing value for the 'FMRKT09' feature, there was an increase in cancer mortality of 19 deaths per capita (100,000) in 2015.

'State_Arizona', the feature which stores whether a county is in Arizona or not, has a Ridge Regression coefficient of 18 (rounded down). This means that if a county is in the state of Arizona, there was an increase in cancer mortality of 18 deaths per capita (100,000) in 2015. 

'State_Utah', the feature which stores whether a county is in Utah or not, has a Ridge Regression coefficient of 17 (rounded down). This means that if a county is in the state of Utah, there was an increase in cancer mortality of 17 deaths per capita (100,000) in 2015.

'State_District of Columbia', the feature which stores whether a county is Washington, DC or not, has a Ridge Regression coefficient of 14 (rounded down). This means that in Washington, DC, there was an increase in cancer mortality of 14 deaths per capita (100,000) in 2015.

'povertyPercent_log', the logarithmic transformation of the feature that stores the percentage of each counties' populace that lives in poeverty, has a Ridge Regression coefficient of 14 (rounded down). This means that for each percentage point increase in poverty, there was an increase in cancer mortality of 14 deaths per capita (100,000) in 2015.

'NATAMEN_isnull', the feature which stores whether there are missing values in the 'NATAMEN' feature which stores the quality of life 'ERS Natural Amenity Index' (for 1999), has a Ridge Regression coefficient of 11 (rounded up). This means that if a county has a missing value for the 'NATAMEN' feature, there was an increase in cancer mortality of 11 deaths per capita (100,000) in 2015.

'State_Oklahoma', the feature which stores whether a county is in the state of Oklahoma or not, has a Ridge Regression coefficient of 11 (rounded up). This means that if a county is in the state of Oklahoma, there was an increase in cancer mortality of 11 deaths per capita (100,000) in 2015.

As can be seen above, the features with the strongest positive correlations with cancer mortality fall into the following categories:

- State that the county is in

- Recreation facility-related feature

Logarithmic transformations of features:
- Health insurance features (private and public)
- Poverty-related feature
- Average household size

Missing value features:
- Farmer’s market related feature
- ERS Natural Amenity Index ('NATAMEN')

## The features which have a negative coefficient of -10 or lower are detailed below:

'PctEmployed16_Over_log', the logarithmic transformation of the feature which stores the percentage of county residents ages 16 and over who were employed, has a Ridge Regression coefficient of -122 (rounded up). This means that for each whole number increase in this feature, there would be a decrease in cancer mortality of 122 deaths per capita (100,000) in 2015.

'State_Hawaii', the feature which stores whether a county is in the state of Hawaii or not, has a Ridge Regression coefficient of -43 (rounded up). This means that if a county is in the state of Hawaii, there would be a decrease in cancer mortality of 43 deaths per capita (100,000) in 2015.

'PctEmpPrivCoverage_log', the logarithmic transformation of the percentage of county residents with employee-provided private health coverage, has a Ridge Regression coefficient of -27 (rounded down). This means that for every whole number increase in this feature, there would be a decrease in cancer mortality of 27 deaths per capita (100,000) in 2015.

'RECFACPTH07', which stores the number of recreation and fitness facilities per 1,000 people in 2007, has a Ridge Regression of -25 (rounded down). This means that for each whole number increase of recreation and fitness facilities per 1,000 people in 2007, there would be a decrease in cancer mortality of 25 deaths per capita (100,000) in 2015.

'PercentMarried_log', the logarithmic transformation of the feature which stores the percentage of county residents who are married, has a Ridge Regression coefficient of -25 (rounded up). This means that for each percentage point increase in a county's populace that are married, there would be a decrease in cancer mortality of 25 deaths per capita (100,000) in 2015.

'MedianAge_log', the logarithmic transformation of the feature which stores the median age of county residents, has a Ridge Regression coefficient of -22 (rounded down). This means that for each whole number increase of the logarithmic transformation of the median age of county residents, there would be a decrease in cancer mortality of 22 deaths per capita (100,000) in 2015.

'State_Rhode Island', the feature which stores whether a county is in the state of Rhode Island or not, has a Ridge Regression coefficient of -19 (rounded up). This means that if a county is in the state of Rhode Island, there would be a decrease in cancer mortality of 19 deaths per capita (100,000) in 2015.

'State_Connecticut', the feature which stores whether a county is in the state of Connecticut or not, has a Ridge Regression coefficient of -17 (rounded up). This means that if a county is in the state of Connecticut, there would be a decrease in cancer mortality of 17 deaths per capita (100,000) in 2015.

'State_Iowa', the feature which stores whether a county is in the state of Iowa or not, has a Ridge Regression coefficient of -15 (rounded up). This means that if a county is in the state of Iowa, there would be a decrease in cancer mortality of 15 deaths per capita (100,000) in 2015.

'State_Illinois', the feature which stores whether a county is in the state of Illinois or not, has a Ridge Regression coefficient of -14 (rounded down). This means that if a county is in the state of Illinois, there would be a decrease in cancer mortality of 14 deaths per capita (100,000) in 2015.

'State_North Carolina', the feature which stores whether a county is in the state of North Carolina or not, has a Ridge Regression coefficient of -14 (rounded down). This means that if a county is in the state of North Carolina, there would be a decrease in cancer mortality of 14 deaths per capita (100,000) in 2015.

'State_Ohio', the feature which stores whether a county is in the state of Ohio or not, has a Ridge Regression coefficient of -13 (rounded down). This means that if a county is in the state of Ohio, there would be a decrease in cancer mortality of 13 deaths per capita (100,000) in 2015.

'State_Michigan', the feature which stores whether a county is in the state of Michigan or not, has a Ridge Regression coefficient of -11 (rounded down). This means that if a county is in the state of New Hampshire, there would be a decrease in cancer mortality of 11 deaths per capita (100,000) in 2015.  

'State_New Hampshire', the feature which stores whether a county is in the state of New Hampshire or not, has a Ridge Regression coefficient of -10 (rounded up). This means that if a county is in the state of New Hampshire, there would be a decrease in cancer mortality of 10 deaths per capita (100,000) in 2015.

'CHILDPOVRATE10_log', the logarithmic transformation of the feature which stores the percentage of children living in poverty in 2010, has a Ridge Regression coefficient of -10 (rounded up). This means that for each whole number increase in this feature's value, there would be a decrease in cancer mortality of 10 deaths per capita (100,000) in 2015. This contradicts other poverty-related features' positive Pearson's correlations, which warrants further research.

As can be seen above, the features with the strongest negative correlations with cancer mortality fall into the following categories:

- State that the county is in

- Recreation facility-related feature

Logarithmic transformations of features:
- Health insurance feature (private)
- Median Age
- Percentage of populace who are married
- Percentage of children in poverty
- Percentage of populace 16 years and older who are employed

# Relationship of "Feature Families" on the Target Feature of Per-Capita Cancer Mortality

Because there are so many features in the feature set, the interpretability of these features' Ridge Regression coefficients with per capita cancer mortality is supported by grouping the features into "feature families" (e.g. distance from major urban centers, healthcare-related features, etc.). This grouping strategy involves taking the sum of all positive coefficients in the feature set, then summing the positive coefficients for each "feature family", and then dividing the sum of each positive "feature family" by the total positive coefficient sum to uncover the proportion of the total predictive value that each "feature family" has on increasing cancer mortality. This strategy will then be repeated for negative coefficients to uncover the proportion of the total predictive value that each "feature family" has on decreasing cancer mortality.

# Positive Coefficients

In [142]:
sum_of_all_pos_coeffs = np.sum(normalized_lr_3_coeffs_sorted[:161].values)
sum_of_all_pos_coeffs

671.8753208319986

In [144]:
lr_3_positive_coeffs = normalized_lr_3_coeffs_sorted[:161]
lr_3_positive_coeffs

PctPrivateCoverage_log            1.815667e+02
State_Nevada                      4.279767e+01
AvgHouseholdSize_log              3.838255e+01
State_Alaska                      3.512214e+01
State_California                  2.992599e+01
PctPublicCoverage_log             2.173945e+01
RECFACPTH12                       2.121466e+01
State_Florida                     1.996096e+01
FMRKT09_isnull                    1.868850e+01
FMRKTPTH09_isnull                 1.868850e+01
State_Arizona                     1.802644e+01
State_Utah                        1.741215e+01
State_District of Columbia        1.431427e+01
povertyPercent_log                1.396522e+01
NATAMEN_isnull                    1.075218e+01
State_Oklahoma                    1.065815e+01
VEG_ACRESPTH07_isnull             9.896747e+00
State_South Carolina              9.442584e+00
State_Mississippi                 9.357084e+00
PCT_OBESE_CHILD08_isnull          8.649288e+00
PCT_LOCLFARM07_isnull             8.253386e+00
FMRKTPTH09   

## Age Features with Positive Coefficients

In [145]:
is_age_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_age_pos_coeffs.append(name.startswith('MedianAge'))
is_age_pos_coeffs = np.array(is_age_pos_coeffs)

In [146]:
age_pos_coeffs = lr_3_positive_coeffs.loc[is_age_pos_coeffs]

In [147]:
sum_of_age_pos_coeffs = np.sum(age_pos_coeffs.values)
sum_of_age_pos_coeffs

0.05540959213243588

In [148]:
age_proportion_of_pos_total = sum_of_age_pos_coeffs / sum_of_all_pos_coeffs
age_proportion_of_pos_total

8.247005123483464e-05

## Average Household Size (Positive Coefficient)

In [149]:
is_avg_household_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_avg_household_pos_coeffs.append(name.startswith('AvgHousehold'))
is_avg_household_pos_coeffs = np.array(is_avg_household_pos_coeffs)

In [150]:
avg_household_pos_coeffs = lr_3_positive_coeffs.loc[is_avg_household_pos_coeffs]

In [151]:
sum_of_avg_household_pos_coeffs = np.sum(avg_household_pos_coeffs.values)
sum_of_avg_household_pos_coeffs

38.382551589290685

In [152]:
avg_household_proportion_of_pos_total = sum_of_avg_household_pos_coeffs / sum_of_all_pos_coeffs
avg_household_proportion_of_pos_total

0.05712749136515491

## Cancer Diagnoses (Positive Coefficient)

In [153]:
is_cancer_diag_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_cancer_diag_pos_coeffs.append(name.startswith('incidenceRate'))
is_cancer_diag_pos_coeffs = np.array(is_cancer_diag_pos_coeffs)

In [154]:
cancer_diag_pos_coeffs = lr_3_positive_coeffs.loc[is_cancer_diag_pos_coeffs]

In [155]:
sum_of_cancer_diag_pos_coeffs = np.sum(cancer_diag_pos_coeffs.values)
sum_of_cancer_diag_pos_coeffs

0.00019668546965366667

In [156]:
cancer_diag_proportion_of_pos_total = sum_of_cancer_diag_pos_coeffs / sum_of_all_pos_coeffs
cancer_diag_proportion_of_pos_total

2.927410243467591e-07

## Clinical Cancer Trials (Positive Coefficient)

In [157]:
is_cancer_trials_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_cancer_trials_pos_coeffs.append(name.startswith('studyPer'))
is_cancer_trials_pos_coeffs = np.array(is_cancer_trials_pos_coeffs)

In [158]:
cancer_trials_pos_coeffs = lr_3_positive_coeffs.loc[is_cancer_trials_pos_coeffs]

In [159]:
sum_of_cancer_trials_pos_coeffs = np.sum(cancer_trials_pos_coeffs.values)
sum_of_cancer_trials_pos_coeffs

1.650005193961696e-08

In [160]:
cancer_trials_proportion_of_pos_total = sum_of_cancer_trials_pos_coeffs / sum_of_all_pos_coeffs
cancer_trials_proportion_of_pos_total

2.4558205113390036e-11

## Comorbid Health Conditions Features with Positive Coefficients

In [161]:
comorbidities_pos = ['PCT_OBESE_ADULTS13', 'PCT_OBESE_ADULTS10', 'PCT_OBESE_CHILD11', 'PCT_DIABETES_ADULTS09', 
                     'PCT_OBESE_ADULTS13_log']

In [162]:
is_comorbid_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_comorbid_pos_coeffs.append(name in comorbidities_pos)
is_comorbid_pos_coeffs = np.array(is_comorbid_pos_coeffs)

In [163]:
comorbid_pos_coeffs = lr_3_positive_coeffs.loc[is_comorbid_pos_coeffs]

In [164]:
sum_of_comorbid_pos_coeffs = np.sum(comorbid_pos_coeffs.values)
sum_of_comorbid_pos_coeffs

1.0895626996791608

In [165]:
comorbid_proportion_of_pos_total = sum_of_comorbid_pos_coeffs / sum_of_all_pos_coeffs
comorbid_proportion_of_pos_total

0.0016216739414240285

## Distance to Major Urban Centers Features with Positive Coefficients

In [166]:
dists_to_urban_centers_pos = ['atlanta_l1_log', 'seattle_l1_log', 'seattle_l2', 'los_ang_l2', 'denver_l2', 
                         'san_fran_l2', 'atlanta_l2', 'dallas_l1', 'city_min_distsl2', 'city_min_distsl1_sqrd', 
                         'denver_l1_sqrd']

In [167]:
is_urban_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_urban_pos_coeffs.append(name in dists_to_urban_centers_pos)
is_urban_pos_coeffs = np.array(is_urban_pos_coeffs)

In [168]:
urban_pos_coeffs = lr_3_positive_coeffs.loc[is_urban_pos_coeffs]

In [169]:
sum_of_urban_pos_coeffs = np.sum(urban_pos_coeffs.values)
sum_of_urban_pos_coeffs

4.623497402010471

In [170]:
urban_proportion_of_pos_total = sum_of_urban_pos_coeffs / sum_of_all_pos_coeffs
urban_proportion_of_pos_total

0.006881481219290936

## Distances to Top 10 Oncology Hospitals Features with Positive Coefficients

In [171]:
dists_to_oncology_hosps_pos = ['nw_mem_l2', 'mskcc_l2', 'nw_mem_l1', 'mgs_l2', 'mgs_l1', 'mgs_l2_log', 'upmcps_l2', 
                          'cleveland_l1_log', 'mskcc_l1_log', 'dfb_l2_log', 'upmcps_l1_log', 'hopkins_l1', 
                          'hlmcc_l1', 'mayo_l1', 'upmcps_l1', 'onc_min_distsl1', 'cleveland_l2_log', 
                          'hopkins_l2', 'utmda_l2', 'cleveland_l1_sqrd', 'mskcc_l1']

In [172]:
is_onc_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_onc_pos_coeffs.append(name in dists_to_oncology_hosps_pos)
is_onc_pos_coeffs = np.array(is_onc_pos_coeffs)

In [173]:
onc_pos_coeffs = lr_3_positive_coeffs.loc[is_onc_pos_coeffs]

In [174]:
sum_of_onc_pos_coeffs = np.sum(onc_pos_coeffs.values)
sum_of_onc_pos_coeffs

22.143848504200935

In [175]:
onc_proportion_of_pos_total = sum_of_onc_pos_coeffs / sum_of_all_pos_coeffs
onc_proportion_of_pos_total

0.032958270407640065

## Education Features with Positive Coefficients

In [176]:
education_pos = ['PctSomeCol18_24_log', 'PctHS25_Over', 'PctBachDeg25_Over_log', 'PctHS18_24', 'PctSomeCol18_24_sqrd']

In [177]:
is_edu_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_edu_pos_coeffs.append(name in education_pos)
is_edu_pos_coeffs = np.array(is_edu_pos_coeffs)

In [178]:
edu_pos_coeffs = lr_3_positive_coeffs.loc[is_edu_pos_coeffs]

In [179]:
sum_of_edu_pos_coeffs = np.sum(edu_pos_coeffs.values)
sum_of_edu_pos_coeffs

5.502439506649943

In [180]:
edu_proportion_of_pos_total = sum_of_edu_pos_coeffs / sum_of_all_pos_coeffs
edu_proportion_of_pos_total

0.008189673494534888

## Environmental Health (Positive Coefficient)

In [181]:
is_env_health_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_env_health_pos_coeffs.append(name.startswith('sc_min_'))
is_env_health_pos_coeffs = np.array(is_env_health_pos_coeffs)

In [182]:
env_health_pos_coeffs = lr_3_positive_coeffs.loc[is_env_health_pos_coeffs]

In [183]:
sum_of_env_health_pos_coeffs = np.sum(env_health_pos_coeffs.values)
sum_of_env_health_pos_coeffs

0.11839272011279472

In [184]:
env_health_proportion_of_pos_total = sum_of_env_health_pos_coeffs / sum_of_all_pos_coeffs
env_health_proportion_of_pos_total

0.0001762123364885412

## Erroneous Data Indicator (Positive Coefficient)

In [185]:
is_err_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_err_pos_coeffs.append(name.startswith('age_gt_'))
is_err_pos_coeffs = np.array(is_err_pos_coeffs)

In [186]:
err_pos_coeffs = lr_3_positive_coeffs.loc[is_err_pos_coeffs]

In [187]:
sum_of_err_pos_coeffs = np.sum(err_pos_coeffs.values)
sum_of_err_pos_coeffs

3.9409651172033926

In [188]:
err_proportion_of_pos_total = sum_of_err_pos_coeffs / sum_of_all_pos_coeffs
err_proportion_of_pos_total

0.0058656196990882255

## Food Environment Features with Positive Coefficients

In [189]:
food_env_pos_features = ['FMRKTPTH09', 'FOODINSEC_CHILD_01_07', 'CH_VLFOODSEC_02_12', 'VLFOODSEC_07_09', 
                    'CH_FOODINSEC_09_12', 'FOODINSEC_10_12', 'FOODINSEC_00_02', 'PCT_LOCLFARM07', 'FMRKT09', 
                    'PCT_LOCLSALE07', 'PCT_LACCESS_HHNV10', 'PCH_FMRKTPTH_09_13', 'VLFOODSEC_10_12', 
                    'GHVEG_FARMS07', 'PCT_LACCESS_POP10', 'BERRY_ACRESPTH07', 'PCT_FMRKT_WIC13', 
                    'PCT_FMRKT_WICCASH13', 'BERRY_FARMS07', 'PCT_LACCESS_CHILD10_sqrd', 'CH_FOODINSEC_02_12', 
                    'AGRITRSM_OPS07', 'PCT_FRMKT_ANMLPROD13', 'FRESHVEG_FARMS07', 'FRESHVEG_ACRESPTH07', 
                    'PCT_FMRKT_OTHER13', 'FMRKT13_sqrd', 'PC_DIRSALES07_sqrd', 'ORCHARD_FARMS07', 'VEG_ACRES07', 
                    'GHVEG_SQFT07', 'AGRITRSM_RCT07']

In [190]:
is_food_env_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_food_env_pos_coeffs.append(name in food_env_pos_features)
is_food_env_pos_coeffs = np.array(is_food_env_pos_coeffs)

In [191]:
food_env_pos_coeffs = lr_3_positive_coeffs.loc[is_food_env_pos_coeffs]

In [192]:
sum_of_food_env_pos_coeffs = np.sum(food_env_pos_coeffs.values)
sum_of_food_env_pos_coeffs

8.233453454826199

In [193]:
food_env_proportion_of_pos_total = sum_of_food_env_pos_coeffs / sum_of_all_pos_coeffs
food_env_proportion_of_pos_total

0.012254436499662654

## Geography (Positive Coefficient)

In [194]:
is_geography_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_geography_pos_coeffs.append(name.startswith('ALAND_'))
is_geography_pos_coeffs = np.array(is_geography_pos_coeffs)

In [195]:
geography_pos_coeffs = lr_3_positive_coeffs.loc[is_geography_pos_coeffs]

In [196]:
sum_of_geography_pos_coeffs = np.sum(geography_pos_coeffs.values)
sum_of_geography_pos_coeffs

6.970139383106363e-11

In [197]:
geography_proportion_of_pos_total = sum_of_geography_pos_coeffs / sum_of_all_pos_coeffs
geography_proportion_of_pos_total

1.0374155988457918e-13

## Health Insurance Features with Positive Coefficients

In [198]:
health_ins_pos_features = ['PctPrivateCoverage_log', 'PctPublicCoverage_log', 'PctEmpPrivCoverage', 
                          'PctPrivateCoverageAlone', 'PctPublicCoverageAlone_sqrd']

In [199]:
is_health_ins_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_health_ins_pos_coeffs.append(name in health_ins_pos_features)
is_health_ins_pos_coeffs = np.array(is_health_ins_pos_coeffs)

In [200]:
health_ins_pos_coeffs = lr_3_positive_coeffs.loc[is_health_ins_pos_coeffs]

In [201]:
sum_of_health_ins_pos_coeffs = np.sum(health_ins_pos_coeffs.values)
sum_of_health_ins_pos_coeffs

203.326843229276

In [202]:
health_ins_proportion_of_pos_total = sum_of_health_ins_pos_coeffs / sum_of_all_pos_coeffs
health_ins_proportion_of_pos_total

0.3026258547158597

## Income Features with Positive Coefficients

In [203]:
income_pos_features = ['binnedInc_[22640, 34218.1]', 'binnedInc_(34218.1, 37413.8]', 
                      'binnedInc_(37413.8, 40362.7]', 'medIncome']

In [204]:
is_income_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_income_pos_coeffs.append(name in income_pos_features)
is_income_pos_coeffs = np.array(is_income_pos_coeffs)

In [205]:
income_pos_coeffs = lr_3_positive_coeffs.loc[is_income_pos_coeffs]

In [206]:
sum_of_income_pos_coeffs = np.sum(income_pos_coeffs.values)
sum_of_income_pos_coeffs

8.101907573502661

In [207]:
income_proportion_of_pos_total = sum_of_income_pos_coeffs / sum_of_all_pos_coeffs
income_proportion_of_pos_total

0.012058647374441263

## Percentage Married (Positive Coefficient)

In [208]:
is_married_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_married_pos_coeffs.append(name.startswith('PercentMarried'))
is_married_pos_coeffs = np.array(is_married_pos_coeffs)

In [209]:
married_pos_coeffs = lr_3_positive_coeffs.loc[is_married_pos_coeffs]

In [210]:
sum_of_married_pos_coeffs = np.sum(married_pos_coeffs.values)
sum_of_married_pos_coeffs

0.04523485330783404

In [211]:
married_proportion_of_pos_total = sum_of_married_pos_coeffs / sum_of_all_pos_coeffs
married_proportion_of_pos_total

6.732626114592018e-05

## Metropolitan Indicator, 2013 (Positive Coefficient)

In [212]:
is_metro_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_metro_pos_coeffs.append(name.startswith('METRO13'))
is_metro_pos_coeffs = np.array(is_metro_pos_coeffs)

In [213]:
metro_pos_coeffs = lr_3_positive_coeffs.loc[is_metro_pos_coeffs]

In [214]:
sum_of_metro_pos_coeffs = np.sum(metro_pos_coeffs.values)
sum_of_metro_pos_coeffs

0.7971816514964501

In [215]:
metro_proportion_of_pos_total = sum_of_metro_pos_coeffs / sum_of_all_pos_coeffs
metro_proportion_of_pos_total

0.001186502356581995

## Missing Value Features with Positive Coefficients

In [216]:
is_missing_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_missing_pos_coeffs.append(name.endswith('_isnull'))
is_missing_pos_coeffs = np.array(is_missing_pos_coeffs)

In [217]:
missing_pos_coeffs = lr_3_positive_coeffs.loc[is_missing_pos_coeffs]

In [218]:
sum_of_missing_pos_coeffs = np.sum(missing_pos_coeffs.values)
sum_of_missing_pos_coeffs

89.90259364533352

In [219]:
missing_proportion_of_pos_total = sum_of_missing_pos_coeffs / sum_of_all_pos_coeffs
missing_proportion_of_pos_total

0.13380844757626323

## Population Loss (Positive Coefficient)

In [220]:
is_poploss_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_poploss_pos_coeffs.append(name.startswith('POPLOSS'))
is_poploss_pos_coeffs = np.array(is_poploss_pos_coeffs)

In [221]:
poploss_pos_coeffs = lr_3_positive_coeffs.loc[is_poploss_pos_coeffs]

In [222]:
sum_of_poploss_pos_coeffs = np.sum(poploss_pos_coeffs.values)
sum_of_poploss_pos_coeffs

4.002963784932161

In [223]:
poploss_proportion_of_pos_total = sum_of_poploss_pos_coeffs / sum_of_all_pos_coeffs
poploss_proportion_of_pos_total

0.005957896741876453

## Poverty-related Features with Positive Coefficients

In [224]:
poverty_pos_features = ['povertyPercent_log', 'CHILDPOVRATE10', 'PERPOV10', 'PERCHLDPOV10', 'povertyPercent_sqrd']

In [225]:
is_poverty_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_poverty_pos_coeffs.append(name in poverty_pos_features)
is_poverty_pos_coeffs = np.array(is_poverty_pos_coeffs)

In [226]:
poverty_pos_coeffs = lr_3_positive_coeffs.loc[is_poverty_pos_coeffs]

In [227]:
sum_of_poverty_pos_coeffs = np.sum(poverty_pos_coeffs.values)
sum_of_poverty_pos_coeffs

15.516064858184661

In [228]:
poverty_proportion_of_pos_total = sum_of_poverty_pos_coeffs / sum_of_all_pos_coeffs
poverty_proportion_of_pos_total

0.02309366690083327

## Race-related Features with Positive Coefficients

In [229]:
race_pos_features = ['PctBlack', 'PctWhite_sqrd']

In [230]:
is_race_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_race_pos_coeffs.append(name in race_pos_features)
is_race_pos_coeffs = np.array(is_race_pos_coeffs)

In [231]:
race_pos_coeffs = lr_3_positive_coeffs.loc[is_race_pos_coeffs]

In [232]:
sum_of_race_pos_coeffs = np.sum(race_pos_coeffs.values)
sum_of_race_pos_coeffs

0.003219190856511429

In [233]:
race_proportion_of_pos_total = sum_of_race_pos_coeffs / sum_of_all_pos_coeffs
race_proportion_of_pos_total

4.791351544993543e-06

## Recreation and Fitness Features with Positive Coefficients

In [234]:
recreation_pos_features = ['RECFACPTH12', 'PCT_HSPA09', 'PCH_RECFACPTH_07_12', 'RECFAC12']

In [235]:
is_recreation_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_recreation_pos_coeffs.append(name in recreation_pos_features)
is_recreation_pos_coeffs = np.array(is_recreation_pos_coeffs)

In [236]:
recreation_pos_coeffs = lr_3_positive_coeffs.loc[is_recreation_pos_coeffs]

In [237]:
sum_of_recreation_pos_coeffs = np.sum(recreation_pos_coeffs.values)
sum_of_recreation_pos_coeffs

21.296806473677506

In [238]:
recreation_proportion_of_pos_total = sum_of_recreation_pos_coeffs / sum_of_all_pos_coeffs
recreation_proportion_of_pos_total

0.03169755728980368

## State Features with Positive Coefficients

In [239]:
is_state_pos_coeffs = []
for name in lr_3_positive_coeffs.index.values:
    is_state_pos_coeffs.append(name.startswith('State_'))
is_state_pos_coeffs = np.array(is_state_pos_coeffs)

In [240]:
state_pos_coeffs = lr_3_positive_coeffs.loc[is_state_pos_coeffs]

In [241]:
sum_of_state_pos_coeffs = np.sum(state_pos_coeffs.values)
sum_of_state_pos_coeffs

244.677753171326

In [242]:
state_proportion_of_pos_total = sum_of_state_pos_coeffs / sum_of_all_pos_coeffs
state_proportion_of_pos_total

0.3641713657056728

## Summary Table of Feature Family Positive Coefficients

In [243]:
positive_proportions_table = pd.read_excel('cancer_feature_family_proportions_normalized.xls', sheet_name='Positive')
positive_proportions_table

Unnamed: 0,Features,Coefficients,Feature Family,Feature Family Sums,Feature Family Proportion,Total Positive Sum,Total Positive Feature Family Proportion Sum
0,MedianAgeFemale,0.031034,Age,0.0554096,8.247006e-05,671.875317,1.0
1,MedianAge,0.0243756,Age,,,,
2,AvgHouseholdSize_log,38.38255,Average Household Size,38.38255,0.05712749,,
3,incidenceRate,0.0001966855,Cancer Diagnoses,0.0001966855,2.927411e-07,,
4,studyPerCap,1.650005e-08,Clinical Cancer Trials,1.650005e-08,2.45582e-11,,
5,PCT_OBESE_ADULTS13_log,0.8070711,Comorbid Health Conditions,1.089563,0.001621674,,
6,PCT_OBESE_ADULTS13,0.219683,Comorbid Health Conditions,,,,
7,PCT_DIABETES_ADULTS09,0.0251723,Comorbid Health Conditions,,,,
8,PCT_OBESE_ADULTS10,0.02220356,Comorbid Health Conditions,,,,
9,PCT_OBESE_CHILD11,0.01543272,Comorbid Health Conditions,,,,


The proportions that each "feature family" has of the total positive influence of increasing cancer mortality in 97% of the counties in the United States during 2015 are as follows (in descending order):

- United States state each county is in ("State"): 0.3642
- Types of health insurance for each county's populace ("Health Insurance"): 0.3026
- Missing values ("Missing Value Feature"): 0.1338
- Average household size of each county ("Average Household Size"): 0.0571
- L1 and L2 distances from county centroids to top 10 oncology hospitals ("Distances to Top 10 Oncology Hospitals"): 0.033
- Recreation and fitness facilities in each county ("Recreation and Fitness"): 0.0317
- Poverty ("Poverty-related"): 0.0231
- Food environment of each county ("Food Environment"): 0.0123
- Financial income of each county's populace ("Income"): 0.0121
- Education levels of each county's populace ("Education"): 0.0082
- L1 and L2 distances from county centroids to major cities ("Distance to Major Urban Centers"): 0.069
- Counties' significant population loss as of the year 2000 ("Population Loss"): 0.006
- Erroneous data indicator referencing average household size ("Erroneous data indicator"): 0.0059
- Health conditions comorbid with cancer ("Comorbid Health Conditions"): 0.0016
- Indicator of whether a county is in a metropolitan area or not ("Metropolitan indicator"): 0.0012
- L2 distance to closest EPA Superfund Cleanup site ("Environmental Health"): 0.0002
- Employment status of each county's populace ("Employment"): 0.0002
- Age of each county's populace ("Age"): 0.0001
- Marital status of county's populace ("Marital feature"): 0.0001
- Race of each county's populace ("Race"): 0.000005
- Rate of cancer diagnoses in each county ("Cancer Diagnoses"): 0.0000003
- Per capita number of cancer-related clinical trials per county ("Clinical Cancer Trials"): 0.00000000003
- Square mileage of land mass for each county ("Geography"): 0.0000000000001

# Negative Coefficients

For features with negative coefficients, the sum of all negative coefficients in the feature set is first taken, then the negative coefficients for each "feature family" is summed, and then the sum of each negative "feature family" is divided by the total negative coefficient sum to uncover the proportion of the total predictive value that each "feature family" has on decreasing cancer mortality.

In [258]:
sum_of_all_neg_coeffs = np.sum(normalized_lr_3_coeffs_sorted[161:321].values)
sum_of_all_neg_coeffs

-563.674182468134

In [259]:
lr_3_negative_coeffs = normalized_lr_3_coeffs_sorted[161:321]
lr_3_negative_coeffs

popEst2015                     -2.896883e-14
PCH_FMRKT_09_13_sqrd           -7.726261e-13
ORCHARD_ACRES07                -2.753200e-11
FRESHVEG_ACRES07               -6.250144e-10
BERRY_ACRES07                  -1.030044e-08
GHVEG_SQFTPTH07                -2.499031e-08
avgAnnCount                    -8.682330e-08
PCT_LACCESS_HHNV10_sqrd        -2.358438e-07
AWATER_SQMI                    -2.797501e-07
VEG_ACRESPTH07                 -1.256540e-06
PctBlack_sqrd                  -1.501211e-06
ORCHARD_ACRESPTH07             -1.626748e-06
PercentMarried_sqrd            -2.577258e-06
INTPTLONG_sqrd                 -3.368235e-06
MedianAgeFemale_sqrd           -5.176086e-06
PctEmployed16_Over_sqrd        -5.405276e-06
mayo_l1_sqrd                   -5.810551e-06
los_ang_l1_sqrd                -5.990811e-06
PctHS25_Over_sqrd              -6.372648e-06
PCT_FMRKT_SFMNP13              -2.055403e-05
RECFAC07                       -2.439636e-05
dfb_l1_sqrd                    -2.728609e-05
VEG_FARMS0

## Age Features with Negative Coefficients

In [260]:
is_age_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_age_neg_coeffs.append(name.startswith('MedianAge'))
is_age_neg_coeffs = np.array(is_age_neg_coeffs)

In [261]:
age_neg_coeffs = lr_3_negative_coeffs.loc[is_age_neg_coeffs]

In [262]:
sum_of_age_neg_coeffs = np.sum(age_neg_coeffs.values)
sum_of_age_neg_coeffs

-22.17485846504236

In [263]:
age_proportion_of_neg_total = sum_of_age_neg_coeffs / sum_of_all_neg_coeffs
age_proportion_of_neg_total

0.039339851202597816

## Average Household Size (Negative Coefficient)

In [264]:
is_avg_household_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_avg_household_neg_coeffs.append(name.startswith('AvgHousehold'))
is_avg_household_neg_coeffs = np.array(is_avg_household_neg_coeffs)

In [265]:
avg_household_neg_coeffs = lr_3_negative_coeffs.loc[is_avg_household_neg_coeffs]

In [266]:
sum_of_avg_household_neg_coeffs = np.sum(avg_household_neg_coeffs.values)
sum_of_avg_household_neg_coeffs

-3.077598923908508

In [267]:
avg_household_proportion_of_neg_total = sum_of_avg_household_neg_coeffs / sum_of_all_neg_coeffs
avg_household_proportion_of_neg_total

0.005459889808031953

## Birth Rate (Negative Coefficient)

In [268]:
is_birthrate_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_birthrate_neg_coeffs.append(name.startswith('BirthRate'))
is_birthrate_neg_coeffs = np.array(is_birthrate_neg_coeffs)

In [269]:
birthrate_neg_coeffs = lr_3_negative_coeffs.loc[is_birthrate_neg_coeffs]

In [270]:
sum_of_birthrate_neg_coeffs = np.sum(birthrate_neg_coeffs.values)
sum_of_birthrate_neg_coeffs

-0.02902136417679431

In [271]:
birthrate_proportion_of_neg_total = sum_of_birthrate_neg_coeffs / sum_of_all_neg_coeffs
birthrate_proportion_of_neg_total

5.148606247978187e-05

## Cancer Diagnoses (Negative Coefficient)

In [272]:
is_cancer_diag_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_cancer_diag_neg_coeffs.append(name.startswith('avgAnnCount'))
is_cancer_diag_neg_coeffs = np.array(is_cancer_diag_neg_coeffs)

In [273]:
cancer_diag_neg_coeffs = lr_3_negative_coeffs.loc[is_cancer_diag_neg_coeffs]

In [274]:
sum_of_cancer_diag_neg_coeffs = np.sum(cancer_diag_neg_coeffs.values)
sum_of_cancer_diag_neg_coeffs

-8.68233008540539e-08

In [275]:
cancer_diag_proportion_of_neg_total = sum_of_cancer_diag_neg_coeffs / sum_of_all_neg_coeffs
cancer_diag_proportion_of_neg_total

1.5403100506374933e-10

## Comorbid Health Conditions Features with Negative Coefficients

In [276]:
comorbidities_neg = ['PCT_OBESE_ADULTS13_sqrd', 'PCT_OBESE_ADULTS09', 'PCT_OBESE_CHILD08', 'PCT_DIABETES_ADULTS10', 
                    'PCH_OBESE_CHILD_08_11']

In [277]:
is_comorbid_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_comorbid_neg_coeffs.append(name in comorbidities_neg)
is_comorbid_neg_coeffs = np.array(is_comorbid_neg_coeffs)

In [278]:
comorbid_neg_coeffs = lr_3_negative_coeffs.loc[is_comorbid_neg_coeffs]

In [279]:
sum_of_comorbid_neg_coeffs = np.sum(comorbid_neg_coeffs.values)
sum_of_comorbid_neg_coeffs

-0.06952445729055479

In [280]:
comorbid_proportion_of_neg_total = sum_of_comorbid_neg_coeffs / sum_of_all_neg_coeffs
comorbid_proportion_of_neg_total

0.00012334156761647532

## Distance to Major Urban Centers Features with Negative Coefficients

In [281]:
dists_to_urban_centers_neg = ['los_ang_l1_sqrd', 'los_ang_l1', 'dallas_l2', 'atlanta_l1', 'nyc_l1', 'denver_l1',  
                         'san_fran_l1', 'seattle_l1', 'city_min_distsl1', 'chi_l1', 'nyc_l2', 'atlanta_l2_log', 
                         'chi_l2']

In [282]:
is_urban_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_urban_neg_coeffs.append(name in dists_to_urban_centers_neg)
is_urban_neg_coeffs = np.array(is_urban_neg_coeffs)

In [283]:
urban_neg_coeffs = lr_3_negative_coeffs.loc[is_urban_neg_coeffs]

In [284]:
sum_of_urban_neg_coeffs = np.sum(urban_neg_coeffs.values)
sum_of_urban_neg_coeffs

-8.21588800717538

In [285]:
urban_proportion_of_neg_total = sum_of_urban_neg_coeffs / sum_of_all_neg_coeffs
urban_proportion_of_neg_total

0.014575597504219287

## Distances to Top 10 Oncology Hospitals Features with Negative Coefficients

In [286]:
dists_to_oncology_hosps_neg = ['mayo_l1_sqrd', 'dfb_l1_sqrd', 'mayo_l2', 'hlmcc_l2', 'utmda_l1', 'onc_min_distsl2', 
                               'cleveland_l1', 'cleveland_l2', 'hopkins_l2_log', 'dfb_l1', 'mayo_l1_log', 'dfb_l2', 
                               'dfb_l1_log', 'upmcps_l2_log', 'mgs_l1_log']

In [287]:
is_onc_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_onc_neg_coeffs.append(name in dists_to_oncology_hosps_neg)
is_onc_neg_coeffs = np.array(is_onc_neg_coeffs)

In [288]:
onc_neg_coeffs = lr_3_negative_coeffs.loc[is_onc_neg_coeffs]

In [289]:
sum_of_onc_neg_coeffs = np.sum(onc_neg_coeffs.values)
sum_of_onc_neg_coeffs

-12.129363619464069

In [290]:
onc_proportion_of_neg_total = sum_of_onc_neg_coeffs / sum_of_all_neg_coeffs
onc_proportion_of_neg_total

0.02151839484000099

## Education Features with Negative Coefficients

In [291]:
education_neg = ['PctHS25_Over_sqrd', 'PctNoHS18_24', 'PctBachDeg18_24', 'PctSomeCol18_24', 'PctBachDeg25_Over']

In [292]:
is_edu_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_edu_neg_coeffs.append(name in education_neg)
is_edu_neg_coeffs = np.array(is_edu_neg_coeffs)

In [293]:
edu_neg_coeffs = lr_3_negative_coeffs.loc[is_edu_neg_coeffs]

In [294]:
sum_of_edu_neg_coeffs = np.sum(edu_neg_coeffs.values)
sum_of_edu_neg_coeffs

-0.019202257574940845

In [295]:
edu_proportion_of_neg_total = sum_of_edu_neg_coeffs / sum_of_all_neg_coeffs
edu_proportion_of_neg_total

3.4066235730117726e-05

## Employment Features with Negative Coefficients

In [296]:
employment_neg = ['PctEmployed16_Over_sqrd', 'PctEmployed16_Over_log']

In [297]:
is_employment_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_employment_neg_coeffs.append(name in employment_neg)
is_employment_neg_coeffs = np.array(is_employment_neg_coeffs)

In [298]:
employment_neg_coeffs = lr_3_negative_coeffs.loc[is_employment_neg_coeffs]

In [299]:
sum_of_employment_neg_coeffs = np.sum(employment_neg_coeffs.values)
sum_of_employment_neg_coeffs

-121.83076930316506

In [300]:
employment_proportion_of_neg_total = sum_of_employment_neg_coeffs / sum_of_all_neg_coeffs
employment_proportion_of_neg_total

0.21613686255721404

## Environmental Health (Negative Coefficient)

In [301]:
is_env_health_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_env_health_neg_coeffs.append(name.startswith('sc_min_'))
is_env_health_neg_coeffs = np.array(is_env_health_neg_coeffs)

In [302]:
env_health_neg_coeffs = lr_3_negative_coeffs.loc[is_env_health_neg_coeffs]

In [303]:
sum_of_env_health_neg_coeffs = np.sum(env_health_neg_coeffs.values)
sum_of_env_health_neg_coeffs

-0.2096557976086054

In [304]:
env_health_proportion_of_neg_total = sum_of_env_health_neg_coeffs / sum_of_all_neg_coeffs
env_health_proportion_of_neg_total

0.0003719450067601026

## Erroneous Data Indicator (Negative Coefficient)

In [305]:
is_err_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_err_neg_coeffs.append(name.startswith('household_lt'))
is_err_neg_coeffs = np.array(is_err_neg_coeffs)

In [306]:
err_neg_coeffs = lr_3_negative_coeffs.loc[is_err_neg_coeffs]

In [307]:
sum_of_err_neg_coeffs = np.sum(err_neg_coeffs.values)
sum_of_err_neg_coeffs

-0.5152421205203302

In [308]:
err_proportion_of_neg_total = sum_of_err_neg_coeffs / sum_of_all_neg_coeffs
err_proportion_of_neg_total

0.0009140779133510489

## Food Environment Features with Negative Coefficients

In [309]:
food_env_neg_features = ['PCH_FMRKT_09_13_sqrd', 'ORCHARD_ACRES07', 'FRESHVEG_ACRES07', 'BERRY_ACRES07', 
                        'GHVEG_SQFTPTH07', 'PCT_LACCESS_HHNV10_sqrd', 'VEG_ACRESPTH07', 'ORCHARD_ACRESPTH07', 
                        'PCT_FMRKT_SFMNP13', 'VEG_FARMS07', 'PCH_FMRKT_09_13', 'PCT_FRMKT_FRVEG13', 
                        'PCT_FMRKT_SNAP13', 'CSA07', 'PCT_LACCESS_LOWI10', 'PC_DIRSALES07', 
                        'PCT_LACCESS_SENIORS10', 'FMRKT13', 'SLHOUSE07', 'PCT_LACCESS_CHILD10', 'FOODINSEC_07_09', 
                        'FOODHUB12', 'FOODINSEC_CHILD_03_11', 'VLFOODSEC_00_02', 'CH_VLFOODSEC_09_12', 
                        'FARM_TO_SCHOOL', 'FMRKTPTH13']

In [310]:
is_food_env_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_food_env_neg_coeffs.append(name in food_env_neg_features)
is_food_env_neg_coeffs = np.array(is_food_env_neg_coeffs)

In [311]:
food_env_neg_coeffs = lr_3_negative_coeffs.loc[is_food_env_neg_coeffs]

In [312]:
sum_of_food_env_neg_coeffs = np.sum(food_env_neg_coeffs.values)
sum_of_food_env_neg_coeffs

-10.247582106988403

In [313]:
food_env_proportion_of_neg_total = sum_of_food_env_neg_coeffs / sum_of_all_neg_coeffs
food_env_proportion_of_neg_total

0.018179974222196575

## Geography (Negative Coefficient)

In [314]:
is_geography_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_geography_neg_coeffs.append(name.startswith('AWATER_'))
is_geography_neg_coeffs = np.array(is_geography_neg_coeffs)

In [315]:
geography_neg_coeffs = lr_3_negative_coeffs.loc[is_geography_neg_coeffs]

In [316]:
sum_of_geography_neg_coeffs = np.sum(geography_neg_coeffs.values)
sum_of_geography_neg_coeffs

-2.797500741086893e-07

In [317]:
geography_proportion_of_neg_total = sum_of_geography_neg_coeffs / sum_of_all_neg_coeffs
geography_proportion_of_neg_total

4.962974761124604e-10

## Health Insurance Features with Negative Coefficients

In [318]:
health_ins_neg_features = ['PctPublicCoverageAlone', 'PctPublicCoverage', 'PctPrivateCoverage', 
                           'PctPublicCoverageAlone_log', 'PctEmpPrivCoverage_log']

In [319]:
is_health_ins_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_health_ins_neg_coeffs.append(name in health_ins_neg_features)
is_health_ins_neg_coeffs = np.array(is_health_ins_neg_coeffs)

In [320]:
health_ins_neg_coeffs = lr_3_negative_coeffs.loc[is_health_ins_neg_coeffs]

In [321]:
sum_of_health_ins_neg_coeffs = np.sum(health_ins_neg_coeffs.values)
sum_of_health_ins_neg_coeffs

-33.33597933277334

In [322]:
health_ins_proportion_of_neg_total = sum_of_health_ins_neg_coeffs / sum_of_all_neg_coeffs
health_ins_proportion_of_neg_total

0.0591405112556453

## Income Features with Negative Coefficients

In [323]:
is_income_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_income_neg_coeffs.append(name.startswith('binnedInc_'))
is_income_neg_coeffs = np.array(is_income_neg_coeffs)

In [324]:
income_neg_coeffs = lr_3_negative_coeffs.loc[is_income_neg_coeffs]

In [325]:
sum_of_income_neg_coeffs = np.sum(income_neg_coeffs.values)
sum_of_income_neg_coeffs

-8.10190724435568

In [326]:
income_proportion_of_neg_total = sum_of_income_neg_coeffs / sum_of_all_neg_coeffs
income_proportion_of_neg_total

0.014373387138080788

## Latitude/Longitude Features with Negative Coefficients

In [327]:
is_latlong_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_latlong_neg_coeffs.append(name.startswith('INTPT'))
is_latlong_neg_coeffs = np.array(is_latlong_neg_coeffs)

In [328]:
latlong_neg_coeffs = lr_3_negative_coeffs.loc[is_latlong_neg_coeffs]

In [329]:
sum_of_latlong_neg_coeffs = np.sum(latlong_neg_coeffs.values)
sum_of_latlong_neg_coeffs

-0.35256861765303005

In [330]:
latlong_proportion_of_neg_total = sum_of_latlong_neg_coeffs / sum_of_all_neg_coeffs
latlong_proportion_of_neg_total

0.0006254829981910014

## Marital Features with Negative Coefficients

In [331]:
marital_features = ['PercentMarried_sqrd', 'PctMarriedHouseholds', 'PercentMarried_log']

In [332]:
is_marital_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_marital_neg_coeffs.append(name in marital_features)
is_marital_neg_coeffs = np.array(is_marital_neg_coeffs)

In [333]:
marital_neg_coeffs = lr_3_negative_coeffs.loc[is_marital_neg_coeffs]

In [334]:
sum_of_marital_neg_coeffs = np.sum(marital_neg_coeffs.values)
sum_of_marital_neg_coeffs

-24.83570214085488

In [335]:
marital_proportion_of_neg_total = sum_of_marital_neg_coeffs / sum_of_all_neg_coeffs
marital_proportion_of_neg_total

0.044060386147380295

## Missing Value Features with Negative Coefficients

In [336]:
is_missing_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_missing_neg_coeffs.append(name.endswith('_isnull'))
is_missing_neg_coeffs = np.array(is_missing_neg_coeffs)

In [337]:
missing_neg_coeffs = lr_3_negative_coeffs.loc[is_missing_neg_coeffs]

In [338]:
sum_of_missing_neg_coeffs = np.sum(missing_neg_coeffs.values)
sum_of_missing_neg_coeffs

-38.73179462541644

In [339]:
missing_proportion_of_neg_total = sum_of_missing_neg_coeffs / sum_of_all_neg_coeffs
missing_proportion_of_neg_total

0.06871308963597256

## Population Feature (Negative Coefficient)

In [342]:
pop_neg_coeffs = lr_3_negative_coeffs.loc['popEst2015']
pop_neg_coeffs

-2.8968826466838384e-14

In [345]:
pop_proportion_of_neg_total = pop_neg_coeffs / sum_of_all_neg_coeffs
pop_proportion_of_neg_total

5.139285666764784e-17

## Poverty-related Features with Negative Coefficients

In [346]:
poverty_neg_features = ['povertyPercent', 'CHILDPOVRATE10_log']

In [347]:
is_poverty_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_poverty_neg_coeffs.append(name in poverty_neg_features)
is_poverty_neg_coeffs = np.array(is_poverty_neg_coeffs)

In [348]:
poverty_neg_coeffs = lr_3_negative_coeffs.loc[is_poverty_neg_coeffs]

In [349]:
sum_of_poverty_neg_coeffs = np.sum(poverty_neg_coeffs.values)
sum_of_poverty_neg_coeffs

-10.091572216062811

In [350]:
poverty_proportion_of_neg_total = sum_of_poverty_neg_coeffs / sum_of_all_neg_coeffs
poverty_proportion_of_neg_total

0.017903201051137935

## Quality of Life (Negative Coefficient)

In [351]:
is_quality_life_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_quality_life_neg_coeffs.append(name.startswith('NATAMEN'))
is_quality_life_neg_coeffs = np.array(is_quality_life_neg_coeffs)

In [352]:
quality_life_neg_coeffs = lr_3_negative_coeffs.loc[is_quality_life_neg_coeffs]

In [353]:
sum_of_quality_life_neg_coeffs = np.sum(quality_life_neg_coeffs.values)
sum_of_quality_life_neg_coeffs

-0.13732615840561696

In [354]:
quality_life_proportion_of_neg_total = sum_of_quality_life_neg_coeffs / sum_of_all_neg_coeffs
quality_life_proportion_of_neg_total

0.0002436268374121967

## Race-related Features with Negative Coefficients

In [355]:
race_neg_features = ['PctBlack_sqrd', 'PctOtherRace', 'PctAsian', 'PctWhite']

In [356]:
is_race_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_race_neg_coeffs.append(name in race_neg_features)
is_race_neg_coeffs = np.array(is_race_neg_coeffs)

In [357]:
race_neg_coeffs = lr_3_negative_coeffs.loc[is_race_neg_coeffs]

In [358]:
sum_of_race_neg_coeffs = np.sum(race_neg_coeffs.values)
sum_of_race_neg_coeffs

-0.030305162668392376

In [359]:
race_proportion_of_neg_total = sum_of_race_neg_coeffs / sum_of_all_neg_coeffs
race_proportion_of_neg_total

5.3763616661839234e-05

## Recreation and Fitness Features with Negative Coefficients

In [360]:
recreation_neg_features = ['RECFAC07', 'PCH_RECFAC_07_12', 'RECFACPTH07']

In [361]:
is_recreation_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_recreation_neg_coeffs.append(name in recreation_neg_features)
is_recreation_neg_coeffs = np.array(is_recreation_neg_coeffs)

In [362]:
recreation_neg_coeffs = lr_3_negative_coeffs.loc[is_recreation_neg_coeffs]

In [363]:
sum_of_recreation_neg_coeffs = np.sum(recreation_neg_coeffs.values)
sum_of_recreation_neg_coeffs

-24.860566830680877

In [364]:
recreation_proportion_of_neg_total = sum_of_recreation_neg_coeffs / sum_of_all_neg_coeffs
recreation_proportion_of_neg_total

0.04410449795984103

## State Features with Negative Coefficients

In [365]:
is_state_neg_coeffs = []
for name in lr_3_negative_coeffs.index.values:
    is_state_neg_coeffs.append(name.startswith('State_'))
is_state_neg_coeffs = np.array(is_state_neg_coeffs)

In [366]:
state_neg_coeffs = lr_3_negative_coeffs.loc[is_state_neg_coeffs]

In [367]:
sum_of_state_neg_coeffs = np.sum(state_neg_coeffs.values)
sum_of_state_neg_coeffs

-244.67775334977455

In [368]:
state_proportion_of_neg_total = sum_of_state_neg_coeffs / sum_of_all_neg_coeffs
state_proportion_of_neg_total

0.43407656578915044

## Summary Table of Feature Family Negative Coefficients

In [369]:
negative_proportions_table = pd.read_excel('cancer_feature_family_proportions_normalized.xls', sheet_name='Negative')
negative_proportions_table

Unnamed: 0,Features,Negative coefficients,Feature Family,Feature Family Sums,Feature Family Proportion,Total Negative Sum,Total Negative Feature Family Proportion Sum
0,MedianAgeFemale_sqrd,5.176086e-06,Age,22.17486,0.03933985,563.674203,1.0
1,MedianAgeMale,0.01395611,Age,,,,
2,MedianAge_log,22.1609,Age,,,,
3,AvgHouseholdSize,3.077599,Average Household Size,3.077599,0.00545989,,
4,BirthRate,0.02902136,Birth Rate,0.02902136,5.148605e-05,,
5,avgAnnCount,8.68233e-08,Cancer Diagnoses,8.68233e-08,1.54031e-10,,
6,PCT_OBESE_ADULTS13_sqrd,7.044184e-05,Comorbid Health Conditions,0.06952446,0.0001233416,,
7,PCT_OBESE_ADULTS09,0.003336906,Comorbid Health Conditions,,,,
8,PCT_OBESE_CHILD08,0.01986464,Comorbid Health Conditions,,,,
9,PCT_DIABETES_ADULTS10,0.0218032,Comorbid Health Conditions,,,,


The proportions that each "feature family" has of the total negative influence of decreasing cancer mortality in 97% of the counties in the United States during 2015 are as follows:

- United States state each county is in ("State"): 0.4341
- Employment status of each county's populace ("Employment"): 0.2161
- Missing values ("Missing Value Feature"): 0.0687
- Types of health insurance for each county's populace ("Health Insurance"): 0.0591
- Recreation and fitness facilities in each county ("Recreation and Fitness"): 0.0441
- Marital status of county's populace ("Marital feature"): 0.0441
- Age of each county's populace ("Age"): 0.0393
- L1 and L2 distances from county centroids to top 10 oncology hospitals ("Distances to Top 10 Oncology Hospitals"): 0.0215
- Food environment of each county ("Food Environment"): 0.0182
- Poverty ("Poverty-related"): 0.0179
- L1 and L2 distances from county centroids to major cities ("Distance to Major Urban Center"): 0.0146
- Financial income of each county's populace ("Income"): 0.0144
- Average household size of each county ("Average Household Size"): 0.0055
- Erroneous data indicator referencing median age ("Erroneous data indicator"): 0.0009
- Latitude and Longitude ("Latitude Longitude"): 0.0006
- L1 distance to closest EPA Superfund Cleanup site ("Environmental Health"): 0.0004
- Quality of life index ("Quality of Life"): 0.0002
- Health conditions comorbid with cancer ("Comorbid Health Conditions"): 0.0001
- Race of each county's populace ("Race"): 0.0001
- Birth Rate ("Birth Rate"): 0.0001
- Education levels of each county's populace ("Education"): 0.00003
- Rate of cancer diagnoses in each county ("Cancer Diagnoses"): 0.0000000002
- Square mileage of water for each county ("Geography"): 0.0000000005
- Population of county ("Population"): 0.00000000000000005