# Causal Modeling of Covid-19 Severity by County
### Authors: Chloe Larkin, Ryan Douglas, Srinidhi Gopalakrishnan



This project focuses on identifying the causes of COVID-19 outbreak in a US county and validating each variable's causal effect through analysis of conditioning and interventions on them. 

This IPython notebook processes and cleans the dataset containing all the variables, including definition of the outcome variable, COVID-19 severity. The variables are binned into categories based on their values to facilitate creation of the causal DAG model and do interventions.

### Import required packages
We use pandas DataFrames for all processing steps.

In [4]:
import pandas as pd

### Function Definitions

get_voter_percentage: Computes percentage of voters who are Democratic and Republican in each of the counties and returns a dictionary with the percentage values.

bin_voter_percentage: Computes the difference between percentage of Democratic and Republican voters per county and categorizes county into whether the citizens are equally distributed between the two (Even), slightly leaning towards one, or heavily leaning towards one. Returns one of the categories - 'Heavily Republican', 'Republican', 'Leaning Republican', 'Even', 'Leaning Democrat', 'Democrat', 'Heavily Democrat'

In [5]:
def get_voter_percentage(county):
    republican_votes = county[county["party"] == 'republican']
    percent_republican = republican_votes["candidatevotes"] / republican_votes["totalvotes"]
    
    democrat_votes = county[county["party"] == 'democrat']
    percent_democrat = democrat_votes["candidatevotes"] / democrat_votes["totalvotes"]
    
    return pd.Series({"Republican": percent_republican.values[0], "Democrat": percent_democrat.values[0]})

def bin_voter_percentage(row, bins={"Even": 0.05, "Leaning ": 0.15, "": 0.25, "Heavily ": 1}):
    for _bin in bins.items():
        if row["voter_diff"] <= _bin[1]:
            label = _bin[0]
            if label == "Even":
                return label
            return label + row[["Republican", "Democrat"]].idxmax(axis=1)
    return "No Label Found"

## Read and Clean data

The data contains one row per county and contains the raw features (except voter data from 2016) which need transformations to produce the input to the DAG.

In [6]:
# Read Data
df = pd.read_csv("Distilled_Dataset.csv")
print('Dataset Dimensions: ', df.shape)

# Rename columns
df.rename(
    columns={
        "%adults less than HS diploma, 2014-18": "no_hs_diploma",
        "%adults HS diploma only, 2014-18": "hs_diploma",
        "%adults some college 1-3 yrs, 2014-18": "some_college",
        "%adults >4 yrs college, 2014-18": "college_graduate",
        "%NeverWearMasks": "NeverWearMasks",
        "%RarelyWearMasks": "RarelyWearMasks",
        "%SometimesWearMasks": "SometimesWearMasks",
        "%FrequentlyWearMasks": "FrequentlyWearMasks",
        "%AlwaysWearMasks": "AlwaysWearMasks"
        
    },
    inplace=True
)

# Trim spaces from string columns
df["Gov_Leaning"] = df["Gov_Leaning"].str.replace('\r', '')
df["Emergency_Declaration"] = df["Emergency_Declaration"].str.replace('\r', '')
df["UrbanInfluence"] = df["UrbanInfluence"].str.replace('\r', '')
df["CountyName"] = df["CountyName"].str.replace('County', '').str.strip()
df.head()

Dataset Dimensions:  (2980, 23)


Unnamed: 0,CountyName,StateName,2019Population,ICUBedsPerThousandHabitants,ConfirmedCases,UrbanInfluence,MedianHouseholdIncome,Gov_Leaning,Reopening_Status,Stay_At_Home_Order,...,Emergency_Declaration,no_hs_diploma,hs_diploma,some_college,college_graduate,NeverWearMasks,RarelyWearMasks,SometimesWearMasks,FrequentlyWearMasks,AlwaysWearMasks
0,Autauga,Alabama,55869,0.109,2103,metropolitan,59338,Republican,Paused,Lifted,...,Yes,11.3,32.6,28.4,27.7,0.053,0.074,0.134,0.295,0.444
1,Baldwin,Alabama,223234,0.251,6743,metropolitan,57588,Republican,Paused,Lifted,...,Yes,9.7,27.6,31.3,31.3,0.083,0.059,0.098,0.323,0.436
2,Barbour,Alabama,24686,0.191,1045,noncore,34382,Republican,Paused,Lifted,...,Yes,27.0,35.7,25.1,12.2,0.067,0.121,0.12,0.201,0.491
3,Bibb,Alabama,22394,0.0,856,metropolitan,46064,Republican,Paused,Lifted,...,Yes,16.8,47.3,24.4,11.5,0.02,0.034,0.096,0.278,0.572
4,Blount,Alabama,57826,0.694,1988,metropolitan,50412,Republican,Paused,Lifted,...,Yes,19.8,34.0,33.5,12.6,0.053,0.114,0.18,0.194,0.459


## Read and Clean Voter data

In [7]:
# Read voter data
voter_df = pd.read_csv("countypres_2000-2016.csv")

# Only use data from last election
voter_df = voter_df[voter_df["year"] == 2016]

# Get percentage of democratic and republican voters
voter_data = voter_df.groupby(["state", "county"]).apply(lambda county: get_voter_percentage(county))

# Bin by difference in voter percentage into new column - Citizen Political Leaning
voter_data["voter_diff"] = abs(voter_data['Republican'] - voter_data['Democrat'])
voter_data["CitizenPoliticalLeaning"] = voter_data.apply(lambda row: bin_voter_percentage(row), axis=1)

# Merge voter data with the other variables
df = pd.merge(df, voter_data.reset_index(), how='left', left_on=["StateName", "CountyName"], right_on=["state", "county"])

# Drop duplicate columns
df.drop(["state", "county"], axis=1, inplace=True)

df.head()

Unnamed: 0,CountyName,StateName,2019Population,ICUBedsPerThousandHabitants,ConfirmedCases,UrbanInfluence,MedianHouseholdIncome,Gov_Leaning,Reopening_Status,Stay_At_Home_Order,...,college_graduate,NeverWearMasks,RarelyWearMasks,SometimesWearMasks,FrequentlyWearMasks,AlwaysWearMasks,Republican,Democrat,voter_diff,CitizenPoliticalLeaning
0,Autauga,Alabama,55869,0.109,2103,metropolitan,59338,Republican,Paused,Lifted,...,27.7,0.053,0.074,0.134,0.295,0.444,0.727666,0.237697,0.489969,Heavily Republican
1,Baldwin,Alabama,223234,0.251,6743,metropolitan,57588,Republican,Paused,Lifted,...,31.3,0.083,0.059,0.098,0.323,0.436,0.765457,0.193856,0.571601,Heavily Republican
2,Barbour,Alabama,24686,0.191,1045,noncore,34382,Republican,Paused,Lifted,...,12.2,0.067,0.121,0.12,0.201,0.491,0.520967,0.465278,0.055688,Leaning Republican
3,Bibb,Alabama,22394,0.0,856,metropolitan,46064,Republican,Paused,Lifted,...,11.5,0.02,0.034,0.096,0.278,0.572,0.764032,0.212496,0.551536,Heavily Republican
4,Blount,Alabama,57826,0.694,1988,metropolitan,50412,Republican,Paused,Lifted,...,12.6,0.053,0.114,0.18,0.194,0.459,0.893348,0.084258,0.80909,Heavily Republican


## Bin Continuous Columns

Here, we convert the continuous columns into categorical variables by first using the pandas qcut function to identify bin edges and then using them to define more intuitive bin edges. The below dictionary bin_dict shows the values of bin edges considered.

In [8]:
bin_dict = {
    'ICUBedsPerThousandHabitants': [-0.001, 0.14, 0.28, 1.2, 31],
    'MedianHouseholdIncome': [-0.001, 42000, 48000, 53000, 60000, 141000],
    'no_hs_diploma': [0, 8, 11, 14, 19, 70],
    'hs_diploma': [8, 29, 33, 37, 40, 56],
    'some_college': [5, 26, 30, 32, 35, 50],
    'college_graduate': [5, 14, 17, 21, 28, 80],
    'ProportionConfirmedCases': [-0.01, 0.013, 0.0202, 0.0276, 0.0378, 0.18],
    '2019Population': [0, 10000, 20000, 37000, 100000, 11000000],
    'AlwaysWearMasks': [0, 0.42, 0.57, 1]
}

### Defining Labels

The binned variables are assigned a category inspired by the Likert scale to give a relative sense of the value it takes.

In [9]:
size_labels = ["Very Small", "Small", "Medium", "Large", "Very Large"]
amount_labels = ["Very Low", "Low", "Medium", "High", "Very High"]
icu_bed_labels = ["Low", "Medium", "High", "Very High"]
cdc_labels = ["Low", "Medium", "High"]

### Binning of columns using Pandas cut function

In [15]:
df_binned_updated = df.copy()

# Binning features - Using pd.qcut to find bins of equal size and pd.cut to define cleaner bin edges
df_binned_updated["EmergencyPreparedness"] = pd.cut(x = df_binned_updated["ICUBedsPerThousandHabitants"], 
                                                    labels = icu_bed_labels, 
                                                    bins = bin_dict['ICUBedsPerThousandHabitants'])
df_binned_updated["MedianHouseholdIncome"] = pd.cut(x = df_binned_updated["MedianHouseholdIncome"], 
                                                    labels = amount_labels, 
                                                    bins = bin_dict['MedianHouseholdIncome'])
df_binned_updated["CDCCompliance"] = pd.cut(x = df_binned_updated["AlwaysWearMasks"], 
                                                    labels = cdc_labels, 
                                                    bins = bin_dict['AlwaysWearMasks'])

# binning all educational columns in case we need them later
df_binned_updated["NoHSDiploma"] = pd.cut(x = df_binned_updated["no_hs_diploma"], 
                                                    labels = amount_labels, 
                                                    bins = bin_dict['no_hs_diploma'])
df_binned_updated["HSDiploma"] = pd.cut(x = df_binned_updated["hs_diploma"], 
                                                    labels = amount_labels, 
                                                    bins = bin_dict['hs_diploma'])
df_binned_updated["SomeCollege"] = pd.cut(x = df_binned_updated["some_college"], 
                                                    labels = amount_labels, 
                                                    bins = bin_dict['some_college'])
df_binned_updated["CollegeGraduate"] = pd.cut(x = df_binned_updated["college_graduate"], 
                                                    labels = amount_labels, 
                                                    bins = bin_dict['college_graduate'])

### Grouping Ban on Large Gatherings categories

In [16]:
# Reducing number of categories from 6 to 4
df_binned_updated.loc[df_binned_updated['Ban_Large_Gatherings'].str.contains('Expanded Limit'), 
                      'Ban_Large_Gatherings'] = 'Expanded Limit'
df_binned_updated.loc[df_binned_updated['Ban_Large_Gatherings'].str.contains('Prohibited'), 
                      'Ban_Large_Gatherings'] = 'Gatherings Prohibited'

### Outcome Variable Definition - Proportion of Confirmed Cases

In [17]:
# Create ConfirmedCasesPerCapita = Number of confirmed cases divided by County population
df_binned_updated["ProportionConfirmedCases"] = df_binned_updated["ConfirmedCases"] / df_binned_updated["2019Population"]

df_binned_updated["ProportionConfirmedCases"] = pd.cut(x = df_binned_updated["ProportionConfirmedCases"], 
                                                    labels = amount_labels, 
                                                    bins = bin_dict['ProportionConfirmedCases'])
df_binned_updated.head()

Unnamed: 0,CountyName,StateName,2019Population,ICUBedsPerThousandHabitants,ConfirmedCases,UrbanInfluence,MedianHouseholdIncome,Gov_Leaning,Reopening_Status,Stay_At_Home_Order,...,Democrat,voter_diff,CitizenPoliticalLeaning,EmergencyPreparedness,CDCCompliance,NoHSDiploma,HSDiploma,SomeCollege,CollegeGraduate,ProportionConfirmedCases
0,Autauga,Alabama,55869,0.109,2103,metropolitan,High,Republican,Paused,Lifted,...,0.237697,0.489969,Heavily Republican,Low,Medium,Medium,Low,Low,High,High
1,Baldwin,Alabama,223234,0.251,6743,metropolitan,High,Republican,Paused,Lifted,...,0.193856,0.571601,Heavily Republican,Medium,Medium,Low,Very Low,Medium,Very High,High
2,Barbour,Alabama,24686,0.191,1045,noncore,Very Low,Republican,Paused,Lifted,...,0.465278,0.055688,Leaning Republican,Medium,Medium,Very High,Medium,Very Low,Very Low,Very High
3,Bibb,Alabama,22394,0.0,856,metropolitan,Low,Republican,Paused,Lifted,...,0.212496,0.551536,Heavily Republican,Low,High,High,Very High,Very Low,Very Low,Very High
4,Blount,Alabama,57826,0.694,1988,metropolitan,Medium,Republican,Paused,Lifted,...,0.084258,0.80909,Heavily Republican,High,Medium,Very High,Medium,High,Very Low,High


### Remove rows with missing values in data

In [18]:
print('Number of Missing values')
print(df_binned_updated.isnull().sum()[df_binned_updated.isnull().sum() > 0])
df_binned_updated.dropna(inplace= True)
print('\n')

print('Data Dimensions before transformations: ', df.shape)
print('Data Dimensions after transformations: ', df_binned_updated.shape)

Number of Missing values
UrbanInfluence             2
no_hs_diploma              1
hs_diploma                 1
some_college               1
college_graduate           1
Republican                 6
Democrat                   6
voter_diff                 6
CitizenPoliticalLeaning    6
NoHSDiploma                1
HSDiploma                  1
SomeCollege                1
CollegeGraduate            1
dtype: int64


Data Dimensions before transformations:  (2980, 27)
Data Dimensions after transformations:  (2972, 34)


## Select columns relevant to DAG

In [20]:
cols = [
    "CitizenPoliticalLeaning",
    "MedianHouseholdIncome",
    "NoHSDiploma",
    "CDCCompliance",
    "UrbanInfluence",
    "Gov_Leaning",
    "EmergencyPreparedness",
    "Ban_Large_Gatherings",
    "ProportionConfirmedCases",
]

dag_df = df_binned_updated.loc[:, cols].copy()
dag_df.head()

Unnamed: 0,CitizenPoliticalLeaning,MedianHouseholdIncome,NoHSDiploma,CDCCompliance,UrbanInfluence,Gov_Leaning,EmergencyPreparedness,Ban_Large_Gatherings,ProportionConfirmedCases
0,Heavily Republican,High,Medium,Medium,metropolitan,Republican,Low,Lifted,High
1,Heavily Republican,High,Low,Medium,metropolitan,Republican,Medium,Lifted,High
2,Leaning Republican,Very Low,Very High,Medium,noncore,Republican,Medium,Lifted,Very High
3,Heavily Republican,Low,High,High,metropolitan,Republican,Low,Lifted,Very High
4,Heavily Republican,Medium,Very High,Medium,metropolitan,Republican,High,Lifted,High


### Check Univariate Likelihoods
Percentage of categories present in data for each variable

In [21]:
# Check univariate likelihoods 
for col in dag_df.columns:
    print(col)
    print(dag_df[col].value_counts(normalize = True)*100)
    print('\n')

CitizenPoliticalLeaning
Heavily Republican    65.746972
Republican             9.690444
Leaning Republican     6.729475
Heavily Democrat       5.652759
Even                   5.215343
Leaning Democrat       4.306864
Democrat               2.658143
Name: CitizenPoliticalLeaning, dtype: float64


MedianHouseholdIncome
Very High    22.039031
Low          21.029610
Very Low     19.380888
High         18.909825
Medium       18.640646
Name: MedianHouseholdIncome, dtype: float64


NoHSDiploma
Low          22.678331
High         20.726783
Very Low     20.423957
Very High    18.135935
Medium       18.034993
Name: NoHSDiploma, dtype: float64


CDCCompliance
Medium    34.825034
High      33.075370
Low       32.099596
Name: CDCCompliance, dtype: float64


UrbanInfluence
noncore         42.597577
metropolitan    36.507402
micropolitan    20.895020
Name: UrbanInfluence, dtype: float64


Gov_Leaning
Republican    58.041723
Democratic    41.958277
Name: Gov_Leaning, dtype: float64


EmergencyPreparedn

## Write transformed data with dag node aliases to fit the causal model

In [23]:
csv_df = dag_df.copy()
csv_df.columns = ["CPL", "MI", "HS", "CDC", "UI", "SPL", "ICU", "BLG", "CC"]
csv_df.to_csv("dag_data_v3.csv", index=False)