# **White Male Effect - CDC**
## **Ekim Luo**
*Last updated: July 16, 2021*

# **Data**
- [COVID-19 Case Surveillance Public Use Data](https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf), CDC
    - January 1, 2020 - July 7, 2021

# **Setting up**

In [1]:
# import libraries
import pandas as pd
import statistics
import math

In [2]:
ls

 COVID-19_Case_Surveillance_Public_Use_Data.csv
 [0m[01;31mCOVID-19_Case_Surveillance_Public_Use_Data.csv.zip[0m
 [01;31mcovidpanel_us_stata_jun_23_2021.zip[0m
 NOTEBOOK_Python.ipynb
 uas_cleaned_full.csv
 uas.csv
 uas_morethan1wave.csv
 uas_only1wave.csv
 [01;31muas.zip[0m
'White Male Effect_Cleaning_CDC.ipynb'
'White Male Effect_Cleaning_UAS_ekim.ipynb'
'White Male Effect_Cleaning_UAS.ipynb'
[01;35m'WME plot_16.7.png'[0m
 wme_uas.py


In [3]:
# import data
cdc = pd.read_csv('COVID-19_Case_Surveillance_Public_Use_Data.csv', low_memory = False)

# **Getting to know the data**

In [4]:
# count N of recorded cases before cleaning
len(cdc)

27145726

In [5]:
# print unique values for race
cdc['race_ethnicity_combined'].unique() 

array(['Black, Non-Hispanic', 'Unknown', 'Hispanic/Latino',
       'White, Non-Hispanic', 'Multiple/Other, Non-Hispanic',
       'Native Hawaiian/Other Pacific Islander, Non-Hispanic',
       'Asian, Non-Hispanic',
       'American Indian/Alaska Native, Non-Hispanic', 'Missing', nan],
      dtype=object)

In [6]:
# print unique values for sex
cdc['sex'].unique()

array(['Male', 'Missing', 'Other', 'Unknown', 'Female', nan], dtype=object)

In [7]:
# print unique values for COVID death
cdc['death_yn'].unique()

array(['No', 'Missing', 'Unknown', 'Yes'], dtype=object)

In [8]:
# print unique values for age
cdc['age_group'].unique()

array(['10 - 19 Years', 'Missing', '30 - 39 Years', '20 - 29 Years',
       '40 - 49 Years', '80+ Years', '0 - 9 Years', '70 - 79 Years', nan,
       '60 - 69 Years', '50 - 59 Years'], dtype=object)

# **Cleaning data**

In [9]:
cdc = cdc.dropna(subset=['sex','race_ethnicity_combined']) # drop na

# sex
cdc = cdc[cdc.sex != 'Unknown'] 
cdc = cdc[cdc.sex != 'Missing'] 
cdc = cdc[cdc.sex != 'Other']

# race_ethnicity_combined
cdc = cdc[cdc.race_ethnicity_combined != 'Missing']

# death_yn
cdc = cdc[cdc['death_yn'] != 'Missing'] 
cdc = cdc[cdc['death_yn'] != 'Unknown']

In [10]:
# count N of recorded cases after cleaning
len(cdc)

13855750

# **Recoding variables**

In [11]:
# sex
# male = 1, female = 0
cdc['sex'].replace('Female',0,inplace=True)
cdc['sex'].replace('Male',1,inplace=True)

In [12]:
# age
# 1 to 9 levels
cdc['age_group'].replace('0 - 9 Years',1,inplace=True)
cdc['age_group'].replace('10 - 19 Years',2,inplace=True) 
cdc['age_group'].replace('20 - 29 Years',3,inplace=True) 
cdc['age_group'].replace('30 - 39 Years',4,inplace=True) 
cdc['age_group'].replace('40 - 49 Years',5,inplace=True) 
cdc['age_group'].replace('50 - 59 Years',6,inplace=True) 
cdc['age_group'].replace('60 - 69 Years',7,inplace=True) 
cdc['age_group'].replace('70 - 79 Years',8,inplace=True) 
cdc['age_group'].replace('80+ Years',9,inplace=True) 
cdc['age_group'].replace('Unknown','NaN',inplace=True) 

# **Scoring data**

In [13]:
cdc['race'] = cdc['race_ethnicity_combined'].str.contains('White', na=False, regex=False).astype(int)

## **group**
- White men, White women, non-White men, non-White women

In [15]:
group = [] # create empty list for appending group values 

for i in cdc[['sex', 'race']].values.tolist(): # loop through the gender and race columns and append group values
    if i[0] == 1 and i[1] == 1: # white male
        group.append('wm') # append value to list 
    elif i[0] == 1 and i[1] != 1: # non-white male
        group.append('nm')
    elif i[0] == 0 and i[1] == 1: # white female
        group.append('ww')
    elif i[0] == 0 and i[1] != 1: # non-white female
        group.append('nw')
    else:
        group.append('NaN') # if the gender or race cells are empty, write in "NaN"

cdc['group'] = group # append list as a column named group

## **group2**
- White men (1) v. others (0)

In [17]:
cdc['group2'] = 1
cdc.loc[lambda cdc: cdc['group'] == 'wm',['group2']] = 0

# **Describing the data**

In [18]:
# count N of deaths total
cdc['death_yn'].value_counts()

No     13375930
Yes      479820
Name: death_yn, dtype: int64

In [19]:
# count N of deaths by group
cdc[cdc['death_yn'] == 'Yes'].groupby(['group']).count()

Unnamed: 0_level_0,cdc_case_earliest_dt,cdc_report_dt,pos_spec_dt,onset_dt,current_status,sex,age_group,race_ethnicity_combined,hosp_yn,icu_yn,death_yn,medcond_yn,race,group2
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
nm,134077,116084,40695,53317,134077,134077,134077,134077,134077,134077,134077,134077,134077,134077
nw,106053,93790,33099,40761,106053,106053,106053,106053,106053,106053,106053,106053,106053,106053
wm,126203,119440,34880,60014,126203,126203,126203,126203,126203,126203,126203,126203,126203,126203
ww,113487,107707,32060,51591,113487,113487,113487,113487,113487,113487,113487,113487,113487,113487


In [20]:
# N of cases by group
cdc.groupby(['group'])['death_yn'].count()

group
nm    4380096
nw    4735947
wm    2228681
ww    2511026
Name: death_yn, dtype: int64

# **Exporting data**

In [21]:
cdc.to_csv('cdc_16.7.2021.csv')