# EEB125 Homework 4: Working with Booleans and Functions

## Logistics

**Due date**: The homework is due 17:00 (5:00pm) on Tuesday, February 6.

You will submit your work on [MarkUs](https://markus-ds.teach.cs.toronto.edu).
To submit your work:

1. Download this file (`Homework_4.ipynb`) from JupyterHub. (See [our JupyterHub Guide](../../../../guides/jupyterhub_guide.ipynb) for detailed instructions.)
2. Submit this file to MarkUs under the **hw4** assignment. (See [our MarkUs Guide](../../../../guides/markus_guide.ipynb) for detailed instructions.)
All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.
We've incuded submission instructions at the end of this notebook.

## Overview

This week, you will be practicing a couple of programming techniques we examined in lecture to answer data science questions about sex-based differences in COVID infection and mortality across the United States. 

We will be using data from <https://www.genderscilab.org/gender-and-sex-in-covid19> for the week of 11/1/2021, which is available in `covid_sex.csv`. To inform your exploration, please read the article "What’s Really Behind the Gender Gap in Covid-19 Deaths?", printed in the New York Times, available in `nytimes_covid_sex.pdf` (go to File -> Open... to open this PDF). We will be exploring this dataset to interpret whether 1) we can observe sex-based differences in the risk of death among those infected with COVID and 2) whether we can identify any sociopolitical, as opposed to biological, explanations for any observed differences.

## Task 1: Read in the data file

### Problem 1a. Prep our data

Open the file `covid_sex.csv` in Python and read in the lines. Assign the header (the first line) to the variable `header` and the rest of the data to the variable `data`.

In [2]:
# Write your code here

file = open("covid_sex.csv")
lines = file.readlines()
header = lines[0]
data = lines[1:]

### Problem 1b. Interpret the data file

Examine the header by printing it to the screen. We will be interested in the following data columns: `State`,`Male_cases`,`Female_cases`,`Male_deaths`, and `Female_deaths`. Please indicate which indices of the header each of these data columns corresponds to. *(1pt)* For example, `State` is at index zero. Please start your indexing at zero.

HINT: You might find it easier to read and count the columns indicated in the header if you split it up according to commas and interpret the resulting list. 

In [3]:
# Write your code here

print(header.strip().split(","))

['\ufeffState', 'Date', 'Total_cases', 'Male_cases', 'Female_cases', 'Male_cases_pct', 'Female_cases_pct', 'Male_cases_rate', 'Female_cases_rate', 'Total_deaths', 'Male_deaths', 'Female_deaths', 'Male_deaths_pct', 'Female_deaths_pct', 'Male_deaths_rate', 'Female_deaths_rate', 'source']


Solution: Male_cases is at index 3, Female_cases is at index 4, Male_deaths is at index 10, and Female_deaths is at index 11.

### Problem 1c. Examine the data

Loop over `data` and print each line to the terminal. You may notice that some of the lines include multiple commas beside one another, with no text in between. This is one way of representing missing data in a .csv file. We will need to find some way to deal with this during the subsequent problems.

In [4]:
# Write your code here

for line in data:
    print(line)

Alabama,30-Oct,831653,,,,,,,15573,,,,,,,https://alpublichealth.maps.arcgis.com/apps/opsdashboard/index.html#/6d2771faa9da4a2786a509d82c8cf0f7

Alaska,30-Oct,132645,67287,64852,50.73,48.89,17450.90,18374.95,699,425,274,60.8,39.2,110.22,77.63,https://coronavirus-response-alaska-dhss.hub.arcgis.com/

Arizona,30-Oct,1166060,561976,599451,48.19,51.41,16272.94,17160.29,21153,12392,8748,58.58,41.36,358.83,250.43,https://www.azdhs.gov/preparedness/epidemiology-disease-control/infectious-disease-epidemiology/index.php#novel-coronavirus-home

Arkansas,30-Oct,512994,239568,268296,46.7,52.3,16314.78,17624.85,8370,,,,,,,https://www.healthy.arkansas.gov/programs-services/topics/novel-coronavirus

California,30-Oct,4647587,2221547,2356327,47.8,50.7,11419.62,11964.09,71519,41696,29537,58.3,41.3,214.33,149.97,https://update.covid19.ca.gov/

Colorado,30-Oct,740461,360012,370008,48.62,49.97,12946.21,13453.33,8186,4508,3666,55.07,44.78,162.11,133.28,https://covid19.colorado.gov/case-data

Connecticut,30-O

## Problem 2: Examining sex-wise differences in COVID death risk




### Problem 2a.

Create two empty lists and assign them, respectively, to the variables `risk_m` and `risk_f`.

In [5]:
# Write your code here

risk_m = []
risk_f = []

### Problem 2b. Calculate risk of death given COVID infection for each sex

Loop through the lines of our datafile. Create a metric for risk of death by COVID by dividing the number of deaths for each sex for each state by the number of cases for each state. In other words, use the following formulae:

``` covid_risk_m = deaths_m / infections_m```

for males, and 

``` covid_risk_f = deaths_f / infections_f```

for females.

Append the results for each sex to each of the lists created in the previous step. 

This step will require that we convert the values from strings to floating point numbers. However, some of the columns have missing values, which will be interpreted as Python as an empty string(""). Use exception handling to skip over lines where the type conversion fails.

In [6]:
# Write your code here

for line in data:
    line_dat = line.strip().split(",")
    state = line_dat[0]
    try:
        inf_m  = float(line_dat[3])
        inf_f = float(line_dat[4])
        death_m = float(line_dat[10])
        death_f = float(line_dat[11])
        covid_risk_m = death_m / inf_m
        covid_risk_f = death_f / inf_f
        risk_m.append(covid_risk_m)
        risk_f.append(covid_risk_f)
    except:
        continue

## Problem 3. Examining sex-specific differences in COVID risk

### Problem 3a: Create a function to calculate the statistical mean from a list of floating point numbers

In this assignment, we will want to calculate the average number of COVID deaths and infections across states for both males and females. Remember from lecture that a statistical mean of a list is calculated as the sum of all of the elements in a list divided by the length of the list (reference this week's lecture if you have forgotten). Please finish the function below so that it takes a list of values and outputs their mean.

In [8]:
def calc_mean(values):
    # Your last line of code should be of the form return <value>,
    # where <value> is the computed mean of your data.

    return sum(values) / len(values)

In [9]:
# This cell is provided to you to help check your work
calc_mean(risk_m)

0.01693750057836708

### Problem 3b. Calculate mean differences across the sexes in risk of death given COVID infection

Estimate mean risk of death from COVID for each sex using the function created in the previous step. Assign the results for each sex to the variables `mean_risk_m` and `mean_risk_f`

In [11]:
# Write your code here

mean_risk_m = calc_mean(risk_m)
mean_risk_f = calc_mean(risk_f)

In [12]:
# This cell is provided to you to help check your work
print(mean_risk_m,mean_risk_f)

0.01693750057836708 0.01336029006238279


### Problem 3c. Interpret your results

Which sex appears to be at greater risk of death, on average, if they are infected with COVID *(2pt)*? Do you think any difference might stem from biological, or behavioural differences, and why *(2pt)*? Feel free to speculate.

*Solution:* Given that the average risk for males `mean_risk_m` is higher than the average risk for females `mean_risk_f` this means that the males are at greater risk on average compared to females. In terms of behavioural differences men may be more exposed to covid due to working in occupations that have higher rates of covid exposure and deaths such as construction, transportation, and factories.

## Problem 4: Politics and epidemiological risk

In this section, we will combine data on the political affilation of each state's governor with the COVID data to ask the question: **do Democrat-run or Republican-run states tend to have greater sexual disparity in risk of COVID death?**

### Problem 4a. Read in the second dataset

Open the file `state_governors.csv`. This contains information on the political party affiliation of each state's governor. Read in the lines and assign the header to the variable `gov_header` and the rest of the dataset to the variable `gov_data`.

In [13]:
# Write your code here

gov_file = open("state_governors.csv", "r")
gov_lines = gov_file.readlines()
gov_header = gov_lines[0]
gov_data = gov_lines[1:]

### Problem 4b. Examine the dataset

Print `gov_header`. Examine the columns and the first few lines of the data. Explain what information you think each column contains. *(1pt)*

In [14]:
## Sample print statement to inform a short answer.

print(gov_header)
print(gov_data[1:4])

state_name,party

['Alaska,republican\n', 'Arizona,republican\n', 'Arkansas,republican\n']


*Solution:* The information contained in each column represents what state and party a govenor belongs too respectively.

### Problem 4c. Read in the political party data

Create an empty dictionary and assign it to the variable `state_govs`. Loop over the lines of `gov_data` and store the name of each state as a key in the dictionary, with the political party of its governor as the value.

In [15]:
# Write your code here

state_govs = {}
for line in gov_data:
    line_dat = line.strip().split(",")
    state = line_dat[0]
    party = line_dat[1]
    state_govs[state] = party

### Problem 4d. Sexual disparity in risk of COVID death by political party

Create two empty lists and assign them, respectively, to the variables `democrat_disp` and `repub_disp`. 

Then, loop over the original COVID data, similar to problem 2b, but this time computing the *disparity* between male and female risk of death and adding the value to either the democratic or republican list based on the state governor. Referencing the formula given in question 2b, please estimate the sex-based disparity in risk of death given COVID infection using the following formula:

```risk_disp = covid_risk_m - covid_risk_f```

For each row of COVID data, you will want to determine the state and look up the political affiliation of the governor of the state in the `state_govs` dictionary. Then append the risk disparity value to either `democrat_disp` or `republican_disp`, depending on whether the state in the current line has a democrat or republican governor. You will want to use an if statement to accomplish this.

HINT: when comparing strings, use `.lower()` or `.upper()` to make sure the cases match. You may also wish to check for and remove any extraneous whitespace.

In [16]:
# Write your code here

democrat_disp = []
repub_disp = []
for line in data:
    line_dat = line.strip().split(",")
    state = line_dat[0]
    party =  state_govs[state]
    try:
        inf_m  = float(line_dat[3])
        inf_f = float(line_dat[4])
        death_m = float(line_dat[10])
        death_f = float(line_dat[11])
        state_risk_m = (death_m/inf_m)
        state_risk_f = (death_f/inf_f)
        risk_disp = state_risk_m - state_risk_f
        if party.lower() == "republican":
            repub_disp.append(risk_disp)
        elif party.lower() == "democrat":
            democrat_disp.append(risk_disp)
    except:
        continue

### Problem 4e. Average sex-based disparity for states run by each political party

Please use the `calc_mean` function you defined above to calculate the average disparity in COVID death risk across states controlled by each major political party in the US. Assign the means from `repub_disp` and `democrat_disp` to the variables `repub_mean_risk` and `democrat_mean_risk`, respectively.

In [17]:
# Write your code here

repub_mean_risk = calc_mean(repub_disp)
democrat_mean_risk = calc_mean(democrat_disp)

In [18]:
# This cell is provided to you to help check your work
print(repub_mean_risk,democrat_mean_risk)

0.003419422241464342 0.0037199713357880535


### Problem 4f. Interpret your results

1) Please interpret the disparity metric from problem 4d. What does a value above zero indicate? What would a value below zero indicate? *(4 pts)*


2) Do Democrat-run or Republican-run states display, on average, greater disparity between risk of COVID death for males vs for females *(2pts)*? 

3) Do you think the observed difference is significant or meaningful? Explain why or why not *(2pts)*. In future weeks, we will learn how to evaluate this latter question statistically, but for now, see if you can create your own argument. **Speculate on any possible reasons for the observed difference**. Does our analysis of COVID death rates by dominant political party provide any possible sociopolitical causes for disparity between the sexes? *(4pts)*



*Solution*

1) A value above zero means that males are more likely to die given infection with COVID. A value below zero indicates that females are more likely to die given COVID infection.

2) Democrat-run states have a slightly higher disparity, on average, than Republican-run states. While males are more likely to die given COVID infection across all states, this difference is slightly greater in states with a Democratic governor.

3) Please accept any well-reasoned answer. Use your own discretion in evaluating thoughtfulness and engagement with the question.