In [7]:
import pandas as pd
covid_raw_data = pd.read_csv('covid_raw_w2.csv')
covid_raw_data.head()

Unnamed: 0.1,Unnamed: 0,State,Date,Total_cases,Male_cases,Female_cases,Male_cases_pct,Female_cases_pct,Male_cases_rate,Female_cases_rate,Total_deaths,Male_deaths,Female_deaths,Male_deaths_pct,Female_deaths_pct,Male_deaths_rate,Female_deaths_rate,pop,source
0,1,Alaska,30-Oct,132645.0,67287.0,64852.0,50.73,48.89,17450.9,18374.95,699.0,425.0,274.0,60.8,39.2,110.22,77.63,731545,https://coronavirus-response-alaska-dhss.hub.a...
1,2,Arizona,30-Oct,1166060.0,561976.0,599451.0,48.19,51.41,16272.94,17160.29,21153.0,12392.0,8748.0,58.58,41.36,358.83,250.43,7278717,https://www.azdhs.gov/preparedness/epidemiolog...
2,4,California,30-Oct,4647587.0,2221547.0,2356327.0,47.8,50.7,11419.62,11964.09,71519.0,41696.0,29537.0,58.3,41.3,214.33,149.97,39512223,https://update.covid19.ca.gov/
3,5,Colorado,30-Oct,740461.0,360012.0,370008.0,48.62,49.97,12946.21,13453.33,8186.0,4508.0,3666.0,55.07,44.78,162.11,133.28,5758736,https://covid19.colorado.gov/case-data
4,6,Connecticut,30-Oct,402583.0,192749.0,208210.0,47.88,51.72,11032.32,11350.47,8764.0,4338.0,4416.0,49.5,50.39,248.29,240.74,3565287,https://portal.ct.gov/Coronavirus


# Extract a `DataFrame` with only the columns we care about

In [8]:
important_columns = ['State', 'Total_cases', 'pop']
covid_data = covid_raw_data[important_columns]
covid_data.head()

Unnamed: 0,State,Total_cases,pop
0,Alaska,132645.0,731545
1,Arizona,1166060.0,7278717
2,California,4647587.0,39512223
3,Colorado,740461.0,5758736
4,Connecticut,402583.0,3565287


In [9]:
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   State        41 non-null     object 
 1   Total_cases  41 non-null     float64
 2   pop          41 non-null     int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.1+ KB


## Large states: population at least 15 million

As it turns out, when we retrieved the data, some values were missing, and we omitted states with missing data. For example, it did not contain a `Total_cases` column for Florida, so we omitted Florida from the dataset. There are other states missing as well — there are a total of 41 states in our cleaned dataset.

In [10]:
large_population = covid_data['pop'] > 15_000_000
large_states = covid_data[large_population]
large_states.head()

Unnamed: 0,State,Total_cases,pop
2,California,4647587.0,39512223
25,New York,2537145.0,19453561
35,Texas,3511739.0,28995881


In [11]:
large_pop_col = large_states['pop']
large_pop_sum = sum(large_pop_col)
large_case_col = large_states['Total_cases']
large_case_sum = sum(large_case_col)
large_case_avg = large_case_sum / large_pop_sum
print(f'Large case avg: {large_case_avg}')

Large case avg: 0.1216037804650469


## Small states: population less than 1 million

In [12]:
small_population = covid_data['pop'] < 1_000_000
small_states = covid_data[small_population]
small_states.head()

Unnamed: 0,State,Total_cases,pop
0,Alaska,132645.0,731545
5,Delaware,143950.0,973764
6,District of Columbia,64240.0,705749
33,South Dakota,154482.0,884659
37,Vermont,40191.0,623989


In [13]:
small_pop_col = small_states['pop']
small_pop_sum = sum(small_pop_col)
small_case_col = small_states['Total_cases']
small_case_sum = sum(small_case_col)
small_case_avg = small_case_sum / small_pop_sum
print(f'Small case avg: {small_case_avg}')

Small case avg: 0.14192263360946455


In [19]:
print('covid_data.head():')
print(covid_data.head())
print()
print('large_pop_sum:')
print(large_pop_sum)
print('large_case_sum:')
print(large_case_sum)
print('large_case_avg:')
print(large_case_avg)
print()
print('small_pop_sum:')
print(small_pop_sum)
print('small_case_sum:')
print(small_case_sum)
print('small_case_avg:')
print(small_case_avg)

covid_data:
         State  Total_cases       pop
0       Alaska     132645.0    731545
1      Arizona    1166060.0   7278717
2   California    4647587.0  39512223
3     Colorado     740461.0   5758736
4  Connecticut     402583.0   3565287

large_pop_sum:
87961665
large_case_sum:
10696471.0
large_case_avg:
0.1216037804650469

small_pop_sum:
4498465
small_case_sum:
638434.0
small_case_avg:
0.14192263360946455


# Old instructions

## Notebook format to submit for homework

Use the following format in your notebook, with at least one cell (and usually several) per section. Use these titles in the Markdown so that each section is easily identifiable. Any level is fine as long as they are the same. You are welcome to add subheadings as well. (NEED TO ADJUST)

Use the following format in your notebook. Each section (except Intro) will have multiple blocks, and some sections will have a mix of markdown and code blocks. The type of block you'll need to create and fill out to answer the question/task will be denoted by either `Code` or `Markdown` in parentheses. Follow the questions in the order below to complete your data story.

Complete the following tasks in code cells. 

1. **Intro:** In a Markdown cell, write a brief introduction (4-6 sentences) that briefly describes the data, methods, computations, and overall conclusions answered in your notebook.


2. **Data:** This section will have both Markdown and Code cells. The type of cell you need to complete the task in will be denoted by parantheses at the end of the question.
    1. Using pandas, read in the `covid_sex.csv` file and assign it to the variable named `covid_data`. `(Code block)`
    1. Call the .head() function on the dataframe `covid_data`. `(Code block)`
    1. Use your result in 2B. to describe your data in a `Markdown block`. Make sure to discuss the following:
        1. The number of rows and observations of the data set. *Note: You might want to reference the pd.DataFrame.shape attribute.*
        1. What each observation ('row') represents?
        1. A description of the following columns: `Total_cases`, `Male_cases`, `Female_cases`, `Male_cases_pct`, `Female_cases_pct`, `Total_deaths`, `Male_deaths`, `Female_deaths`, `Male_deaths_pct`, and `Female_deaths_pct`. Make sure to talk about the type of the variable, and any relationships/patterns you notice between variables.
        1. What column names are important? (*might be specific to the data story*)
        
        
3. **Methods:** As part of your data story, you'll be completing the following tasks. For the methods section, describe using a markdown cell, explain how you'd complete each of the tasks and their sub-tasks. Feel free to divide your discussion by task.
    1. Data Cleaning
        1. Checking column names.
        1. Renaming ('cleaning') columns. 
    1. Selecting Variables and Summary Statistics
        1. Extracting multiple columns: `State`, `Total Cases`, and `pop`
        1. Computing the states with the highest and lowest total case counts.
    1. Subsetting Observations
        1. Subsetting the first 5 observations and all columns (Note: do not use .head()) 
        1. Subsetting the first 5 observations, and drilling down on columns of interest (total case counts, male case percentage, female case percentage, total deaths, male death percentage, female case percentage)
    
    
4. **Computation:** This section will have both Markdown and Code cells. Complete the following tasks in the type of cell specified in parentheses.
    * Data Cleaning:         
        1. Get the column names of `covid_data` and assign it to the variable `column_names`. `(Code block)`
        1. What do you notice about how our column names are formatted, i.e. are they formatted similarly, differently, consistently, etc.? Why might standardization of column names be important for our future computations? Answer in 2 lines for each question. `(Markdown)`
        1. <s> We decide we want to standardize our column names, and thus rename columns. 'Standardizing' column names means all columns names follow a standard, consistent convention. Create a naming scheme of your choice, and create a dictionary mapping the old column names to your new ones. Assign the dictionary to the variable `new_column_names`. *Note: we won't touch setting these new column names just yet, you just have to create the dictionary.* `(Code)` </s>
        1. <s> In a markdown cell, explain your choice for the naming convention used in 4D. in 2 lines. `(Markdown)` </s>
    * Selecting Variables and Summary Statistics
        1. Select the variables `State`, `Total Cases`, and `pop`. Keep all the observations. Assign your result to the variable `columns_of_interest`. `(code)`
        1. Using implicit indexing, subset the first five observations in `covid_data`. Next, select the `case_count` and `pop` variables.  Assign your result to the variable `first_five_states`.
        1. In a markdown block, describe in 1 line a potential relationship between `case_count` and `pop`. In 1-2 lines, reason why this relationship makes sense.. In 1-2 lines, suggest a new computation we might want to perform that would allow us to compare covid cases between states, accounting or 'adjusting' for population. `Markdown`
        1. Compute the minimum and maximum of the column `Total Cases` and assign the variables to `min_cases` and `max_cases`. Refer to the **Helpful Documentation** section for documentation on functions that help you compute these aggregates. `(code)`
        1. Compute the minimum and maximum of the column `pop` and assign them to the variables `smallest_population` and `largest_population`. `(code)`
        1. Compute the minimum and maximum of the column `Total Cases` and assign them to the variables `smallest_case_count` and `largest_case_count`. `(code)`
        1. Would the state with the greatest population necessarily have the largest case count? Give your reasoning in 1-2 sentences. `(markdown)`.
   * Subsetting Observations (this should be unindented)
        1. Let's verify whether the state with the largest population has the largest case count. We will complete this in multiple steps. First, create a boolean expression that checks whether each value in the `pop` column is the maximum value (in that column). Assign this to the variable `max_population_mask `(Code)`
        1. Now, let's subset the observation that has the largest value in the column `pop`. Use the mask from above (A.) and assign your result to the variable `state_with_largest_population_stats`. `(Code)`
        1. Finally, we want to know what state has the largest population. Select a single column from your variable `state_with_largest_population_stats` that gives you *just* the name of the state with the largest population. You should return a 1x1 dataframe with just one value, and assign to the variable `state_with_largest_population`. `(Code)`
        1. In a markdown block, explain each of the steps above in one line each (A.-C.). Start by explaining the overarching goal of this analysis - what are we trying to find? Then, state what you accomplish with each step, and explain the meaning of the code (i.e. 'translate' the code syntax to plain language. Feel free to reference code enclosed by \` \` where needed. `(Code)`
        1. Call the function [.squeeze()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.squeeze.html) on `state_with_largest_population`, and assign it to the variable name `state_scalar`. (need a better var name here). `Code`
       
       \*FYI: pd.DataFrame.squeeze() is useful for extracting 1x1 dataframes to 'scalar' values (individual ints, floats, str, etc.; which is useful for further filtering).
        1. Compare the case count for the state with the largest population to the case count of the state with the largest case count. Use a boolean expression to compare `state_scalar` to `highest_case_count`. *Hint*: Our result in the question above is a `str` that we can pass as a condition in subsetting.
        1. In one sentence, briefly state your result from the question above using a full sentence. In a second line, interpret this result (what does it mean in terms of the variables that you use)? In a third line, give a short explanation or justification for your result. `Markdown`.
        1. Find the state with the largest case count. *Hint: The method for finding follows that of A.-C. with a minor modification.*
        1. In 2 lines, state the result of the analysis of the above question, and a potential interpretation. `Markdown`
        
        
5. **Discussion and Conclusion:** Wow! We did a lot. Let's try to summarize our main findings in this section, using a Markdown block. 
    1. Data - 2 lines reiterating a general description of the data.
    2. Data Description, Summary Statistics, and Subsetting: 
        * Mention the states with the largest and smallest population and case counts, respectively. 
        * Interpret and discuss the relationships between the population and case counts. Mention your answer to X) and why this might be a better metric for comparison.
        * Mention your procedure for determining the state (name) with the highest population. State whether or not the largest state had the largest case count. Discuss briefly why or why not this might have been the case, drawing from anecdotal experience or empiric evidence.