# Lab Assignment 4 -- Data and Pandas
In this lab, you will complete a series of exercises related to the lecture material on data and Pandas using some real datasets. Each exercise will focus around a single dataset and contain multiple steps. 

In [None]:
# We import the libraries you will need
import math
import pandas as pd

## Exercise 1 -- Unemployment Data
Below, we load in data on Unemployment in the United States at the State level.

In [None]:
# Do Not Edit
url = "https://datascience.quantecon.org/assets/data/state_unemployment.csv"
unemp = pd.read_csv(url, parse_dates=["Date"])

## Exercise 1a -- Displaying Data
Complete the following steps:
- In two cells below, display the top 7 rows  and the bottom 3 rows of `unemp`. 
- In the third cell, change `unemp` so its column names are strictly lowercase
- Display the resulting DataFrame by calling `unemp` at the bottomg of the third cell.

From a "tidy" perspective, what is an observation in this data? Explain why. Answer in the Markdown cell below.

In [None]:
## Exercise 1a -- Top 7


In [None]:
## Exercise 1a -- Bottom 3


In [None]:
# Exercise 1a -- Rename Columns


### Response to Exercise 1a

## Exercise 1b -- Creating Variables & Index Setting
Complete the following steps:
- Create a column in `unemp` called "year" that is equal to the year of the date. 
- Change the index of `unemp` so that the index (or indices) reflect the observational units. 
- Display the indices.  

In the Markdown cell below, address the following prompts:
1. How many indices are there?
2. Why are there this many indices? Write an equation that explains it.
3. Is one of your indices a `DateTimeIndex` object?

In [None]:
# Exercise 1b Code


### Response to Exercise 1b

### Exercise 1c -- Plotting Annual Averages
Complete the following steps:
1. Using `tiny_unemp`, find the year-state average of the unemployment rate and save it to `yearly_state_unemp`.
2. Reshape `yearly_state_unemp` by using `unstack()`. Ensure that your row indices are years and your column variables are states.  
3. Display `yearly_state_unemp` by making it the last line in the cell.
4. Note that the `unemploymentrate` level is not very useful because all of the numbers are unemployment rates. Let's remove it by using `.droplevel()` on `yearly_state_unemp`. Call this new DataFrame `clean_state_unemp` You may want to reference the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.droplevel.html) to see which argument you will need. 
5. Use the `.plot()` method on `clean_state_unemp`.
6. In the next cell, use the `.plot()` method on `yearly_state_unemp`.

In the Markdown cell below **answer the following questions**: 
1. What is the most salient real world phenomenon this is visible in the plot?
2. Compare the two plots. How did removing the `unemploymentrate` level change the plot?

**Hints**
- You can use `df.drop()` ([documentation here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html))  to get rid of the `laborforce` column when creating `yearly_state_unemp`.

In [None]:
# Do not edit this code
tiny_unemp = unemp.loc[["Colorado", "California", "Alabama", "New York", "Florida"]]

In [None]:
# Exercise 1c -- Steps 1-3


In [None]:
# Exercise 1c -- Steps 4 & 5


In [None]:
# Exercise 1c -- Step 6


### Response to Exercise 1c

## Exercise 1d -- Using Lags to Plot Difference
Instead of plotting the unemployment rate average over time, we will plot the annual difference between unemployment rates over time for each state. To do this, complete the following steps:
1. Create a DataFrame called `shifted_unemp` and assign it to `clean_state_unemp` shifted by 1.
2. Create a DataFrame called `change_unemp` and assign it to the the difference between `clean_state_unemp` and `shifted_unemp` divided by `shifted_unemp`.
3. Call `.plot()` on `change_unemp`.

Answer the following questions in the markdown cell below. 
1. Why is the year 2000 no longer being plotted? Look at the `change_unemp` DataFrame if you are unsure.
2. Which state saw the largest annual increase in unemployment during this period?

In [None]:
# Exercise 1d code




### Response to Exercise 1d

## Exercise 2
In this question, we're going to look at data on daily Covid cases in British Columbia from the [COVID-19 Canada Open Data Working Group](https://github.com/ccodwg/Covid19Canada). This data is broken down into five health regions:
- Fraser Health (Fraser)
- Interior Health (Interior)
- Northern Health (Northern)
- Vancouver Coastal Health (Vancouver Coastal)
- Vancouver Island Health Authority (Island)

You can see the geography of these regions below (Image from gov.bc.ca)

<img src = "https://www2.gov.bc.ca/assets/gov/health/managing-your-health/mental-health-substance-use/find-services-map-large.jpg"/>

In [None]:
# Dowlonad the Data -- don't edit this cell
cases_bc = pd.read_csv("..\\data\\covid_cases_bc.csv")
cases_bc.head()

## Exercise 2a -- Checking the Data
`cases_bc` contains daily reported covid cases per hundred thousand people in BC for the year 2021. The data is broken down by health region. Before working with the data, you should learn how its structured and check it for potential errors. Answer the following questions in the first markdown cell below.
1. What is an observation (or row) in the dataset?
2. What is the index? Could we turn one of our variables into an index? Which one?
3. Does it make sense for any of the values in the table to be negative? Why or why not?

Now complete the following steps:
- Make it so the date is our index and check whether it is a `DateTimeIndex`. Afterwards, display `cases_bc` by making it the last line of the cell below. 
- In the second cell, call `.dropna()` on cases_bc. 
- In the third cell, us `.any()` and comparison operators **applied to `cases_bc`** to display all dates on which at least one of the health regions has a negative value. Call this DataFrame `neg_values`. Display `neg_values` by making it the last line of the cell. (**Hint**: Observe what happens when you use a comparison operator on a DataFrame. You can use the output to easily form an index for your DataFrame.)
- If any dates had a negative values, use comparison operators to set all negative values in `cases_bc` to `NaN`.


Answer the following Questions in the second Markdown cell.

4. Did dropping missing value change the DataFrame at all?
5. Did you find any negative values? If so, they have been turned into `NaN` values and will not be used in future analysis. What would be a possible alternative to this approach?


### Response to Exercise 2a Questions 1-3


In [None]:
# Exercise 2a -- Step 1


In [None]:
# Exercise 2a -- Step 2


In [None]:
# Exercise 2a -- Step 3


In [None]:
# Exercise 2a -- Step 4


### Response to Exercise 2a 4-5

## Exercise 2b -- Aggregations
Using aggregators with the axis argument, complete the following steps:
1. At each date, find the minimum number of cases per 100,000 across health regions. Print the top 3 rows of the resulting series.
2. For each health region, what was the median number of daily cases per 100,000 in 2021. Print the resulting series. (**Hint:** Think about how long should this series be.)
3. What the maximum number of daily cases per 100,000 across all health regions? Which health region did this maximum occur in? What day was the maximum attained?

**Hint:** For the last step, you might need to aggregate twice. You may also want to use the aggregator `.idxmax()` which returns the index of the maximum value. 

In [None]:
# Exercise 2b -- Step 1


In [None]:
# Exercise 2b -- Step 2


## Exercise 2c -- Classifying Variance
Averages and medians communicate some notion of the statistical center of data. Similarly, the sample variance of data gives us some sense of how dispersed the data is around that center. A low variance means the data is relatively concentrated whereas a high variance means the data is relatively dispersed. The sample variance of a column of data $x$ can be given by the following equation
$$
var(x) = \frac{1}{n-1} \sum_{i=1}^{n} \left(x_i - \bar{x}\right)^2
$$
where $n$ is the nimber of elements in $x$, $\bar{x}$ is the average of $x$, and $x_i$ is a single element in $x$. Complete the following steps:
1.  The method DataFrame and Series method `.var()` will automatically calculate the variance of each column or row in a DataFrame. Use `.var()` to calculate the variance in the daily cases per 100,000 for each health region in 2021. Name this DataFrame `hr_var`.
2. Currently,  `hr_var` has a single column indexed by the integer 0. Rename this column so it is called "s_var".
3. Using a list comprehension, create a new column in `hr_var` called "classification" which is equal to
    - "High" if the variance for a health region is strictly greater than 300,  
    - "Medium" if the variance for a health region is strictly greater than 150 and less than or equal to 300
    - "Low" if the variance for a health region is less than or equal to 150. 
    
4. Display the resulting DataFrame by having `hr_var` as your last line in the cell. 
5. In the second cell, use `cases_bc`, indexing, and `.plot()` to plot the daily cases per 100,000 for the Northern and Island health regions.

Finally, in the Markdown cell below **answer the following question**. What are the classifications for the Northern and Island regions. What features of the two lines you plotted reflect these classifications? 


**Hint:** For step 3, you will want to use a nested if else statements within the list comprehension. 

In [None]:
# Exercise 2c -- Steps 1-4

In [None]:
# Exercise 2c -- Step 5


### Response to Exercise 2c 

## Exerice 2d -- More Classifying & `.applymap()`
Now, we want to determine whether the cases per 100,000 on a given day was "High" ($> 10$), "Low"($\leq 10$ and $>0$), or "None" ($=0$) for each region-day. To do this, complete the following steps:
1. Define a function called `classify_cases` that takes a **single number** and returns a "High", "Low" or "None" according to the criteria aboce
2. `.applymap()` takes a function of a single value and applies that function to each cell in a DataFrame  individuall. Using `classify_cases` and `.applymap`, create a DataFrame that has a classification for each region-day. Call this DataFrame `cases_bins`. 
3. Print the top 5 rows of that DataFratme. 

In [None]:
# Exercise 2d code


### Exercise 2e -- Classification Count
Next, we want to count how many days of each type ("High", "Low", and "None") each helth region had on 2021. Complete the following steps:
1. Using `pd.value_counts` and `.apply()`, create DataFrame called `class_counts` where the row indices are the three classes and the columns are health regions.
2. Using the DataFrame method `.barh()`, create a horizontal bar plot where there are five groups of three bars. 

**Hint:** You may have to use `.T` to get the right bar chart. 

In [None]:
# Exercise 2e code


### Exercise 2f -- Choose Your Health Region
Choose one of the five health regions. Complete the following steps to find the average number of cases per 100,000 when case loads are "High" and when case loads are "Low".
1. Using `pd.concat()` `cases_bc`, and `cases_bin`, create a DataFrame with two columns called `my_health_region`. The first column should be titled "cases" and include the number of cases per 100,000 in your chosen health region for each day. The second column should be titled "class" and include the classification for that health region in each day.
2. Using `.groupby()` and an aggregator, find the average number of cases per 100,000 when case loads are "High" and when case loads are "Low".

In [None]:
# Exercise 2f code
