## Assignment 1 Data Analysis using Pandas

This assignment will contain 13 questions with details as below. The due date is Friday, November 8th, 23:59 pm. Each late day will result in 20% loss of total points.

The file of 'Daily reports (csse_covid_19_daily_reports)' contains 01-01-2023 (MM-DD-YYYY) daily case report. All timestamps are in UTC (GMT+0). More Description can be found in [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.](https://github.com/CSSEGISandData/COVID-19)

References:

- Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1


Field/Feature/Column names descriptions are listed as follows

- FIPS: US only. Federal Information Processing Standards code that uniquely identifies counties within the USA.

- Admin2: County name. US only.

- Province_State: Province, state or dependency name.

- Country_Region: Country, region or sovereignty name. The names of locations included on the Website correspond with the official designations used by the U.S. Department of State.

- Last Update: MM/DD/YYYY HH:mm:ss (24 hour format, in UTC).

- Lat and Long: Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian dots are located at the centroid of the largest city in each state.

- Confirmed: Counts include confirmed and probable (where reported).

- Deaths: Counts include confirmed and probable (where reported).

- Recovered: Recovered cases are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number. US state-level recovered cases are from COVID Tracking Project. We stopped to maintain the recovered cases.

- Active: Active cases = total cases - total recovered - total deaths. This value is for reference only after we stopped to report the recovered cases.

- Incident_Rate: Incidence Rate = cases per 100,000 persons.

- Case_Fatality_Ratio (%): Case-Fatality Ratio (%) = 100 * Number recorded deaths / Number cases.

- All cases, deaths, and recoveries reported are based on the date of initial report.


Note: Please download the dataset "01-01-2023.csv" from the moodle to your local path for performing the analysis, as some modification on the original data was done to suit the needs for this assignment.

In [None]:
import pandas as pd
import numpy as np

**Question 1 (2 points)**

Now you need to use ```pandas``` to read the downloaded file from your local path. Print the column names, and also print a general description of it by using ```.describe()``` function.**

In [None]:
### Q1


**Question 2  (2 points)**

Meanwhile, the data contains a few errors that need to be resolved:

- the ```Long``` column is mistakenly encoded as ```Long_```
- the ```Recovered``` column contains mostly missing values and needs to be deleted
- the ```Active``` column contains mostly missing values and needs to be deleted
- the ```Incident_Rate``` column is miscalculated by multiplying 100 on its original value

In [None]:
### Q2


**Question 3  (2 points)**

The column ```Last_Update``` involves some timestamps that are not in the year of 2023. Find them out and delete those rows.

**The updated dataframe should have only rows with timestamp in 2023.**

Hint: use value_counts() to count unique values first.

In [None]:
### Q3


**Question 4  (2 points)**

There are two provinces/states that have the same latitude (```Lat```) 52.939900. Print out these two provinces/states.

In [None]:
### Q4


**Question 5  (2 points)**

Calculate and display the average Confirmed cases across all regions (report one overall average).

Calculate and display the median Deaths for U.S. counties (report one overall median).

In [None]:
### Q5


**Question 6 (2 points)**

Show the difference of average ```Deaths``` number between Alabama in US and Wyoming in US .

In [None]:
### Q6

**Question 7 (4 points)**

Create a subset of the DataFrame containing only samples collected in the U.S. where Admin2 is not "Unassigned".

Extract the State Name: Using the values in the Combined_Key column, create a new column called State_recovered that contains only the name of the province, state, or dependency. Exclude any county names and country/region information.

**Note: For all remaining data curation tasks, use this U.S. subset DataFrame.**

In [None]:
### Q7


**Question 8 (2 points)**

Compute the correlation between ```Confirmed, Deaths, Incident_Rate, Case_Fatality_Ratio```. What do you observe?

In [None]:
### Q8


**Question 9 (2 points)**

Find the number of miscalculated samples when the ```Case_Fatality_Ratio```(%) is not equal to 100 * Deaths number divided by Confirmed number. Note that in this case you also need to make sure the ```Confirmed```, as the denominator, is not zero.

In [None]:
### Q9


**Question 10 (2 points)**
Create a new column ```Case_Fatality_Ratio_short``` to extract and store the first three digits of the original values.
Create a new column ```Case_Fatality_Ratio_calculated``` and compute Case-Fatality Ratio(%) by yourself. Store the first three digits of the computed values as well.

Note that Case-Fatality Ratio(%) = 100 * Number recorded deaths / Number cases.

In [None]:
### Q10


**Question 11 (2 points)**

Find the number of samples when the ```Case_Fatality_Ratio_short``` is not equal to```Case_Fatality_Ratio_calculated```. Remember to drop the missing values appeared in these two columns, before count the sample size.

In [None]:
### Q11


**Question 12 (4 points)**

We define a new concept, acceptable percentage error, to measure the magnitude of error. It is calculated as the absolute percentage difference between the calculated value and the original stored value (rounded to three decimal places), using the formula:
![image.png](attachment:image.png)

Compute the acceptable percentage error and add it as a new column to the DataFrame.

Group this continuous acceptable percentage error into the following discrete bins: [0, 0.5], (0.5, 1], (1, 10], and (10, 50]. Note that 0 is included in the first bin.

Use the value_counts() method to check the distribution of samples across these bins.

Compute this acceptable percentage error, add it as a new column of the data frame, and group this continuous acceptable percentage error into discrete bins ([0,0.5], (0.5,1], (1,10], (10,50]) to generate a new categorical object. 


In [None]:
### Q12


**Question 13 (2 points)**

Use ```map()``` method to perform element-wise transformation on the generated categorical object and create a new series, according to the following rules:

- if error is in range [0, 0.5] or (0.5, 1], transform as 'Accept'
- if error is in range (1, 10] or (10, 50], transform as 'Reject'

Use ```value_counts()``` to check the counts for these three types.

In [None]:
### Q13
