# Coronavirus World Data Analysis

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

First of all, run the following cell to:

- import `pandas` with an alias of `pd`
- read a CSV containing the data to work with
- convert the `date` column to the `datetime` format
- create a DataFrame `df` containing the data for only 1st July 2020
- take a look at the first few rows of the DataFrame


In [1]:
import pandas as pd

data = pd.read_csv('data/owid-covid-data.csv')
data['date'] = pd.to_datetime(data['date'])
df = data[data['date'] == '2020-07-01']

df.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,...,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
173,AFG,Asia,Afghanistan,2020-07-01,31517.0,279.0,746.0,13.0,809.616,7.167,...,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83
300,ALB,Europe,Albania,2020-07-01,2535.0,69.0,62.0,4.0,880.881,23.977,...,8.643,11803.431,1.1,304.195,10.08,7.1,51.2,,2.89,78.57
491,DZA,Africa,Algeria,2020-07-01,13907.0,336.0,912.0,7.0,317.142,7.662,...,3.857,13913.839,0.5,278.364,6.73,0.7,30.4,83.741,1.9,76.88
613,AND,Europe,Andorra,2020-07-01,855.0,0.0,52.0,0.0,11065.812,0.0,...,,,,109.135,7.97,29.0,37.8,,,83.73
727,AGO,Africa,Angola,2020-07-01,284.0,8.0,13.0,2.0,8.641,0.243,...,1.362,5819.495,,276.045,3.94,,,26.664,,61.15


`df` now has one row of data for each country with data present for July 1st 2020. However, it also has a row with a `location` of `World` which contains aggregated values for all countries. 

**Q1. Create a new DataFrame which is the same as `df` but with the `World` row removed.**

Assign this new DataFrame to the variable `countries`; do not modify `df`.

In [2]:
countries = df.copy()
countries = countries.iloc[:-1,:]

**Q2. Check the shape of your DataFrame to confirm that `countries` has one row fewer than `df`:**

In [3]:
print(df.shape, countries.shape)

(211, 34) (210, 34)


In [4]:
cols = ['continent', 'location', 'total_deaths_per_million']

**Q3. Define a DataFrame based on the `countries` DataFrame, but which only contains the columns in `cols` (defined above) and assign this to a variable called `countries_dr`**

Order this DataFrame by `total_deaths_per_million`, with the highest numbers at the top.

In [5]:
countries_dr = countries.copy()
countries_dr = countries_dr.drop(countries_dr.columns.difference(cols), axis=1).sort_values(by=['total_deaths_per_million'], ascending=False)
countries_dr

Unnamed: 0,continent,location,total_deaths_per_million
23306,Europe,San Marino,1237.551
2917,Europe,Belgium,841.615
613,Europe,Andorra,673.008
28347,Europe,United Kingdom,644.168
25362,Europe,Spain,606.633
...,...,...,...
23111,North America,Saint Vincent and the Grenadines,0.000
23926,Africa,Seychelles,0.000
15734,Africa,Lesotho,0.000
10808,Europe,Gibraltar,0.000


**Q4. Using the `countries` DataFrame we created earlier, find the sum of `total_tests` for countries in `Africa`, assigning the result, *as an integer*, to `africa_tests`.**

In [6]:
africa_tests = countries.copy()
africa_tests = africa_tests[africa_tests['continent'] == 'Africa'].total_tests.sum().astype(int)
africa_tests

3445134

**Q5. How many countries in Africa have no value recorded for the number of `total_tests`? Assign the result to `africa_missing_test_data`.**

*You may find the pandas `.isna()` method useful.*

In [7]:
africa_missing_test_data = countries.copy()
africa_missing_test_data = len(africa_missing_test_data[(africa_missing_test_data['continent'] == 'Africa') & (africa_missing_test_data['total_tests'].isna())].groupby('location'))

In [8]:
africa_missing_test_data

45

**Q6. How many countries have a higher value for `total_tests` than the `United Kingdom`? Assign your answer to a variable called `countries_more_tests`.**

Remember to work from the `countries` DataFrame rather than `df`. You should avoid modifying any existing DataFrames. 

In [9]:
countries_more_tests = countries.copy()
countries_more_tests = len(countries_more_tests[countries_more_tests['total_tests'] > (countries_more_tests[countries_more_tests['location'] == 'United Kingdom'].total_tests.sum().astype(int))])
countries_more_tests

3

**Q7. Create a DataFrame called `beds_dr` which is based on the `countries` DataFrame, but contains only the columns `hospital_beds_per_thousand` and `total_deaths_per_million`.**

Your answer should only  include rows where there are values present in both of these columns. *You may find the `.dropna()` method useful.*

In [10]:
col = ['hospital_beds_per_thousand', 'total_deaths_per_million']

In [11]:
beds_dr = countries.copy()
beds_dr = beds_dr.drop(beds_dr.columns.difference(col), axis=1).dropna()
beds_dr

Unnamed: 0,total_deaths_per_million,hospital_beds_per_thousand
173,19.163,0.50
300,21.544,2.89
491,20.798,1.90
952,30.635,3.80
1081,28.919,5.00
...,...,...
29136,1.794,0.80
29332,0.000,2.60
29506,10.461,0.70
29623,1.305,2.00


**Q8. What is the average `total_deaths_per_million` for entries in `beds_dr` where `hospital_beds_per_thousand` is greater than the mean?**

Assign the answer to `dr_high_bed_ratio`.

In [12]:
dr_high_bed_ratio = beds_dr[beds_dr['hospital_beds_per_thousand'] > (beds_dr.hospital_beds_per_thousand.mean())].total_deaths_per_million.mean()
dr_high_bed_ratio

98.18423728813558

In [13]:
beds_dr[beds_dr['hospital_beds_per_thousand'] > (beds_dr.hospital_beds_per_thousand.mean())].total_deaths_per_million.mean()


98.18423728813558

**Q9. What is the average `total_deaths_per_million` for entries in `beds_dr` where `hospital_beds_per_thousand` is less than the mean?**

Assign the answer to `dr_low_bed_ratio`.

In [14]:
dr_low_bed_ratio = beds_dr[beds_dr['hospital_beds_per_thousand'] < (beds_dr.hospital_beds_per_thousand.mean())].total_deaths_per_million.mean()
dr_low_bed_ratio


56.29405714285714

**Q10. Create a DataFrame called `no_new_cases` which contains only rows from `countries` with zero `new_cases`.**

In [15]:
no_new_cases = countries[countries['new_cases'] == 0 ]
no_new_cases

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,total_deaths,new_deaths,total_cases_per_million,new_cases_per_million,...,aged_70_older,gdp_per_capita,extreme_poverty,cvd_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy
613,AND,Europe,Andorra,2020-07-01,855.0,0.0,52.0,0.0,11065.812,0.0,...,,,,109.135,7.97,29.0,37.8,,,83.73
836,AIA,North America,Anguilla,2020-07-01,3.0,0.0,0.0,0.0,199.973,0.0,...,,,,,,,,,,81.88
952,ATG,North America,Antigua and Barbuda,2020-07-01,66.0,0.0,3.0,0.0,673.965,0.0,...,4.631,21490.943,,191.511,13.17,,,,3.8,77.02
1381,ABW,North America,Aruba,2020-07-01,103.0,0.0,3.0,0.0,964.727,0.0,...,7.452,35973.781,,,11.62,,,,,76.29
2080,BHS,North America,Bahamas,2020-07-01,104.0,0.0,11.0,0.0,264.464,0.0,...,5.200,27717.847,,235.954,13.17,3.1,20.4,,2.9,73.92
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27103,TLS,Asia,Timor,2020-07-01,24.0,0.0,0.0,0.0,18.203,0.0,...,1.897,6570.102,30.3,335.346,6.86,6.3,78.1,28.178,5.9,69.50
27725,TCA,North America,Turks and Caicos Islands,2020-07-01,41.0,0.0,2.0,1.0,1058.939,0.0,...,,,,,,,,,,80.22
28654,VIR,North America,United States Virgin Islands,2020-07-01,84.0,0.0,6.0,0.0,804.420,0.0,...,10.799,,,273.670,12.26,,,,,80.58
29016,VAT,Europe,Vatican,2020-07-01,12.0,0.0,0.0,0.0,14833.127,0.0,...,,,,,,,,,,75.12


**Q11. Which country in `no_new_cases` has had the highest number of `total_cases`? Assign your answer to `highest_no_new`.**

In [16]:
highest_no_new = str(no_new_cases[no_new_cases['total_cases'] == (no_new_cases.total_cases.max())].location.values).replace("[","").replace("]","").replace("'","")


In [17]:
highest_no_new

'Cameroon'

**Q12. What is the sum of the `population` of all countries which have had zero `total_deaths`?**

Assign your answer to `sum_populations_no_deaths`. Your answer should be in millions, rounded to the nearest whole number, and converted to an integer.

In [18]:
sum_populations_no_deaths = countries.population.sum().astype(int)
sum_populations_no_deaths

7757980095

**Q13. Create a function called `country_metric` which accepts the following three parameters:**

- a DataFrame (which can be assumed to be of a similar format to `countries`)
- a location (i.e. a string  which will be found in the `location` column of the DataFrame)
- a metric (i.e. a string which will be found in any column  (other than `location`)  in the DataFrame)

The function should return only the value from the first row for a given `location` and  `metric`. *You may find  `.iloc[]`  useful.*

In [19]:
def country_metric(data, place, metric):
    for item in data:
        item = float(data[data['location'] == place][metric].values)           
        return item

**Q.14 Use your function to collect the value for `Vietnam` for the metric `aged_70_older`, assigning the result to `vietnam_older_70`.**

In [20]:
vietnam_older_70 = country_metric(countries, 'Vietnam', 'aged_70_older')
vietnam_older_70

4.718

**Q.15 Create another function called `countries_average`, which accepts the following three parameters:**

- a DataFrame (which can be assumed to be such as `countries`)
- a list of countries (which can be assumed to all be found in the `location` column of the DataFrame)
- a string (which can be assumed to be a column (other than `location`) which will be found in the DataFrame) 

The function should return the average value for the given metric for the given list of countries.

In [26]:
def countries_average(data, places, col_name):
    new_data = data[(data.location.isin(places))]
    col = new_data[col_name]
    ave = col.mean()
    return ave

In [27]:
g7 = ['United States', 'Italy', 'Canada', 'Japan', 'United Kingdom', 'Germany', 'France']

**Q16. Use your `countries_average` function to find out the average `life_expectancy` of countries in the `g7` list defined above. Assign the result to the variable `g7_avg_life_expectancy`.**

In [28]:
g7_avg_life_expectancy = countries_average(countries, g7, 'life_expectancy')
g7_avg_life_expectancy

82.10571428571428

**Q.17 Find the country with lowest value for `life_expectancy` in the `countries` DataFrame, and create a string which is formatted as follows:**

'{country} has a life expectancy of {diff} years lower than the G7 average.'
    
Assign your string to the variable `headline` and ensure it is formatted exactly as above, with:

- {country} being replaced by the value in the `location` column of the DataFrame
- {diff} being replaced by a float **rounded to one decimal place**, of the value from the `life_expectancy` column subtracted from `g7_avg_life_expectancy` 
- Please note that {diff} should be a positive value

In [32]:
country = countries[countries['life_expectancy'] == (countries.life_expectancy.min())].location.tolist()
print(f'{country} has a life expectancy of 10 years lower than the G7 average.')

['Central African Republic'] has a life expectancy of 10 years lower than the G7 average.


In [33]:
headline = print(f'{country} has a life expectancy of 10 years lower than the G7 average.')

['Central African Republic'] has a life expectancy of 10 years lower than the G7 average.
