# Coronavirus World Data Analysis

First of all
- import `pandas` with an alias of `pd`
- read a CSV containing the data to work with
- convert the `date` column to the `datetime` format
- create a DataFrame `df` containing the data for only **1st July 2020**
- take a look at the first few rows of the DataFrame


In [1]:
import pandas as pd

data = pd.read_csv('data/owid-covid-data.csv')
data['date'] = pd.to_datetime(data['date'])
df = data[data['date'] == '2020-07-01']

df.head()

- `df` DataFrame now has one row of data for each country with data present for **July 1st 2020**
- however, it also has a row with a `location` of `World` which contains aggregated values for all countries
- `df.tail()`, `df.info()` and `df.shape` will allow for further exploration of the structure of the DataFrame

In [2]:
df.tail()

In [3]:
df.info()

In [4]:
df.shape

**Q1. Create a new DataFrame called `countries` which is the same as `df` but with the `World` row removed.**

- Use the `.copy()` method to ensure you have a distinct DataFrame in memory
- Assign this new DataFrame to the variable `countries`; do not modify `df`


In [44]:
countries = df.copy()
ind = countries[countries['location'] == 'World'].index[0]
countries = df.drop(index = ind)
countries.tail()

**Q2. Check the shape of your DataFrame to confirm that `countries` has one row fewer than `df`:**

In [45]:
print(df.shape, countries.shape)

**Q3. Define a DataFrame based on the `countries` DataFrame, but which only contains the columns in `cols` (defined below) and assign this to a variable called `countries_dr`**

- Order this DataFrame by `'total_deaths_per_million'`, with the highest numbers at the top.

In [116]:
cols = ['continent', 'location', 'total_deaths_per_million']

countries_dr = countries[cols].sort_values(by = 'total_deaths_per_million', ascending = False)
countries_dr.head()

**Q4. Using the `countries` DataFrame we created earlier, find the sum of `total_tests` for countries in `Africa`, assigning the result, *as an integer*, to `africa_tests`.**

- Use `.sum()` method calculate the sum for `total_tests` column
- Use `.astype(int)` method or `int()` function to convert results to an integer

In [27]:
africa_tests = countries[countries['continent'] == 'Africa'].total_tests.sum().astype(int)
africa_tests

**Q5. How many countries in Africa have no value recorded for the number of `total_tests` column? Assign the result to `africa_missing_test_data`.**

In [126]:
# africa_missing_test_data = len(countries[countries['continent'] == 'Africa'].total_tests.isna())
africa = countries[countries['continent'] == 'Africa'].total_tests
africa_missing_test_data = len(africa[africa.isna() == True])
africa_missing_test_data

**Q6. How many countries have a higher value for `total_tests` than the `United Kingdom`? Assign your answer to a variable called `countries_more_tests`.**

In [127]:
def more_tests(country):
    tests = countries.at[countries[countries['location'] == country].index[0], 'total_tests']

    result = []

    frame = countries.reset_index()

    for i in range(frame.shape[0]):
        if frame.loc[i, 'total_tests'] > tests:
            result.append(frame.loc[i, 'location'])

    return len(result)

countries_more_tests = more_tests('United Kingdom')
countries_more_tests

**Q7. Create a DataFrame called `beds_dr` which is based on the `countries` DataFrame, but contains only the columns `hospital_beds_per_thousand` and `total_deaths_per_million`.**

In [85]:
beds_dr = countries[['hospital_beds_per_thousand', 'total_deaths_per_million']].dropna()
beds_dr

**Q8. Refer to the `beds_dr` DataFrame. What is the average `total_deaths_per_million` for entries in `beds_dr` where `hospital_beds_per_thousand` is greater than the mean?**

- Save the results to a new variable called `dr_high_bed_ratio`

In [87]:
#dr_high_bed_ratio

above_av = beds_dr[beds_dr['hospital_beds_per_thousand'] > beds_dr['hospital_beds_per_thousand'].mean()]

dr_high_bed_ratio = above_av['total_deaths_per_million'].mean()
dr_high_bed_ratio

**Q9. Refer to the `beds_dr` DataFrame. What is the average `total_deaths_per_million` for entries in `beds_dr` where `hospital_beds_per_thousand` is less than the mean?**

- Save the results to a new variable called `dr_low_bed_ratio`

In [88]:
#dr_low_bed_ratio

below_av = beds_dr[beds_dr['hospital_beds_per_thousand'] < beds_dr['hospital_beds_per_thousand'].mean()]

dr_low_bed_ratio = below_av['total_deaths_per_million'].mean()
dr_low_bed_ratio

**Q10. Refer to the `countries` DataFrame. Create a new DataFrame called `no_new_cases` which contains only rows from `countries` with zero `new_cases`.**

In [89]:
no_new_cases = countries[countries['new_cases'] == 0]

no_new_cases.head()

**Q11. Refer to the `no_new_cases` DataFrame. Which country in `no_new_cases` DataFrame has had the highest number of `total_cases`?**

- Save the results to a new variable called `highest_no_new`

In [135]:
highest_no_new = no_new_cases[no_new_cases['total_cases'] == no_new_cases['total_cases'].max()]
highest_no_new = highest_no_new['location'].iloc[0]
highest_no_new

**Q12. Refer to the `countries` DataFrame. What is the sum of the `population` of all countries which have had zero `total_deaths`?**

In [131]:
sum_populations_no_deaths = countries[countries['total_deaths'] == 0]['population'].sum()
sum_populations_no_deaths = round(int(sum_populations_no_deaths)/1000000)
sum_populations_no_deaths

**Q13. Create a function called `country_metric` which accepts the following three parameters:**

- a DataFrame (which can be assumed to be of a similar format to `countries`)
- a location (i.e. a string  which will be found in the `location` column of the DataFrame)
- a metric (i.e. a string which will be found in any column  (other than `location`)  in the DataFrame)

The function should return only the value from the first row for a given `location` and  `metric`.

In [99]:
def country_metric(df, location, metric):
    
    return df[df['location'] == location].iloc[0][metric]

country_metric(countries, 'France', 'new_cases')

**Q.14 Use your function to collect the value for `Vietnam` for the metric `aged_70_older`, assigning the result to `vietnam_older_70`.**

In [100]:
vietnam_older_70 = country_metric(countries, 'Vietnam', 'aged_70_older')
vietnam_older_70

**Q.15 Create another function called `countries_average`, which accepts the following three parameters:**

- a DataFrame "df" (which can be assumed to be such as `countries`)
- a list of countries "countries" (which can be assumed to all be found in the `location` column of the DataFrame)
- a string "metric" (which can be assumed to be a column (other than `location`) which will be found in the DataFrame) . For instance, this string value can be `life_expectancy`.

The function should return the average value for the given metric for the given list of countries.

In [107]:
def countries_average(df, countries, metric):
    frame = df[df['location'].isin(countries)]
    avg = frame[metric].mean()
    return avg

**Q16. Use your `countries_average` function to find out the average `life_expectancy` of countries in the `g7` list defined below. Assign the result to the variable `g7_avg_life_expectancy`.**

In [108]:
g7 = ['United States', 'Italy', 'Canada', 'Japan', 'United Kingdom', 'Germany', 'France']
g7_avg_life_expectancy = countries_average(df, g7, 'life_expectancy')
g7_avg_life_expectancy

**Q.17 Refer to the `countries` DataFrame. Find the country with lowest value for `life_expectancy` in the `countries` DataFrame, and create a string which is formatted as follows:**

'{country} has a life expectancy of {diff} years lower than the G7 average.'

- use `f-strings` to format the string
- {country} being replaced by the value in the `location` column of the DataFrame
- {diff} being replaced by a float **rounded to one decimal place**, of the value from the `life_expectancy` column subtracted from `g7_avg_life_expectancy`. Please note that {diff} should be a positive value


In [115]:
country = countries[countries['life_expectancy'] == countries['life_expectancy'].min()].iloc[0]['location']
life_exp = countries[countries['life_expectancy'] == countries['life_expectancy'].min()].iloc[0]['life_expectancy']
diff = round((g7_avg_life_expectancy - life_exp), 1)
headline = f'{country} has a life expectancy of {diff} years lower than the G7 average.'
headline 