# Homework pandas

<table align="left">
    <tr>
    <td><a href="https://colab.research.google.com/github/airnandez/numpandas/blob/master/exam/2020-exam.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a></td>
    <td><a href="https://mybinder.org/v2/gh/airnandez/numpandas/master?filepath=exam%2F2020-exam.ipynb">
  <img src="https://mybinder.org/badge_logo.svg" alt="Launch Binder"/>
</a></td>
  </tr>
</table>

*Author: Fabio Hernandez*

*Last updated: 2021-04-01*

*Location:* https://github.com/airnandez/numpandas/exam

--------------------
## Instructions

For this excercise we will use a public dataset curated and made available by [Our World in Data](https://ourworldindata.org) located in [this repository](https://github.com/owid/covid-19-data/tree/master/public/data). We will use a snapshot of the dataset as of 2021-04-02.

For your convenience, this notebook is prepared with code for downloading the snapshot dataset from its source, loading it into memory as a **pandas** dataframe and with some cleaning and helper functions. Your mission is execute the provided cells and to write the code to answer the questions below.

You must not modify the code provided. You must provide code for answering the questions, following the instructions for each one of them.

When you have finished, please save your notebook in the form of a `.ipynb` file and send it to your instructor according to the instructions you received by e-mail.

---------------------
## Dependencies

In [1]:
import datetime
import os
import glob

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.__version__

'1.0.1'

In [3]:
import numpy as np
np.__version__

'1.18.1'

------
## Download the dataset

Define a helper function for downloading data to a local file:

In [4]:
import requests

def download(url, path):
    """Download file at url and save it locally at path"""
    with requests.get(url, stream=True) as resp:
        mode, data = 'wb', resp.content
        if 'text/plain' in resp.headers['Content-Type']:
            mode, data = 'wt', resp.text
        with open(path, mode) as f:
            f.write(data)

Download the data files. We store the downloaded data in the directory `../data` relative to the location of this notebook. If a file has been already been downloaded, don't download it again.

In [8]:
# Download files
data_sources = (
    "https://raw.githubusercontent.com/airnandez/numpandas/master/data/2021-04-02-owid-covid-data.csv",
)

# Create destination directory
os.makedirs(os.path.join('..', 'data'), exist_ok=True)

for url in data_sources:
    # Build the URL and the destination file path
    path = os.path.join('..', 'data', os.path.basename(url))
    
    # If file already exists don't download it again
    if not os.path.isfile(path) :
        print(f'downloading {url} to {path}')
        download(url, path)

downloading https://raw.githubusercontent.com/airnandez/numpandas/master/data/2021-04-02-owid-covid-data.csv to ../data/2021-04-02-owid-covid-data.csv


Check what files we have for our analysis:

In [9]:
file_paths = glob.glob(os.path.join('..', 'data', '2021-*-owid-*'))
print('\n'.join(f for f in file_paths))

../data/2021-04-02-owid-covid-data.csv


---------------------
## Load the data

Load the file `2021-04-02-owid-covid-data.csv` to a **pandas** dataframe.

⚠️ **Make sure you get familiar with the contents of that file, by reading the [codebook](https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv), which describes the meaning of each column.**

In [10]:
path = os.path.join('..', 'data', '2021-04-02-owid-covid-data.csv')
df = pd.read_csv(path, parse_dates=['date'])
df.sample(5)

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,new_tests,total_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,new_vaccinations_smoothed_per_million,stringency_index,population,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
39349,LVA,Europe,Latvia,2020-03-21,124.0,13.0,14.0,,,0.0,65.741,6.892,7.422,,,0.0,1.27,,,,,,,,,1389.0,5762.0,3.055,0.736,678.0,0.359,0.021,48.4,tests performed,,,,,,,,,,60.19,1886202.0,31.212,43.9,19.754,14.136,25063.846,0.7,350.06,4.91,25.6,51.0,,5.57,75.29,0.866
28352,GRC,Europe,Greece,2020-05-28,2906.0,3.0,7.571,175.0,2.0,1.0,278.805,0.288,0.726,16.79,0.192,0.096,0.88,,,,,,,,,4222.0,170467.0,16.355,0.405,3770.0,0.362,0.002,498.0,samples tested,,,,,,,,,,68.52,10423056.0,83.479,45.3,20.396,14.524,24574.382,1.5,175.695,4.55,35.3,52.0,,4.21,82.24,0.888
14325,TCD,Africa,Chad,2020-09-10,1051.0,3.0,4.714,79.0,0.0,0.286,63.984,0.183,0.287,4.809,0.0,0.017,1.26,,,,,,,,,,,,,,,,,,,,,,,,,,,64.81,16425859.0,11.833,16.7,2.486,1.446,1768.153,38.4,280.995,6.1,,,5.818,,54.24,0.398
41431,LIE,Europe,Liechtenstein,2020-09-19,113.0,1.0,0.286,1.0,0.0,0.0,2963.002,26.221,7.492,26.221,0.0,0.0,0.74,,,,,,,,,,,,,,,,,,,,,,,,,,,,38137.0,237.012,,,,,,,7.77,,,,2.397,82.49,0.919
26051,FRA,Europe,France,2020-11-11,1917577.0,36406.0,46345.429,42609.0,328.0,553.571,28138.549,534.222,680.073,625.245,4.813,8.123,0.9,4789.0,70.274,31918.0,468.365,,,,,50587.0,,,0.742,245382.0,3.601,0.121,8.3,people tested,,,,,,,,,,78.7,68147687.0,122.578,42.0,19.718,13.079,38605.671,,86.06,4.77,30.1,35.6,,5.98,82.66,0.901


------------------------
## Question 1: number of cases, incidence and fatality ratio

We want to compute the total number of cases, deaths and fatality ratio in France and in the world as of 2021-04-01.

The fatality ratio is the fraction of deaths over the total number of confirmed COVID-19 cases. The incidence is the ratio of the total number of confirmed cases over the population.

### Question 1a (3 points)

Compute the total number of cases, deaths, incidence and fatality ratio for France. You must write code to extract the relevant information from the dataframe and assign the appropriate values to the variables defined in the cell below.

In [11]:
# Total confirmed cases of COVID-19 in France
df_france = df[df['location'] == 'France']
total_cases_fr = df_france['total_cases'].max()

# Population in France
population_fr = df_france['population'].max()

# Total number of deaths attributed to COVID-19 in France
total_deaths_fr = df_france['total_deaths'].max()

# Incidence in France
incidence_fr = (total_cases_fr / population_fr) * 100

# Fatality ratio: deaths vs confirmed cases
fatality_fr = (total_deaths_fr / total_cases_fr) * 100

print(f'Population in France:             {population_fr:>12,.0f}')
print(f'Total number of cases in France:  {total_cases_fr:>12,.0f}')
print(f'Total number of deaths in France: {total_deaths_fr:>12,.0f}')
print(f'Incidence in France:              {incidence_fr:>12,.2f}%')
print(f'Fatality ratio in France:         {fatality_fr:>12.2f}%')

Population in France:               68,147,687
Total number of cases in France:     4,755,779
Total number of deaths in France:       96,106
Incidence in France:                      6.98%
Fatality ratio in France:                 2.02%


### Question 1b (3 points)

As done for France in the previous question, here you need to compute the total number of cases, deaths, incidence and fatality ratio for the entire world:

In [12]:
# Select data for the whole world
df_world = df[df['location'] == 'World']
population_world = df_world['population'].max()
total_cases_world = df_world['total_cases'].max()
total_deaths_world = df_world['total_deaths'].max()
incidence_world = (total_cases_world / population_world) * 100
fatality_world = (total_deaths_world / total_cases_world) * 100

print(f'World population                     {population_world:>14,.0f}')
print(f'Total number of cases in the world:  {total_cases_world:>14,.0f}')
print(f'Total number of deaths in the world: {total_deaths_world:>14,.0f}')
print(f'Incidence in the world:              {incidence_world:>14,.2f}%')
print(f'Fatality ratio in the world:         {fatality_world:>14.2f}%')

World population                      7,794,798,729
Total number of cases in the world:     129,607,542
Total number of deaths in the world:      2,827,520
Incidence in the world:                        1.66%
Fatality ratio in the world:                   2.18%


------------------
## Question 2 (7 points)

Compute and print a list with the name of the **countries** that have administered 80% of the global number of vaccination doses.

⚠️ Please note that in this dataframe there are rows that contain information about a region (e.g. Europe, Asia, World), in addition to information about individual countries.

In [13]:
# Select only the rows which contain information about a country (as opposed to a region)
# Regions are encoded with a 'iso_code' of the form 'OWID_XXXXX'
is_country = ~df['iso_code'].str.contains('^OWID_.+$', regex=True)
df_countries = df[is_country]

# Sum the vaccination doses administered by all countries
df_grouped_by_country = df_countries.groupby('location')
total_vaccinations = df_grouped_by_country['total_vaccinations'].max().sum()

# Sort the countries by their value of vaccination doses administered
vaccinations_per_country = df_grouped_by_country['total_vaccinations'].max().sort_values(ascending=False)

# Build the list of countries which have administered 80% of the 
# doses administered around the world
cumulated_doses = 0
countries = []
for country, doses in vaccinations_per_country.iteritems():
    if cumulated_doses >= 0.8 * total_vaccinations:
        break
    cumulated_doses += doses
    countries.append(country)

print(f'{len(countries)} out of {len(vaccinations_per_country)} countries have administered 80% of the {total_vaccinations:,.0f}'
      ' doses administered around the world. Those countries are:')

for country in countries:
    print(f"{country:>16}: {vaccinations_per_country[country]:>12,.0f}")

13 out of 204 countries have administered 80% of the 617,027,408 doses administered around the world. Those countries are:
   United States:  153,631,404
           China:  126,616,000
           India:   68,789,138
  United Kingdom:   35,660,902
          Brazil:   20,068,856
          Turkey:   16,148,683
         Germany:   13,772,656
       Indonesia:   12,226,028
          Russia:   11,642,295
          France:   11,386,807
           Chile:   10,760,851
           Italy:   10,501,841
          Israel:   10,055,840


_____
## Question 3 (7 points)

Compute an ordered list of the top 10 countries with population more than 1 million, ranked by the fraction of their population which have already taken **all the doses** prescribed by the vaccination protocol.

In [14]:
# Build a dataframe with one row per country and two columns: 'people_fully_vaccinated' and 'population'
is_country = ~df['iso_code'].str.contains('^OWID_.+$', regex=True)
df_countries = df[is_country]

# Extract the number of people fully vaccinated and the population for each country
df_grouped_by_country = df_countries.groupby('location')
fully_vaccinated_per_country = df_grouped_by_country['people_fully_vaccinated'].max()
population_per_country = df_grouped_by_country['population'].max()

# Build a new dataframe with those two columns
data = {}
for country in fully_vaccinated_per_country.index:
    data[country] = (fully_vaccinated_per_country[country], population_per_country[country])

df_vaccinated = pd.DataFrame.from_dict(data, orient='index', columns=['people_fully_vaccinated', 'population'])

# Add a column with the fraction of people fully vaccinated
df_vaccinated['percent_fully_vaccinated'] = df_vaccinated['people_fully_vaccinated'] / df_vaccinated['population']

# Among the countries with more than 1M people, select the top 10
# ranked by percentage of fully vaccinated population
is_bigger_than_1m = df_vaccinated['population'] >= 1_000_000
df_vaccinated[is_bigger_than_1m]['percent_fully_vaccinated'].nlargest(10)

Israel                  0.554981
United Arab Emirates    0.221209
Chile                   0.200475
United States           0.169454
Serbia                  0.152501
Bahrain                 0.151384
Morocco                 0.102063
Hungary                 0.089354
Turkey                  0.082313
Denmark                 0.067261
Name: percent_fully_vaccinated, dtype: float64

----------------
## Bonus question (3 points)

The function `plot` below generates an displays a figure for visualizing a set of countries and the percentage of their population which is fully vaccinated. You need to provide the information to visualize the top 10 countries with populations at least 1 million people which have the largest fraction of their population fully vaccinated (see Question 3).

To use this function, you must compute two Python lists:

* the list `countries` which contains the name of the top 10 countries with population of at least 1 million people, which have the largest fraction of their population fully vaccinated,
* the list `percents` which contains the percentage of the fully vaccinated population of those 10 countries

After computing those two lists call the function `plot` to visualize the figure, as shown below:

```python
    countries = [ 'France', 'Germany', 'Italy', ... ]
    percents = [ 0.3, 0.2, 0.1, ... ]
    plot(countries, percent)
```

In [15]:
import bokeh
import bokeh.plotting
bokeh.plotting.output_notebook()

def plot(countries, percents):
    """Generates and displays a Bokeh plot with horizontal bars, one bar per country"""
    figure = bokeh.plotting.figure(
        title = 'Percentage of population fully vaccinated (countries with population ≥ 1M)',
        x_axis_label = 'percentage',
        x_range = (0, 1),
        y_range = countries,
        plot_width = 800,
        plot_height = 400,
        background_fill_color = 'whitesmoke',
        background_fill_alpha = 0.8
    )
    figure.xaxis.formatter = bokeh.models.formatters.NumeralTickFormatter(format='0%')
    figure.ygrid.grid_line_color = None

    figure.hbar(right=percents, y=countries, height=0.5, color='coral')
    bokeh.plotting.show(figure)

In [16]:
countries, percents = [], []
for country, percent in df_vaccinated[is_bigger_than_1m]['percent_fully_vaccinated'].nlargest(10).iteritems():
    countries.insert(0, country)
    percents.insert(0, percent)
    
plot(countries, percents)