# Homework pandas

<table align="left">
    <tr>
    <td><a href="https://colab.research.google.com/github/airnandez/numpandas/blob/master/exam/2020-exam.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a></td>
    <td><a href="https://mybinder.org/v2/gh/airnandez/numpandas/master?filepath=exam%2F2020-exam.ipynb">
  <img src="https://mybinder.org/badge_logo.svg" alt="Launch Binder"/>
</a></td>
  </tr>
</table>

*Author: Fabio Hernandez*

*Last updated: 2021-04-02*

*Location:* https://github.com/airnandez/numpandas/exam

--------------------
## Instructions

For this excercise we will use a public dataset curated and made available by [Our World in Data](https://ourworldindata.org) located in [this repository](https://github.com/owid/covid-19-data/tree/master/public/data). We will use a snapshot of the dataset as of 2021-04-02.

For your convenience, this notebook is prepared with code for downloading the snapshot dataset from its source, loading it into memory as a **pandas** dataframe and with some cleaning and helper functions. Your mission is execute the provided cells and to write the code to answer the questions below.

You must not modify the code provided. You must provide code for answering the questions, following the instructions for each one of them.

When you have finished, please save your notebook in the form of a `.ipynb` file and send it to your instructor according to the instructions you received by e-mail.

---------------------
## Dependencies

In [None]:
import datetime
import os
import glob

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.__version__

In [None]:
import numpy as np
np.__version__

------
## Download the dataset

Define a helper function for downloading data to a local file:

In [None]:
import requests

def download(url: str, path: str):
    """Download file at url and save it locally at path."""
    with requests.get(url, stream=True) as resp:
        if not resp.ok:
            raise f'Could not find file at URL {url}'
            
        mode, data = 'wb', resp.content
        if 'text/plain' in resp.headers['Content-Type']:
            mode, data = 'wt', resp.text
        with open(path, mode) as f:
            f.write(data)

Download the data files. We store the downloaded data in the directory `../data` relative to the location of this notebook. If a file has been already been downloaded, don't download it again.

In [None]:
# Download files
data_sources = (
    "https://raw.githubusercontent.com/airnandez/numpandas/master/data/2021-04-02-owid-covid-data.csv",
)

# Create destination directory
os.makedirs(os.path.join('..', 'data'), exist_ok=True)

for url in data_sources:
    # Build the URL and the destination file path
    path = os.path.join('..', 'data', os.path.basename(url))
    
    # If file already exists don't download it again
    if not os.path.isfile(path) :
        print(f'downloading {url} to {path}')
        download(url, path)

Check what files we have for our analysis:

In [None]:
file_paths = glob.glob(os.path.join('..', 'data', '2021-*-owid-*'))
print('\n'.join(f for f in file_paths))

---------------------
## Load the data

Load the file `2021-04-02-owid-covid-data.csv` to a **pandas** dataframe.

⚠️ **Make sure you get familiar with the contents of that file, by reading the [codebook](https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv), which describes the meaning of each column.**

In [None]:
path = os.path.join('..', 'data', '2021-04-02-owid-covid-data.csv')
df = pd.read_csv(path, parse_dates=['date'])
df.sample(5)

------------------------
## Question 1: number of cases, incidence and fatality ratio

We want to compute the total number of cases, deaths and fatality ratio in France and in the world as of 2021-04-01.

The fatality ratio is the fraction of deaths over the total number of confirmed COVID-19 cases. The incidence is the ratio of the total number of confirmed cases over the population.

### Question 1a (3 points)

Compute the total number of cases, deaths, incidence and fatality ratio for France. You must write code to extract the relevant information from the dataframe and assign the appropriate values to the variables defined in the cell below.

In [None]:
# Total confirmed cases of COVID-19 in France
...
total_cases_fr = ...

# Population in France
population_fr = ...

# Total number of deaths attributed to COVID-19 in France
total_deaths_fr = ...

# Incidence in France
incidence_fr = (total_cases_fr / population_fr) * 100

# Fatality ratio: deaths vs confirmed cases
fatality_fr = (total_deaths_fr / total_cases_fr) * 100

print(f'Population in France:             {population_fr:>12,.0f}')
print(f'Total number of cases in France:  {total_cases_fr:>12,.0f}')
print(f'Total number of deaths in France: {total_deaths_fr:>12,.0f}')
print(f'Incidence in France:              {incidence_fr:>12,.2f}%')
print(f'Fatality ratio in France:         {fatality_fr:>12.2f}%')

### Question 1b (3 points)

As done for France in the previous question, here you need to compute the total number of cases, deaths, incidence and fatality ratio for the entire world:

In [None]:
# Select data for the whole world
...
population_world = ...
total_cases_world = ...
total_deaths_world = ...
incidence_world = (total_cases_world / population_world) * 100
fatality_world = (total_deaths_world / total_cases_world) * 100

print(f'World population                     {population_world:>14,.0f}')
print(f'Total number of cases in the world:  {total_cases_world:>14,.0f}')
print(f'Total number of deaths in the world: {total_deaths_world:>14,.0f}')
print(f'Incidence in the world:              {incidence_world:>14,.2f}%')
print(f'Fatality ratio in the world:         {fatality_world:>14.2f}%')

------------------
## Question 2 (7 points)

Compute and print a list with the name of the **countries** that have administered 80% of the global number of vaccination doses.

⚠️ Please note that in this dataframe there are rows that contain information about a region (e.g. Europe, Asia, World), in addition to information about individual countries.

In [None]:
# Select only the rows which contain information about a country (as opposed to a region)
# Regions are encoded with a 'iso_code' of the form 'OWID_XXXXX'
...

# Compute the vaccination doses administered by all countries
...

# Sort the countries by their value of vaccination doses administered
...

# Build the list of countries which have administered 80% of the 
# doses administered around the world
....

_____
## Question 3 (7 points)

Compute an ordered list of the top 10 countries with population more than 1 million, ranked by the fraction of their population which have already taken **all the doses** prescribed by the vaccination protocol.

In [None]:
# Build a dataframe with one row per country and two columns: 'people_fully_vaccinated' and 'population'
....

# Extract the number of people fully vaccinated and the population for each country
...

# Among the countries with more than 1M people, select the top 10
# ranked by percentage of fully vaccinated population
....

----------------
## Bonus question (3 points)

The function `plot` below generates an displays a figure for visualizing a set of countries and the percentage of their population which is fully vaccinated. You need to provide the information to visualize the top 10 countries with populations at least 1 million people which have the largest fraction of their population fully vaccinated (see Question 3).

To use this function, you must compute two Python lists:

* the list `countries` which contains the name of the top 10 countries with population of at least 1 million people, which have the largest fraction of their population fully vaccinated,
* the list `percents` which contains the percentage of the fully vaccinated population of those 10 countries

After computing those two lists call the function `plot` to visualize the figure, as shown below:

```python
    countries = [ 'France', 'Germany', 'Italy', ... ]
    percents = [ 0.3, 0.2, 0.1, ... ]
    plot(countries, percent)
```

In [None]:
import bokeh
import bokeh.plotting
bokeh.plotting.output_notebook()

def plot(countries, percents):
    """Generates and displays a Bokeh plot with horizontal bars, one bar per country"""
    figure = bokeh.plotting.figure(
        title = 'Percentage of population fully vaccinated (countries with population ≥ 1M)',
        x_axis_label = 'percentage',
        x_range = (0, 1),
        y_range = countries,
        plot_width = 800,
        plot_height = 400,
        background_fill_color = 'whitesmoke',
        background_fill_alpha = 0.8
    )
    figure.xaxis.formatter = bokeh.models.formatters.NumeralTickFormatter(format='0%')
    figure.ygrid.grid_line_color = None

    figure.hbar(right=percents, y=countries, height=0.5, color='coral')
    bokeh.plotting.show(figure)

In [None]:
countries = ...
percents = ...

plot(countries, percents)