## Why pandas?

- pandas is a multitool - it has all the basic things one wants when exploring a dataset;
- pandas is quick to code - being a multitool, load, transformation, summarisation and visualisation come all built in;
- pandas is Python - and its gravity pulls every other tool towards compatibility and seamless integration.

In [None]:
import pandas as pd
import numpy as np
pd.options.display.max_rows = 2

### 1. loading data

In [None]:
!ls *.csv

In [None]:
v = pd.read_csv("country_vaccinations.csv", parse_dates=['date'])
v

In [None]:
c = pd.read_csv("continents2.csv")
c

In [None]:
p = pd.read_csv("population_by_country_2020.csv")
p

## Now we have vaccination data. How many people are already vaccinated?

In [None]:
v.daily_vaccinations.sum()

In [None]:
v.groupby(['country']).daily_vaccinations.sum().sort_values(ascending=False).to_frame().T.astype(int)

In [None]:
v.date.max()

## by region?

In [None]:
v.merge(c, left_on='country', right_on='name').groupby('sub-region').daily_vaccinations.sum().sort_values(ascending=False).to_frame().T.astype(int)

In [None]:
(
    v
    .merge(c, left_on='country', right_on='name')
    .groupby('sub-region')
    .daily_vaccinations
    .sum()
    .sort_values(ascending=False)
    .to_frame()
    .T
    .astype(int)
)

## how are they trending?

In [None]:
(
    v
    .merge(c, left_on='country', right_on='name')
    .groupby(['date','sub-region'])
    .daily_vaccinations
    .sum()
    .sort_values(ascending=False)
    .unstack('sub-region')
    .cumsum()
    .plot()
    .legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
)

## 9 things floating in the air
time to pin some things down

In [None]:
csum = (
    v
    .merge(c, left_on='country', right_on='name')
    .groupby(['date','country'])
    .daily_vaccinations
    .sum()
    .sort_values(ascending=False)
    .unstack('country')
    .cumsum()
    .stack('country')
    .to_frame('vaccination_progress')
    .reset_index()
)
csum

# percentage of population?

In [None]:
cp = csum.merge(p, left_on='country', right_on='Country (or dependency)')
cp

In [None]:
cp['perc_vac'] = cp.vaccination_progress / cp['Population (2020)'] * 100
cpg = cp.groupby(['date','country']).perc_vac.sum().unstack('country')
cpg

In [None]:
cpg.fillna(method='ffill').iloc[-1].sort_values(ascending=False)