# Practical 2 - MultiIndex, Stack, Unstack

In [None]:
import pandas as pd
import numpy as np

**Exercise 1**
Create a list of country names called `countries` with the values `['US','GB']`.
Create a list of sales types called `types` with the values `['in_store','online']`.

In [None]:
countries = ['US','GB']
types = ['in_store','online']


**Exercise 2** We would now like to have a MultiIndex where each country has a subindex of `in_store` and `online`.

Using `pd.MultiIndex.from_product`, create this MultiIndex called `cols`. For two lists, `list1, list2`, in this case `countries, types`, the syntax of `from_product` is as follows:

```Python
pd.MultiIndex.from_product([list1, list2], names=[name1, name2])
```

This will take the set product of the two lists (which means you get every possible pair from the two lists). `names=[...]` sets the name of each level of the MultiIndex. We would like the first level of the MultiIndex to be `country` and the second level of the MultiIndex to be `type`. 

In [None]:
cols = pd.MultiIndex.from_product([countries, types],names=['country', 'type'])


Now we can generate some random data, and create a DataFrame with the columns set by your MultiIndex:

In [None]:
a = np.random.randn(12,4)
df = pd.DataFrame(a, columns=cols)
df

**Exercise 3**: Let's use the index to denote quarterly reports. Create a row MultiIndex, using `from_arrays`, which groups each of the four quarters per year into year groups. Call the outer index 'year' and inner index 'quarter'. For example, the first year needs to encapsulate rows `[0,1,2,3]`.

Here's an example of `from_arrays` from the pandas documentation:

```python
>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
MultiIndex([(1,  'red'),
            (1, 'blue'),
            (2,  'red'),
            (2, 'blue')],
           names=['number', 'color'])
```


Use `df.index = index` to set the new index.

In [None]:
index = pd.MultiIndex.from_arrays([[0,0,0,0,1,1,1,1,2,2,2,2],df.index], names=['year','quarter'])

df.index = index
df


**Exercise 4**: In the last practical we worked with the countries dataset. Import the dataset again and use `df.head()` to get a view of the first few rows. 

In [None]:
df = pd.read_csv('data/countries.csv', decimal=",")
df.head()


Remember in pandas we can create new columns in dataframes using boolean statements as follows:

In [None]:
df['High GDP'] = df['GDP ($ per capita)'] > df['GDP ($ per capita)'].mean()

df['High GDP']

**Exercise 5**: Create a new column `Net inward migration` containing Boolean values, indicating where `Net migration` is positive.

Using `df.dropna(subset=[<cols>])`, create a new DataFrame called `dfbd` where any countries with missing values for `Net migration`, `Birthrate`, and `Deathrate` have been removed.

Next, using `.groupby`, create a MultiIndex DataFrame with `Region` and `Net inward migration` as the indices and the `mean` of `Birthrate` and `Deathrate` as the columns.

Here's an example of `groupby` on a MultiIndex:

```python
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
>>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
...                   index=index)
>>> df
                Max Speed
Animal Type
Falcon Captive      390.0
       Wild         350.0
Parrot Captive       30.0
       Wild          20.0
>>> df.groupby(level=0).mean()
        Max Speed
Animal
Falcon      370.0
Parrot       25.0
>>> df.groupby(level="Type").mean()
         Max Speed
Type
Captive      210.0
Wild         185.0
```

In [None]:
df['Net inward migration'] = df['Net migration'] > 0
dfbd = df.dropna(subset = ['Net migration', 'Birthrate', 'Deathrate'])
gb = dfbd.groupby(['Region', 'Net inward migration'])[['Birthrate', 'Deathrate']].mean()
gb


Remember to consider what we have actually calculated here - and how it may not represent what someone might expect it to at first glance; countries with small populations will contribute to the mean values to the same degree as large ones.

We would like to re-format the output so that we have a single index but an additional level of column labels instead, and only show mean values for `Birthrate` and `Deathrate` where `Net inward migration` is `False`.

We are going to do this over multiple steps:

**Exercise 6**: First, `unstack` the DataFrame and name it `emig`.

In [None]:
emig = gb.unstack()


Now the DataFrame looks like this:

In [None]:
emig

**Exercise 7**: Drop the "True" columns so that we only have countries in each region with negative `Net inward migration`. Use `df.drop`. You will have to specify the columns (in this case we just want to drop the `True` column), and also the `level=?` of the MultiIndex. Remember levels are indexed from 0!

In [None]:
emig = emig.drop(columns=True, level=1)
emig


**Exercise 8**: Sort the output by the values for mean `Birthrate` using [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html). 

In [None]:
emig = emig.sort_values(by=('Birthrate', False))
emig


**Exercise 9**: The DataFrame still has the `Net inward migration` level even though we only have the `False` columns. Since we don't need it any more, use `droplevel(i)` to index and remove the `Net inward migration` level of the MultiIndex. 

Remember that the MultiIndex is over the columns and not the index! So you will have to do `df.columns = df.columns.droplevel(i)`. 

In [None]:
emig.columns = emig.columns.droplevel(1)
emig
