# Pandas Intermediate Analysis

We have covered a number of features within Pandas, such as the basics of **Series** and **DataFrame**, reading files, checking for missing data, querying/selection, aggregation, sorting/ranking.

In [None]:
import numpy as np
import pandas as pd
import sys
sys.path.insert(0, "../Scripts")
import misc_fig
import appendage_figs

## Hierarchical Indexing

Up to this point, we've mostly focused on one-dimensional and two-dimensional data, stored in Pandas `Series` and `DataFrame` objects, respectively. Often it is useful to go beyond this and store higher dimensional data - that is, data indexed by more than one or two keys. A common pracice is to make use of *hierarchical indexing* or *multi-indexing* to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar 1D or 2D objects.

Pandas provides for this with the `MultiIndex` object, where we consider indexing, slicing and computing statistics across multiple-indexed data. One poor way of doing this is by using Python `tuple` as keys:

In [None]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

With this indexing scheme, you can index of slice based on this multiple-index:

In [None]:
pop[("California", 2010):("Texas",2000)]

But if we needed to do something more complicated, (e.g select all values from 2010), we need to do expensive Pythonic munging to make this happen:

In [None]:
pop[[i for i in pop.index if i[1]==2010]]

### Improving on this: Pandas MultiIndex

Fortunately, Pandas has the solution. We can create a multi-index from the tuples as follows:

In [None]:
pd.MultiIndex.from_tuples(index)

Notice the multiple *levels* of indexing - in this case, the state names and the years, as well as multiple *labels* for each data point which **encodes** these levels. You could think of this as using categorical labels (see the later section).

If we re-index our series with a `MultiIndex`, we see a hierarchical representation:

In [None]:
pop=pop.reindex(pd.MultiIndex.from_tuples(index))
pop

Here the first two columns show the multiindex, and the final column shows the data. Now to access all data for `year == 2010`, we simply use the familiar Pandas slicing notation:

In [None]:
pop[:, 2010]

### MultiIndex as extra dimension

We could have instead stored the same data using a `DataFrame` with the year as a column-axis instead of a `MultiIndex`. The `unstack()` method will convert a multiply-indexed `Series` into a conventionally-indexed `DataFrame`:

In [None]:
pop_df = pop.unstack()
pop_df

Likewise this can be reversed with the `stack()` method:

In [None]:
pop_df.stack()

Now we can take it a step further and use *hierarchical indexing* within a `DataFrame` context. Here's a below example:

In [None]:
pop_df = pd.DataFrame({"total": pop, "under18": [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})

In [None]:
pop_df

Often it is convenient to name the levels of the `MultiIndex`. This can be done using the `names` argument to a MultiIndex constructor, else by setting the `names` attribute of the index after creation:

In [None]:
pop.index.names=["state","year"]
pop

### Explicit MultiIndex constructors

There are a selection of methods that can generate MultiIndex constructors for you to use. The most common are `from_arrays` and `from_tuples`:

In [None]:
pd.MultiIndex.from_arrays([["a","a","b","b"],[1,2,1,2]])

In [None]:
pd.MultiIndex.from_tuples([("a",1), ("a",2), ("b",1), ("b",2)])

In addition, we could construct using the Cartesian product of single indices:

In [None]:
pd.MultiIndex.from_product([["a","b"],[1, 2]])

### MultiIndex for Index and Columns

In a `DataFrame`, the rows and columns are symmetric, and just as the rows can have multiple levels, so can the columns! 

In [None]:
def mock_health_data():
    idx=pd.MultiIndex.from_product([[2013,2014],[1,2]], names=["year","visit"])
    col=pd.MultiIndex.from_product([["Bob","Guido","Sue"],["HR","Temp"]],names=["subject","type"])
    # data
    data = np.round(np.random.randn(4, 6), 1)
    data[:,::2]*=10
    data+=37
    return data, idx, col

data, idx, col = mock_health_data()

hc_data = pd.DataFrame(data=data, index=idx, columns=col)
hc_data

This essentially represents 4-dimensional data, where dimensions include subject, measurement type, the year and visit number. We can do top-level column selection by the person's name and get a full `DataFrame` containing a subject's information:

In [None]:
hc_data["Guido"]

In [None]:
hc_data["Guido","HR"]

As with the single-index case, we can use `loc` and `iloc` as indexers for the index introduced previously:

In [None]:
hc_data.iloc[:2,:2]

In [None]:
hc_data.loc[:, ("Bob","HR")]

### Index setting and resetting

Another way to rearrange hierarchical data is to turn the index labels into columns using the `reset_index` method. Calling this on the single Series will result in a `DataFrame`. For clarity, we can specify the name of the data for the new column(s) formed:

In [None]:
pop_flat = pop.reset_index(name="population")
pop_flat

And likewise an index can be directly set using the columns already existing in the DataFrame:

In [None]:
pop_flat.set_index(["state","year"])

### Aggregation on Multi-Indices

As previously seen, Pandas has built-in aggregation methods, such as `mean()`, `sum()` and `max()`. For hierarchically-indexes data, this can receive a `level` parameter that controls which level to aggregate over:

In [None]:
hc_data.mean(level="year")

Alternatively the `axis` argument specifies whether to aggregate across the rows or columns instead.

In [None]:
hc_data.mean(axis=1, level="subject")

## Combining Datasets: Concat and Append

It is often that interesting studies of data come from different data sources. Combining these data sources can involve anything from a straightforward and simple concatenation to a more complicated database-style function such as `join` or `merge` that handle any overlaps between datasets. Pandas includes functions and methods that make this sort of data wrangling fast and easy.

In [None]:
def make_df(cols, ind):
    data = {c: [str(c) + str(i) for i in ind] for c in cols}
    return pd.DataFrame(data, ind)

make_df("ABC",range(3))

### Simple concatenation with `pd.concat`

Pandas has a function: `pd.concat` which has a similar syntax to `np.concatenate` but contains a number of interesting options.

```python
pd.concat(objs, axis=0, join="outer", join_axes=None, ignore_index=False, 
          keys=None, levels=None, names=None, verify_integrity=False, copy=True
```

Here we list all of the parameters associated.

In [None]:
ser1 = pd.Series(["A","B","C"],index=range(3))
ser2 = pd.Series(["D","E","F"],index=range(3,6))
pd.concat([ser1,ser2])

In [None]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

In [None]:
df1=make_df("AB",[1,2])
df2=make_df("AB",[3,4])
# misc_fig.Display("df1","df2", "pd.concat([df1, df2])")

In [None]:
display("df1","df2","pd.concat([df1, df2])")

By default, the concatenation happens row-wise within the DataFrame (i.e `axis=0`), We can specify the axis:

In [None]:
df3=make_df("AB",[1,2])
df4=make_df("CD",[1,2])
display("df3","df4","pd.concat([df3, df4], axis=1)")

#### Duplicate indices

One important difference between `np.concatenate` and `pd.concat` is the Pandas concatenation will preserve indices, even if the result will have duplicates. Consider this example:

In [None]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index  # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')

#### Catching the repeats as an error

If you want to verify that the indices in the result of `pd.concat` do not overlap, specify the `verify_integrity` flag. With this to `True` any concatenation will raise an exception if there are duplicate indices. 

In [None]:
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:",e)

#### Ignoring the index

Sometimes the index itself does not matter, and you would prefer to ignore it. This is an option using the `ignore_index` flag.

In [None]:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')

#### Adding MultiIndex keys

Another options is to use the keys option to specify a label for the data sources; this then creates a hierarchically indexed `Series`:

In [None]:
display('x', 'y', 'pd.concat([x, y], keys=["x","y"])')

### Concatenation with joins

In this sample we looked at `DataFrame` objects with *shared column names*. In practice, data from different sources might have different sets of column names, and `pd.concat` offers several options in this case. Consider the following concatenation:

In [None]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
#sort=false to remove a warning
display('df5', 'df6', 'pd.concat([df5, df6],sort=False)')

By default, entires for which no data is available is filled with `np.nan`, to change this, we can specify one of several options for `join` and `join_axes` parameters of the concat function. By default, the join is a *union* of the input columns (`join="outer"`), but we can change this to an *intersection* of the columns (`join="inner"`):

In [None]:
display('df5', 'df6', 'pd.concat([df5, df6],join="inner",sort=False)')

### The `append()` method

Because direct array concatenation is so common, we have an append method that can accomplish the same thing but in fewer strokes. 

In [None]:
display("df1", "df2", "df1.append(df2)")

Keep in mind that unlike `append()` and `extend()` from methods in Python lists, dictionaries etc, `append()` in Pandas does **NOT** modify the original object - instead it creates a new copy with the combined data. It is not very efficient either, b ecause it creates a new index and data buffer. Thus if you plan to stack multiple append operations, create a list of dataframes and pass them at once to the concat function.

## Combining Datasets: Merge and Join

One essential feature of Pandas is its' high performance, in-memory join and merge operations. Any experience with databases would have made you familiar with this type of data interaction. The main inferface for this is the `pd.merge` function, with some examples below.

### Categories of Joins 

The `pd.merge` function implements a number of types of joins:

- *One-to-one*
- *Many-to-one*
- *Many-to-many*

All three types of joins are accessed via an identical call to the `pd.merge` interface; the type of join performed depends on the form of the input data. Here we will show simple examples of the three types of merges.

#### One-to-one

Perhaps the simplest type of merge expression, very similar to the column-wise concatenation seen previously.

In [None]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2', "pd.merge(df1, df2)")

Here the `pd.merge` function recognizes that each `DataFrame` shares the *employee* column, and automatically joins using this column as the key. The result of the merge is a new DataFrame that combines the information from the two inputs. Notice that the order of entries in each column is not necessarily maintained. Additionally keep in mind that merges in general *discard the index*, except in a few special cases of merges by index.

#### Many-to-one

Many-to-one joins are where one of the two key columns contains duplicate entries. For the many-to-one case, the resulting `DataFrame` will preserve those duplicate entries as appropriate. Consider the followign example of a *many-to-one* join:

In [None]:
df3 = pd.merge(df1, df2)
df3

In [None]:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')

Here the resulting `DataFrame` has an additional column with 'supervisor' information, where the information is repeated in one or more locations as required.

#### Many-to-many

Basically as an extension to *many-to-one*, the key columns in a many-to-many relationship have duplicates in both the left and right array. Consider the following example:

In [None]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")

It's worth remembering that in practice, the datasets are rarely as clean as the one we're working with here. In the following section we'll consider some of the options provided by `pd.merge` that enable you to tune how the join operations work.

### Specifying the Merge Key

Here we can specify which column we would actually like to merge on - using the `on` key word:

In [None]:
display('df1', 'df2', "pd.merge(df1, df2, on='employee')")

This option only works if both the left and right `DataFrame` both have the column `employee`.

At times however, you may wish to merge two datasets with different column names; for example we may have a dataset in which the employee name is labelled as *name* rather than *employee*. In this case we can specify those names using `left_on` and `right_on` keywords:

In [None]:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')

The result has a redundant column that we can optionally drop using `drop(#name, axis=1)`.

### The `left_index` and `right_index` keywords

Sometimes, rather than merging on a column, we want to merge on the index itself.

In [None]:
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')

You can use the index as a key for merging as before:

In [None]:
display('df1a', 'df2a',"pd.merge(df1a, df2a, left_index=True, right_index=True)")

For convenience the `join()` method, also implemented by `DataFrame`, does exactly this.

If you'd like to mix indices and columns, you can combine `left_index` with `right_on` or vice versa to get the desired behaviour:

In [None]:
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")

By default, `pd.merge` results in an *inner-join* or the intersection of the two sets of inputs. We can specify the type of join explicitly using the `how` keyword:

In [None]:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')

In [None]:
display('df6', 'df7', 'pd.merge(df6, df7, how="outer")')

#### Overlapping column names: `suffixes`

You may end up in a case where your two input `DataFrame` objects have conflicting column names. Consider this case:

In [None]:
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')

Because there is a conflict in the `rank` column, the merge function automatically appends a suffix to each column output. These can be specified using the `suffixes` keyword:

In [None]:
display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L","_R"])')

## Groupby: Split, Apply, Combine

Simple aggregations give you a flavour of your dataset, but often it would be nice to aggregate *conditionally* on some label or index, this is known as a *groupby* operation. The name 'groupby' comes from the popular SQL command from the database language.

In [None]:
appendage_figs.demo_aggregate_df()

Let's go through what each step in `groupby` accomplishes:

- *Split*: Involves breaking up and grouping a `DataFrame` based on the value of the specified column/key.
- *Apply*: Computing some function, usually aggregate, transformation or filter, within individual groups.
- *Combine*: Merge the results of these operations back together into an output array.

Here's a concrete example:

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

The most basic operation can be computed with the `groupby()` method, passing the name of the desired key column(s):

In [None]:
df.groupby("key")

Pandas here returns a `DataFrameGroupBy` object, which is essentially a special view that computes nothing until an aggregation function has been given. This "lazy evaluation" approach means that common aggregates can be implemented efficiently in a way that is almost transparent to the user. To produce a result, we perform an appropriate aggregation function, for example `sum()`:

In [None]:
df.groupby("key").sum()

### Groupby object: Real world with cdystonia

Here we'll use a real-world example of aggregation using `groupby`:

In [None]:
cdystonia = pd.read_csv("datasets/cdystonia.csv")
cdystonia.head(3)

Here we've selected a particular `Series` group from the original `DataFrame`.

In [None]:
cdystonia.groupby("patient")["twstrs"].mean().head()

### Aggregate, filter, transform, apply

The preceding discussion focused on aggregation alone for the *apply* part of `groupby`, but there are even more options available. In particular, `groupby` objects have `aggregate()`, `filter()`, `transform()` and generic `apply()` methods that efficiently implement a variety of useful operations before combining the grouped data.

In [None]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

#### Aggregate

In [None]:
df.groupby("key").aggregate(["min",np.median, max])

Another useful pattern is to pass a dictionary mapping column names to operations to be applied:

In [None]:
df.groupby("key").aggregate({"data1": "min", "data2": np.max})

#### Filtering

A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups where the standard deviation is larger than some critical threshold:

In [None]:
def filter_func(x):
    return x["data2"].std() > 4

display("df", "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")

The `filter` function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have `std() > 4`, it is dropped.

#### Transformation

While aggregation must return a reduced version of the data, transformation must return some transformed version of the full data to recombine. In other words, transformation output is the same shape as the input. A common example would be to standardize the data by subtracting the group-wise mean:

In [None]:
display("df", "df.groupby('key').transform(lambda x: (x - x.mean()))")

### Specifying the split key

The most common case is to split the `DataFrame` on a column, however the key itself can be manually specified in a number of ways.

For example, a dictionary can map the index values to some output value, which is used in the aggregation function:

In [None]:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')

Including any Python function that will input the index value and output the group...:

In [None]:
display('df2', 'df2.groupby(str.lower).sum()')

Furthermore, any of the preceding key choice can be combined to group on a multi-index:

In [None]:
display('df2', 'df2.groupby([str.lower, mapping]).sum()')

## Reshaping DataFrame objects: Real world example

In the context of a single DataFrame, we are often interested in re-arranging the layout of our data; particularly for machine learning where algorithms for prediction rely on strict criteria for $X$ and $y$ inputs.

To illustrate this, we will work with a dataset from "*Statistical Methods for the Analysis of Repeated Measurements by Charles S. Davis, pp. 161-163 (Springer, 2002)*", which handles some data from a controlled trial of botulinum toxin type B (BoTB) in patients with cervical dystonia.
* Response variable: (twstrs), measuring severity, pain, and disability caused from cervical dystonia.
* Measured multiple times per patient in weeks 0, 2, 4, 8, 12 and 16.

In [None]:
cdystonia = pd.read_csv("datasets/cdystonia.csv")
cdystonia.head()

We could use the `stack()` method to rotate the dataframe, so that columns are represented as rows:

In [None]:
cdystonia.stack().head(12)

We could create a **hierarchical index** with this to make the data more understandable:

In [None]:
cdystonia2 = cdystonia.set_index(['patient','obs']).drop("id",axis=1)
cdystonia2.head(8)

We could disregard most of the table, and unstack the response variable in columns to make it 'per patient':

In [None]:
twstrs_wide = cdystonia2.twstrs.unstack("obs")
twstrs_wide.head()

Now we could re-merge this back into the original dataset with observation and twstrs as column-data, making a wide-format:

In [None]:
cdystonia_wide = (cdystonia[['patient','site','treat','age','sex']]
     .drop_duplicates()
     .merge(twstrs_wide, right_index=True, left_on="patient", how='inner'))
cdystonia_wide.head()

## Pivot Tables

Now that we've explored some `groupby` and reshaping `DataFrame` objects, let's explore *pivots*. A pivot table is a similar operation seen in spreadsheets that operator on tabular data. The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summary of the data. It helps to think of pivot tables as a *multidimensional* version of `groupby` aggregation. 

We're going to be using the *Titanic* dataset again to illustrate some of their uses.

In [None]:
titanic = pd.read_excel("datasets/titanic.xlsx")
titanic.head()

In [None]:
titanic.pivot_table("Survived", index="Sex", columns="Pclass")

### Multi-level pivots

Just as in the GroupBy, the grouping in pivot tables can be specified with multiple levels, and via a number of options. For example, we might be interested in looking at age as a third dimension. We'll bin the age using the `pd.cut` function:

In [None]:
tit_age = pd.cut(titanic["Age"], [0, 18, 80])
titanic.pivot_table("Survived", ["Sex", tit_age], "Pclass")

We can apply the same strategy when working with the columns as well; let's add info on the fare paid using `pd.qcut` to automatically compute quantiles:

In [None]:
fare = pd.qcut(titanic['Fare'], 2)
titanic.pivot_table('Survived', ['Sex', tit_age], [fare, 'Pclass'])

### Additional pivot table options

The full call signature of `pivot_table` has a huge roster of options:

```python
DataFrame.pivot_table(data, values=None, index=None, columns=None, 
                      aggfunc='mean', fill_value=None, margins=False,
                      dropna=True, margins_name='All')
```
We've already seen examples of the first three arguments; here we'll take a quick look at the remaining ones. Two of the options, `fill_value` and `dropna`, have to do with missing data and are fairly straightforward; we will not show examples of them here.

The aggfunc keyword controls what type of aggregation is applied, which is a mean by default. As in the GroupBy, the aggregation specification can be a string representing one of several common choices (e.g., `sum`, `mean`, `count`, `min`, `max`, etc.) or a function that implements an aggregation (e.g., `np.sum()`, `min()`, `sum()`, etc.). Additionally, it can be specified as a dictionary mapping a column to any of the above desired options:

In [None]:
titanic.pivot_table(index="Sex", columns="Pclass", aggfunc={"Survived": sum, "Fare": "mean"})

Note that we've omitted the `values` keyword: when using `aggfunc`, the values is the key.

We **highly recommend** you check out the [cookbook](https://pandas.pydata.org/pandas-docs/stable/cookbook.html) for efficient ways of handling and processing your pandas.DataFrames.

## Tasks

You're going to be working with several different datasets including:

- **Birth rates in the US**
- **Discovered Planets**
- **fMRI data**
- **World Bank data**

They all exist within the `datasets/` directory so there is no need to download it.

### Task 1

We'll be worthing with the births dataset, which is a record from the 1960s the number of yearly/monthly births across the US.

1. Import the `births` dataset. 
2. Create a new column called `decade` that groups years together. 
3. Create a pivot with total number of births, `decade` on the x-axis and `gender` as the columns
4. (Optional) Plot the pivoted M and F values using `plt.plot`.
5. (Optional) Create a pivot but using `year` on the x-axis instead of `decade`, and re-plot

In [None]:
# your code here

### Task 2

We're working with the planets dataset; which is a dataset recording the number of exoplanets discovered in each year, with their orbital period, mass and distance from Earth. 

1. Import the `planets` dataset.
2. Display the descriptions of all of the major aggregations using `df.describe`.
3. Calculate the median orbital period for each method.
4. Calculate the decade that each sample is in.
5. Discover and count how many discovered planets by method and by decade.
6. (Optional) Plot the scatterplot between the `mass` and the `orbital_period`, colouring by the `distance`.

In [None]:
# your code here

### Task 3

We're working with an fMRI dataset; taken from www.openfmri.org, we have a number of subjects with time-series data, including the signal strength $y$, and regional descriptions. 

1. Import the fMRI dataset.
2. Create a multi-index using `timepoint` and the `event`.
3. Calculate the median `signal` across each `timepoint`. (Optionally plot)
4. Recalculate the mean `signal` across each `timepoint`, filtering out signals near $0 \pm 0.01$.
5. Convert `event` and `region` into categorical variables
6. (Optional) Using `seaborn`, produce a `catplot` with $x$ being `timepoint`, $y$ being `signal`, hue coloured by `subject`, columns being `event` and rows being `region`. Use pointplot. 

In [None]:
# your code here

### Task 4

We'll be working with the *World Bank data* on country population, fertility rate and life expectancy, taken from [World Bank website](www.data.worldbank.org), with thanks to [Kaggle](https://www.kaggle.com/gemartin/world-bank-data-1960-to-2016).

This data is broken into three files:

- `life_expectancy.csv`: Number of years a newborn would live if the patterns of mortality at the time of birth remain the same throughout life.
- `fertility_rate.csv`: Number of children a women would give birth during her childbearing years
- `country_population.csv`: Total number of residents regardless of legal status or citizenship (midyear estimates).

The subtasks are as follows:

1. Import all three datasets.
2. Remove countries that do not exist (such as World, or HDECs) as rows.
3. Drop any countries that have missing values.
4. Merge together the three datasets using the country code as an index. Create a MultiColumn Index for the Time-series for each data type.
5. Calculate the change in population for each year as a percentage.
6. (Optional) Plot the yearly percent change in population across time for 5 countries from the following groups: *Western Europe*, *East Asia*, *Sub-Saharan Africa* and *Middle East*.

In [None]:
# your code here

### Task 5 (Optional)

We will be continuing with the World Bank datasets provided, with some in-deep technical tasks which mostly revolve around plots. Knowledge of Matplotlib is recommended here, but you can at least do the `pandas` part of the exercises without knowing too much plotting material.

The subtasks are as follows:

1. Plot the scatterplot between **fertility rate** and **yearly percent change in population** for the *first and last* 5 years, using the life expectancy as the colour.
2. Create a column 'half-decade' that takes the average every 5 years (say 1960-64, 1965-69, etc.)
3. Calculate an *adjusted fertility rate*, whereby if two children are needed to maintain the population, $f-2$ represents the adjusted fertility rate. Negative values indicate that the population cannot be sustainably maintained by fertility alone.
4. Plot the *adjusted fertility rate* for 5 countries from the following groups: *Western Europe*, *East Asia*, *Sub-Saharan Africa* and *Middle East*.

In [None]:
# your code here

### Task 6 (Optional)

What would be nice would be to predict how the populations will grow in the **future**, given the data we have. In order to do this, we should first off make some assumptions about the data:

1. Roughly 46-48% of the population are women - women that influence the **fertility rate**. We can be fairly confident that this metric remains consistent.
2. The distribution of women's age follows some sort of distribution: in this instance we recommend you choose the *Gamma distribution*. This can be obtained from the `scipy.stats` package. We can use parameters such as the population size and life expectancy per year to help determine the shape of the distribution.
3. Women between the ages of 15 to 35 are in their optimal fertility period.

Thus we can draw a sample of women from any given population given their appropriate distribution $\mathcal{G}$, and using the fertility rate, calculate within their 20-year window the probability of producing a child that year. Using this number, we can simulate those children generated and see how they compare to the actual increase in population. We make some assumptions here; primarily that any age within this window there is a uniform chance of choosing to produce a child - realistically this is not the case.

Your tasks are thus as follows:

1. Write a function `new_babies`, that, given the country, year and the best $\sigma$, estimates the number of babies born that year. This can be achieved by drawing from the **Gamma distribution** with $\gamma \in [2, 4]$ and $\sigma \in [2, 12]$. Optimize for the best $\sigma$. Assume 47% of the population is female, with breeding ages between 15 and 38. Let $n$ be the total women in the population, $p$ be the estimated breeding proportion of that population, $\delta$ be the age difference between the top and bottom breeding age, and $f$ be the fertility rate, then the estimated number of new babies is:
$$
\hat{\beta}=\frac{np}{\delta}f
$$
note that $\hat{\beta} \in \mathbb{Z}$. 
2. Calculate the estimated births using a range of $\sigma \in [2, 12]$, for each year, using the life expectancy as a maximum parameter for the size of the sigma interval for the **Gamma distribution**. What do you notice about the values produced?
3. Factor in the *mortality rate* to your estimated population, which varies from country to country, but usually lies in the range 5-15 deaths annually per 1000 people. Global averages have decreased from 19.1 in 1950-55 to 8.1 in 2015-2020. Does the model improve?

In [None]:
# your code here

## Solutions

**WARNING**: _Please attempt to solve the problems before fetching the solutions!_

See the solutions to all of the problems here:

In [None]:
%load solutions/02_solutions.py