# Practical Python exercise: Merging and aggregating

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

In [None]:
population=pd.read_json('population.json')
economy=pd.read_json('economy.json')

In [None]:
population

In [None]:
economy

In [None]:
population['Regions'].value_counts()

# Your Task

- use methods like `.head()` (to show first lines), `.describe()` (to get descriptive statistics of a dataframe or a column) and/or `.value_counts()` (to get a frequency table of a column) to get a sense of both datasets.
- what are the common characteristics between the datasets, what are the differences?

In [None]:
population.describe()

In [None]:
population['Regions'].value_counts()

In [None]:
population.head()

In [None]:
# now the economy...

# Discuss: What type of join?
Discuss with your neighbor
- what type of join (inner, outer, left, right) you want; and
- which column(s) to join on

Then, create a combined dataframe with a command along the lines of

```
df = population.merge(economy, on='columnname'], how='left/right/inner/outer')
```
or if you have multiple columns to join on:
```
df = population.merge(economy, on=['columnname','columnname'], how='left/right/inner/outer')
```



Then, give some information about the resulting dataframe.

In [None]:
# your code here

## Setting an index
While our columns have a descriptive names (headers), our rows don't right now. They are just numbers. However, we could actually give them *meaningful* names. A nice side-effect is that you will get better plots, with meaningful axis labels later on.

In [None]:
df.index=df['Periods']

See the difference?

In [None]:
df.head()

## Analyze the data

Let's train a bit with  `.groupby()` and `.agg()`.

In [None]:
df.plot()

In [None]:
df['GDPVolumeChanges_1'].plot(kind='bar')

## Discuss: Why does the above not work?

OK, got it?

Let's try this instead:

In [None]:
mysubset = df[['GDPVolumeChanges_1','Regions']]
mygroups = mysubset.groupby('Regions')
meanspergroup = mygroups.agg(np.mean)
meanspergroup.plot(kind='bar')

In [None]:
meanspergroup

In [None]:
df[['GDPVolumeChanges_1','Regions']].groupby("Regions").agg(lambda x: max(x) - min(x)).plot(kind='bar')

In [None]:
df[['GDPVolumeChanges_1','Regions']].groupby(
    'Regions').agg(np.mean).plot(kind='bar')

In [None]:
df.index = df['Periods']

In [None]:
df.drop('Periods', inplace=True, axis=1)

In [None]:
df

In [None]:
df['LiveBornChildren_2'].groupby("Periods").agg(sum).plot()

## Discuss: which aggregation function?

- Why did we choose `np.mean`?
- What function should we choose for analyzing `df['LiveBornChildren_2']`? Why?



## Some more possibilities

Just take a look and try to play around a bit...

In [None]:
df.groupby(df.index)['LiveBornChildren_2'].agg(sum)

In [None]:
df.groupby('Regions')['NetMigrationExcludingAdministrative_19'].plot(legend=True, figsize = [10,10] )

In [None]:
df[df['Regions']=='Flevoland']['NetMigrationExcludingAdministrative_19'].plot(legend=False, figsize = [4,4] )
df[df['Regions']=='Zuid-Holland']['NetMigrationExcludingAdministrative_19'].plot(legend=False )

In [None]:
df['Regions']=='Flevoland'

In [None]:
df.groupby(df.index)['NetMigrationExcludingAdministrative_19'].agg(sum).plot(legend = True)
df.groupby(df.index)['GDPVolumeChanges_1'].agg(np.mean).plot(legend=True, secondary_y=True)

### Discuss
I personally find this last plot a pretty cool one. Do you agree?

In [None]:
df[['NetMigrationExcludingAdministrative_19','GDPVolumeChanges_1']].corr() # we probably should have lagged one of the variables by a year or so for this.

## Correlational analysis

We could also look into some bivariate plots.... 

In [None]:
df.plot(y='LiveBornChildren_2', x='GDPVolumeChanges_1', kind='scatter')

In [None]:
sns.lmplot(y='LiveBornChildren_2', x='GDPVolumeChanges_1', data=df,
           fit_reg=True, lowess=False, robust=True) 