# Pandas Groupby

We first need to read the files `2017_german_election_overall.csv` and `2017_german_election_party.csv` from the german-election-2017 dataset.

In [None]:
import pandas as pd
import numpy as np

In [None]:
german_party = pd.read_csv('ex-data/german-election-2017/2017_german_election_party.csv')
german_overall = pd.read_csv('ex-data/german-election-2017/2017_german_election_overall.csv')

In [None]:
german_party.head()

In [None]:
german_overall.head()

## For each area, compute the percentage of total votes over the registered voters

In [None]:
german_overall['perc'] = german_overall['valid_first_votes'] / german_overall['registered.voters'] * 100
german_overall.head()

## For each state, compute the total number of registered voters

We can avoid using dictionaries and loops, leveraging on the functionalities provided by Pandas.

In [None]:
registered_voters = german_overall.groupby('state')['registered.voters'].sum()
registered_voters

In [None]:
type(registered_voters)

## How many registered voters are there in Bayern or Saarland (compute the voters in each state and the sum of the two numbers)

Using the previous Series, this exercise becomes trivial.

In [None]:
bayern_voters = registered_voters['Bayern']
saarland_voters = registered_voters['Saarland']
print(bayern_voters + saarland_voters)

## For each state, compute the number of votes (first vote) for each party

In [None]:
state_party_votes = german_party.groupby(['state', 'party']).sum()['votes_first_vote']
state_party_votes

## For each state and each party, compute the area where the party has taken the most total votes (first votes)

Recall that using `max()` after a `groupby` returns the maximum **value** in the group, not the row that includes such value.
Therefore, we need to use `idxmax()`.

In [None]:
german_party.loc[german_party.groupby(['state', 'party'])['votes_first_vote'].idxmax()]

Let's manually test whether the maximum amount of votes that `Christlich.Demokratische.Union.Deutschlands` got in Baden-Württemberg was 7750.

In [None]:
german_party[(german_party['state'] == 'Baden-Württemberg') & \
             (german_party['party'] == 'Christlich.Demokratische.Union.Deutschlands')]

## For each party, compute the area where the party has taken the most and the least votes (first vote), as a percentage of the overall registered voters in the state.

To compute the required field we can use the `registered_voters` Series we computed before, together with `apply`.

In [None]:
german_party['percentage'] = german_party.apply(lambda row:\
                                                row['votes_first_vote'] / registered_voters[row['state']] * 100,
                                                axis=1)
german_party.head()

Now we can use `idxmin` and `idxmax` to solve the problem

In [None]:
least_perc_index = german_party.groupby('party')['percentage'].idxmin()
german_party.loc[least_perc_index][['party', 'state', 'area_name', 'percentage']]

In [None]:
max_perc_index = german_party.groupby('party')['percentage'].idxmax()
german_party.loc[max_perc_index][['party', 'state', 'area_name', 'percentage']]

## For each area, compute the difference between the valid first votes and the valid second votes

In [None]:
german_party['difference'] = german_party['votes_first_vote'] - german_party['votes_second_vote']
german_party.groupby('area_name').sum()['difference']

## For each state, compute the difference between the valid first votes and the valid second votes

In [None]:
german_party.groupby('state').sum()['difference']

## For each party, compute the difference between the valid first votes and the valid second votes

In [None]:
german_party.groupby('party').sum()['difference']

## For each area and each party, compute the difference between the valid first votes and the valid second votes

In [None]:
german_party.groupby(['area_name', 'party']).sum()['difference']