# Python 101
## Part X.
---
## Dataframes and visualization
### Act I: Use the pandas, Luke!

<img src="pics/pandas.png" align="left">
<br style="clear:left;"/>

In [None]:
import pandas as pd

#### Part I. Basic pandas operations
- read csv data into a pandas dataframe

In [None]:
data = pd.read_csv('./data/vote2022.csv')

- show the first 5 rows

In [None]:
data.head()

- get a dataframe's column names

In [None]:
data.columns

- select a subset of columns

In [None]:
data[['name', 'votes']].head()

- filter columns by selecting subset of columns

In [None]:
cols_i_want = [col for col in data.columns if not col == 'winner']
cols_i_want

In [None]:
data[cols_i_want].head()

__Caution!__ `data[cols]` only creates a view!  
Use `data = data[cols]` if you want on a subset.

In [None]:
data.head()

In [None]:
data = data[cols_i_want]
data.head()

- use aggregation functions  
_How many people voted?_

In [None]:
data['votes'].sum()

- group values to get more insight  
_Let's get the sum of the votes for each party!_

In [None]:
data[['party', 'votes']].groupby('party').sum().head(10)

- replacing values  
_How about renaming parties to a shorter name?_

In [None]:
party_mapping = {
    ('FIDESZ - MAGYAR POLGÁRI SZÖVETSÉG'
     '-KERESZTÉNYDEMOKRATA NÉPPÁRT'): 'FIDESZ-KDNP',

    ('DEMOKRATIKUS KOALÍCIÓ'
     '-JOBBIK MAGYARORSZÁGÉRT MOZGALOM'
     '-MOMENTUM MOZGALOM'
     '-MAGYAR SZOCIALISTA PÁRT'
     '-LMP - MAGYARORSZÁG ZÖLD PÁRTJA'
     '-PÁRBESZÉD MAGYARORSZÁGÉRT PÁRT'): 'OSSZEFOGAS',
     
    ('MAGYAR MUNKÁSPÁRT'
     '-IGEN SZOLIDARITÁS MAGYARORSZÁGÉRT MOZGALOM'): 'MUNKÁSPÁRT'
}

data['party'] = data['party'].replace(party_mapping, inplace=False)

- ordering dataframes  
_Order results by the number of votes!_

In [None]:
party_votes = (
    data
    [['party', 'votes']]
    .groupby('party')
    .sum()
    .sort_values('votes', ascending=False)
)
party_votes.head()

In [None]:
len(data['party'].unique())

In [None]:
data.party.nunique()

- create new column

In [None]:
data['opposition'] = data['party'] != 'FIDESZ-KDNP'
data.head()

#### Part II. Plotting results

Use a jupyter "magic" function to draw the plots into the notebook.  
Also load plotting libraries `matplotlib` and `seaborn`.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

- simple barplot

In [None]:
party_votes.plot(kind='bar');

- filtering and plotting  
_Only plot parties with at least 10000 votes!_

We can filter dataframes with the `dataframe.loc[condition]` statement where condition is a logical expression on one (or more) column(s).

In [None]:
condition = party_votes['votes'] > 10000
condition.head(15)

In [None]:
vote10k = party_votes.loc[condition]
vote10k.plot(kind='bar');

_Plot the top 6 party!_

In [None]:
top6 = party_votes.head(6)
top6.plot(kind='bar');

---
### Act III: The devil lies in the details!

<img src="pics/evil-panda.png" width=300 height=300 align="left">
<br style="clear:left;"/>

- Nested grouping operations  
_Consider the regional data too._

In [None]:
regional = (
    data
    [['party', 'region', 'votes']]
    .groupby(['region', 'party'])
    .sum()
)
regional.head(10)

_Let's only have the ones with more than 5000 votes!_

In [None]:
regional5k = regional.loc[regional.votes > 5000]
regional5k.head(10)

- Pivot  
To pivot this dataframe first we need to remove the nested index:

In [None]:
regional5k.reset_index().head()

Now we can pivot this flattened dataframe:

In [None]:
(
    regional5k
    .reset_index()
    .pivot(index='region', columns='party', values='votes')
)

Plot the results:

In [None]:
(
    regional5k
    .reset_index()
    .pivot(index='region', columns='party', values='votes')
    .plot(kind='barh', figsize=(12, 15))
);

---

## Let's do some...

<img align="left" width=150 src="pics/magic.gif">
<br style="clear:left;"/>

### Act IV: Cool library of the week: <a href="https://docs.python.org/2/library/collections.html">python's collections lib</a>

#### High-performance container datatypes
There are five awesome data types implemented in this library. My two favourites are:
- Counter

In [None]:
from collections import Counter

data = ("Please, listen to me. The archbishop Lazarus, "
        "he led us down here to find the lost prince. "
        "The bastard led us into a trap! Now everyone is dead... "
        "Killed by a demon he called The Butcher. Avenge us! "
        "Find this butcher and slay him so that "
        "our souls may finally rest...").lower().split()

wordcount = Counter(data)
wordcount.most_common(10)

- defaultdict

In [None]:
import string
from collections import defaultdict

words = defaultdict(set)

for word in data:
    words[word[0]].add(word)

words

---
## Final Act: The pandas is strong with this one!

<img src="pics/darth_panda.jpg" align="left"/>

<br style="clear:left;" />

    
## It's your turn - write the missing code snippets!

#### 1.  Plot the number of voters in each region!

#### 2. Who would win, if Fidesz doesn't participate in the election?

Hint: You can create filters based on equality. (`~(data['party'] == 'FIDESZ-KDNP')`)

#### 3. Who would win by regions, if Fidesz doesn't participate in the election?

#### 4. Who were the most successful candidates? (top10)

#### 5. List the number of subregions each party participated! 
Hint: groupby aggregation function `.count()` returns the number of items in a group.

#### 6. How many wins could the opposition get in case of perfect cooperation?

Solution steps:
1. Create a new boolean column if a row is an opposition or not (ie. not Fidesz)
2. Group by region and the column from step 1 and sum up the votes
3. Save the pivoted (regions are the records and opposition / Fidesz are the columns) dataframe to a variable
4. Create a new column in the pivoted dataframe to store if the opposition got more votes than Fidesz
5. Sum up this column to get the number of regions where the "perfect opposition" wins

#### 7. List the most successful regions for each party!

Solution steps:
1. Group by the data by regions and party and sum up the votes
2. Sort the result by votes
3. Reset the index
4. Group by party and select the first row with `.first()`