# Python 101
## Part X.
---
## Dataframes and visualization
### Act I: Use the pandas, Luke!

<img src="pics/pandas.png" align="left">
<br style="clear:left;"/>

In [None]:
import pandas as pd

#### Part I. Basic pandas operations
- read csv data into a pandas dataframe

In [None]:
data = pd.read_csv('./data/vote2022.csv')

- show the first 5 rows

In [None]:
data.head()

- get a dataframe's column names

In [None]:
data.columns

- select a subset of columns

In [None]:
data[['name', 'votes']].head()

- filter columns by selecting subset of columns

In [None]:
cols_i_want = [col for col in data.columns if not col == 'winner']
cols_i_want

In [None]:
data[cols_i_want].head()

__Caution!__ `data[cols]` only creates a view!  
Use `data = data[cols]` if you want on a subset.

In [None]:
data.head()

In [None]:
data = data[cols_i_want]
data.head()

- use aggregation functions  
_How many people voted?_

In [None]:
data['votes'].sum()

- group values to get more insight  
_Let's get the sum of the votes for each party!_

In [None]:
data[['party', 'votes']].groupby('party').sum().head(10)

- replacing values  
_How about renaming parties to a shorter name?_

In [None]:
party_mapping = {
    ('FIDESZ - MAGYAR POLGÁRI SZÖVETSÉG'
     '-KERESZTÉNYDEMOKRATA NÉPPÁRT'): 'FIDESZ-KDNP',

    ('DEMOKRATIKUS KOALÍCIÓ'
     '-JOBBIK MAGYARORSZÁGÉRT MOZGALOM'
     '-MOMENTUM MOZGALOM'
     '-MAGYAR SZOCIALISTA PÁRT'
     '-LMP - MAGYARORSZÁG ZÖLD PÁRTJA'
     '-PÁRBESZÉD MAGYARORSZÁGÉRT PÁRT'): 'OSSZEFOGAS',
     
    ('MAGYAR MUNKÁSPÁRT'
     '-IGEN SZOLIDARITÁS MAGYARORSZÁGÉRT MOZGALOM'): 'MUNKÁSPÁRT'
}

data['party'] = data['party'].replace(party_mapping, inplace=False)

- ordering dataframes  
_Order results by the number of votes!_

In [None]:
party_votes = (
    data
    [['party', 'votes']]
    .groupby('party')
    .sum()
    .sort_values('votes', ascending=False)
)
party_votes.head()

In [None]:
len(data['party'].unique())

In [None]:
data.party.nunique()

- create new column

In [None]:
data['opposition'] = data['party'] != 'FIDESZ-KDNP'
data.head()

#### Part II. Plotting results

Use a jupyter "magic" function to draw the plots into the notebook.  
Also load plotting libraries `matplotlib` and `seaborn`.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

- simple barplot

In [None]:
party_votes.plot(kind='bar');

- filtering and plotting  
_Only plot parties with at least 10000 votes!_

We can filter dataframes with the `dataframe.loc[condition]` statement where condition is a logical expression on one (or more) column(s).

In [None]:
condition = party_votes['votes'] > 10000
condition.head(15)

In [None]:
vote10k = party_votes.loc[condition]
vote10k.plot(kind='bar');

_Plot the top 6 party!_

In [None]:
top6 = party_votes.head(6)
top6.plot(kind='bar');

---
### Act III: The devil lies in the details!

<img src="pics/evil-panda.png" width=300 height=300 align="left">
<br style="clear:left;"/>

- Nested grouping operations  
_Consider the regional data too._

In [None]:
regional = (
    data
    [['party', 'region', 'votes']]
    .groupby(['region', 'party'])
    .sum()
)
regional.head(10)

_Let's only have the ones with more than 5000 votes!_

In [None]:
regional5k = regional.loc[regional.votes > 5000]
regional5k.head(10)

- Pivot  
To pivot this dataframe first we need to remove the nested index:

In [None]:
regional5k.reset_index().head()

Now we can pivot this flattened dataframe:

In [None]:
(
    regional5k
    .reset_index()
    .pivot(index='region', columns='party', values='votes')
)

Plot the results:

In [None]:
(
    regional5k
    .reset_index()
    .pivot(index='region', columns='party', values='votes')
    .plot(kind='barh', figsize=(12, 15))
);

---

## Let's do some...

<img align="left" width=150 src="http://www.reactiongifs.com/r/mgc.gif">

<br style="clear:left;"/>

### Act III: Cool "library" of the week: caching function results
#### Speed up your computation heavy functions

Consider that you have a function which is pretty lengthy to compute, but it is executed a lot for the same input. In this case saving the input-output pairs in a dictionary will speed up your program considerably.

How to measure the time it takes to compute with the time module:

In [None]:
import time

start_time = time.time()
time.sleep(5)   # wait 5 sec to simulate computation
end_time = time.time() - start_time

print(f"Computation took {end_time:.2f} seconds.")

Or use a jupyter notebook exclusive magic command:
```python
%time command
```
to measure single line expression, and
```python
%%time

command1()
command2()
command3()
```
to measure the whole cell.

In [None]:
%%time

time.sleep(5)

This will measure on time execution - this is often misleading, but we can do better! Let's measure multiple execution and get the avg execution time with:
```python
%timeit command
```
and 
```python
%%timeit

command1()
command2()
command3()
```
commands.

In [None]:
%%timeit

time.sleep(5)

Now that we can measure computation time, let's improve a lengthy function by caching it with caching.
We will use the [`lru_cache`](https://docs.python.org/3/library/functools.html#functools.lru_cache) function from python's built-in library `functools`.

First, let's create and measure our dummy function:

In [None]:
def lengthy_function(n):
    time.sleep(n)
    return n

In [None]:
%time lengthy_function(5)

In [None]:
%time lengthy_function(5)

Now let's create the same function, but now cached.

In [None]:
from functools import lru_cache

@lru_cache()
def cached_lengthy_function(n):
    time.sleep(n)
    return n

The first run will take the same amount of time, but it is now cached:

In [None]:
%time cached_lengthy_function(5)

So from now on, it will be super fast:

In [None]:
%time cached_lengthy_function(5)

---
## Final Act: The pandas is strong with this one!

<img src="pics/darth_panda.jpg" align="left"/>

<br style="clear:left;" />

    
## It's your turn - write the missing code snippets!

#### 1.  Plot the number of voters in each region!

#### 2. Who would win, if Fidesz doesn't participate in the election?

Hint: You can create filters based on equality. (`~(data['party'] == 'FIDESZ-KDNP')`)

#### 3. Who would win by regions, if Fidesz doesn't participate in the election?

#### 4. Who were the most successful candidates? (top10)

#### 5. List the number of subregions each party participated! 
Hint: groupby aggregation function `.count()` returns the number of items in a group.

#### 6. How many wins could the opposition get in case of perfect cooperation?

Solution steps:
1. Create a new boolean column if a row is an opposition or not (ie. not Fidesz)
2. Group by region and the column from step 1 and sum up the votes
3. Save the pivoted (regions are the records and opposition / Fidesz are the columns) dataframe to a variable
4. Create a new column in the pivoted dataframe to store if the opposition got more votes than Fidesz
5. Sum up this column to get the number of regions where the "perfect opposition" wins

#### 7. List the most successful regions for each party!

Solution steps:
1. Group by the data by regions and party and sum up the votes
2. Sort the result by votes
3. Reset the index
4. Group by party and select the first row with `.first()`