# Python 101
## Part IX.
---
## Dataframes and visualization

### Act I: Get the data!

<img src="http://www.london24.com/polopoly_fs/1.3024317.1385128334!/image/4183113330.jpg_gen/derivatives/landscape_630/4183113330.jpg" width="360" align="left"></img>
<br style="clear:left;"/>

Scrape the 2014 hungarian voting results!
- import required libraries

In [None]:
import requests
from bs4 import BeautifulSoup

- set up basic URIs

In [None]:
VOTE_BASE = 'http://valasztas.hu/dyn/pv14/szavossz/hu/'
OVERALL = 'oevker.html'
BASE_URI = './data/'

- download document

In [None]:
vote_response = requests.get(VOTE_BASE + OVERALL)
print vote_response.status_code

- extract data with beautifulsoup

In [None]:
vote_soup = BeautifulSoup(vote_response.content, "lxml") 
containers = vote_soup.find('table', {'border': '1'}).findAll('tr')
print len(containers)
containers[:5]

- get the items out of the tablerows

In [None]:
rows = [row.findAll('td') for row in containers]
rows[:5]

- we've got an unneeded first row, remove it!

In [None]:
rows = rows[1:]
rows[:5]

- "transform" the data into a table-like format

In [None]:
for row in rows[:5]:
    print [r.getText() for r in row]

- for our analysis, we need the region, the subregion and the links

In [None]:
REGIONS = []
for row in rows:
    REGIONS.append([row[0].getText(), row[2].getText(), row[1].find('a').get('href')])
REGIONS[:5]

In [None]:
print 'Number of regions:', len(REGIONS)

- get the detailed information for each region

In [None]:
results = []

for city, region, sub_url in REGIONS:
    print u"Downloading and processing data for {} - {} ...".format(city, region),
    region_response = requests.get(VOTE_BASE + sub_url)
    region_soup = BeautifulSoup(region_response.content, "lxml")
    region_container = (region_soup
                        .find(text='A szavazatok száma jelöltenként')
                        .findNext('table')
                        .findAll('tr'))
    region_rows = [row.findAll('td') for row in region_container][1:] # remove empty header
    # every candidate will go to a new row
    for row in region_rows:
        results.append([city, region] + [r.getText() for r in row][:-1]) # remove the last 'tick column'
    print "Done."

- let's look at the detailed information

In [None]:
print results[:5]
print '-' * 79
print 'Number of candidates:', len(results)

- transform the items

In [None]:
cleaned_results =[]

for row in results:
    cleaned_results.append(
        [item.replace(u'\xa0', u'').replace(u'%', u'').strip() # replace the unneeded characters
         for item in row]
    )
cleaned_results[:5]    

Now we can finally save it!

In [None]:
import codecs
header = [u'region', u'subregion', u'subid', u'name', u'party', u'votes', u'votes %']
filename = 'vote2014.csv'
with codecs.open(BASE_URI + filename, 'w', 'utf-8') as csv:
    csv.write(u';'.join([u'"' + item + u'"' for item in header]) + u'\n')
    for row in cleaned_results:
        csv.write(u';'.join([u'"' + item + u'"' for item in row]) + u'\n')

# check if it is successfully created
import os
os.path.isfile(BASE_URI + filename)

---
### Act II: Use the pandas, Luke!

<img src="pics/pandas.png" align="left">
<br style="clear:left;"/>

In [None]:
import pandas as pd

#### Part I. Basic pandas operations
- read csv data into a pandas dataframe

In [None]:
data = pd.read_csv(BASE_URI + filename, quotechar='"', delimiter=';', encoding='utf-8')

- show the first 5 rows

In [None]:
data.head()

- get a dataframe's column names

In [None]:
data.columns

- select a subset of columns

In [None]:
data[['name', 'votes']].head()

- filter columns by selecting subset of columns

In [None]:
cols_i_want = [col for col in data.columns if not col == 'subid']
cols_i_want

In [None]:
data[cols_i_want].head()

__Caution!__ `data[cols]` only creates a view!  
Use `data = data[cols]` if you want on a subset.

In [None]:
data.head()

In [None]:
data = data[cols_i_want]
data.head()

- use aggregation functions  
_How many people voted?_

In [None]:
data['votes'].sum()

- group values to get more insight  
_Let's get the sum of the votes for each party!_

In [None]:
data[['party', 'votes']].groupby('party').sum().head(10)

- ordering dataframes  
_Order results by the number of votes!_

In [None]:
party_votes = (
    data
    [['party', 'votes']]
    .groupby('party')
    .sum()
    .sort_values('votes', ascending=False)
)
party_votes.head()

In [None]:
len(data['party'].unique())

#### Part II. Plotting results

Use a jupyter "magic" function to draw the plots into the notebook.  
Also load plotting libraries `matplotlib` and `seaborn`.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
try:
    import seaborn as sns
except:
    try:
        !conda install seaborn
    except:
        print 'Shit happens...'

- simple barplot

In [None]:
party_votes.plot(kind='bar')

- filtering and plotting  
_Only plot parties with at least 10000 votes!_

We can filter dataframes with the `dataframe[condition]` statement where condition is a logical expression on one (or more) column(s).

In [None]:
condition = party_votes['votes'] > 10000
condition.head(15)

In [None]:
vote10k = party_votes.loc[condition]
vote10k.plot(kind='bar')

_Plot the top 6 party!_

In [None]:
top6 = party_votes.head(6)
top6.plot(kind='bar')

---
### Act III: The devil lies in the details!

<img src="pics/evil-panda.png" width=300 height=300 align="left">
<br style="clear:left;"/>

- Nested grouping operations  
_Consider the regional data too._

In [None]:
regional = (
    data
    [['party', 'region', 'votes']]
    .groupby(['region', 'party'])
    .sum()
)
regional.head(10)

_Let's only have the ones with more than 5000 votes!_

In [None]:
regional5k = regional.loc[regional.votes > 5000]
regional5k.head(10)

- Pivot  
To pivot this dataframe first we need to remove the nested index:

In [None]:
regional5k.reset_index().head()

Now we can pivot this flattened dataframe:

In [None]:
(regional5k
 .reset_index()
 .pivot(index='region', columns='party', values='votes'))

Set the resulting figure size:

In [None]:
plt.rcParams['figure.figsize'] = 8, 6

Plot the results:

In [None]:
(
    regional5k
    .reset_index()
    .pivot(index='region', columns='party', values='votes')
    .plot(kind='barh')
)

---

## Let's do some...

<img align="left" width=150 src="http://www.reactiongifs.com/r/mgc.gif">

<br style="clear:left;"/>

### Act III: Cool library of the week: <a href="https://mzucker.github.io/2016/09/20/noteshrink.html">noteshrink</a>
#### Export your notes into readable pdfs!

To install:
- install pillow (in your shell execute: `conda install pillow`)
- download and unzip the library (from <a href="https://github.com/mzucker/noteshrink/archive/master.zip">here</a>)
- change line #578 in noteshrink/noteshrink.py: comment out the line
- optional: install with pip: `pip install -e noteshrink`
Then use with:  
`python filename(s) -b output_file_prefix`  
example:

In [None]:
!python ./noteshrink/noteshrink.py noteshrink/examples/notesA1.jpg noteshrink/examples/notesA2.jpg -b example

<img src="noteshrink/examples/notesA1.jpg" width="300" align="left"><img src="example0000.png" width="300" align="left">

---
## Final Act: The pandas is strong with this one!

<img src="http://2.bp.blogspot.com/-pgK8KdMmSn8/TsFTOwrGk9I/AAAAAAAABAk/5ondVGyw6w8/s320/Darth+Panda.jpg" align="left">

<br style="clear:left;"/>

## It's your turn - write the missing code snippets!

#### 1.  Plot the number of voters in each region!

#### 2. Who would win, if Fidesz doesn't participate in the election?

Hint: You can create filters based on equality. (`~data['party'] == 'FIDESZ-KDNP'`)

#### 3. Who would win by regions, if Fidesz doesn't participate in the election?