# Python 101
## Part IX. - Vote results scraping
---

<img src="http://www.london24.com/polopoly_fs/1.3024317.1385128334!/image/4183113330.jpg_gen/derivatives/landscape_630/4183113330.jpg" width="360" align="left"></img>
<br style="clear:left;"/>

Scrape the 2018 hungarian voting results!
- import required libraries

In [None]:
import requests
from bs4 import BeautifulSoup

- set up basic URIs

In [None]:
VOTE_BASE = 'http://valasztas.hu/dyn/pv18/szavossz/hu/'
OVERALL = 'oevker.html'
BASE_URI = './data/'

- download document

In [None]:
vote_response = requests.get(VOTE_BASE + OVERALL)
print(vote_response.status_code)

- extract data with beautifulsoup

In [None]:
vote_soup = BeautifulSoup(vote_response.content, "html.parser") 
containers = vote_soup.find('table', {'border': '1'}).findAll('tr')
print(len(containers))
containers[:5]

- get the items out of the tablerows

In [None]:
rows = [row.findAll('td') for row in containers]
rows[:5]

- "transform" the data into a table-like format

In [None]:
for row in rows[:5]:
    print([r.getText() for r in row])

- for our analysis, we need the region, the subregion and the links

In [None]:
REGIONS = []
for row in rows:
    REGIONS.append([row[0].getText(), row[2].getText(), row[1].find('a').get('href')])
REGIONS[:5]

In [None]:
print('Number of regions:', len(REGIONS))

- get the detailed information for each region

In [None]:
results = []

for city, region, sub_url in REGIONS:
    print("Downloading and processing data for {} - {} ...".format(city, region), end='')
    region_response = requests.get(VOTE_BASE + sub_url)
    region_soup = BeautifulSoup(region_response.content, "html.parser")
    region_container = (region_soup
                        .find(text='A szavazatok száma jelöltenként')
                        .findNext('table')
                        .findAll('tr'))
    region_rows = [row.findAll('td') for row in region_container][1:] # remove empty header
    # every candidate will go to a new row
    for row in region_rows:
        results.append([city, region] + [r.getText() for r in row][:-1]) # remove the last 'tick column'
    print("Done.")

- let's look at the detailed information

In [None]:
print(results[:5])
print('-' * 79)
print('Number of candidates:', len(results))

- transform the items

In [None]:
cleaned_results =[]

for row in results:
    cleaned_results.append(
        [item.replace(u'\xa0', u'').replace(u'%', u'').strip() # replace the unneeded characters
         for item in row]
    )
cleaned_results[:5]    

Now we can finally save it!

In [None]:
import pandas as pd

header = [u'region', u'subregion', u'subid', u'name', u'party', u'votes', u'votes %']
filename = 'vote2018.csv'
pd.DataFrame(cleaned_results, columns=header).to_csv(filename, index=False)