# Scrape and download - Introduction
https://www.sejm.gov.pl/Sejm9.nsf/poslowie.xsp

# Environment setup

## Google Drive mount
I'm using Google Colaboratory as my default platform, therefore I need to set up my environment to integrate it with Google Drive. You can skip this bit if you're working locally.

1. Mount Google Drive on the runtime to be able to read and write files. This will ask you to log in to your Google Account and provide an authorization code.
2. Create a symbolic link to a working directory 
3. Change the directory to the one where I cloned my repository.


In [1]:
# mount Google Drive on the runtime
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:
# create a symbolic link to a working directory
!ln -s /content/gdrive/My\ Drive/Colab\ Notebooks/SEJMograf /mydrive

# navigate to the working directory
%cd /mydrive

ln: failed to create symbolic link '/mydrive/SEJMograf': File exists
/content/gdrive/My Drive/Colab Notebooks/SEJMograf


## Libraries & functions
Let's now import the necessary libraries and function we're gonna use in this notebook.

- `requests` - http handling
- `BeautifulSoup` - html parsing & web-scraping
- `urllib.request` - url-opening
- `tqdm.notebook` - loop progress bar for notebooks
- `timeit` - cell runtime check
- `numpy` - linear algebra
- `pandas` - data manipulation & analysis
- `sys` - system-specific parameters & functions
- `os` - operating system interfaces
- `os.path` - pathname manipulation
- `json` - JSON files handling

In [3]:
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

Reading package lists... Done
Building dependency tree       
Reading state information... Done
chromium-chromedriver is already the newest version (87.0.4280.66-0ubuntu0.18.04.1).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.
cp: '/usr/lib/chromium-browser/chromedriver' and '/usr/bin/chromedriver' are the same file


In [4]:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", options=options)

In [5]:
# wd = webdriver.Chrome('chromedriver',options=options)
# wd.get('https://www.jeju.studio')
# print(wd.page_source) # results

In [6]:
import requests
import bs4
from bs4 import BeautifulSoup
# from selenium import webdriver
# from selenium.webdriver.common.keys import Keys
import tqdm.notebook as tq
import timeit
import numpy as np
import pandas as pd
import time
import re

# import urllib.request
# import sys
# import os
# from os.path import basename
# import json

# Scraping
Let's now retrieve all the information we need to proceed.

In [7]:
# start the timer and print the information
start = timeit.default_timer()
print('\nStarting. This might take a few seconds to complete...\n')

# initiate the containers
deputy_names = []
deputy_urls = []

# perform a http request
url = 'https://www.sejm.gov.pl/Sejm9.nsf/poslowie.xsp'
response = requests.get(url)

# initiate BeautifulSoup and find objects of our interest
soup = BeautifulSoup(response.content, 'html.parser')
letters = soup.find_all('ul', attrs={'class': 'deputies'})

for letter in letters:
  deputies = letter.find_all('a')

  for deputy in deputies:

    url = deputy.get('href')
    deputy_urls.append(f'https://www.sejm.gov.pl{url}')

    name = deputy.find('div', attrs={'class': 'deputyName'}).contents[0]
    deputy_names.append(name)


print(f'{len(deputy_names)} deputies found.')

# stop the timer and print runtime duration
stop = timeit.default_timer() 
print('\nRuntime: {} seconds.'.format(int(stop-start)))


Starting. This might take a few seconds to complete...

465 deputies found.

Runtime: 2 seconds.


In [8]:
df = pd.DataFrame(
    {'name': deputy_names,
     'url': deputy_urls}
    )
df

Unnamed: 0,name,url
0,Adamczyk Andrzej,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...
1,Adamczyk Rafał,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...
2,Adamowicz Piotr,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...
3,Ajchler Romuald,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...
4,Andruszkiewicz Adam,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...
...,...,...
460,Zyska Ireneusz,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...
461,Żalek Jacek,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...
462,Żelazowska Bożena,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...
463,Żuk Stanisław,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...


In [9]:
# start the timer and print the information
start = timeit.default_timer()
print('\nStarting. This might take a few minutes to complete...\n')

# initiate the containers
deputy_pic = []
election_date = []
election_list = []
election_constituency = []
election_votes = []
oath_date = []
service_history = []
party = []

for url in tq.tqdm(df['url']):

  # perform a http request
  response = requests.get(url)
  
  # initiate BeautifulSoup and find objects of our interest
  soup = BeautifulSoup(response.content, 'html.parser')

  # print name of deputy currently scanned
  deputy_name = soup.find('div', attrs={'id': 'title_content'}).h1.contents[0]
  print(f'Retrieving party data of {deputy_name}')

  # find party data container and its items
  party_data = soup.find('div', attrs={'class': 'partia'})
  party_items = party_data.find_all('li')

  # deputy's picture
  deputy_pic.append(party_data.find('img').get('src'))


  # go through the items of party data
  for item in party_items:
    left = item.find('p', attrs={'class': 'left'}).contents[0]
    right = item.find('p', attrs={'class': 'right'}).contents[0]

    if re.search('Wybran. dnia:', left):
      election_date.append(right)
    elif left == 'Lista:':
      election_list.append(right)
    elif left == 'Okręg wyborczy:':
      election_constituency.append(re.search('\D+', right).group().strip())
    elif left == 'Liczba głosów:':
      election_votes.append(right)
    elif left == 'Ślubowanie:':
      oath_date.append(right)
    elif left == 'Staż parlamentarny:':
      service_history.append(right)
    elif left == 'Klub/koło:':
      if isinstance(right, bs4.element.NavigableString):
        party.append(right)
      else:
        party.append(right.contents[0])


# stop the timer and print runtime duration
stop = timeit.default_timer() 
print('\nRuntime: {} seconds.'.format(int(stop-start)))


Starting. This might take a few minutes to complete...



HBox(children=(FloatProgress(value=0.0, max=465.0), HTML(value='')))

Retrieving party data of Andrzej Adamczyk
Retrieving party data of Rafał Adamczyk
Retrieving party data of Piotr Adamowicz
Retrieving party data of Romuald Ajchler
Retrieving party data of Adam Andruszkiewicz
Retrieving party data of Waldemar Andzel
Retrieving party data of Tomasz Aniśko
Retrieving party data of Jan Krzysztof Ardanowski
Retrieving party data of Iwona Arent
Retrieving party data of Marek Ast
Retrieving party data of Urszula Augustyn
Retrieving party data of Tadeusz Aziewicz
Retrieving party data of Zbigniew Babalski
Retrieving party data of Piotr Babinetz
Retrieving party data of Ryszard Bartosik
Retrieving party data of Władysław Teofil Bartoszewski
Retrieving party data of Barbara Bartuś
Retrieving party data of Mieczysław Baszko
Retrieving party data of Dariusz Bąk
Retrieving party data of Paweł Bejda
Retrieving party data of Konrad Berkowicz
Retrieving party data of Magdalena Biejat
Retrieving party data of Jerzy Bielecki
Retrieving party data of Marek Biernacki
Ret

In [10]:
df['picture'] = deputy_pic
df['election_date'] = election_date
df['election_list'] = election_list
df['election_constituency'] = election_constituency
df['election_votes'] = election_votes
df['oath_date'] = oath_date
df['service_history'] = service_history
df['party'] = party
df

Unnamed: 0,name,url,picture,election_date,election_list,election_constituency,election_votes,oath_date,service_history,party
0,Adamczyk Andrzej,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...,https://orka.sejm.gov.pl/Poslowie9.nsf/0/8A510...,13-10-2019,Prawo i Sprawiedliwość,Kraków,29686,12-11-2019,"poseł V kadencji, poseł VI kadencji, poseł VII...",Klub Parlamentarny Prawo i Sprawiedliwość
1,Adamczyk Rafał,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...,https://orka.sejm.gov.pl/Poslowie9.nsf/0/8ACA6...,13-10-2019,Sojusz Lewicy Demokratycznej,Katowice,12148,12-11-2019,brak,"Koalicyjny Klub Parlamentarny Lewicy (Razem, S..."
2,Adamowicz Piotr,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...,https://orka.sejm.gov.pl/Poslowie9.nsf/0/3CA57...,13-10-2019,Koalicja Obywatelska,Gdańsk,41795,12-11-2019,brak,Klub Parlamentarny Koalicja Obywatelska - Plat...
3,Ajchler Romuald,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...,https://orka.sejm.gov.pl/Poslowie9.nsf/0/5E88F...,13-10-2019,Sojusz Lewicy Demokratycznej,Piła,14438,12-11-2019,"poseł II kadencji, poseł III kadencji, poseł I...","Koalicyjny Klub Parlamentarny Lewicy (Razem, S..."
4,Andruszkiewicz Adam,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...,https://orka.sejm.gov.pl/Poslowie9.nsf/0/82D86...,13-10-2019,Prawo i Sprawiedliwość,Białystok,29829,12-11-2019,poseł VIII kadencji,Klub Parlamentarny Prawo i Sprawiedliwość
...,...,...,...,...,...,...,...,...,...,...
460,Zyska Ireneusz,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...,https://orka.sejm.gov.pl/Poslowie9.nsf/0/B258B...,13-10-2019,Prawo i Sprawiedliwość,Wałbrzych,10688,12-11-2019,poseł VIII kadencji,Klub Parlamentarny Prawo i Sprawiedliwość
461,Żalek Jacek,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...,https://orka.sejm.gov.pl/Poslowie9.nsf/0/34603...,13-10-2019,Prawo i Sprawiedliwość,Białystok,12141,12-11-2019,"poseł VI kadencji, poseł VII kadencji, poseł V...",Klub Parlamentarny Prawo i Sprawiedliwość
462,Żelazowska Bożena,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...,https://orka.sejm.gov.pl/Poslowie9.nsf/0/A52DB...,13-10-2019,Polskie Stronnictwo Ludowe,Warszawa,8665,12-11-2019,brak,"Klub Parlamentarny Koalicja Polska - PSL, UED,..."
463,Żuk Stanisław,https://www.sejm.gov.pl/Sejm9.nsf/posel.xsp?id...,https://orka.sejm.gov.pl/Poslowie9.nsf/0/66BC5...,13-10-2019,Polskie Stronnictwo Ludowe,Legnica,7694,12-11-2019,brak,Poseł niezrzeszony


In [12]:
# start the timer and print the information
start = timeit.default_timer()
print('\nStarting. This might take a few minutes to complete...\n')

# initiate the containers
deputy_pic = []
election_date = []
election_list = []
election_constituency = []
election_votes = []
oath_date = []
service_length = []
party = []
birth_date = []
birth_place = []
education = []
schools = []
occupation = []

for url in tq.tqdm(df['url']):

  # perform a http request
  response = requests.get(url)
  
  # initiate BeautifulSoup and find objects of our interest
  soup = BeautifulSoup(response.content, 'html.parser')

  deputy_name = soup.find('div', attrs={'id': 'title_content'}).h1.contents[0]
  print(f'Scanning deputy: {deputy_name}')

  # party data
  party = soup.find('div', attrs={'class': 'partia'})
  party_txt = party.find_all('p', attrs={'class': 'right'})
  
  # append party data to containers
  deputy_pic.append(party.find('img').get('src'))
  election_date.append(party_txt[0].contents[0])
  election_list.append(party_txt[1].contents[0])
  election_constituency.append(party_txt[2].contents[0])
  election_votes.append(party_txt[3].contents[0])
  oath_date.append(party_txt[4].contents[0])
  service_length.append(party_txt[5].contents[0])
  party.append(party_txt[6].contents[0])

  # cv data
  cv = soup.find('div', attrs={'class': 'cv'})
  cv_txt = cv.find_all('li') #, attrs={'class': 'right'}

  for item in cv_txt:
    left = item.find('p', attrs={'class': 'left'}).contents[0]
    right = item.find('p', attrs={'class': 'right'}).contents[0]

    if left == 'Data i miejsce urodzenia:':
      birth_date.append(right.split(', ')[0])
      birth_place.append(right.split(', ')[1])
    elif left == 'Wykształcenie:':
      education.append(right)
    elif left == 'Ukończona szkoła:':
      schools.append(right)








  # # append cv data to container
  # birth_date.append(cv_txt[0].contents[0].split(', ')[0])
  # birth_place.append(cv_txt[0].contents[0].split(', ')[1])
  # education.append(cv_txt[1].contents[0].split(', ')[0])
  # schools.append(cv_txt[2].contents[0].split(', ')[0])
  # occupation.append(cv_txt[3].contents[0].split(', ')[0])

# stop the timer and print runtime duration
stop = timeit.default_timer() 
print('\nRuntime: {} seconds.'.format(int(stop-start)))


Starting. This might take a few minutes to complete...



HBox(children=(FloatProgress(value=0.0, max=465.0), HTML(value='')))

Scanning deputy: Andrzej Adamczyk
Scanning deputy: Rafał Adamczyk
Scanning deputy: Piotr Adamowicz
Scanning deputy: Romuald Ajchler
Scanning deputy: Adam Andruszkiewicz
Scanning deputy: Waldemar Andzel
Scanning deputy: Tomasz Aniśko
Scanning deputy: Jan Krzysztof Ardanowski
Scanning deputy: Iwona Arent
Scanning deputy: Marek Ast
Scanning deputy: Urszula Augustyn
Scanning deputy: Tadeusz Aziewicz
Scanning deputy: Zbigniew Babalski
Scanning deputy: Piotr Babinetz
Scanning deputy: Ryszard Bartosik
Scanning deputy: Władysław Teofil Bartoszewski
Scanning deputy: Barbara Bartuś
Scanning deputy: Mieczysław Baszko
Scanning deputy: Dariusz Bąk
Scanning deputy: Paweł Bejda
Scanning deputy: Konrad Berkowicz
Scanning deputy: Magdalena Biejat
Scanning deputy: Jerzy Bielecki
Scanning deputy: Marek Biernacki
Scanning deputy: Mariusz Błaszczak
Scanning deputy: Mateusz Bochenek
Scanning deputy: Rafał Bochenek
Scanning deputy: Jerzy Borowczak
Scanning deputy: Joanna Borowiak
Scanning deputy: Kamil Bortn

KeyboardInterrupt: ignored

In [None]:
df[
   ['picture',
    'election_date',
    'election_list',
    'election_constituency',
    'election_votes',
    'oath_date',
    'service_length',
    'party',
    'brith_date',
    'birth_place',
    'education',
    'schools',
    'occupation'
    ]
   ] = pd.DataFrame(
       [
        [deputy_pic,
         election_date,
         election_list,
         election_constituency,
         election_votes,
         oath_date,
         service_length,
         party,
         birth_date,
         birth_place,
         education,
         schools,
         occupation
         ]
        ],
        index=df.index
        )


In [None]:
# start the timer and print the information
start = timeit.default_timer()
print('\nStarting. This might take a few minutes to complete...\n')

# initiate the containers
deputy_pic = []
election_date = []
election_list = []
election_constituency = []
election_votes = []
oath_date = []
service_length = []
party = []
birth_date = []
birth_place = []
education = []
schools = []
occupation = []
number_of_speeches = []
interpellations = []
voting_frequency = []
committee = []
delegations = []
groups = []

for url in tq.tqdm(df['urls']):
  # perform a http request
  # response = requests.get(url)

  driver.get(url)
  activities = ['wystapienia', 'int', 'glosowania', 'komisje', 'delegacje', 'zespoly']
  for a in activities:
    button = driver.find_element_by_id(a).click()
  time.sleep(.1)
  page_source = driver.page_source
  
  # initiate BeautifulSoup and find objects of our interest
  soup = BeautifulSoup(page_source, 'html.parser')

  # party data
  party = soup.find('div', attrs={'class': 'partia'})
  party_txt = party.find_all('p', attrs={'class': 'right'})
  
  deputy_pic.append(party.find('img').get('src'))
  election_date.append(party_txt[0].contents[0])
  election_list.append(party_txt[1].contents[0])
  election_constituency.append(party_txt[2].contents[0])
  election_votes.append(party_txt[3].contents[0])
  oath_date.append(party_txt[4].contents[0])
  service_length.append(party_txt[5].contents[0])
  party.append(party_txt[6].contents[0])

  # cv data
  cv = soup.find('div', attrs={'class': 'cv'})
  cv_txt = cv.find_all('p', attrs={'class': 'right'})

  birth_date.append(cv_txt[0].contents[0].split(', ')[0])
  birth_place.append(cv_txt[0].contents[0].split(', ')[1])
  education.append(cv_txt[1].contents[0].split(', ')[0])
  schools.append(cv_txt[2].contents[0].split(', ')[0])
  occupation.append(cv_txt[3].contents[0].split(', ')[0])

  # activity data
  activity = soup.find('div', attrs={'class': 'aktywnosc'})
  activity_txt = activity.find_all('div', attrs={'class': 'kluby-kola'})


  speeches = soup.find('td', attrs={'class': 'left'}).contents
  if len(speeches) > 1:
    number = int(re.search('\d+', speeches[1]).group())
    number_of_speeches.append(number)
  else:
    number_of_speeches.append(0)




  # print(activity_li)
  # number_of_speeches.append(activity_txt)
  # interpellations = []
  # voting_frequency = []
  # committee = []
  # delegations = []
  # groups = [] 

# stop the timer and print runtime duration
stop = timeit.default_timer() 
print('\nRuntime: {} seconds.'.format(int(stop-start)))