# Scrape and download - Introduction
https://www.sejm.gov.pl/Sejm9.nsf/poslowie.xsp

# Environment setup

## Google Drive mount
I'm using Google Colaboratory as my default platform, therefore I need to set up my environment to integrate it with Google Drive. You can skip this bit if you're working locally.

1. Mount Google Drive on the runtime to be able to read and write files. This will ask you to log in to your Google Account and provide an authorization code.
2. Create a symbolic link to a working directory 
3. Change the directory to the one where I cloned my repository.


In [1]:
# mount Google Drive on the runtime
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:
# create a symbolic link to a working directory
!ln -s /content/gdrive/My\ Drive/Colab\ Notebooks/SEJMograf /mydrive

# navigate to the working directory
%cd /mydrive

/content/gdrive/My Drive/Colab Notebooks/SEJMograf


## Libraries & functions
Let's now import the necessary libraries and function we're gonna use in this notebook.

- `requests` - http handling
- `BeautifulSoup` - html parsing & web-scraping
- `urllib.request` - url-opening
- `tqdm.notebook` - loop progress bar for notebooks
- `timeit` - cell runtime check
- `numpy` - linear algebra
- `pandas` - data manipulation & analysis
- `sys` - system-specific parameters & functions
- `os` - operating system interfaces
- `os.path` - pathname manipulation
- `json` - JSON files handling

In [3]:
import requests
from bs4 import BeautifulSoup
import urllib.request
import tqdm.notebook as tq
import timeit
import numpy as np
import pandas as pd
import sys
import os
from os.path import basename
import json

# Scraping
Let's now retrieve all the information we need to proceed.

In [4]:
# # start the timer and print the information
# start = timeit.default_timer()
# print('\nStarting. This might take a few minutes to complete...\n')

# initiate the containers
deputy_names = []
deputy_urls = []

# perform a http request
url = 'https://www.sejm.gov.pl/Sejm9.nsf/poslowie.xsp'
response = requests.get(url)

# initiate BeautifulSoup and find objects of our interest
soup = BeautifulSoup(response.content, 'html.parser')
names = soup.find_all('div', attrs={'class': 'deputyName'})

names

# # stop the timer and print runtime duration
# stop = timeit.default_timer() 
# print('Runtime: {} seconds.'.format(int(stop-start)))

[<div class="deputyName">Adamczyk Andrzej</div>,
 <div class="deputyName">Adamczyk Rafał</div>,
 <div class="deputyName">Adamowicz Piotr</div>,
 <div class="deputyName">Ajchler Romuald</div>,
 <div class="deputyName">Andruszkiewicz Adam</div>,
 <div class="deputyName">Andzel Waldemar</div>,
 <div class="deputyName">Aniśko Tomasz</div>,
 <div class="deputyName">Ardanowski Jan Krzysztof</div>,
 <div class="deputyName">Arent Iwona</div>,
 <div class="deputyName">Ast Marek</div>,
 <div class="deputyName">Augustyn Urszula</div>,
 <div class="deputyName">Aziewicz Tadeusz</div>,
 <div class="deputyName">Babalski Zbigniew</div>,
 <div class="deputyName">Babinetz Piotr</div>,
 <div class="deputyName">Bartosik Ryszard</div>,
 <div class="deputyName">Bartoszewski Władysław Teofil</div>,
 <div class="deputyName">Bartuś Barbara</div>,
 <div class="deputyName">Baszko Mieczysław</div>,
 <div class="deputyName">Bąk Dariusz</div>,
 <div class="deputyName">Bejda Paweł</div>,
 <div class="deputyName">Ber