# HW02: Data Collection 


## Preambule

### Guidelines

All problem sets must be submitted to moodle a compiled *notebook* file in PDF. You can achieve this by exporting the notebook in PDF. 


**In this problem set, you have a lot of freedom.** You should propose an original data collection task: <span style='color:green'> choose a website that you want to harvest, briefly justifying why you think it is interesting</span>.

The  <span style='color:green'> following guidelines </span> are indications, that you can twist depending on your specific collection task. Nonetheless, <span style='color:green'> your code should contain:</span>

   1. a list of `url` that you are interested in and whose html content share a common structure (so that you can loop over them)
   2. A `loop` over the url: the **automation** part that we care about
   3. A **structured** output (`dataframe` or `json`)
    

## 1. <span style='color:green'>Choose a website that you want to harvest, briefly justifying why you think it is interesting</span>
It should be a static website, with some interesting underlying structure that you could loop over. 

If you have 0 ideas, you can think about an area that you would be interested in: 
- newspapers article, 
- speeches from policymakers
- wikipedia articles

I want to scrape the names and parties of members of the italian "camera dei deputati" for a project in social data science that I am working on. For this project I would need a list of names and party membership for all the members of the "camera dei deputati", and I could not find it in a structured format (e.g. a .csv file) so far this is why I want to scrape the offical page of the "camera dei deputati"

## 2.  <span style='color:green'> Build the list of the `url` (or `query parameters`) that you will scrape
</span>
Before building the scraper, explain (& code) how you will generate the list of url that you will scrape. It can be based on a query parameters or on a list of url that are already on a webpage

In [1]:
base_url = "https://www.camera.it/leg18/28"
letters = ["A", "B", "C", "D", "E", "F", "G", "I", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "Z"]
query_str = "?lettera="

## 3. <span style='color:green'> First build the scraper for 1 webpage</span>
- query the website using `request` 
- extract the relevant information using `beautifulsoup`

In [2]:
import requests
import re
import json
from bs4 import BeautifulSoup
from unidecode import unidecode

In [3]:
response = requests.get(base_url + query_str + letters[0])
soup = BeautifulSoup(response.text, "html.parser")

In [4]:
vcards = soup.find_all("div" ,class_='vcard')

In [5]:
def get_deputati(vcards):
    deputati = []
    for card in vcards:
        vcard = {}
        name = card.find("div", class_="fn").a
        vcard["name"] = unidecode(name.text)
        vcard["url"] = name.get("href")
        # get party affiliation have to take care for misto group which has sub-groups
        org = card.find("div", class_="org")
        i = 0
        if org:
            for div in org.find_all("a"):
                if i == 0:
                    vcard["gruppo"] = unidecode(div.text)
                    vcard["url_gruppo"] = div.get("href")
                    i += 1
                elif i >0 :
                    vcard["sub_gruppo"] = unidecode(div.text)
                    vcard["sub_gruppo_url"] = div.get("href")
        # find the ones that abandoned their mandate
        if card.find("div", class_=""):
            vcard["cessato"] = re.sub(r"\s*Cessato dal mandato parlamentare  il ", "", unidecode(card.find("div", class_="").text))
        deputati.append(vcard)
    return deputati

In [6]:
get_deputati(vcards)[:5]

[{'name': 'ACQUAROLI Francesco',
  'url': 'https://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=deputati&tipoDoc=schedaDeputato&idlegislatura=18&idPersona=307171&tipoAttivita=&tipoVisAtt=&tipoPersona=',
  'cessato': '22 ottobre 2020'},
 {'name': 'ACUNZO Nicola',
  'url': 'https://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=deputati&tipoDoc=schedaDeputato&idlegislatura=18&idPersona=307538&tipoAttivita=&tipoVisAtt=&tipoPersona=',
  'gruppo': 'MISTO',
  'url_gruppo': 'https://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=Deputati&tipoDoc=gruppo&idlegislatura=18&shadow_gruppi_parlamentari=3033',
  'sub_gruppo': 'CENTRO DEMOCRATICO-ITALIANI IN EUROPA',
  'sub_gruppo_url': 'https://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=Deputati&tipoDoc=gruppoMisto&idlegislatura=18&shadow_gruppi_parlamentari=3033&shadow_gruppi_misti=3411&tipoVis=2#misto3411'},
 {'name': 'ADELIZZI Cosimo',
  'url': 'https://documenti.cam

## 4. <span style='color:green'> Scraper: `loop` over the `url` list using the scraper from </span> 3. 

This is the key step. You should find a way to collect all the info, for example in `Datafame` or in a `json`. 

In a real-world project, you might want to:
- save the `html` on disk before cooking the soup. 
- save each output instead of concatenating them together in a `Datafame` or in a `json`

In [7]:
deputati = []
for letter in letters:
    url = base_url + query_str + letter
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    vcards = soup.find_all("div" ,class_='vcard')
    deputati += get_deputati(vcards)

In [13]:
deputati[30:35]

[{'name': 'BARTOLOZZI Giusi',
  'url': 'https://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=deputati&tipoDoc=schedaDeputato&idlegislatura=18&idPersona=307404&tipoAttivita=&tipoVisAtt=&tipoPersona=',
  'gruppo': 'FORZA ITALIA - BERLUSCONI PRESIDENTE',
  'url_gruppo': 'https://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=Deputati&tipoDoc=gruppo&idlegislatura=18&shadow_gruppi_parlamentari=3053'},
 {'name': 'BARZOTTI Valentina',
  'url': 'https://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=deputati&tipoDoc=schedaDeputato&idlegislatura=18&idPersona=308001&tipoAttivita=&tipoVisAtt=&tipoPersona=',
  'gruppo': 'MOVIMENTO 5 STELLE',
  'url_gruppo': 'https://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=Deputati&tipoDoc=gruppo&idlegislatura=18&shadow_gruppi_parlamentari=3051'},
 {'name': 'BASINI Giuseppe',
  'url': 'https://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=deputati&tipoDoc=scheda

In [14]:
print( "--- number of vcard's --- \n")
print(len(deputati))
cessati = 0
for dep in deputati:
    if "cessato" in dep:
        cessati += 1
    else:
        pass
print("\n--- number of abandonments --- \n")
print(cessati)

--- number of vcard's --- 

650

--- number of abandonments --- 

21


## 5. <span style='color:green'> Save your file on disk</span> (Does not have to be included in the submission)

In [15]:
# i like to dump json files trough json 
out_file = open("deputati.json", "w") 
json.dump(deputati, fp=out_file)

In [None]:
import pandas as pd
import numpy as np

df = pd.read_json("deputati.json")

In [6]:
df.head()

Unnamed: 0,name,url,cessato,gruppo,url_gruppo,sub_gruppo,sub_gruppo_url
0,ACQUAROLI Francesco,https://documenti.camera.it/apps/commonService...,22 ottobre 2020,,,,
1,ACUNZO Nicola,https://documenti.camera.it/apps/commonService...,,MISTO,https://documenti.camera.it/apps/commonService...,CENTRO DEMOCRATICO-ITALIANI IN EUROPA,https://documenti.camera.it/apps/commonService...
2,ADELIZZI Cosimo,https://documenti.camera.it/apps/commonService...,,MOVIMENTO 5 STELLE,https://documenti.camera.it/apps/commonService...,,
3,AIELLO Davide,https://documenti.camera.it/apps/commonService...,,MOVIMENTO 5 STELLE,https://documenti.camera.it/apps/commonService...,,
4,AIELLO Piera,https://documenti.camera.it/apps/commonService...,,MISTO,https://documenti.camera.it/apps/commonService...,CENTRO DEMOCRATICO-ITALIANI IN EUROPA,https://documenti.camera.it/apps/commonService...


In [8]:
df["name"] = df["name"].apply(lambda x: str(x).lower())

In [33]:
from dateparser import parse
df["cessato"] = df["cessato"].apply(lambda x: parse(str(x)) if str(x) != "nan" else False)

df[cess!=False]

Unnamed: 0,name,url,cessato,gruppo,url_gruppo,sub_gruppo,sub_gruppo_url
0,acquaroli francesco,https://documenti.camera.it/apps/commonService...,2020-10-22 00:00:00,,,,
160,crosetto guido,https://documenti.camera.it/apps/commonService...,2019-03-13 00:00:00,FRATELLI D'ITALIA,https://documenti.camera.it/apps/commonService...,,
174,de carlo luca,https://documenti.camera.it/apps/commonService...,2020-08-05 00:00:00,,,,
223,ermini david,https://documenti.camera.it/apps/commonService...,2018-09-25 00:00:00,,,,
233,fedriga massimiliano,https://documenti.camera.it/apps/commonService...,2018-05-08 00:00:00,,,,
242,fidanza carlo,https://documenti.camera.it/apps/commonService...,2019-06-27 00:00:00,,,,
266,fugatti maurizio,https://documenti.camera.it/apps/commonService...,2019-01-09 00:00:00,,,,
284,gentiloni silveri paolo,https://documenti.camera.it/apps/commonService...,2019-12-02 00:00:00,,,,
289,giacomelli antonello,https://documenti.camera.it/apps/commonService...,2020-09-30 00:00:00,,,,
293,giannetta domenico,https://documenti.camera.it/apps/commonService...,2020-05-07 00:00:00,,,,


In [34]:
df.to_csv("deputati.csv")