Use dati.camera.it and dati.senato.it #1

briatte · 2016-01-03T17:50:21Z

http://dati.camera.it/ (on which http://storia.camera.it/ is based)
http://dati.senato.it/ (on which http://www.senato.it/sitostorico/home is probably based)

It is possible to scrape http://storia.camera.it/, but the data are pretty large (~ 800MB for l. 9-16 as HTML files) and do not include l. 17, so that won't provide a unified method to get more and better data from all possible legislatures.

SPARQL endpoints

Example Camera query, with some missing values and multiple rows caused by multiple party affiliations and committee memberships (replace 00 by the legislature number):

SELECT DISTINCT
  ?url ?name ?surname ?born ?sex ?constituency
  ?party ?start ?end ?committee ?photo
WHERE {
  ?url ocd:rif_mandatoCamera ?mandato; a foaf:Person.
  ?d a ocd:deputato; ocd:aderisce ?aderisce;
  ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_00>;
  ocd:rif_mandatoCamera ?mandato.
  ?d foaf:firstName ?name; foaf:surname ?surname.
  OPTIONAL { ?d foaf:gender ?sex. }
  OPTIONAL { ?d foaf:depiction ?photo. }
  OPTIONAL {
    ?url <http://purl.org/vocab/bio/0.1/Birth> ?nascita.
    ?nascita <http://purl.org/vocab/bio/0.1/date> ?born.
  }
  OPTIONAL {
    ?mandato ocd:rif_elezione ?elezione.
    ?elezione dc:coverage ?constituency.
  }
  OPTIONAL {
    ?aderisce ocd:startDate ?start.
  }
  OPTIONAL {
    ?aderisce ocd:endDate ?end.
  }
  OPTIONAL {
    ?aderisce ocd:rif_gruppoParlamentare ?gruppo.
    ?gruppo dc:title ?party.
  }
  OPTIONAL {
    ?d ocd:membro ?membro.?membro ocd:rif_organo ?organo.
    ?organo dc:title ?committee.
  }
}

Example Senato query (all senators from l. 9, with many fields set to 'optional' to avoid filtering out senators with missing data):

PREFIX osr: <http://dati.senato.it/osr/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?senatore ?nome ?cognome ?legislatura ?mandato
WHERE {
    ?senatore a osr:Senatore.
    ?senatore foaf:firstName ?nome.
    ?senatore foaf:lastName ?cognome.
    ?mandato osr:legislatura ?legislatura.
    OPTIONAL { ?senatore osr:mandato ?mandato. }
    OPTIONAL { ?senatore osr:dataNascita ?dataNascita. }
    FILTER(?legislatura=9)
}

Example Camera query to get all bills sponsored by a particular MP (replace 00 by legislature and uid by the unique identifier that contains both the MP uid and the legislature; the query is limited to 10,000 signatures but that should not be an issue):

SELECT DISTINCT
?role ?ref ?date ?title
WHERE {
{
  ?atto ?ruolo ?deputato;
  dc:date ?date;
  dc:identifier ?ref;
  dc:title ?title;
  dc:type ?tipo.
  FILTER(?ruolo = ocd:primo_firmatario)
}
UNION {
?atto ?ruolo ?deputato;
dc:date ?date;
dc:identifier ?ref;
dc:title ?title;
dc:type ?tipo.
FILTER(?ruolo = ocd:altro_firmatario)
}
## filter bills
FILTER(?tipo = 'Progetto di Legge')
?ruolo rdfs:label ?role.
## filter sponsor
?deputato ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_00>
FILTER(REGEX(?deputato,'uid','i'))
}

All queries can be passed to the endpoint with httr to set the query element. The results are RDF files that can be parsed by xml2 as if they were HTML.

RDF data dumps

Camera

http://dati.camera.it/it/download/atti-e-votazioni.html (bills — note: dump for l. 15 is corrupted)
http://dati.camera.it/it/download/anagrafica.html (sponsors)

The dumps should be parsable with xml2, but coercing the RDF to an HTML structure will create errors that make the dumps unusable without extra software like Apache Jena, which is available from within R only through the Java-dependent rrdf package.

Senato

http://dati.senato.it/DatiSenato/browse/7?testo_generico=12&legislatura=17 (bills, l. 13-17)
http://dati.senato.it/DatiSenato/browse/6?testo_generico=11&legislatura=13 (sponsors, l. 13-17)

The dumps return very few senators on past legislatures (e.g. 7).

The text was updated successfully, but these errors were encountered:

briatte · 2016-01-06T20:53:55Z

Done. The only open data resource that worked as expected was the SPARQL endpoint of the Camera.

cschwem2er · 2016-03-21T09:40:56Z

Hi,

sorry to highjack this issues for contacting you but I was not able to find your mail adress on github. I'm a researcher from the University of Bamberg working an international research project analysing the representation of citizens of immigrant origin in eight European democracies.
For this we need to extract textual data on speeches and first of all especially written questions in Italy.

I do not speak Italian but was able to find an rdf dataset on the data.camera.it page, "Atti Camera (leg XVI)" which should contain the content. Can you give me a hint how to open the rdf file? I'm using Python and cannot parse the file without errors, neither with rdflib nor with lxml.

Unfortunately they also do not respond to my contact requests.

briatte · 2016-03-21T11:16:34Z

Hi @methodds

Don't worry about the hijacking.

As far as I understand, written questions are likely to be in the Interrogazioni, Interpellanze e mozioni file.

Re: parsing, there are many issues with the RDF dumps of the Camera. For starters, some files are simply not there. And then, there's the parser errors issue that you have also encountered.

My solution, as briefly described in the top post of this issue, has been to get the data from the SPARQL endpoint (not from the dumps), and then to parse them as if they were X(HT)ML, using (a port of) libxml2. This works fine in R, and there are Python bindings for libxm2, so I guess it might also work on your end.

Hope this helps!

cschwem2er · 2016-03-22T16:45:16Z

Thanks a lot for your recommendations!

I started to use the SPARQL endpoint and parsing the html after I managed to express the correct query should not be a problem at all.
However, while I'm very close to getting all I need, there are still some small issues.

I want to get all written questions like this one.

This is my query:

SELECT DISTINCT
?role ?name ?contribs ?ref ?date ?title ?question 
WHERE {
{
  ?atto ?ruolo ?deputato;
  dc:date ?date;
  dc:identifier ?ref;
  dc:title ?title;
  dc:type ?type;
  dc:creator ?name;
  dc:contributor ?contribs;
  dc:description ?question.
}

## Only selection written questions
 FILTER(REGEX(?type,'INTERROGAZIONE A RISPOSTA SCRITTA','i'))
?ruolo rdfs:label ?role.

## Legislative Period 16
?deputato ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_16>
}

While this looks already pretty good, I'm not yet able to fix two related problems:

Questions are asked by one primary MP and sometimes but not always additional MP's. I need to get every related MP for a question. As a result of the query above every question gets copied multiple times for each contributor. Is there a way to change the output such that each question is contained only once but the column for contributors includes multiple names?
In this version I always receive the maximum of 10.000 requests because the question duplicates inflate the numbers. Do you know an elegant way of surpassing the limit? Of course I could write maaaany queries and filter for single months each time, but maybe there is a better way.

In case you have any idea about these issues any help is very much appreciated :)

briatte · 2016-03-22T18:06:28Z

My guess is that you could use the same strategy as I do in my first post: get the list of all MPs, and then ask for all written questions by this MP as primary author. This will create duplicates, but you can filter these out later on, and it should get you around the 10,000-results limit.

Another option, which should work just as well, would be to split legislature 16 into several time periods, and then to query each time period.

Sorry I cannot submit a SPARQL query to help, I have not studied the docs for written questions. Both strategies above, however, should work.

Viktoriiqa · 2023-10-18T13:55:29Z

Dear Briatte, i was trying to extract data from senato.it, using Sparksql, but the code suggested by the site not include the column of profession, as in camera.it. I was wandering if you have some suggestion in regard, to extract also profession information. Thank you.

briatte added enhancement help wanted labels Jan 3, 2016

briatte self-assigned this Jan 3, 2016

briatte mentioned this issue Jan 3, 2016

Use more open data portals briatte/parlnet#8

Open

3 tasks

briatte closed this as completed Jan 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dati.camera.it and dati.senato.it #1

Use dati.camera.it and dati.senato.it #1

briatte commented Jan 3, 2016

briatte commented Jan 6, 2016

cschwem2er commented Mar 21, 2016

briatte commented Mar 21, 2016

cschwem2er commented Mar 22, 2016

briatte commented Mar 22, 2016

Viktoriiqa commented Oct 18, 2023

Use dati.camera.it and dati.senato.it #1

Use dati.camera.it and dati.senato.it #1

Comments

briatte commented Jan 3, 2016

SPARQL endpoints

RDF data dumps

Camera

Senato

briatte commented Jan 6, 2016

cschwem2er commented Mar 21, 2016

briatte commented Mar 21, 2016

cschwem2er commented Mar 22, 2016

briatte commented Mar 22, 2016

Viktoriiqa commented Oct 18, 2023