Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dati.camera.it and dati.senato.it #1

Closed
briatte opened this issue Jan 3, 2016 · 6 comments
Closed

Use dati.camera.it and dati.senato.it #1

briatte opened this issue Jan 3, 2016 · 6 comments

Comments

@briatte
Copy link
Owner

briatte commented Jan 3, 2016

It is possible to scrape http://storia.camera.it/, but the data are pretty large (~ 800MB for l. 9-16 as HTML files) and do not include l. 17, so that won't provide a unified method to get more and better data from all possible legislatures.

SPARQL endpoints

Example Camera query, with some missing values and multiple rows caused by multiple party affiliations and committee memberships (replace 00 by the legislature number):

SELECT DISTINCT
  ?url ?name ?surname ?born ?sex ?constituency
  ?party ?start ?end ?committee ?photo
WHERE {
  ?url ocd:rif_mandatoCamera ?mandato; a foaf:Person.
  ?d a ocd:deputato; ocd:aderisce ?aderisce;
  ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_00>;
  ocd:rif_mandatoCamera ?mandato.
  ?d foaf:firstName ?name; foaf:surname ?surname.
  OPTIONAL { ?d foaf:gender ?sex. }
  OPTIONAL { ?d foaf:depiction ?photo. }
  OPTIONAL {
    ?url <http://purl.org/vocab/bio/0.1/Birth> ?nascita.
    ?nascita <http://purl.org/vocab/bio/0.1/date> ?born.
  }
  OPTIONAL {
    ?mandato ocd:rif_elezione ?elezione.
    ?elezione dc:coverage ?constituency.
  }
  OPTIONAL {
    ?aderisce ocd:startDate ?start.
  }
  OPTIONAL {
    ?aderisce ocd:endDate ?end.
  }
  OPTIONAL {
    ?aderisce ocd:rif_gruppoParlamentare ?gruppo.
    ?gruppo dc:title ?party.
  }
  OPTIONAL {
    ?d ocd:membro ?membro.?membro ocd:rif_organo ?organo.
    ?organo dc:title ?committee.
  }
}

Example Senato query (all senators from l. 9, with many fields set to 'optional' to avoid filtering out senators with missing data):

PREFIX osr: <http://dati.senato.it/osr/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT DISTINCT ?senatore ?nome ?cognome ?legislatura ?mandato
WHERE {
    ?senatore a osr:Senatore.
    ?senatore foaf:firstName ?nome.
    ?senatore foaf:lastName ?cognome.
    ?mandato osr:legislatura ?legislatura.
    OPTIONAL { ?senatore osr:mandato ?mandato. }
    OPTIONAL { ?senatore osr:dataNascita ?dataNascita. }
    FILTER(?legislatura=9)
}

Example Camera query to get all bills sponsored by a particular MP (replace 00 by legislature and uid by the unique identifier that contains both the MP uid and the legislature; the query is limited to 10,000 signatures but that should not be an issue):

SELECT DISTINCT
?role ?ref ?date ?title
WHERE {
{
  ?atto ?ruolo ?deputato;
  dc:date ?date;
  dc:identifier ?ref;
  dc:title ?title;
  dc:type ?tipo.
  FILTER(?ruolo = ocd:primo_firmatario)
}
UNION {
?atto ?ruolo ?deputato;
dc:date ?date;
dc:identifier ?ref;
dc:title ?title;
dc:type ?tipo.
FILTER(?ruolo = ocd:altro_firmatario)
}
## filter bills
FILTER(?tipo = 'Progetto di Legge')
?ruolo rdfs:label ?role.
## filter sponsor
?deputato ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_00>
FILTER(REGEX(?deputato,'uid','i'))
}

All queries can be passed to the endpoint with httr to set the query element. The results are RDF files that can be parsed by xml2 as if they were HTML.

RDF data dumps

Camera

The dumps should be parsable with xml2, but coercing the RDF to an HTML structure will create errors that make the dumps unusable without extra software like Apache Jena, which is available from within R only through the Java-dependent rrdf package.

Senato

The dumps return very few senators on past legislatures (e.g. 7).

@briatte
Copy link
Owner Author

briatte commented Jan 6, 2016

Done. The only open data resource that worked as expected was the SPARQL endpoint of the Camera.

@briatte briatte closed this as completed Jan 6, 2016
@cschwem2er
Copy link

Hi,

sorry to highjack this issues for contacting you but I was not able to find your mail adress on github. I'm a researcher from the University of Bamberg working an international research project analysing the representation of citizens of immigrant origin in eight European democracies.
For this we need to extract textual data on speeches and first of all especially written questions in Italy.

I do not speak Italian but was able to find an rdf dataset on the data.camera.it page, "Atti Camera (leg XVI)" which should contain the content. Can you give me a hint how to open the rdf file? I'm using Python and cannot parse the file without errors, neither with rdflib nor with lxml.

Unfortunately they also do not respond to my contact requests.

@briatte
Copy link
Owner Author

briatte commented Mar 21, 2016

Hi @methodds

Don't worry about the hijacking.

As far as I understand, written questions are likely to be in the Interrogazioni, Interpellanze e mozioni file.

Re: parsing, there are many issues with the RDF dumps of the Camera. For starters, some files are simply not there. And then, there's the parser errors issue that you have also encountered.

My solution, as briefly described in the top post of this issue, has been to get the data from the SPARQL endpoint (not from the dumps), and then to parse them as if they were X(HT)ML, using (a port of) libxml2. This works fine in R, and there are Python bindings for libxm2, so I guess it might also work on your end.

Hope this helps!

@cschwem2er
Copy link

Thanks a lot for your recommendations!

I started to use the SPARQL endpoint and parsing the html after I managed to express the correct query should not be a problem at all.
However, while I'm very close to getting all I need, there are still some small issues.

I want to get all written questions like this one.

This is my query:

SELECT DISTINCT
?role ?name ?contribs ?ref ?date ?title ?question 
WHERE {
{
  ?atto ?ruolo ?deputato;
  dc:date ?date;
  dc:identifier ?ref;
  dc:title ?title;
  dc:type ?type;
  dc:creator ?name;
  dc:contributor ?contribs;
  dc:description ?question.
}

## Only selection written questions
 FILTER(REGEX(?type,'INTERROGAZIONE A RISPOSTA SCRITTA','i'))
?ruolo rdfs:label ?role.

## Legislative Period 16
?deputato ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_16>
}

While this looks already pretty good, I'm not yet able to fix two related problems:

  • Questions are asked by one primary MP and sometimes but not always additional MP's. I need to get every related MP for a question. As a result of the query above every question gets copied multiple times for each contributor. Is there a way to change the output such that each question is contained only once but the column for contributors includes multiple names?
  • In this version I always receive the maximum of 10.000 requests because the question duplicates inflate the numbers. Do you know an elegant way of surpassing the limit? Of course I could write maaaany queries and filter for single months each time, but maybe there is a better way.

In case you have any idea about these issues any help is very much appreciated :)

@briatte
Copy link
Owner Author

briatte commented Mar 22, 2016

My guess is that you could use the same strategy as I do in my first post: get the list of all MPs, and then ask for all written questions by this MP as primary author. This will create duplicates, but you can filter these out later on, and it should get you around the 10,000-results limit.

Another option, which should work just as well, would be to split legislature 16 into several time periods, and then to query each time period.

Sorry I cannot submit a SPARQL query to help, I have not studied the docs for written questions. Both strategies above, however, should work.

@Viktoriiqa
Copy link

Dear Briatte, i was trying to extract data from senato.it, using Sparksql, but the code suggested by the site not include the column of profession, as in camera.it. I was wandering if you have some suggestion in regard, to extract also profession information. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants