-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use dati.camera.it and dati.senato.it #1
Comments
Done. The only open data resource that worked as expected was the SPARQL endpoint of the Camera. |
Hi, sorry to highjack this issues for contacting you but I was not able to find your mail adress on github. I'm a researcher from the University of Bamberg working an international research project analysing the representation of citizens of immigrant origin in eight European democracies. I do not speak Italian but was able to find an rdf dataset on the data.camera.it page, "Atti Camera (leg XVI)" which should contain the content. Can you give me a hint how to open the rdf file? I'm using Python and cannot parse the file without errors, neither with Unfortunately they also do not respond to my contact requests. |
Hi @methodds Don't worry about the hijacking. As far as I understand, written questions are likely to be in the Interrogazioni, Interpellanze e mozioni file. Re: parsing, there are many issues with the RDF dumps of the Camera. For starters, some files are simply not there. And then, there's the parser errors issue that you have also encountered. My solution, as briefly described in the top post of this issue, has been to get the data from the SPARQL endpoint (not from the dumps), and then to parse them as if they were X(HT)ML, using (a port of) Hope this helps! |
Thanks a lot for your recommendations! I started to use the SPARQL endpoint and parsing the html after I managed to express the correct query should not be a problem at all. I want to get all written questions like this one. This is my query: SELECT DISTINCT
?role ?name ?contribs ?ref ?date ?title ?question
WHERE {
{
?atto ?ruolo ?deputato;
dc:date ?date;
dc:identifier ?ref;
dc:title ?title;
dc:type ?type;
dc:creator ?name;
dc:contributor ?contribs;
dc:description ?question.
}
## Only selection written questions
FILTER(REGEX(?type,'INTERROGAZIONE A RISPOSTA SCRITTA','i'))
?ruolo rdfs:label ?role.
## Legislative Period 16
?deputato ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_16>
} While this looks already pretty good, I'm not yet able to fix two related problems:
In case you have any idea about these issues any help is very much appreciated :) |
My guess is that you could use the same strategy as I do in my first post: get the list of all MPs, and then ask for all written questions by this MP as primary author. This will create duplicates, but you can filter these out later on, and it should get you around the 10,000-results limit. Another option, which should work just as well, would be to split legislature 16 into several time periods, and then to query each time period. Sorry I cannot submit a SPARQL query to help, I have not studied the docs for written questions. Both strategies above, however, should work. |
Dear Briatte, i was trying to extract data from senato.it, using Sparksql, but the code suggested by the site not include the column of profession, as in camera.it. I was wandering if you have some suggestion in regard, to extract also profession information. Thank you. |
It is possible to scrape http://storia.camera.it/, but the data are pretty large (~ 800MB for l. 9-16 as HTML files) and do not include l. 17, so that won't provide a unified method to get more and better data from all possible legislatures.
SPARQL endpoints
Example Camera query, with some missing values and multiple rows caused by multiple party affiliations and committee memberships (replace
00
by the legislature number):Example Senato query (all senators from l. 9, with many fields set to 'optional' to avoid filtering out senators with missing data):
Example Camera query to get all bills sponsored by a particular MP (replace
00
by legislature anduid
by the unique identifier that contains both the MP uid and the legislature; the query is limited to 10,000 signatures but that should not be an issue):All queries can be passed to the endpoint with
httr
to set thequery
element. The results are RDF files that can be parsed byxml2
as if they were HTML.RDF data dumps
Camera
The dumps should be parsable with
xml2
, but coercing the RDF to an HTML structure will create errors that make the dumps unusable without extra software like Apache Jena, which is available from within R only through the Java-dependent rrdf package.Senato
The dumps return very few senators on past legislatures (e.g. 7).
The text was updated successfully, but these errors were encountered: