Thanks @FOIMonkey !
Compare with: http://voixlibres.blogspot.co.uk/2014/02/blog-post_20.html
Hi guys. I'm able to scrape the page but not the Google Translated one. Well, there are only about 270 nodes, as opposed to 660 on the original. I guess it's because the translated one isn't loading before it's being scraped. Do you just use Google Translate and configure your scraper to wait, or do you use another method?
I've been thinking about this and it would seem to make as much sense to make a browser extension so that a native language speaker could specify a schema for the page, according to the data you want. So first you'd select a politician's name, then any other information for that politician (e.g. gender), then the whole set of politicians. Then it would scrape the screen and ask if it got it right, and then the scraper would be set to carry on as usual. No need then for translation, except for any subsequent use by non-natives.
Hi @willnwhite -- thanks for this but I might be able to make things simpler for you: it's best if you scrape the data in its native form (although we appreciate fields like gender being converted into male or female but that's more a data translation than a linguistic one).
We don't need the names transliterated in English from the page and perhaps more to the point Google translate, although often astonishingly good with semantics, can be very unreliable with names. So instead, we'd prefer to receive the Arabic names, and subsequently get any transliterations if and when another source can provide them. It's possible we'd get a some or all of these automatically (I haven't checked Sahrawi Arab Democratic Republic explicitly) from Wikidata because we already have this mechanism for transliterations in place on EveryPolitician. Which does work a little like magic sometimes :-)
So please don't stall on the Google translation because I don't think you need it :-) I hope this helps!
Of course! I was confusing me needing Google Translate, to see which bits are the names, with the scraper needing to translate, which it doesn't. Thanks.
@willnwhite no problem -- also it really isn't obvious that the transliterations might come from another place entirely :-)
ID and name data at https://morph.io/willnwhite/same. It's not cleansed and I need to add the other fields.
https://github.com/willnwhite/Every-Politician-Sahrawi-Arab-Democratic-Republic/blob/master/same.csv Cleansed and added state.
I guess this doesn't need to be re-scraped as it's a blog post of an election result, not a live reference. Is that correct?
Hi @willnwhite — thanks for this! Yes, a one off scrape of this is all that's needed, as the information on this page shouldn't change. It would be good if we could find a source that's kept up to date with changes; but in the absence of that, this is certainly better than nothing.
I think there's still a little more tidying required, though. Some of the names of people and/or areas still have the leading punctation ('-' or '*'), and the people in the final two sections have their names combined with the type of membership (e.g. which organisation they represent).
Is this actually running on morph.io? Our workflow is a little easier if we can just point at a scraper there.
@tmtmtmtm I've taken '-' and '*'s out (wasn't sure if they really were just punctuation or not), and put the organisation as group where applicable. There are two lines with incorrect brackets and I can't fix them even by copying and pasting the original string from the source. I think it's to do with the right-to-left text.
The scraper running on morph gets the lines and makes the UUIDs only. I've done the work by hand from there.
Suggest adding label 3 - WIP to this issue, just to make http://everypolitician.org/needed.html show the same number of countries as http://everypolitician.org/countries.html :)
3 - WIP
It looks like the github code for this scraper disappeared :( Morph.io thinks it was here:
What happened, @willnwhite? Where did you get to on this one? Is the code still available somewhere?
I've restored that repo from my machine now. Will leave it up and rename it from "same".