Sahrawi Arab Democratic Republic #670

Open
tmtmtmtm opened this Issue Sep 3, 2015 · 16 comments

Projects

None yet

5 participants

@tmtmtmtm
Member

Thanks @FOIMonkey !

@tmtmtmtm tmtmtmtm added To Scrape and removed To Find labels Sep 13, 2015
@tmtmtmtm tmtmtmtm self-assigned this Oct 12, 2015
@tmtmtmtm tmtmtmtm added 3 - WIP and removed To Scrape labels Oct 12, 2015
@tmtmtmtm tmtmtmtm removed their assignment Oct 26, 2015
@tmtmtmtm tmtmtmtm added To Scrape and removed 3 - WIP labels Apr 28, 2016
@willnwhite

Hi guys. I'm able to scrape the page but not the Google Translated one. Well, there are only about 270 nodes, as opposed to 660 on the original. I guess it's because the translated one isn't loading before it's being scraped. Do you just use Google Translate and configure your scraper to wait, or do you use another method?

@willnwhite
willnwhite commented May 22, 2016 edited

I've been thinking about this and it would seem to make as much sense to make a browser extension so that a native language speaker could specify a schema for the page, according to the data you want. So first you'd select a politician's name, then any other information for that politician (e.g. gender), then the whole set of politicians. Then it would scrape the screen and ask if it got it right, and then the scraper would be set to carry on as usual. No need then for translation, except for any subsequent use by non-natives.

@davewhiteland
Contributor

Hi @willnwhite -- thanks for this but I might be able to make things simpler for you: it's best if you scrape the data in its native form (although we appreciate fields like gender being converted into male or female but that's more a data translation than a linguistic one).

We don't need the names transliterated in English from the page and perhaps more to the point Google translate, although often astonishingly good with semantics, can be very unreliable with names. So instead, we'd prefer to receive the Arabic names, and subsequently get any transliterations if and when another source can provide them. It's possible we'd get a some or all of these automatically (I haven't checked Sahrawi Arab Democratic Republic explicitly) from Wikidata because we already have this mechanism for transliterations in place on EveryPolitician. Which does work a little like magic sometimes :-)

So please don't stall on the Google translation because I don't think you need it :-) I hope this helps!

@willnwhite

Of course! I was confusing me needing Google Translate, to see which bits are the names, with the scraper needing to translate, which it doesn't. Thanks.

@davewhiteland
Contributor

@willnwhite no problem -- also it really isn't obvious that the transliterations might come from another place entirely :-)

@willnwhite

ID and name data at https://morph.io/willnwhite/same. It's not cleansed and I need to add the other fields.

@willnwhite

I guess this doesn't need to be re-scraped as it's a blog post of an election result, not a live reference. Is that correct?

@tmtmtmtm
Member

Hi @willnwhite — thanks for this! Yes, a one off scrape of this is all that's needed, as the information on this page shouldn't change. It would be good if we could find a source that's kept up to date with changes; but in the absence of that, this is certainly better than nothing.

I think there's still a little more tidying required, though. Some of the names of people and/or areas still have the leading punctation ('-' or '*'), and the people in the final two sections have their names combined with the type of membership (e.g. which organisation they represent).

Is this actually running on morph.io? Our workflow is a little easier if we can just point at a scraper there.

@willnwhite
willnwhite commented May 31, 2016 edited

@tmtmtmtm I've taken '-' and '*'s out (wasn't sure if they really were just punctuation or not), and put the organisation as group where applicable. There are two lines with incorrect brackets and I can't fix them even by copying and pasting the original string from the source. I think it's to do with the right-to-left text.

The scraper running on morph gets the lines and makes the UUIDs only. I've done the work by hand from there.

@tmtmtmtm tmtmtmtm added To Merge and removed To Scrape labels Aug 2, 2016
@andylolz
Contributor
andylolz commented Jan 4, 2017

Suggest adding label 3 - WIP to this issue, just to make http://everypolitician.org/needed.html show the same number of countries as http://everypolitician.org/countries.html :)

@andylolz
Contributor
andylolz commented Jan 4, 2017 edited

It looks like the github code for this scraper disappeared :( Morph.io thinks it was here:
https://github.com/willnwhite/same

What happened, @willnwhite? Where did you get to on this one? Is the code still available somewhere?

@willnwhite

I've restored that repo from my machine now. Will leave it up and rename it from "same".

@tmtmtmtm tmtmtmtm added the 3 - WIP label Jan 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment