New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sahrawi Arab Democratic Republic #670

Open
tmtmtmtm opened this Issue Sep 3, 2015 · 17 comments

Comments

Projects
None yet
5 participants
@tmtmtmtm
Member

tmtmtmtm commented Sep 3, 2015

@FOIMonkey

This comment has been minimized.

Show comment
Hide comment

FOIMonkey commented Sep 13, 2015

@tmtmtmtm

This comment has been minimized.

Show comment
Hide comment
@tmtmtmtm

tmtmtmtm Sep 13, 2015

Member

Thanks @FOIMonkey !

Member

tmtmtmtm commented Sep 13, 2015

Thanks @FOIMonkey !

@tmtmtmtm tmtmtmtm added To Scrape and removed To Find labels Sep 13, 2015

@tmtmtmtm tmtmtmtm self-assigned this Oct 12, 2015

@tmtmtmtm tmtmtmtm added 3 - WIP and removed To Scrape labels Oct 12, 2015

@tmtmtmtm

This comment has been minimized.

Show comment
Hide comment
Member

tmtmtmtm commented Oct 25, 2015

@tmtmtmtm tmtmtmtm removed their assignment Oct 26, 2015

@tmtmtmtm tmtmtmtm added To Scrape and removed 3 - WIP labels Apr 28, 2016

@willnwhite

This comment has been minimized.

Show comment
Hide comment
@willnwhite

willnwhite May 21, 2016

Hi guys. I'm able to scrape the page but not the Google Translated one. Well, there are only about 270 nodes, as opposed to 660 on the original. I guess it's because the translated one isn't loading before it's being scraped. Do you just use Google Translate and configure your scraper to wait, or do you use another method?

willnwhite commented May 21, 2016

Hi guys. I'm able to scrape the page but not the Google Translated one. Well, there are only about 270 nodes, as opposed to 660 on the original. I guess it's because the translated one isn't loading before it's being scraped. Do you just use Google Translate and configure your scraper to wait, or do you use another method?

@willnwhite

This comment has been minimized.

Show comment
Hide comment
@willnwhite

willnwhite May 22, 2016

I've been thinking about this and it would seem to make as much sense to make a browser extension so that a native language speaker could specify a schema for the page, according to the data you want. So first you'd select a politician's name, then any other information for that politician (e.g. gender), then the whole set of politicians. Then it would scrape the screen and ask if it got it right, and then the scraper would be set to carry on as usual. No need then for translation, except for any subsequent use by non-natives.

willnwhite commented May 22, 2016

I've been thinking about this and it would seem to make as much sense to make a browser extension so that a native language speaker could specify a schema for the page, according to the data you want. So first you'd select a politician's name, then any other information for that politician (e.g. gender), then the whole set of politicians. Then it would scrape the screen and ask if it got it right, and then the scraper would be set to carry on as usual. No need then for translation, except for any subsequent use by non-natives.

@davewhiteland

This comment has been minimized.

Show comment
Hide comment
@davewhiteland

davewhiteland May 23, 2016

Contributor

Hi @willnwhite -- thanks for this but I might be able to make things simpler for you: it's best if you scrape the data in its native form (although we appreciate fields like gender being converted into male or female but that's more a data translation than a linguistic one).

We don't need the names transliterated in English from the page and perhaps more to the point Google translate, although often astonishingly good with semantics, can be very unreliable with names. So instead, we'd prefer to receive the Arabic names, and subsequently get any transliterations if and when another source can provide them. It's possible we'd get a some or all of these automatically (I haven't checked Sahrawi Arab Democratic Republic explicitly) from Wikidata because we already have this mechanism for transliterations in place on EveryPolitician. Which does work a little like magic sometimes :-)

So please don't stall on the Google translation because I don't think you need it :-) I hope this helps!

Contributor

davewhiteland commented May 23, 2016

Hi @willnwhite -- thanks for this but I might be able to make things simpler for you: it's best if you scrape the data in its native form (although we appreciate fields like gender being converted into male or female but that's more a data translation than a linguistic one).

We don't need the names transliterated in English from the page and perhaps more to the point Google translate, although often astonishingly good with semantics, can be very unreliable with names. So instead, we'd prefer to receive the Arabic names, and subsequently get any transliterations if and when another source can provide them. It's possible we'd get a some or all of these automatically (I haven't checked Sahrawi Arab Democratic Republic explicitly) from Wikidata because we already have this mechanism for transliterations in place on EveryPolitician. Which does work a little like magic sometimes :-)

So please don't stall on the Google translation because I don't think you need it :-) I hope this helps!

@willnwhite

This comment has been minimized.

Show comment
Hide comment
@willnwhite

willnwhite May 23, 2016

Of course! I was confusing me needing Google Translate, to see which bits are the names, with the scraper needing to translate, which it doesn't. Thanks.

willnwhite commented May 23, 2016

Of course! I was confusing me needing Google Translate, to see which bits are the names, with the scraper needing to translate, which it doesn't. Thanks.

@davewhiteland

This comment has been minimized.

Show comment
Hide comment
@davewhiteland

davewhiteland May 23, 2016

Contributor

@willnwhite no problem -- also it really isn't obvious that the transliterations might come from another place entirely :-)

Contributor

davewhiteland commented May 23, 2016

@willnwhite no problem -- also it really isn't obvious that the transliterations might come from another place entirely :-)

@willnwhite

This comment has been minimized.

Show comment
Hide comment
@willnwhite

willnwhite May 24, 2016

ID and name data at https://morph.io/willnwhite/same. It's not cleansed and I need to add the other fields.

willnwhite commented May 24, 2016

ID and name data at https://morph.io/willnwhite/same. It's not cleansed and I need to add the other fields.

@willnwhite

This comment has been minimized.

Show comment
Hide comment
@willnwhite

This comment has been minimized.

Show comment
Hide comment
@willnwhite

willnwhite May 31, 2016

I guess this doesn't need to be re-scraped as it's a blog post of an election result, not a live reference. Is that correct?

willnwhite commented May 31, 2016

I guess this doesn't need to be re-scraped as it's a blog post of an election result, not a live reference. Is that correct?

@tmtmtmtm

This comment has been minimized.

Show comment
Hide comment
@tmtmtmtm

tmtmtmtm May 31, 2016

Member

Hi @willnwhite — thanks for this! Yes, a one off scrape of this is all that's needed, as the information on this page shouldn't change. It would be good if we could find a source that's kept up to date with changes; but in the absence of that, this is certainly better than nothing.

I think there's still a little more tidying required, though. Some of the names of people and/or areas still have the leading punctation ('-' or '*'), and the people in the final two sections have their names combined with the type of membership (e.g. which organisation they represent).

Is this actually running on morph.io? Our workflow is a little easier if we can just point at a scraper there.

Member

tmtmtmtm commented May 31, 2016

Hi @willnwhite — thanks for this! Yes, a one off scrape of this is all that's needed, as the information on this page shouldn't change. It would be good if we could find a source that's kept up to date with changes; but in the absence of that, this is certainly better than nothing.

I think there's still a little more tidying required, though. Some of the names of people and/or areas still have the leading punctation ('-' or '*'), and the people in the final two sections have their names combined with the type of membership (e.g. which organisation they represent).

Is this actually running on morph.io? Our workflow is a little easier if we can just point at a scraper there.

@willnwhite

This comment has been minimized.

Show comment
Hide comment
@willnwhite

willnwhite May 31, 2016

@tmtmtmtm I've taken '-' and '*'s out (wasn't sure if they really were just punctuation or not), and put the organisation as group where applicable. There are two lines with incorrect brackets and I can't fix them even by copying and pasting the original string from the source. I think it's to do with the right-to-left text.

The scraper running on morph gets the lines and makes the UUIDs only. I've done the work by hand from there.

willnwhite commented May 31, 2016

@tmtmtmtm I've taken '-' and '*'s out (wasn't sure if they really were just punctuation or not), and put the organisation as group where applicable. There are two lines with incorrect brackets and I can't fix them even by copying and pasting the original string from the source. I think it's to do with the right-to-left text.

The scraper running on morph gets the lines and makes the UUIDs only. I've done the work by hand from there.

@tmtmtmtm tmtmtmtm added To Merge and removed To Scrape labels Aug 2, 2016

@andylolz

This comment has been minimized.

Show comment
Hide comment
@andylolz

andylolz Jan 4, 2017

Contributor

Suggest adding label 3 - WIP to this issue, just to make http://everypolitician.org/needed.html show the same number of countries as http://everypolitician.org/countries.html :)

Contributor

andylolz commented Jan 4, 2017

Suggest adding label 3 - WIP to this issue, just to make http://everypolitician.org/needed.html show the same number of countries as http://everypolitician.org/countries.html :)

@andylolz

This comment has been minimized.

Show comment
Hide comment
@andylolz

andylolz Jan 4, 2017

Contributor

It looks like the github code for this scraper disappeared :( Morph.io thinks it was here:
https://github.com/willnwhite/same

What happened, @willnwhite? Where did you get to on this one? Is the code still available somewhere?

Contributor

andylolz commented Jan 4, 2017

It looks like the github code for this scraper disappeared :( Morph.io thinks it was here:
https://github.com/willnwhite/same

What happened, @willnwhite? Where did you get to on this one? Is the code still available somewhere?

@willnwhite

This comment has been minimized.

Show comment
Hide comment
@willnwhite

willnwhite Jan 4, 2017

I've restored that repo from my machine now. Will leave it up and rename it from "same".

willnwhite commented Jan 4, 2017

I've restored that repo from my machine now. Will leave it up and rename it from "same".

@tmtmtmtm tmtmtmtm added the 3 - WIP label Jan 5, 2017

@andylolz

This comment has been minimized.

Show comment
Hide comment
@andylolz

andylolz Jan 31, 2017

Contributor

Great – thanks @willnwhite!

Contributor

andylolz commented Jan 31, 2017

Great – thanks @willnwhite!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment