Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulgaria #451

Closed
briatte opened this issue Jul 30, 2015 · 3 comments
Closed

Bulgaria #451

briatte opened this issue Jul 30, 2015 · 3 comments
Assignees

Comments

@briatte
Copy link

briatte commented Jul 30, 2015

Here's my own scraper.

Links to MP pages are of the form

http://www.parliament.bg/bg/MP/222

The index is at

http://www.parliament.bg/bg/MP/

@tmtmtmtm
Copy link
Contributor

Also in English from http://www.parliament.bg/en/MP/

@tmtmtmtm
Copy link
Contributor

And in XML at http://www.parliament.bg/export.php/en/xml/MP/1000 (etc)

@tmtmtmtm tmtmtmtm mentioned this issue Aug 22, 2015
@briatte
Copy link
Author

briatte commented Aug 22, 2015

Damn, I missed the XML version! That might have made things simpler: my scraper collects both Bulgarian and English pages, because they do not show exactly the same information (time in office can only be computed from the Bulgarian pages, IIRC).

Note—I cannot find gender information in the XML records.

@tmtmtmtm tmtmtmtm added 3 - WIP and removed To Scrape labels Sep 5, 2015
@tmtmtmtm tmtmtmtm removed the 3 - WIP label Sep 12, 2015
mhl added a commit that referenced this issue Jul 20, 2016
The countries.json file includes the URLs of files via
cdn.rawgit.com, and those URLs include the commit's object name (also
known as its hash or SHA-1) so that a known version of every file is
referred to in each version of countries.json.

The commit object names used in these URLs (for both the Popolo JSON
and term CSV files) would always be the same for all files in a
particular country, since the commit and last modification time would be
found from a command like:

  git --no-pager log --format='%h|%at' -1 data/Australia/

... and the results would be used for all files for that country.

This meant that if any file for that country was updated, then all files
for that country would have a new commit object name in their URL. This
was bad, because a consumer of that data may be using changes in these
URLs to detect if there is new data to process, and this would mean
they'd have to process more data than necessary.

This commit changes that: now when rebuilding the countries.json file a
mapping is first found from each filename under data/ to the most recent
(based on committer date) non-merge commit that changed that file. This
mapping is then used to find the commit by which a file should be
referred to on a per file basis.  This fixes #451.

Note that both this and the previous version of the code are incorrect,
strictly speaking. There is no guarantee that the commit found by
either method has the same version of the file as in HEAD. For example,
both methods ignore merge commits, and it's possible that a merge commit
has a different version of the file than the most recent non-merge
commit. Also, the order by committer date can be completely different
from topological order. Because of the workflow by which merges are done
in this repository, however, this isn't likely to cause a problem in
practice, but a more correct way to do this would be:

  - Find the object name of the file's blob in HEAD

  - Find the earliest commit (by whatever ordering) which has that
    blob's object name at that path.

That's rather more awkward to implement, however, so this version should
do for the moment, and, as I said, it's no *more* incorrect than the
previous version.

In terms of performance, this is only a few seconds slower on my laptop
than the previous version; that cost is is dwarfed by the time taken to
reclone the repository each time - see
everypolitician/everypolitician#359
@mhl mhl self-assigned this Jul 20, 2016
mhl added a commit that referenced this issue Jul 21, 2016
The countries.json file includes the URLs of files via
cdn.rawgit.com, and those URLs include the commit's object name (also
known as its hash or SHA-1) so that a known version of every file is
referred to in each version of countries.json.

The commit object names used in these URLs (for both the Popolo JSON
and term CSV files) would always be the same for all files in a
particular country, since the commit and last modification time would be
found from a command like:

  git --no-pager log --format='%h|%at' -1 data/Australia/

... and the results would be used for all files for that country.

This meant that if any file for that country was updated, then all files
for that country would have a new commit object name in their URL. This
was bad, because a consumer of that data may be using changes in these
URLs to detect if there is new data to process, and this would mean
they'd have to process more data than necessary.

This commit changes that: now when rebuilding the countries.json file a
mapping is first found from each filename under data/ to the most recent
(based on committer date) non-merge commit that changed that file. This
mapping is then used to find the commit by which a file should be
referred to on a per file basis.  This fixes #451.

Note that both this and the previous version of the code are incorrect,
strictly speaking. There is no guarantee that the commit found by
either method has the same version of the file as in HEAD. For example,
both methods ignore merge commits, and it's possible that a merge commit
has a different version of the file than the most recent non-merge
commit. Also, the order by committer date can be completely different
from topological order. Because of the workflow by which merges are done
in this repository, however, this isn't likely to cause a problem in
practice, but a more correct way to do this would be:

  - Find the object name of the file's blob in HEAD

  - Find the earliest commit (by whatever ordering) which has that
    blob's object name at that path.

That's rather more awkward to implement, however, so this version should
do for the moment, and, as I said, it's no *more* incorrect than the
previous version.

In terms of performance, for a rebuild of every country, this is only a
few seconds slower on my laptop than the previous version; that cost is
dwarfed by the time taken to reclone the repository each time - see
everypolitician/everypolitician#359

Thanks to Tony Bowden (@tmtmtmtm) for suggesting many improvements to
this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants