-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulgaria #451
Comments
Also in English from http://www.parliament.bg/en/MP/ |
And in XML at http://www.parliament.bg/export.php/en/xml/MP/1000 (etc) |
Damn, I missed the XML version! That might have made things simpler: my scraper collects both Bulgarian and English pages, because they do not show exactly the same information (time in office can only be computed from the Bulgarian pages, IIRC). Note—I cannot find gender information in the XML records. |
The countries.json file includes the URLs of files via cdn.rawgit.com, and those URLs include the commit's object name (also known as its hash or SHA-1) so that a known version of every file is referred to in each version of countries.json. The commit object names used in these URLs (for both the Popolo JSON and term CSV files) would always be the same for all files in a particular country, since the commit and last modification time would be found from a command like: git --no-pager log --format='%h|%at' -1 data/Australia/ ... and the results would be used for all files for that country. This meant that if any file for that country was updated, then all files for that country would have a new commit object name in their URL. This was bad, because a consumer of that data may be using changes in these URLs to detect if there is new data to process, and this would mean they'd have to process more data than necessary. This commit changes that: now when rebuilding the countries.json file a mapping is first found from each filename under data/ to the most recent (based on committer date) non-merge commit that changed that file. This mapping is then used to find the commit by which a file should be referred to on a per file basis. This fixes #451. Note that both this and the previous version of the code are incorrect, strictly speaking. There is no guarantee that the commit found by either method has the same version of the file as in HEAD. For example, both methods ignore merge commits, and it's possible that a merge commit has a different version of the file than the most recent non-merge commit. Also, the order by committer date can be completely different from topological order. Because of the workflow by which merges are done in this repository, however, this isn't likely to cause a problem in practice, but a more correct way to do this would be: - Find the object name of the file's blob in HEAD - Find the earliest commit (by whatever ordering) which has that blob's object name at that path. That's rather more awkward to implement, however, so this version should do for the moment, and, as I said, it's no *more* incorrect than the previous version. In terms of performance, this is only a few seconds slower on my laptop than the previous version; that cost is is dwarfed by the time taken to reclone the repository each time - see everypolitician/everypolitician#359
The countries.json file includes the URLs of files via cdn.rawgit.com, and those URLs include the commit's object name (also known as its hash or SHA-1) so that a known version of every file is referred to in each version of countries.json. The commit object names used in these URLs (for both the Popolo JSON and term CSV files) would always be the same for all files in a particular country, since the commit and last modification time would be found from a command like: git --no-pager log --format='%h|%at' -1 data/Australia/ ... and the results would be used for all files for that country. This meant that if any file for that country was updated, then all files for that country would have a new commit object name in their URL. This was bad, because a consumer of that data may be using changes in these URLs to detect if there is new data to process, and this would mean they'd have to process more data than necessary. This commit changes that: now when rebuilding the countries.json file a mapping is first found from each filename under data/ to the most recent (based on committer date) non-merge commit that changed that file. This mapping is then used to find the commit by which a file should be referred to on a per file basis. This fixes #451. Note that both this and the previous version of the code are incorrect, strictly speaking. There is no guarantee that the commit found by either method has the same version of the file as in HEAD. For example, both methods ignore merge commits, and it's possible that a merge commit has a different version of the file than the most recent non-merge commit. Also, the order by committer date can be completely different from topological order. Because of the workflow by which merges are done in this repository, however, this isn't likely to cause a problem in practice, but a more correct way to do this would be: - Find the object name of the file's blob in HEAD - Find the earliest commit (by whatever ordering) which has that blob's object name at that path. That's rather more awkward to implement, however, so this version should do for the moment, and, as I said, it's no *more* incorrect than the previous version. In terms of performance, for a rebuild of every country, this is only a few seconds slower on my laptop than the previous version; that cost is dwarfed by the time taken to reclone the repository each time - see everypolitician/everypolitician#359 Thanks to Tony Bowden (@tmtmtmtm) for suggesting many improvements to this commit.
Here's my own scraper.
Links to MP pages are of the form
http://www.parliament.bg/bg/MP/222
The index is at
http://www.parliament.bg/bg/MP/
The text was updated successfully, but these errors were encountered: