Bulgaria #451

briatte · 2015-07-30T12:24:57Z

Here's my own scraper.

Links to MP pages are of the form

http://www.parliament.bg/bg/MP/222

The index is at

http://www.parliament.bg/bg/MP/

tmtmtmtm · 2015-08-22T10:13:38Z

Also in English from http://www.parliament.bg/en/MP/

tmtmtmtm · 2015-08-22T10:24:21Z

And in XML at http://www.parliament.bg/export.php/en/xml/MP/1000 (etc)

briatte · 2015-08-22T12:12:56Z

Damn, I missed the XML version! That might have made things simpler: my scraper collects both Bulgarian and English pages, because they do not show exactly the same information (time in office can only be computed from the Bulgarian pages, IIRC).

Note—I cannot find gender information in the XML records.

The countries.json file includes the URLs of files via cdn.rawgit.com, and those URLs include the commit's object name (also known as its hash or SHA-1) so that a known version of every file is referred to in each version of countries.json. The commit object names used in these URLs (for both the Popolo JSON and term CSV files) would always be the same for all files in a particular country, since the commit and last modification time would be found from a command like: git --no-pager log --format='%h|%at' -1 data/Australia/ ... and the results would be used for all files for that country. This meant that if any file for that country was updated, then all files for that country would have a new commit object name in their URL. This was bad, because a consumer of that data may be using changes in these URLs to detect if there is new data to process, and this would mean they'd have to process more data than necessary. This commit changes that: now when rebuilding the countries.json file a mapping is first found from each filename under data/ to the most recent (based on committer date) non-merge commit that changed that file. This mapping is then used to find the commit by which a file should be referred to on a per file basis. This fixes #451. Note that both this and the previous version of the code are incorrect, strictly speaking. There is no guarantee that the commit found by either method has the same version of the file as in HEAD. For example, both methods ignore merge commits, and it's possible that a merge commit has a different version of the file than the most recent non-merge commit. Also, the order by committer date can be completely different from topological order. Because of the workflow by which merges are done in this repository, however, this isn't likely to cause a problem in practice, but a more correct way to do this would be: - Find the object name of the file's blob in HEAD - Find the earliest commit (by whatever ordering) which has that blob's object name at that path. That's rather more awkward to implement, however, so this version should do for the moment, and, as I said, it's no *more* incorrect than the previous version. In terms of performance, this is only a few seconds slower on my laptop than the previous version; that cost is is dwarfed by the time taken to reclone the repository each time - see everypolitician/everypolitician#359

@tmtmtmtm

The countries.json file includes the URLs of files via cdn.rawgit.com, and those URLs include the commit's object name (also known as its hash or SHA-1) so that a known version of every file is referred to in each version of countries.json. The commit object names used in these URLs (for both the Popolo JSON and term CSV files) would always be the same for all files in a particular country, since the commit and last modification time would be found from a command like: git --no-pager log --format='%h|%at' -1 data/Australia/ ... and the results would be used for all files for that country. This meant that if any file for that country was updated, then all files for that country would have a new commit object name in their URL. This was bad, because a consumer of that data may be using changes in these URLs to detect if there is new data to process, and this would mean they'd have to process more data than necessary. This commit changes that: now when rebuilding the countries.json file a mapping is first found from each filename under data/ to the most recent (based on committer date) non-merge commit that changed that file. This mapping is then used to find the commit by which a file should be referred to on a per file basis. This fixes #451. Note that both this and the previous version of the code are incorrect, strictly speaking. There is no guarantee that the commit found by either method has the same version of the file as in HEAD. For example, both methods ignore merge commits, and it's possible that a merge commit has a different version of the file than the most recent non-merge commit. Also, the order by committer date can be completely different from topological order. Because of the workflow by which merges are done in this repository, however, this isn't likely to cause a problem in practice, but a more correct way to do this would be: - Find the object name of the file's blob in HEAD - Find the earliest commit (by whatever ordering) which has that blob's object name at that path. That's rather more awkward to implement, however, so this version should do for the moment, and, as I said, it's no *more* incorrect than the previous version. In terms of performance, for a rebuild of every country, this is only a few seconds slower on my laptop than the previous version; that cost is dwarfed by the time taken to reclone the repository each time - see everypolitician/everypolitician#359 Thanks to Tony Bowden (@tmtmtmtm) for suggesting many improvements to this commit.

tmtmtmtm added the To Scrape label Aug 13, 2015

tmtmtmtm added the New Country label Aug 22, 2015

tmtmtmtm mentioned this issue Aug 22, 2015

Buglaria #119

Closed

briatte mentioned this issue Aug 22, 2015

Try using the XML records briatte/bgparl#1

Open

tmtmtmtm added 3 - WIP and removed To Scrape labels Sep 5, 2015

tmtmtmtm mentioned this issue Sep 12, 2015

Bulgaria: initial data #816

Merged

tmtmtmtm closed this as completed in #816 Sep 12, 2015

tmtmtmtm removed the 3 - WIP label Sep 12, 2015

mhl mentioned this issue Jul 20, 2016

Use finer-grained git commit object names in file URLs #14273

Merged

mhl self-assigned this Jul 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulgaria #451

Bulgaria #451

briatte commented Jul 30, 2015

tmtmtmtm commented Aug 22, 2015

tmtmtmtm commented Aug 22, 2015

briatte commented Aug 22, 2015

Bulgaria #451

Bulgaria #451

Comments

briatte commented Jul 30, 2015

tmtmtmtm commented Aug 22, 2015

tmtmtmtm commented Aug 22, 2015

briatte commented Aug 22, 2015