Permalink
Browse files

Try using the slug part of the id from the querystring.

People's names change, so slugifying them doesn't really cut it
for the id field. However, there is nothing else which looks
particularly useful.

The 'id' in the urls is certainly not a person id - each person
can have more than one. This is true whether we work with the
numerical part alone, or the slug part, or both parts together.

The best compromise for now seems to be to take the slug part of
this querystring parameter, remove the term id from the end, and
use that. Some matching up of people with more than one ID will
be necessary for downstream users - sorry!
  • Loading branch information...
Duncan Parkes
Duncan Parkes committed Mar 22, 2016
1 parent 5f39b3c commit 7eedc0a04e10d24748f5aa5fc8fd1134e6793951
Showing with 3 additions and 3 deletions.
  1. +3 −3 scraper.py
View
@@ -8,8 +8,6 @@
import lxml.html
import execjs
from slugify import slugify_unicode
sources = (
('National Council', 'http://www.parliament.na/index.php?option=com_contact&view=category&id=108&Itemid=1483'),
('National Assembly', 'http://www.parliament.gov.na/index.php?option=com_contact&view=category&id=104&Itemid=1479'),
@@ -73,9 +71,11 @@ def handle_chamber(chamber_name, source_url, data, term_data):
name_link = tr.cssselect('.jsn-table-column-name')[0].find('a')
member['name'] = name_link.text.strip()
member['id'] = slugify_unicode(member['name'])
details_url = member['details_url'] = urljoin(source_url, name_link.get('href'))
possible_id = parse_qs(urlsplit(details_url).query).get('id')[0].split(':')[1]
member['id'] = re.sub(r'-1st|-2nd|-3rd|-4th|-5th|-6th', '', possible_id)
try:
member['party'] = tr.cssselect('.jsn-table-column-country')[0].text.strip()
except AttributeError:

0 comments on commit 7eedc0a

Please sign in to comment.