Skip to content

Commit

Permalink
Detect start and end dates
Browse files Browse the repository at this point in the history
If the name is followed by a date string, try to work out whether it's a
start or end date.

Log any we can't understand to an "unexpected_date_type" column.
  • Loading branch information
tmtmtmtm committed Jun 22, 2018
1 parent 9292f5d commit 4260e2d
Showing 1 changed file with 26 additions and 0 deletions.
26 changes: 26 additions & 0 deletions scraper.rb
Expand Up @@ -12,6 +12,13 @@
require 'open-uri/cached'
OpenURI::Cache.cache_path = '.cache'

class String
def to_date
return if empty?
Date.parse(self).to_s rescue nil
end
end

class MembersPage < Scraped::HTML
decorator RemoveNotes
decorator WikidataIdsDecorator::Links
Expand All @@ -29,6 +36,9 @@ def members_tables
end

class MemberRow < Scraped::HTML
START_INDICATORS = %w[elected].to_set
END_INDICATORS = %w[resigned died].to_set

def vacant?
tds[2].text == 'Vacant'
end
Expand Down Expand Up @@ -57,11 +67,27 @@ def vacant?
tds[1].text.tidy
end

field :start_date do
included_date[:when].to_date if START_INDICATORS.include? included_date[:what].to_s.downcase
end

field :end_date do
included_date[:when].to_date if END_INDICATORS.include? included_date[:what].to_s.downcase
end

field :unexpected_date_type do
([included_date[:what].to_s.downcase] - START_INDICATORS.merge(END_INDICATORS).to_a).join(', ')
end

private

def tds
noko.css('td,th')
end

def included_date
tds[2].text.match(/\((?<what>.*) on (?<when>\d+ \w+ \d+)\)/) || {}
end
end

url = 'https://en.wikipedia.org/wiki/List_of_members_of_the_16th_Lok_Sabha'
Expand Down

0 comments on commit 4260e2d

Please sign in to comment.