Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revises to scrape Trolley Watch #1

Merged
merged 1 commit into from
Oct 4, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
18 changes: 9 additions & 9 deletions scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,20 @@

# scrape_table function: gets passed an individual page to scrape
def scrape_table(root): # root variable defined in scrape_and_look_for_next_link: parses xml from url
rows = root.cssselect("table.data tr") # selects all <tr> blocks within <table class="data">
rows = root.cssselect("table.Trolley.table tr") # selects all <tr> blocks within <table class="data">
for row in rows:
# Set up our data record - we'll need it later
record = {}
table_cells = row.cssselect("td") # extract cell in the table as you loop through it
if table_cells: # if there are any cells
record['Artist'] = table_cells[0].text # put the text between each tag in a variable called record, unique key is artist
record['Album'] = table_cells[1].text
record['Released'] = table_cells[2].text
record['Sales m'] = table_cells[4].text
record['Date'] = table_cells[0].text # put the text between each tag in a variable called record, unique key is artist
record['Hospital'] = table_cells[1].text
record['Region'] = table_cells[2].text
record['Trolley total'] = table_cells[4].text
# Print out the data we've gathered
print record, '------------'
# Finally, save the record to the datastore - 'Artist' is our unique key
scraperwiki.sqlite.save(["Artist"], record)
scraperwiki.sqlite.save(["Hospital"], record)

# scrape_and_look_for_next_link function: calls the scrape_table
# function, then hunts for a 'next' link: if one is found, calls itself again
Expand All @@ -48,8 +48,8 @@ def scrape_and_look_for_next_link(url):

# ---------------------------------------------------------------------------
# START HERE: define your starting URL - then
# call a function to scrape the first page in the series.
# call a function to scrape it
# ---------------------------------------------------------------------------
base_url = 'https://paulbradshaw.github.io/'
starting_url = urlparse.urljoin(base_url, 'scraping-for-everyone/webpages/example_table_1.html')
starting_url = 'http://inmo.ie/6022'
# urlparse breaks up the url by /. urlparse.urljoin combines two urls together.
scrape_and_look_for_next_link(starting_url)