Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Episode on scraping UNSC data with requests/lxml #47

Merged
merged 3 commits into from
Jul 9, 2017

Conversation

jnothman
Copy link
Contributor

@jnothman jnothman commented Jul 4, 2017

Closes #6, #11, #12 .

This episode is untested in real life.

The motivation (as far as I'm concerned) for presenting requests/lxml instead of scrapy is that:

  • it is paradigmatically closer to what beginner python programmers are familiar with (i.e. procedural, loops, etc.)
  • it motivates the reasons to prefer scrapy or a similar framework

Issues:

  • currently moving @timtomch's scrapy lesson to _extras, perhaps to be updated to UNSC site.
  • Advanced Topics section could perhaps be moved to conclusion.
  • Code makes use of dicts, which I later found out are not part of the SWC Python Beginners course.
  • I use "robust", "resilient" and similar terms interchangeably. Perhaps the entire lesson should adopt a term like "defensive web scraping" in parallel to "defensive programming". I personally think this is an important part of web scraping in practice, and is emphasised by the UNSC site's quirks.
  • Should put more specific version requirements in setup.md.

Copy link
Collaborator

@timtomch timtomch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a beautiful episode! Thanks Joel for contributing. I have a few minor comments, for consideration.

- "How can I download web pages' HTML in Python?"
- "How can I evaluate XPath or CSS selectors in Python?"
- "How can I format scraped data as a spreadsheet?"
- "How do I build a scraper that is resilient to change and aberration?"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using a different word than "aberration" in the spirit of reducing the use of jargon, especially in the episode questions. Or just something like "How do I build a scraper that will keep working even if the page structure changes?"

- "An element tree's `cssseelct` and `xpath` methods extract elements of interest."
- "A scraper can be divided into: identifying the set of URLs to scrape; extracting some elements from a page; and transforming them into a useful output format."
- "It is important but challenging to be resilient to variation in page structure: one should automatically validate and manually inspect their extractions."
- "A framework like [Scrapy](http://scrapy.org) may help to build robust scrapers, but may be harder to learn."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the scrapy lesson if moved to _extras?

* We can look at the HTML source code of a page to find how target elements are structured and
how to select them.
* We can use the browser console to try out XPath or CSS selectors on a live site.
* We can use visual scrapers to handle some basic scraping tasks. These help determine an appropriate selector, and may also perform spidering.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was spidering defined in the previous episode? If not, I suggest explaining the term.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mentioned in the intro, but in passing. Might be worthwhile expanding on that definition here.


# Introducing Requests and lxml

We make use of two tools that are not specifically developed for scraping, but are very useful for that purpose (among others).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving up the paragraph about Python before introducing the libraries. First, clarify that we will be coding the scraper in Python (link to the base Python lesson) and that the following tools are Python libraries that we will need.

> its HTML (using the `html.parser` backend). In some ways, Beautiful
> Soup may have a more friendly design for web scraping (e.g. its handling
> of text).
{: .callout}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the comments about alternatives (also in the visual scraper episode). But I wonder if it wouldn't be better to group them into a discussion section. I think one of the principles of the Carpentries is that we try to focus on one set of tools, teach them, and then mention alternatives, rather than risk confusing people while they are learning?

The bs4 comments are not really necessary here at all
@timtomch timtomch merged commit 5c594c0 into data-lessons:gh-pages Jul 9, 2017
@jnothman
Copy link
Contributor Author

jnothman commented Jul 9, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include Beautiful Soup instead of Scrapy (or as an add-on)
2 participants