Episode on scraping UNSC data with requests/lxml #47

jnothman · 2017-07-04T07:04:59Z

Closes #6, #11, #12 .

This episode is untested in real life.

The motivation (as far as I'm concerned) for presenting requests/lxml instead of scrapy is that:

it is paradigmatically closer to what beginner python programmers are familiar with (i.e. procedural, loops, etc.)
it motivates the reasons to prefer scrapy or a similar framework

Issues:

currently moving @timtomch's scrapy lesson to _extras, perhaps to be updated to UNSC site.
Advanced Topics section could perhaps be moved to conclusion.
Code makes use of dicts, which I later found out are not part of the SWC Python Beginners course.
I use "robust", "resilient" and similar terms interchangeably. Perhaps the entire lesson should adopt a term like "defensive web scraping" in parallel to "defensive programming". I personally think this is an important part of web scraping in practice, and is emphasised by the UNSC site's quirks.
Should put more specific version requirements in setup.md.

timtomch

I think this is a beautiful episode! Thanks Joel for contributing. I have a few minor comments, for consideration.

timtomch · 2017-07-04T12:43:09Z

_episodes/04-lxml.md

+- "How can I download web pages' HTML in Python?"
+- "How can I evaluate XPath or CSS selectors in Python?"
+- "How can I format scraped data as a spreadsheet?"
+- "How do I build a scraper that is resilient to change and aberration?"


Suggest using a different word than "aberration" in the spirit of reducing the use of jargon, especially in the episode questions. Or just something like "How do I build a scraper that will keep working even if the page structure changes?"

timtomch · 2017-07-04T12:45:46Z

_episodes/04-lxml.md

+- "An element tree's `cssseelct` and `xpath` methods extract elements of interest."
+- "A scraper can be divided into: identifying the set of URLs to scrape; extracting some elements from a page; and transforming them into a useful output format."
+- "It is important but challenging to be resilient to variation in page structure: one should automatically validate and manually inspect their extractions."
+- "A framework like [Scrapy](http://scrapy.org) may help to build robust scrapers, but may be harder to learn."


Link to the scrapy lesson if moved to _extras?

timtomch · 2017-07-04T12:52:58Z

_episodes/04-lxml.md

+* We can look at the HTML source code of a page to find how target elements are structured and
+  how to select them.
+* We can use the browser console to try out XPath or CSS selectors on a live site.
+* We can use visual scrapers to handle some basic scraping tasks. These help determine an appropriate selector, and may also perform spidering.


Was spidering defined in the previous episode? If not, I suggest explaining the term.

It is mentioned in the intro, but in passing. Might be worthwhile expanding on that definition here.

timtomch · 2017-07-04T12:55:52Z

_episodes/04-lxml.md

+
+# Introducing Requests and lxml
+
+We make use of two tools that are not specifically developed for scraping, but are very useful for that purpose (among others).


I suggest moving up the paragraph about Python before introducing the libraries. First, clarify that we will be coding the scraper in Python (link to the base Python lesson) and that the following tools are Python libraries that we will need.

timtomch · 2017-07-04T12:59:50Z

_episodes/04-lxml.md

+> its HTML (using the `html.parser` backend). In some ways, Beautiful
+> Soup may have a more friendly design for web scraping (e.g. its handling
+> of text).
+{: .callout}


I like the comments about alternatives (also in the visual scraper episode). But I wonder if it wouldn't be better to group them into a discussion section. I think one of the principles of the Carpentries is that we try to focus on one set of tools, teach them, and then mention alternatives, rather than risk confusing people while they are learning?

The bs4 comments are not really necessary here at all

jnothman · 2017-07-09T21:21:30Z

Thanks!

…

On 10 July 2017 at 04:01, Thomas Guignard ***@***.***> wrote: Merged #47 <#47>. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#47 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6638-B81U_Gfc_fCwlW6-rVubfDVks5sMRWNgaJpZM4OM9sX> .

Episode on scraping UNSC data with requests/lxml

bcab205

timtomch requested changes Jul 4, 2017

View reviewed changes

jnothman added 2 commits July 5, 2017 14:09

Avoid jargon

1ca42c7

Respond to Thomas's comments

1fef358

The bs4 comments are not really necessary here at all

timtomch merged commit 5c594c0 into data-lessons:gh-pages Jul 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Episode on scraping UNSC data with requests/lxml #47

Episode on scraping UNSC data with requests/lxml #47

jnothman commented Jul 4, 2017 •

edited

Loading

timtomch left a comment

timtomch Jul 4, 2017

timtomch Jul 4, 2017

timtomch Jul 4, 2017

timtomch Jul 4, 2017

timtomch Jul 4, 2017

timtomch Jul 4, 2017

jnothman commented Jul 9, 2017 via email


		# Introducing Requests and lxml

		We make use of two tools that are not specifically developed for scraping, but are very useful for that purpose (among others).

Episode on scraping UNSC data with requests/lxml #47

Episode on scraping UNSC data with requests/lxml #47

Conversation

jnothman commented Jul 4, 2017 • edited Loading

timtomch left a comment

Choose a reason for hiding this comment

timtomch Jul 4, 2017

Choose a reason for hiding this comment

timtomch Jul 4, 2017

Choose a reason for hiding this comment

timtomch Jul 4, 2017

Choose a reason for hiding this comment

timtomch Jul 4, 2017

Choose a reason for hiding this comment

timtomch Jul 4, 2017

Choose a reason for hiding this comment

timtomch Jul 4, 2017

Choose a reason for hiding this comment

jnothman commented Jul 9, 2017 via email

jnothman commented Jul 4, 2017 •

edited

Loading