Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introspection after ResBaz Sydney 2017 lesson #41

Closed
jnothman opened this issue Jul 3, 2017 · 3 comments
Closed

Introspection after ResBaz Sydney 2017 lesson #41

jnothman opened this issue Jul 3, 2017 · 3 comments

Comments

@jnothman
Copy link
Contributor

jnothman commented Jul 3, 2017

This afternoon, I had 3h (including 10 min break) to present web scraping. I presented from https://ctds-usyd.github.io/2017-07-03-resbaz-webscraping/. I am not a trained SWC instructor, and not used to the narrative format of SWC lessons. I am also an experienced software engineer, so while I am used to some amount of teaching, it was hard for me to recall how much ground work there is to this topic. In the context of ResBaz, I was presenting to a group of research students, librarians, ?academics, etc. from Sydney universities. I did not get anything in the way of a survey, but hope to ask the ResBaz organisers to email students for their comments.

There were about 22 students, though 40 had signed up. Despite the Library Carpentry resolutions of a few weeks ago to focus on coding scrapers, I had decided to make something accessible to non-coders. In the end, we did not cover the coding part at all. I don't think we suffered greatly for this.

What we managed to cover

We covered, perhaps, half the material:

  • basically all of the introduction (about 20 mins)
  • almost all of CSS selectors (about 60 mins?)
  • (coffee break)
  • visual scraping with Web Scraper extension (75 mins)
  • did not do Python scraping at all
  • conclusion / ethical discussion (5 mins)

Good points

  • The UNSC resolutions web site highlights well the need to adjust your scraper to quirks and variation in the web site, and why not to always rely on the extraction patterns chosen by the visual scraper (but I'm not sure how clearly the latter came across to some students).
  • Successfully scraping data after a while talking/learning about selectors etc yielded quite a sense of accomplishment.
  • I think most students got the idea of the nested scraper design used in Web Scraper, and generally that episode worked quite well.
  • I think most students got the idea of CSS selectors matching elements and how this fits into a scraper.
  • The Selecting the challenge, performed in pairs/triples was enjoyed.
  • The conclusion was a bit rushed, but quite clear (if perhaps a little repetitive).

Things deserving attention

Overall

  • There is far too much narrative before getting hands dirty. Even so, students seemed to appreciate the "what web scraping is not" at least to some extent. Could probably be moved to conclusion.

  • Students who were not well grounded in the structure of web pages struggled.

  • I had two projector screens. Even so, it is challenging to set up a visual projection that covers: the lesson, the page being scraped, source code or element inspector for a page being scraped, the scraping tool or code...

  • I think it would be good to focus on a visual scraper, but then have a number of scripts in several scraping frameworks and languages available as supplementary material to the lesson. A discussion of the nuances of coding these things by hand can be left brief, or available with more description for an extended lesson.

  • I feel that visual scrapers are a good way to demonstrate what we're up to with little coding competence required, and are in practice a useful technology to grok.

  • The key thing we need to consider is to what extent we make this available with a "choose your own adventure: CSS vs XPath; visual vs requests/lxml vs scrapy" approach, or as a single well-honed curriculum that works for most people.

CSS selectors

  • Should somehow start with an exercise, and make the episode more bottom-up. Perhaps should start with the HTML element inspector over the target web site and assertions like "similar structures in the page have the same tag name; class names; etc." and use this to introduce topics like markup structure and tree structure / terminology (which is very useful). It would be better, for instance, to have seen class attributes long before we describe how to select them.
  • Should possibly use UNSC example instead of analysing lesson page (also makes lesson easier to maintain)
  • We could even consider evaluating CSS selectors in the console before we go into the details of a CSS selector and how it works.
  • "View source" can probably be replaced by "Element Inspector". There are advantages to both, but I don't feel like it's worth complicating the matter.
  • <catfood> example is poorer for only having one of each tag name.
  • Some of the introductory material here is awkward because it tries to mirror the XPath content. For example, in CSS selection, we don't really think of attributes as part of the tree, and only sometimes think of text as such.
  • I borrowed from the XPath lesson the idea that the evaluation follows the path from the root to the target. But it seems easier to me to describe the way the selector prescribes the target, on its parent, etc. Remnants of the path paradigm linger in the lesson.
  • ("Extensions to CSS selectors" was relevant to a draft of the visual scraping lesson, but is no longer and can be removed.)

Visual scraping

  • There are some annoying interface issues in Web Scraper extension, particularly in navigation (e.g. clicking rows vs clicking buttons on their right; where to run a data preview and what to expect in it / how that relates to the scrape results; the tool for selecting a parent selector is difficult for a beginner; the data preview shows with too-large margins so the box is very small; etc)
  • We had other issues with Web Scraper including that its Selection tool can behave a bit quirkily.
  • Web Scraper introduces some confusing technology: its "selectors" need to be distinguished from CSS "selectors"; its "Type" needs to be distinguished from CSS selectors' :nth-of-type which refers to tag name.
  • The Web Scraper extension opens up a browser window to do the scraping in. I don't think I made it clear to students that this is not a necessary part of scraping, i.e. that scraping in the background, often automatically and periodically, is the norm.
  • I ad-libbed some content on why we might want an Element Attribute apart from href, and spoke of machine-readable publication dates (with microdata) in news sites. Also could have mentioned a's title attr. What else? Worth writing a paragraph on in the lesson, perhaps.

I'll offer my lessons across to this repo shortly.

Anything to add, @nikzadb, @Anushi, @RichardPBerry?

@ostephens
Copy link
Collaborator

Thanks so much for this write up @jnothman

It feels like there could be room for different web scraping lessons here - an 'intro to web scraping with tools' - focus on a tool, include introduction to HTML/CSS; and a more advanced lesson - possibly 'web scraping with Python'.

I could see this being multiple episodes within a single lesson - but it would have to be clear that the intention wasn't to use all the episodes in one teaching session. (@drjwbaker has suggested a similar approach in the OpenRefine lesson to me previously)

I feel that any tool introduced should follow the selectors we are teaching - so if we are teaching css selectors it seems odd then to use a tool that uses similar but different selector syntax.

Was there any feedback from the participants in terms of how useful they found it and whether it met with their expectations?

@jnothman
Copy link
Contributor Author

jnothman commented Jul 4, 2017 via email

@RichardPBerry
Copy link

Great workshop, and great summary @jnothman, I think you picked out all the keypoints. I will add that having attended with absolutely 0 web scraping experience I got a lot out of it!

I agree that this could be broken down into basic (visual) and advanced (code based) lesson. The mechanics of how to do that best I would leave to you... :)

The only things I would add are:
a) some diagrams or perhaps looking at the structure of a very simple webpage using the element inspector might be a good way to get give those not familiar with HTML some more solid grounding (looking at the structure of the course material page can be a bit overwhelming)

b) maybe after introducing the concept of one or two selectors it would be good to jump straight into the visual scraper tool and try this out on the simple webpage. This could be followed up by the more in-depth discussion of various CSS selectors and the UNSC example. I think this would help cement the concept and break up the theoretical discussion at the start.

Last point, personally I think the UNSC example is really good. The quirks of this site show how difficult good scraping could be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants