Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review web scraping lesson structure and content #7

Closed
ostephens opened this issue Jun 1, 2017 · 44 comments
Closed

Review web scraping lesson structure and content #7

ostephens opened this issue Jun 1, 2017 · 44 comments

Comments

@ostephens
Copy link
Collaborator

Need to do an overall review of the structure & content of this session and decide if it is the right stuff for a library carpentry lesson on web scraping. Suggest we use an Etherpad to agree a syllabus for a lesson and review against this existing lesson

@weaverbel
Copy link
Collaborator

Ok @ostephens - shall I create the pad or will you? Easy for me to do.

@ostephens
Copy link
Collaborator Author

Pls go ahead @weaverbel

@weaverbel
Copy link
Collaborator

Here you go @ostephens
http://pad.software-carpentry.org/scrape

@ostephens
Copy link
Collaborator Author

OK - looking at this and thinking about it this morning, my view is that we should remove the Python part of the lesson (either completely, or separate it into it's own 'advanced' lesson).

There is really useful and relevant stuff in the lesson without leaping to programming tools - and I think probably enough content for 3hrs anyway. I think the additional overhead and barriers introduced by going to programming is that we potentially lose people who would otherwise take some really good concepts from the lesson. It would also give more time to spend on XPath (and potentially CSS or jQuery selectors)

So I'm proposing we take out Episode 04 entirely. Any views?

@kimpham54
Copy link
Contributor

kimpham54 commented Jun 1, 2017

When we've taught this lesson in the past we've requested that people come with some knowledge of programming (not necessarily Python), but that didn't always happen and people were still able to go through the exercises. I agree that the web scraping python lesson is advanced - it makes a lot of assumptions that people understand the basic concepts programming, including knowledge of object oriented programming

Although we've been able to deliver the full lesson in 3 hours, it is definitely possible to go more in-depth and provide more instruction in each of the sections. The original XPath lesson I developed was more generic and was not aimed at web scraping - it addressed using XPath for XML documents, and also provided a brief introduction to XQuery.

Here are some suggested structures for a web scraping syllabus:

1

  • XPath/XSLTs/XQuery on documents (XML, HTML)
  • Intro to Web scraping (prereq: XPath lesson, maybe mention that they only need a particular portion to get started?)
  • Web Scraping with Python (prereq: XPath lesson, Intro to Web Scraping)

2

  • Intro to Web scraping + XPath Intro
  • Web Scraping with Python (prereq: XPath lesson, Intro to Web Scraping)

@kimpham54
Copy link
Contributor

pinging @timtomch for your thoughts as well!

@ostephens
Copy link
Collaborator Author

I think a Library Carpentry intro to web scraping shouldn't require programming knowledge as a pre-requisite.

I think XPath is a really useful thing to learn and makes complete sense in this context and so I'm in favour of keeping this in.

I really like all the surrounding material - what is web scraping/HTML DOM/ethics etc. all really good IMO

@runderwood
Copy link
Collaborator

From a scholarly point of view, I would say that a non-programmatic approach to web scraping is not likely to be all that useful, at least in my experience. If something can be scraped without programming, typically folks just do it through brute force.

As for XPATH, while useful, it is a rather narrow DSL. I'd be more interested in making sure people are introduced to the idea of document structures/trees, markup syntax/semantics, and selectors, broadly. For munging HTML, I've yet to do any heavy lifting in web scraping without the use of a parsing library like BeautufilSoup, so, for my part, I'd like us to really consider retaining something along those lines in the lesson.

@alixk
Copy link

alixk commented Jun 1, 2017

I agree that Python/programming knowledge shouldn't be a pre-req for an LC workshop. I like the idea of the bulk of the lesson being non-programming oriented, although it might be useful for learners to see/play around with BeautifulSoup at the end. I think it's pretty approachable and exciting for people to see how it works. (I added this on the etherpad, but I've taught the ProgHist lesson on BeautifulSoup and it worked very well: http://programminghistorian.org/lessons/intro-to-beautiful-soup)

I also agree that having a conversation about ethics in the introduction of the lesson is really important.

@ostephens
Copy link
Collaborator Author

@runderwood My experience differs in terms the usefulness of a non-programmatic approach to web scraping - but it's about having the right tool.

I agree that the more general concepts are important, but XPath and XLST is something we've been asked for repeatedly as part of Library Carpentry and this seems like a good place to introduce that syntax.

@ostephens
Copy link
Collaborator Author

I should be clear that I'm not against a separate lesson that has programming as a pre-requisite - I just think that it doesn't belong in the 'intro' lesson.

@runderwood
Copy link
Collaborator

runderwood commented Jun 1, 2017

I have never used XSLT in web scraping. I've written mountains of the stuff, but I just can't imagine it being something we'd wade into here, especially if we're avoiding programming.

@ostephens
Copy link
Collaborator Author

@runderwood - sorry wasn't being clear - I meant 'introducing the XPath syntax' not 'introducing XSLT' - I'm not suggesting adding xslt into this lesson

@runderwood
Copy link
Collaborator

@ostephens OK! Phew.

I like XPATH, but one could argue that CSS-like selectors are more relevant in the web world. And I think those point directly toward XPATH.

BeautifulSoup's document traversal facilities are arguably more comprehensible than XPATH proper. Alternatively, LXML's HTML parsers/traversers are more XPATH-oriented.

Whether we teach XPATH or not, I still think some basic programming is a necessity if this is to be maximally useful to researchers. Also, the non-programmatic resources mentioned so far are all proprietary and severely limiting.

@ostephens
Copy link
Collaborator Author

ostephens commented Jun 1, 2017

@runderwood
Agreed that CSS selectors as relevant in web world and I think we should consider adding these to the lesson - but alongside XPath IMO

I still think some basic programming is a necessity if this is to be maximally useful to researchers.

But this is not aimed at researchers, but at librarians.

Also, the non-programmatic resources mentioned so far are all proprietary and severely limiting.

I don't disagree particularly
I know better tools for doing scraping (without programming) but I think you end up teaching the tool - which I think is a bad idea
I think the tools used in this lesson are lightweight enough that the focus becomes teaching the concepts not the tool

I think the question for me is whether introducing the basics here is possible and useful without breaking out into programming. I'm inclined to think it is, but I think I'm currently outnumbered in that regard on here...

@ostephens
Copy link
Collaborator Author

@alixk when you taught http://programminghistorian.org/lessons/intro-to-beautiful-soup - how long did you have available and what experience did people coming to the lesson have?

@alixk
Copy link

alixk commented Jun 1, 2017

Yeah, in my (limited) experience, it seems as though tools like browser extensions or proprietary tools tend to not have a very long lifespan, where Python/BeautifulSoup is more dependable. And builds upon the shell lesson.

What about a two-part lesson similar in style to spreadsheets + OpenRefine? 1.5 hours for What is Web Scraping, Ethics, Landscape of resources; and then 1.5 hours for Python, BeautifulSoup.

@runderwood
Copy link
Collaborator

@ostephens I definitely like your idea of working XPATH into the lesson. As for non-programmatic approaches, I think there is definitely a place for that. I'm just not sure this lesson is it. The LC ethos, as I understand it, doesn't seem compatible with purely conceptual lessons nor does it seem in line with proprietary, non-programmatic approaches to scraping. The notion here, I think, is that these lessons knit together all these different more-or-less UNIX-ish tools to do interesting things and empower librarians/researchers. If this is our end, I'd think we'd want to move this lesson in a direction not too radically different from its present form, generally speaking -- programmatic and very concrete and hands-on.

@ostephens
Copy link
Collaborator Author

@runderwood I don't entirely agree with that summation of the LC ethos, but I am in favour of concrete rather than abstract

@ostephens
Copy link
Collaborator Author

@alixk @runderwood @kimpham54
OK from this discussion, and having reviewed http://programminghistorian.org/lessons/intro-to-beautiful-soup you are starting to convince me.

Rather than a 1.5/1.5 split I think I might go for a 1 hr/2 hr split - which reflects the current structure I think?

@ostephens
Copy link
Collaborator Author

Are we all agreed that BS is a better starting point than Scrapy for this lesson?

@ostephens
Copy link
Collaborator Author

OK - so if we are saying we are going to use Python with BS here, do you think that we should dive straight in? That is making the Python/BS install the starting point rather than using the Chrome extension as a starting point?

@ostephens
Copy link
Collaborator Author

ostephens commented Jun 1, 2017

BTW just to show I'm not entirely making up the idea you can do useful scraping without Python (IMO), here is a tutorial I wrote using Google Sheets to scrape data from the Early English Short Title catalogue - based on a real world use case http://www.meanboyfriend.com/overdue_ideas/2015/06/using-google-sheets-with-estc/

@runderwood
Copy link
Collaborator

@ostephens That is really, really interesting, both the target (the short title catalogue) and the approach.

But I would note that, as effective as this seems to have been, it is a) using a proprietary tool b) in a way that amounts to scripting/programming.

Something like this:

=importXml(concat(“http://estc.bl.uk/”,A2),”//td[@class=td1]//a[contains(@href,’&set_entry’)]/@href”)

...while technically mostly declarative, isn't necessarily more comprehensible than Python code (in this case, actually, I think it's less so).

These same techniques, in any case, can be applied in a Python environment, and with more potential for broader application -- one will hit a hard limit on the utility of the Google Sheets approach long before you'll find something you can't script in Python.

@copystar
Copy link

copystar commented Jun 1, 2017

Just a small suggestion. While I think the Short Title catalogue is a great example for librarians to work on scraping, I would also like to suggest that we use Wikipedia as a potential source of scaping material.

@ostephens
Copy link
Collaborator Author

@copystar why wikipedia?

@runderwood
Copy link
Collaborator

runderwood commented Jun 1, 2017

@copystar Are there applications for scraping of Wikipedia not covered by its API?

@ostephens
Copy link
Collaborator Author

@runderwood
I agree - the approach has definite short comings!
It's just that I've found this approach a way of introducing key concepts in a concrete, hands-on, without having to get people started with Python and without having to install any s/w locally.

@runderwood
Copy link
Collaborator

@ostephens Understood. People familiar with Software Carpentry and LC have indicated to me that Bash and Python are generally taken as a given, so I feel like we wouldn't be breaking with precedent there.

But I'm definitely sharing your blog post with colleagues.

@ostephens
Copy link
Collaborator Author

Re API vs Scraping - how far is this differentiation important?
Using a simple API would be easier than doing a difficult scrape.
Could we start with an API example - well structured data, and then move onto scraping - more difficult HTML parsing?

@ostephens
Copy link
Collaborator Author

@runderwood in case it's of interest, I also did a similar one for introducing APIs http://www.meanboyfriend.com/overdue_ideas/2016/06/introduction-to-apis-using-iiif/

@runderwood
Copy link
Collaborator

@ostephens I think they're very different, in practice. An API implies that the data is being made easy to obtain, with documentation, a nod to standards, etc. Web scraping is usually working around in-built or incidental barriers to aggregating data.

I think the approach in the existing lesson makes a lot of sense. I'd be interesting in taking its approach, and even its use case, as a core for reworking.

@copystar
Copy link

copystar commented Jun 1, 2017

@ostephens For an introduction for beginners, using Wikipedia as an example would allow them to use a source that they are already familiar with. When I first tried working with XPath, I found using the structure of Wikipedia very straightforward. And once you can master webscraping a column of data from a long list from Wikipedia (or Wikidata), you now have the ability to draw on every subject matter.

But that being said, I understand what you mean about not needing to use web scraping because Wikipedia already provides an API. But for beginners, they likely don't have this option yet.

Again, this is just a suggestion. I'm more than happy to work with the examples already selected.

@ostephens
Copy link
Collaborator Author

@copystar no examples selected so far. The existing lesson uses members of parliament - but I'm not convinced this is a good set of examples for LC.

Wikipedia/wikidata seem reasonable examples to me

@ostephens
Copy link
Collaborator Author

I've got to go now - will be back online either later this evening or tomorrow

@copystar
Copy link

copystar commented Jun 1, 2017

Ok. I'm going to start working on a version of https://github.com/qut-dmrc/web-scraping-intro-workshop. While this example does make use of Python, I think it's a approach of using Requests and Beautiful Soup is less intimidating than Scrappy

@runderwood
Copy link
Collaborator

@copystar Requests and BS are super important packages, so I'm all for using them here. 👍

Maybe before we start hacking away, we could nail down the structure. I think if we can get that hammered out, since we have a general sense of the toolset we'd like to use, we can then talk about what use case we'd like to pursue.

@copystar
Copy link

copystar commented Jun 1, 2017

@runderwood Good idea! Will hold off. Thanks

@runderwood
Copy link
Collaborator

@copystar Sweet. When we're ready, I can fork so we can do pull requests there. No need to start from scratch!

@ldko
Copy link
Collaborator

ldko commented Jun 1, 2017

@copystar we will go ahead and layout the structure on this etherpad http://pad.software-carpentry.org/scrape and if that structure is agreeable, we can claim parts to work on--also the etherpad has a chat built in if we want to discuss anything via chat.

@ldko
Copy link
Collaborator

ldko commented Jun 1, 2017

@copystar we have outlined a tentative new structure out on the etherpad, do you have any comments or changes you want to add on the etherpad?

@alixk
Copy link

alixk commented Jun 1, 2017

Sorry for being MIA--back now and checking out the proposed structure!

@copystar
Copy link

copystar commented Jun 1, 2017

@ldko It looks good to me!

@ldko
Copy link
Collaborator

ldko commented Jun 5, 2017

This was the structure proposed in the etherpad during the Sprint:
Proposed Structure:
1 Intro What is web scraping? & data ethics brief intro
2 Document structure & Selectors
2.1 XPath (content looks fine, after developing the BeautifulSoup lessons, we should revisit this to align with them as needed)
2.2 CSS Selectors
3 Introduction to scraping with BeautifulSoup (based on existing OU lesson plan but use the Security Council Resolutions - http://www.un.org/en/sc/documents/resolutions/ )
4 Advanced Web scraping using Python and BeautifulSoup (UN Security Council Resolutions)
A. Count total number of security council resolutions per year and print totals
B. Generate a CSV with a row for each resolution, including year, resolution #, description, and link to PDF
5 Conclusion, including group conversation on ethics of web scraping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants