Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapting this for Resbaz Sydney #30

Closed
jnothman opened this issue Jun 15, 2017 · 14 comments
Closed

Adapting this for Resbaz Sydney #30

jnothman opened this issue Jun 15, 2017 · 14 comments

Comments

@jnothman
Copy link
Contributor

jnothman commented Jun 15, 2017

I thought I should shout out that Sydney University has asked me to present a webscraping introduction to researchers at https://2017.resbaz.com/sydney in early July. This is a great place to start, but I was hoping to make the following changes (time permitting):

  • use toscrape.com as a less context-biased test site, and to highlight some of the problems that scrapers need to get around.
  • incorporate XPath exercises using the Xpath Diner
  • mention CSS selectors as an alternative to XPath (I had originally wanted to do it all with CSS selectors due to their familiarity, but given the existing Xpath content, that it more likely teaches everyone something new, and that it is more expressive, I'm now likely to going with Xpath.). Main problem with Xpath is the pain of selecting by class name.
  • ?perhaps start by building a point-and-click scraper with grepsr.com, which I found to be the most user-friendly of available tools, requiring 0 knowledge of xpath, etc.
  • The Scraper Chrome Extension seems very limited in terms of doing only single-page scrapes, and I've found it buggy (I get error popups whose message is [object Object]). I would rather introduce users to Portia, which can then be converted to scrapy; the main problem with that is that it's not so easy for most researchers to install themselves, although the docker run is trivial for a techie.

Now that this has been rehomed to data-lessons and is being actively maintained, I will try to make my revisions more structured.

@jnothman
Copy link
Contributor Author

I would also love to hear critique of this lesson plan variation.

@jnothman
Copy link
Contributor Author

jnothman commented Jun 15, 2017

What I would like to know is whether these changes would be merged in here, or whether I should maintain a fork and give back piece by piece after the fact.

For the sake of rapid development, I intend to work with a fork at https://github.com/ctds-usyd/library-webscraping

@weaverbel
Copy link
Collaborator

Hi @jnothman Please work in your fork for now. The lesson maintainers would need time to look at your changes and to decide whether they want to incoporate them. The key for us is to work with open source tools that work across all platforms. Glad that you are doing the workshop and I look forward to hearing how it goes.

@jnothman
Copy link
Contributor Author

jnothman commented Jun 15, 2017

Yes, that's my main uncertainty regarding grepsr. I'm certainly sensitive to open-source as a priority. Still, try it out. It's very neat!

Scrapy is currently included in the lesson, hence some of the motivation for considering Portia, which is an open-source visual interface for building scrapy-like scrapers, i.e. a lot more powerful than the Chrome Scraper extension which is limited to single-page scrapers, etc. As opposed to many online tools, you can own the software, the server and the scraper definition.

But in reality it's possible (read: likely) I'll not have time to rewrite for Portia.

In that case, my focus will be cleaning up what's here, and moving it to work with toscrape.com.

@ostephens
Copy link
Collaborator

Hi @jnothman

At the moment this web scraping lesson is in development. The basic lesson here has been copied from work by @timtomch, but during the recent global sprint, it was agreed to make some substantive changes to what we have here.

This work led to a new proposed structure:

1 Intro What is web scraping? & data ethics brief intro
2 Document structure & Selectors
2.1 XPath (content looks fine, after developing the BeautifulSoup lessons, we should revisit this to align with them as needed)
2.2 CSS Selectors
3 Introduction to scraping with BeautifulSoup (based on existing OU lesson plan http://ouinformatics.github.io/swc_beautiful_soup/ but use the Security Council Resolutions - http://www.un.org/en/sc/documents/resolutions/ )
4 Advanced Web scraping using Python and BeautifulSoup (UN Security Council Resolutions)
A. Count total number of security council resolutions per year and print totals
B. Generate a CSV with a row for each resolution, including year, resolution #, description, and link to PDF
5 Conclusion, including group conversation on ethics of web scraping

More info at:

It looks like there is some overlap with your ideas - in particular the introduction of CSS selectors and the decision not to use the Chrome extension.

However, as you can see, there was a general feeling that Beautiful Soup was a better place to start than Scrapy, and a move away from visual scraping tools to focus on using BS/Python to do the scraping. There was extensive discussion about this point (see #7 ) so while I like the idea of Portia (which I hadn't seen before - thanks!) I'm reluctant to revisit this discussion again so soon :)

I think your approach of forking and doing your own version is definitely the right thing to do, but if there is anything you can offer back, and especially any improvements to the XPath content and anything on CSS Selectors definitely, that would be very welcome

@jnothman
Copy link
Contributor Author

jnothman commented Jun 15, 2017 via email

@jnothman
Copy link
Contributor Author

jnothman commented Jun 15, 2017

To be clear, if mine is the soonest application of this course, I would much rather make something that will take you towards the decided goals.

To show you where I'm at (not far), these are the content changes I'd done until today (when I noticed your recent efforts): ctds-usyd@df9f141. Note that I've not yet merged in changes here.

Also, for the record, here are topics I thought one could cover before I looked for prior art:

  • What we are trying to do:
    • Example applications
    • The idea of DBs + generated pages (and inverting that process)
  • What we are NOT trying to do:
    • Retrieving sets of data from arbitrary locations/formats on the web
    • Information extraction from free text
  • Introduction to steps: spidering, scraping, structuring/storing
  • Spidering:
    • Lists of URLs / hyperlinks
    • URL structure and URL hacking
    • Recursive crawling + URL pattern
    • Paginated lists of URLs (a scraping task in itself)
    • Periodic scheduling
  • A UI for designing scrapers (???? Find one)
  • Scraping:
    • The structure of a web page
    • CSS selectors
    • designing and refining CSS selectors
    • Regular expressions (brief)
  • Structuring/storing:
    • Relational DBs
    • JSON
  • Harder cases (brief)
    • Nested scrapers: Each page is a record, and in each page are multiple sub-records
    • Conditional logic
    • ?Hidden content: microdata
  • 5 legalities:
    • Terms and conditions
    • Rate limiting
  • Alternatives to writing your own:
    • Automatic visual scrapers
    • Boilerplate removal tools
    • microdata extraction
    • APIs (brief! Otherwise need to discuss input formats, output formats, authentication)
      • e.g. Facebook’s open graph
    • existing scrapers
  • Difficult input formats, such as PDFs

@jnothman
Copy link
Contributor Author

I see some immediate benefits to BeautifulSoup rather than scrapy, notably not requiring spidering strategies and class definitions. BeautifulSoup fits more into a procedural paradigm, and while the declarative scrapy approach may be appropriate paradigmatically, it's a bigger jump for people who have been taught roughly-procedural programming, which is, I think, the assumption in data carpentry.

The biggest disadvantage to BeautifulSoup is that it does not support Xpath.

I will try to work towards the model you've designed, if that's okay. I'll need to revise some bs4!

@jnothman
Copy link
Contributor Author

jnothman commented Jun 15, 2017

I am a little concerned by the current proposed plan not including a visual scraper or a tool for visual experimentation with selectors. I want to be able to present this to people who aren't such confident programmers.

@ostephens
Copy link
Collaborator

Thanks @jnothman

I agree on your analysis of the benefits of BS. However I hadn't realised that it didn't support XPath (I use Ruby w Nokogiri in the real world...).

I personally agree with your concern about the lack of a visual tool of some kind, but I didn't win that argument on #7

Possibly it might be worth you jumping on the Gitter channel for Library Carpentry and see if any of the people who contributed to this lesson during the sprint are on there - might be worth trying to get their input and views.

Generally actions speak louder than words - so if you do something and show it works, I think it is a strong argument for considering bringing into this lesson!

I think the biggest difference in terms of syllabus between what has been outlined during the sprint and what you have suggested is work around the storage of scraped data - I don't think this was considered at all in discussions around this lesson

@jnothman
Copy link
Contributor Author

I also think raising "Look for an API before you scrape" is important. I'll try gitter when I'm able to focus on this.

@jnothman
Copy link
Contributor Author

(Btw, these days lxml apparently has as robust an HTML parser as bs4. I'll have to think a little more.)

@timtomch
Copy link
Collaborator

Just chiming in to say that I'm glad everyone is working on improving this lesson. I don't currently have the capacity to devote much time to this, but feel free to ping me if there's anything specific I can help with.

@ldko
Copy link
Collaborator

ldko commented Jun 15, 2017

I agree on an emphasis on "look for an API before you scrape" being important.

I think it would make sense to use lxml if it is desired for people to be able to use xpath. Is there content that would be impossible to get without xpath? If not, we might not need to teach xpath. If so, we could switch to lxml. I remember using BeautifulSoup when I was a new programmer and it making sense. I also remember working with xpath and lxml and being confused by xpath syntax.

I think keeping an episode about using a visual tool for scraping would be ok if the tool is free/preferably open source, can be run on Windows/Mac/Linux...if it has been around for a while and has community support behind it, that would be nice too.

Generally actions speak louder than words - so if you do something and show it works, I think it is a strong argument for considering bringing into this lesson!

I agree with that, being someone who has talked about restructuring the lesson but has not actually done any of the work to add new talked about components (I would like to if I can find the time)...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants