Code snippets for a workshop on web scraping.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
00-basics.py
01-dataset.py
02-caching.py
03-multithreading.py
README.md
requirements.txt

README.md

Scrape the Gibson

These code snippets are the core of a post I wrote about web scraping in python. It's addressed at people who have already done a bit of coding but want to explore scraping in python in more depth. The workshop will be much easier if you have a Mac or Linux-based computer.

Dependencies

  1. Download repo: https://github.com/abelsonlive/scrape-the-gibson

  2. Install dependencies

  • If you don't have pip installed, type:
sudo easy_install pip
  • change directories
cd nyu-skill-share-scraping
  • now run:
sudo pip install -r requirements.txt

Topics

Introduction

  • Getting started with Scraping in Python using requests
  • Exploring HTML documents and extracting the data, with BeautifulSoup
  • Saving scraped data to a database with dataset

Advanced

  • Thinking about ETL (Extract, Transform, Load)
  • Keep your source data around.
  • Running multiple requests in parallel to scrape faster
  • Regular Expressions to Extract More Data
  • Programmatic crawling of entire sites.

Links

There are plenty of existing resources on scraping. A few links: