No description, website, or topics provided.
Switch branches/tags
Nothing to show
Latest commit fe49aef Jun 22, 2013
edraves Fixed description of Driver.process_page
Updated Driver.process_page docstring to reflect what it actually does.
Failed to load latest commit information.
podscrape Fixed description of Driver.process_page Jun 22, 2013
tests Hooked File based output into Driver Jun 22, 2013
.gitignore Added method to FileOutput for lookup info Jun 21, 2013
README.rst Added FileOutput class Jun 21, 2013 Hooked File based output into Driver Jun 22, 2013



A Python module for crawling the iTunes Podcast directory.

The reason for scraping the web pages is to collect each of the podcast urls, which contains the podcast's id number. This id number can be used in Apple's Lookup API in order to get the podcast's actual url (Apple doesn't host any of these podcasts, just puts up a fancy landing page for them on iTunes). Since I use Linux, and don't use iTunes, I'd rather have the original feed urls.

A sample script is included in the root directory of the project. It starts from near the end of the list. If you want to run it from the beginning, the very first url is in a comment in the script. Keep Control+C handy, it will probably run for a while if you let it. I haven't tested to see what kind of rate limiting Apple does, or if they have cutoffs. So consider that before running from the beginning.

Running the Unit Tests:

# Once nose is installed
$ nosetests
  • beautifulsoup4: for parsing and extracting content from web pages
  • nose: for testing (though some of the tests are usable with unittest)
  • requests: for making http requests

Set up the Python environment

Install Virtualenv

Always use a virtualenv when you install Python package requirements. It saves a ton of pain with library versioning issues, and prevents clutter in the system wide package directory. It also lets you install Python packages through its own internal pip, which doesn't require superuser permissions.

If pip isn't installed, nor available in the package manager repositories, then install distribute first. I use distribute instead of setup-tools because distribute is more forward compatible with Python 3.

Install Distribute:

$ curl -O
$ python

Install Pip:

$ curl -O
$ [sudo] python

Install Virtualenv:

$ [sudo] pip install virtualenv

Set up the Virtualenv

Make a virtualenv for us to run in:

#This can go wherever. e.g. /usr/local/
#Creates a new folder './podscrape-env' to hold the environment
#If you're using setup-tools instead of distribute, leave out the --distribute flag
$ cd /usr/local
$ virtualenv podscrape-env --distribute --no-site-packages

Activate the virtualenv:

$ source /usr/local/podscrape-env/bin/activate

Install the requirements from the project:

$ cd /path/to/the/code
$ pip install -r requirements.txt