Skip to content

Commit

Permalink
Merge pull request coursera-dl#34 from rbrito/fixes/code-quality
Browse files Browse the repository at this point in the history
Fixes/code quality
  • Loading branch information
jplehmann committed Dec 22, 2012
2 parents 3dd088c + 30e7900 commit da47faa
Show file tree
Hide file tree
Showing 2 changed files with 96 additions and 37 deletions.
51 changes: 37 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,26 @@
Coursera Downloader
===================
[Coursera] is creating some fantastic, free educational classes (e.g., algorithms, machine learning, natural language processing, SaaS). This script allows one to batch download lecture resources (e.g., videos, ppt, etc) for a Coursera class. Given a class name and related cookie file, it scrapes the course listing page to get the week and class names, and then downloads the related materials into appropriately named files and directories.

[Coursera] is creating some fantastic, free educational classes (e.g.,
algorithms, machine learning, natural language processing, SaaS). This
script allows one to batch download lecture resources (e.g., videos, ppt,
etc) for a Coursera class. Given a class name and related cookie file, it
scrapes the course listing page to get the week and class names, and then
downloads the related materials into appropriately named files and
directories.

Why is this helpful? Before I was using *wget*, but I had the following problems:

1. Video names have a number in them, but this does not correspond to the actual order. Manually renaming them is a pain.
1. Video names have a number in them, but this does not correspond to the
actual order. Manually renaming them is a pain.
2. Using names from the syllabus page provides more informative names.
3. Using a wget in a for loop picks up extra videos which are not posted/linked, and these are sometimes duplicates.
3. Using a wget in a for loop picks up extra videos which are not
posted/linked, and these are sometimes duplicates.

*DownloadThemAll* can also work, but this provides better names.

Inspired in part by [youtube-dl] by which I've downloaded many other good videos such as those from Khan Academy.
Inspired in part by [youtube-dl] by which I've downloaded many other good
videos such as those from Khan Academy.


Features
Expand All @@ -28,13 +38,16 @@ Features
Directions
----------

Requires Python 2.x (where x >= 5) and a free Coursera account enrolled in the class of interest.
Requires Python 2.x (where x >= 5) and a free Coursera account enrolled in
the class of interest.

1\. Install any missing dependencies.

* [Beautiful Soup 3]
Ubuntu/Debian: `sudo apt-get install python-beautifulsoup`
Mac OSX: bs4 may be required instead (modify import as well)
* [Beautiful Soup 3] or [Beautiful Soup 4]
Ubuntu/Debian for BS3: `sudo apt-get install python-beautifulsoup`
Ubuntu/Debian for BS4: `sudo apt-get install python-bs4`
Mac OSX: `bs4` may be required instead.
When using `bs4`, be sure to modify the import at the top of the script.
Other: `easy_install BeautifulSoup`
* [Argparse] (Not necessary if Python version >= 2.7)
Ubuntu/Debian: `sudo apt-get install python-argparse`
Expand Down Expand Up @@ -63,13 +76,22 @@ username, password (or a `~/.netrc` file) and the class name.
Specify download path: coursera-dl progfun-2012-001 -n --path=C:\Coursera\Classes\
Download multiple classes: coursera-dl progfun-2012-001 -n --add-class=hetero-2012-001 --add-class=thinkagain-2012-001

On \*nix platforms\*, the use of a `~/.netrc` file is a good alternative to specifying both your username and password every time on the command line. To use it, simply add a line like the one below to a file named `.netrc` in your home directory (or the [equivalent], if you are using Windows) with contents like:
On \*nix platforms\*, the use of a `~/.netrc` file is a good alternative to
specifying both your username and password every time on the command
line. To use it, simply add a line like the one below to a file named
`.netrc` in your home directory (or the [equivalent], if you are using
Windows) with contents like:

machine coursera-dl login <user> password <pass>

Create the file if it doesn't exist yet. From then on, you can switch from using `-u` and `-p` to simply call `coursera-dl` with the option `-n` instead. This is especially convenient, as typing usernames and passwords directly on the command line can get tiresome (even more if you happened to choose a "strong" password).
Create the file if it doesn't exist yet. From then on, you can switch from
using `-u` and `-p` to simply call `coursera-dl` with the option `-n`
instead. This is especially convenient, as typing usernames and passwords
directly on the command line can get tiresome (even more if you happened to
choose a "strong" password).

\* if this works on Windows, please add additional instructions for it if any are needed.
\* if this works on Windows, please add additional instructions for it if
any are needed.

Troubleshooting
---------------
Expand All @@ -90,16 +112,17 @@ Troubleshooting

Contact
-------
Post bugs and issues on [github]. Send other comments to John Lehmann: first last at geemail dotcom or [@jplehmann]


Post bugs and issues on [github]. Send other comments to John Lehmann:
first last at geemail dotcom or [@jplehmann]

[@jplehmann]: www.twitter.com/jplehmann
[1]: https://chrome.google.com/webstore/detail/lopabhfecdfhgogdbojmaicoicjekelh
[2]: https://addons.mozilla.org/en-US/firefox/addon/export-cookies
[youtube-dl]: http://rg3.github.com/youtube-dl
[Coursera]: http://www.coursera.org
[Beautiful Soup 3]: http://www.crummy.com/software/BeautifulSoup
[Beautiful Soup 3]: http://www.crummy.com/software/BeautifulSoup/bs3
[Beautiful Soup 4]: http://www.crummy.com/software/BeautifulSoup
[Argparse]: http://pypi.python.org/pypi/argparse
[wget]: http://sourceforge.net/projects/gnuwin32/files/wget/1.11.4-1/wget-1.11.4-1-setup.exe
[easy_install]: http://pypi.python.org/pypi/setuptools
Expand Down
82 changes: 59 additions & 23 deletions coursera-dl
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
#!/usr/bin/env python
"""
For downloading lecture resources such as videos for Coursera classes. Given a class name, username and password, it scrapes the coarse listing page to get the section (week) and lecture names, and then downloads the related materials into appropriately named files and directories.
For downloading lecture resources such as videos for Coursera classes. Given
a class name, username and password, it scrapes the coarse listing page to
get the section (week) and lecture names, and then downloads the related
materials into appropriately named files and directories.
Examples:
coursera-dl -u <user> -p <passwd> saas
Expand All @@ -9,31 +12,42 @@ Examples:
Author:
John Lehmann (first last at geemail dotcom or @jplehmann)
Contributions are welcome, but please try to make them platform independent and backward compatible.
Contributions are welcome, but please try to make them platform independent
and backward compatible.
"""

import sys, os, re, string
import urllib, urllib2, urlparse, cookielib
import tempfile
import subprocess
import argparse
import cookielib
import errno
import netrc
import os
import re
import string
import StringIO
import subprocess
import sys
import tempfile
import netrc
import urllib
import urllib2

from BeautifulSoup import BeautifulSoup
# for OSX, bs4 is recommended
#from bs4 import BeautifulSoup

def get_syllabus_url(className):
"""Return the Coursera index/syllabus URL."""
"""
Return the Coursera index/syllabus URL.
"""
return "http://class.coursera.org/%s/lecture/index" % className

def get_auth_url(className):
return "http://class.coursera.org/%s/auth/auth_redirector?type=login&subtype=normal&email=&visiting=&minimal=true" % className

def write_cookie_file(className, username, password):
"""
Automatically generate a cookie file for the coursera site.
"""
try:
"""automatically generate a cookie file for the coursera site"""
(hn,fn) = tempfile.mkstemp()
cj = cookielib.MozillaCookieJar(fn)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj), urllib2.HTTPHandler())
Expand All @@ -57,9 +71,11 @@ def write_cookie_file(className, username, password):
return fn

def load_cookies_file(cookies_file):
"""Loads the cookies file. I am pre-pending the file with the special
"""
Loads the cookies file. I am pre-pending the file with the special
Netscape header because the cookie loader is being very particular about
this string."""
this string.
"""
cookies = StringIO.StringIO()
NETSCAPE_HEADER = "# Netscape HTTP Cookie File"
cookies.write(NETSCAPE_HEADER);
Expand All @@ -69,7 +85,9 @@ def load_cookies_file(cookies_file):
return cookies

def get_opener(cookies_file):
"""Use cookie file to create a url opener."""
"""
Use cookie file to create a url opener.
"""
cj = cookielib.MozillaCookieJar()
cookies = load_cookies_file(cookies_file)
# nasty hack: cj.load() requires a filename not a file, but if
Expand All @@ -79,7 +97,9 @@ def get_opener(cookies_file):
return urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

def get_page(url, cookies_file):
"""Download an HTML page using the cookiejar."""
"""
Download an HTML page using the cookiejar.
"""
opener = get_opener(cookies_file)
#return opener.open(url).read()
ret = opener.open(url).read()
Expand All @@ -97,7 +117,9 @@ def grab_hidden_video_url(href, cookies_file):
return l[0]['src']

def get_syllabus(class_name, cookies_file, local_page=False):
""" Get the course listing webpage."""
"""
Get the course listing webpage.
"""
if (not (local_page and os.path.exists(local_page))):
url = get_syllabus_url(class_name)
page = get_page(url, cookies_file)
Expand All @@ -110,23 +132,29 @@ def get_syllabus(class_name, cookies_file, local_page=False):
return page

def clean_filename(s):
"""Sanitize a string to be used as a filename."""
"""
Sanitize a string to be used as a filename.
"""
# strip paren portions which contain trailing time length (...)
s = re.sub("\([^\(]*$", "", s)
s = s.strip().replace(':','-').replace(' ', '_')
valid_chars = "-_.()%s%s" % (string.ascii_letters, string.digits)
return ''.join(c for c in s if c in valid_chars)

def get_anchor_format(a):
"""Extract the resource file-type format from the anchor"""
"""
Extract the resource file-type format from the anchor
"""
# (. or format=) then (file_extension) then (? or $)
# e.g. "...format=txt" or "...download.mp4?..."
format = re.search("(?:\.|format=)(\w+)(?:\?.*)?$", a)
return format.group(1) if format else None

def parse_syllabus(page, cookies_file):
"""Parses a Coursera course listing/syllabus page.
Each section is a week of classes."""
"""
Parses a Coursera course listing/syllabus page. Each section is a week of
classes.
"""
sections = []
soup = BeautifulSoup(page)
# traverse sections
Expand Down Expand Up @@ -186,7 +214,9 @@ def download_lectures(
path='',
verbose_dirs=False
):
"""Downloads lecture resources described by sections."""
"""
Downloads lecture resources described by sections.
"""

def format_section(num, section):
sec = "%02d_%s" % (num, section)
Expand Down Expand Up @@ -218,7 +248,9 @@ def download_lectures(
open(lecfn, 'w').close() # touch

def download_file(url, fn, cookies_file, wget_bin):
"""Downloads file and removes current file if aborted by user."""
"""
Downloads file and removes current file if aborted by user.
"""
try:
if wget_bin:
download_file_wget(wget_bin, url, fn, cookies_file)
Expand All @@ -230,14 +262,18 @@ def download_file(url, fn, cookies_file, wget_bin):
sys.exit()

def download_file_wget(wget_bin, url, fn, cookies_file):
"""Downloads a file using wget. Could possibly use python to stream files to
disk, but wget is robust and gives nice visual feedback."""
"""
Downloads a file using wget. Could possibly use python to stream files to
disk, but wget is robust and gives nice visual feedback.
"""
cmd = [wget_bin, url, "-O", fn, "--load-cookies", cookies_file, "--no-check-certificate"]
print "Executing wget:", cmd
retcode = subprocess.call(cmd)

def download_file_nowget(url, fn, cookies_file):
"""'Native' python downloader -- slower than wget."""
"""
'Native' python downloader -- slower than wget.
"""
print "Downloading %s -> %s" % (url, fn)
urlfile = get_opener(cookies_file).open(url)
chunk_sz = 1048576
Expand Down

0 comments on commit da47faa

Please sign in to comment.