scrape old lectopia lectures
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
static
templates
.gitattributes
.gitignore
LICENSE.txt
README.md
backup.py
getallmeta.py
getlist.py
getmedia.py
getmeta.py
nginx.conf
redo.py
requirements.txt
scalpel.sh
schema.sql
setup.py
web.py

README.md

Scrapetopia

Preparation

Install Python 2.7, pip and a C compiler. Windows users would be far better off using cygwin, though it may work with native Python and exactly MSVC++ 2008.

If you don't have pip, install it with easy_install pip.

Then install dependencies with pip install -r requirements.txt.

Finally, run setup.py to create the database and other directories.

Fetching lecture metadata

Run getallmeta.py to fetch lecture information.

There is no official list of units, and no indication of the highest ID, so by default it tries 0 to 9999, which appears to be sufficient because the highest ID that exists within this range is 5151.

Extracting the media file list

Run getlist.py to dump all media file URLs from the database, except for those which have a corresponding file saved in data/media.

The list of files will be at data/list.txt. Don't try to feed it into a typical download manager like JDownloader, it won't handle it nicely because there can be over 100000 files.

Downloading media files

Run getmedia.py to download media files.

It supports retrying and continuing partial downloads, so feel free to ^C at any time. If you restart the downloader, it'll go through and ping all of the files to make sure they exist and are the right size.

Files that are definitely done will automatically go into data/done.txt, and you can remove these from the main list by running scalpel.sh. Running this is optional but makes restarts faster because the list is shortened.

If you have lost data/done.txt and want to remove completed files in data/media from your current data/list.txt, run redo.py and a new list.txt will be generated without the existing media files.

Backing up media files

You can run backup.py with the main and backup directories supplied as arguments to copy the media files to another location. The need to copy a particular file is determined by a quick check of file sizes, as matching file contents exactly would be very slow.

Hosting the web-based browser

Run web.py to start the application.

Install nginx and use the provided nginx.conf, changing the alias path on line 15 to point to the data/media directory. nginx proxies to the application but is still needed because it is responsible for serving the media files.

Start nginx and the interface will be available on port 8653.

Screenshots

statistics

units

lectures

files

player