MapReduce jobs to run on a corpus of a million course syllabi.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

MapReduce Jobs for CHNM's Million Syllabi Database


This repository contains a series of MapReduce jobs that run on a sample of 50,000 syllabi 100 syllabi from CHNM's million syllabi database. They can also be used on the entire million+ dataset, however only a subset of the data has been cleaned and reformatted at this time. MapReduce jobs are written in Python, using MRJob.


  • /data/ - Includes syllabi_sample.tsv, which is the first 100 records from the CHNM syllabi database.
  • - Calculate the average number of words per syllabus text.
  • - Count the number of syllabi in the dataset. This is the most-basic example of map reduce and using MRJob I could write.