MapReduce Jobs for CHNM's Million Syllabi Database


This repository contains a series of MapReduce jobs that run on a sample of 50,000 syllabi 100 syllabi from CHNM's million syllabi database. They can also be used on the entire million+ dataset, however only a subset of the data has been cleaned and reformatted at this time. MapReduce jobs are written in Python, using MRJob.


  • /data/ - Includes syllabi_sample.tsv, which is the first 100 records from the CHNM syllabi database.
  • - Calculate the average number of words per syllabus text.
  • - Count the number of syllabi in the dataset. This is the most-basic example of map reduce and using MRJob I could write.