Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
HTML Perl PLpgSQL Python JavaScript CSS Other
Switch branches/tags
Clone or download
Failed to load latest commit information.
.travis-lxd Update LXC image's URL Jun 25, 2018
_Inline-webapp Add directory for storing webapp's Inline::Python compiled binaries Feb 27, 2018
ansible add rss dump path to cron job Jun 29, 2018
data Add directory for Jun 29, 2018
doc add tags to topics/media/list Jul 5, 2018
lib fix bug causing crash of media rescrape when medium url is not valid Jul 15, 2018
mediacloud add to the list of accelerated domains for the t… Jul 6, 2018
root Drop Superglue support Jun 29, 2018
schema Merge branch 'master' into distributed_topic_spider_fetch_link Jul 3, 2018
script restore Jun 29, 2018
solr change docvalues fields back to indexed="true" May 4, 2018
supervisor Merge branch 'master' into clean_up_old_scripts Jun 26, 2018
t Remove "db_row_last_updated" from crawler's tests Jun 29, 2018
tools Update story sentence offset Jul 13, 2018
.gitattributes Strip "mediawords_" prefix from scripts in script/ Jun 19, 2018
.gitignore Revert "Revert "Add empty _Inline/ directory"" Feb 27, 2018
.gitmodules Fix path to Hindi Hunspell Git submodule Dec 30, 2017
.python-version Set preferred Python version for pyenv Oct 24, 2017 Use LXD containers for Travis runs Feb 28, 2018 Don't fail if there are no old images to remove Jun 26, 2018 Strip "mediawords_" prefix from scripts in script/ Jun 19, 2018 Don't pass Superglue test URL to container Apr 24, 2018
.travis.yml Drop Superglue support Jun 29, 2018
INSTALL.markdown Strip "mediawords_" prefix from scripts in script/ Jun 19, 2018
LICENSE Reorganize the directory structure so that branch isn't under the trunk. May 5, 2010
README.markdown Rewrite INSTALL into a Markdown document [ci skip] Apr 11, 2016
app.psgi Add psgi to run it with plack Apr 11, 2012 Strip "mediawords_" prefix from scripts in script/ Jun 19, 2018
log4perl.conf revert log4perl.conf to print all DEBUGs Nov 5, 2016


This is the source code for the Media Cloud core system. Media Cloud, a joint project of the Berkman Center for Internet & Society at Harvard University and the Center for Civic Media at MIT, is an open source, open data platform that allows researchers to answer complex quantitative and qualitative questions about the content of online media.

For more information on Media Cloud, go to

NOTE: Most users prefer to use Media Cloud's API and public tools to query our data instead of running their own Media Cloud instance.

The code in this repository will be of interest to those users who wish to run their own Media Cloud instance and users of the public tools who want to understand how Media Cloud is implemented.

The Media Cloud code here does three things:

  • Runs a web app that allows you to manage a set of media sources and their feeds.

  • Periodically crawls the feeds setup within the web app and downloads any new stories found within the downloaded feeds.

  • Extracts the substantive text from the downloaded story content (minus the ads, navigation, comments, etc) and associates a set of tags with each story based on that extracted text.

For very brief installation instructions, see INSTALL.markdown.

Please send us a note at if you are using any of this code or if you have any questions. We are very interested in knowing who's using the code and for what.

For a brief roadmap of the code contained in this release, see repo-map.markdown.

Build Status

Build Status Coverage Status

History of the Project

Print newspapers are declaring bankruptcy nationwide. High-profile blogs are proliferating. Media companies are exploring new production techniques and business models in a landscape that is increasingly dominated by the Internet. In the midst of this upheaval, it is difficult to know what is actually happening to the shape of our news. Beyond one-off anecdotes or painstaking manual content analysis, there are few ways to examine the emerging news ecosystem.

The idea for Media Cloud emerged through a series discussions between faculty and friends of the Berkman Center. The conversations would follow a predictable pattern: one person would ask a provocative question about what was happening in the media landscape, someone else would suggest interesting follow-on inquiries, and everyone would realize that a good answer would require heavy number crunching. Nobody had the time to develop a huge infrastructure and download all the news just to answer a single question. However, there were eventually enough of these questions that we decided to build a tool for everyone to use.

Some of the early driving questions included:

  • Do bloggers introduce storylines into mainstream media or the other way around?
  • What parts of the world are being covered or ignored by different media sources?
  • Where do stories begin?
  • How are competing terms for the same event used in different publications?
  • Can we characterize the overall mix of coverage for a given source?
  • How do patterns differ between local and national news coverage?
  • Can we track news cycles for specific issues?
  • Do online comments shape the news?

Media Cloud offers a way to quantitatively examine all of these challenging questions by collecting and analyzing the news stream of tens of thousands of online sources.

Using Media Cloud, academic researchers, journalism critics, policy advocates, media scholars, and others can examine which media sources cover which stories, what language different media outlets use in conjunction with different stories, and how stories spread from one media outlet to another.


Media Cloud is made possible by the generous support of the Ford Foundation, the Open Society Foundations, and the John D. and Catherine T. MacArthur Foundation.


Past and present collaborators include Morningside Analytics, Betaworks, and


Media Cloud is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Media Cloud is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with Media Cloud . If not, see <>.