Discovering Discourse: The Relationship between Media and NGOs in Egypt between 2011--13
The data for this project consists of all articles published by Al-Ahram English, Daily News Egypt, and Egypt Independent between November 24, 2011 and April 25, 2013. We essentially used multiple virtual private servers to download local mirrors of their sites using a combination of
wget, and then used BeautifulSoup in Python to extract all the data into an SQLite database.
For the sake of transparency, the files for scraping and parsing are in
parse_raw_html/. However, because the whole process took weeks (and a lot of manual corrections), none of those files are included in the
Makefile. Instead, the
Makefile assumes you have copies of the complete, clean SQLite corpora.
Because the corpora are fairly large (160–500 MB), and because of potentially murky intellectual property issues, we have not included them in this repository. If you are interested in replicating, extending, or playing around with this project, contact Andrew Heiss to get access to the corpora.
OS X and Linux