Code related to a working paper that was first presented at the AFSP Annual Meeting in Paris, 2013. See Section 1 of this paper and its appendix, or read the HOWTO below for a technical summary.
- June 2014 – Major update * Updated working paper * Added new appendix * Added five media scrapers * Updated Google Trends data
- June 2013 – First release
The scraper currently collects slightly over 6,300 articles from
- ecrans.fr (including articles from liberation.fr)
- lemonde.fr (first lines only for paid content)
- lesechos.fr (left-censored to December 2011)
- lefigaro.fr (first lines only for paid content)
- numerama.com (including old articles from ratiatium.com)
- zdnet.fr
The entry point is make.r
:
get_articles
will scrape the news sources (adjust page counters to current website search results to update the data)get_corpus
will extract all entities and list the most common ones (set minimum frequency withthreshold
; defaults to 10)get_ranking
will export the top 15 central nodes of the co-occurrence network to thetables
folder, in Markdown formatget_network
returns the co-occurrence network, optionally trimmed to its top quantile of weighted edges (set withthreshold
; defaults to 0)
corpus.terms.csv
– a list of all entities, ordered by their raw countscorpus.freqs.csv
– a list of entities found in each articlecorpus.edges.csv
– a list of undirected weighted network ties
- The weighting scheme is inversely proportional to the number of entity pairs in each article.
- The weighted degree formula is by Tore Opsahl and uses an alpha parameter of 1.