Kasper Welbers & Emma Diel 2024-06
Media Cloud is an open source platform for media analysis, that allows you to search news from a huge range of international news outlets. Although you will not get the full articles (which also wouldn’t be legal), you will get headlines and usefull metadata. Furthermore, since you also receive the URLs, you can look up the articles, and scrape more informationn if (legally) available.
In this brief tutorial we first show you how to use Media Cloud, and then use the paperboy R package to (try to) scrape more data.
Media cloud is very straightforward to use. To get started, visit the website, and create a (free) user account.
Once you’re logged in, click the big “search now” button (or this) link).
On the search page you can enter search phrases using the simplified form, but you can also use advanced mode if you understand boolean queries (remember practical 3!). There is also a detailed guide for what query features are supported.
One very important thing to notice is the select collections button
in step 2. By default you’ll search in the United States - National
collection, but there are many different collections, including other
countries (e.g., Netherlands - National
).
Once you’ve conducted a search, you’ll see several results below, like
Attention over Time
, Total Attention
and Sample Content
. The one
we’re most interested in right now is Total Attention
, where you can
also find the DOWNLOAD ALL URLS
button. Click this button to start the
download (it might take a few second or more if you have a lot of
results).
You should now have a new CSV file in your downloads folder.
Next you can set up the search query. You can include or exclude certain words, as well as decide on the time frame. After you click on search, the website will generate several documents. Under the heading of “Total Attention” you will find a button to download a csv with the URLs. The website will also give you a selection of random articles from the search. Make sure that these articles really reflect what you are looking for or whether the search query needs to be refined
Reading the Media Cloud data into R is very straightforward. We’ll again
use the read_csv
function (like we did in the practicals). We’ll use
the tidyverse package for some quick data cleaning (this package also
includes the read_csv function).
library(tidyverse)
d <- read_csv("~/Downloads/mc-onlinenews-mediacloud-20240604134941-content.csv")
(make sure to replace the file name with location and filename on your computer, or use the data import wizard, as we did in the practicals.)
If you did everything right, you should now have the data in R. Remember
that you can view the data by clicking on the name in the Environment
tab (top-right) use using View(d)
.
The paperboy package provides an easy function to try to scrape a news article from a URL. We say try, because whether it works depends on two things:
- The data might be behind a paywall, in which case we can’t get it (without paying for it).
- Paperboy doesn’t know every news website on the planet. If you’re trying to scrape data from a website that it doesn’t know, it will try a generic approach. Whether this works depends on whether the website uses a common, standard layout
If you cannot scrape more information for the articles that you want to use for your project assignment, don’t worry! You could perhaps do the analysis using only the headlines. If this really doesn’t work for your RQ, consult your supervisor. And make sure to write down your experiences for the ‘reflection’ part of the assignment. Figuring out what data is available and clean is also a part of doing research!
Luckily, trying whether you can scrape the data doesn’t take much time.
You just need to install the paperboy package, and use 1 function. Note
that to install paperboy you need the remotes
package. So if you get
an error saying “there is no package called ‘remotes’”, make sure to
install it first.
remotes::install_github('JBGruber/paperboy')
library(paperboy)
We’ll first try scraping just a sample of the data. We’ll pick the first 10 rows from our data, and then scrape the URLs for these rows.
samp = head(d, 40)
scraped_data = pb_deliver(samp$url)
If you’re lucky, the scraped_data
is a dataframe with 10 news
articles, that now has both a headline
and text
column. It could be
that you have a few articles less, if some of them could not be found.
Also, it could be that you don’t have the text
for all articles.
Once you have everything figured out, you can try downloading all the
data. In case you run into any error, you might try setting the
ignore_fails
argument to TRUE to be a bit more lenient.
all_data = pb_deliver(d$url, ignore_fails = T)
When you have your results, it’s always good to quickly check how much you’re missing compared to your original data. First, you can just look at the total number of articles in your scraped data compared to the Media Cloud URL data.
nrow(all_data)
nrow(d)
Or directly as a percentage:
100 * nrow(all_data) / nrow(d)
But don’t forget to also check whether the scraped data actually has the
text
! To do this, we first check whether the value in the text
column is missing (NA) or whether its just an empty string (““)
text_missing = is.na(all_data$text) | all_data$text == ""
Now we can count how often it’s NOT missing (! means NOT)
sum(!text_missing)
And we can again compare this to the total number of URLs we tried to scrape
100 * sum(!text_missing) / nrow(d)
If you have a LOT of data, you might want to do the scraping in batches. A simple approach would be to directly slice the data.
batch1 = d$url[1:500,]
batch2 = d$url[501:1000,]
## etc.