-
Notifications
You must be signed in to change notification settings - Fork 84
/
guardian.Rmd
87 lines (67 loc) · 3.06 KB
/
guardian.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
title: "Scraping the Guardian from R"
author: "Wouter van Atteveldt & Kasper Welbers"
date: "April 2019"
output:
github_document:
toc: yes
editor_options:
chunk_output_type: console
---
```{r, echo=F, message=F}
knitr::opts_chunk$set(warning=FALSE, message=FALSE, results=F, fig.keep='none')
library(printr)
```
# The Guardian API
The Guardian (newspaper) has a very nice API that let's you download news content, and there is an R package `guardianR` that allows you to easily access this.
First, install GuardianR from CRAN as normal and get an API key from https://open-platform.theguardian.com/access/. Then replace the 'test' key below with your actual key:
```{r}
key = "test"
```
# Querying the API
Sending a single query to the API is simple. Let's query the news on the 'backstop' in the week before Brexit was supposed to happen.
Note, by default the body (full text) is returned as factor, so we convert it to text and also create a column with the html stripped:
```{r}
library(GuardianR)
library(tidyverse)
news = get_guardian("backstop", from.date="2019-03-23", to.date="2019-03-29", api.key = key) %>%
as_tibble %>% mutate(body = as.character(body), date=as.Date(webPublicationDate),
text=str_remove_all(body, "<[^>]+>"))
news %>% select(webUrl, publication, headline, date, body) %>% mutate(len=str_length(body))
```
As you can see, this includes the metadata (url, headline, date) and full body of the article.
Just because we can, here is a wordcloud:
```{r}
library(quanteda)
news %>% corpus() %>% dfm(remove_punct=T, remove=stopwords("english")) %>% textplot_wordcloud(max_words=50)
```
For more information on the query syntax and possibilities, see https://open-platform.theguardian.com/documentation/.
Note that (at the time of writing) the `guardianR` package is somewhat limited, so to search for a specific tag you need to integrate it in your query manually like so:
```{r}
news_pol = get_guardian("backstop&tag=politics/politics", from.date="2019-03-23", to.date="2019-03-29", api.key = key)
print(str_c("# of articles total: ", nrow(news), "; # of articles in politics tag: ", nrow(news_pol)))
```
# Querying longer time spans
With `GuardianR`, you can only query one month at a time.
To query a longer time frame, you can use a for loop like so:
```{r}
library(lubridate)
results = list()
months = seq(as.Date("2019-01-01"), by = "month", length = 3)
for (i in seq_along(months)) {
from = months[i]
message(str_c("Querying: ", i, ": ", from))
# calculate last day of month
to = from
month(to) = month(to) + 1
to = to - 1
# query
news = get_guardian("backstop", from.date=from, to.date=to, api.key = key)
# store in list
results[[as.character(from)]] = news
}
# combine all results and postpr
results = bind_rows(results)
print(str_c("Retrieved ", nrow(results), " articles"))
```
Note that because `guardianR` returns factors for most fields, you get warning messages because on combining factors it changes them into character colunms when the factor levels differ. These warning messages are generally safe to ignore.