Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
370 lines (297 sloc) 14.3 KB

Open knowledge in R with Wikimedia APIs

By Mikhail Popov, Wikimedia Foundation

Table of Contents

Introduction

Wikimedia Foundation is a non-profit that operates free & open projects like Wikipedia, Wiktionary, and Wikidata that anyone can contribute to

No time to talk about me (plus that's always the boring part)[^1]

A Markdown copy of this deck is at git.io/vSi6a for following along

R packages required to follow along:

install.packages(
  c("magrittr", "rvest", "xml2"
    "pageviews", "WikipediR", "WikidataR",
    "WikidataQueryServiceR"),
  repos = c(CRAN = "https://cran.rstudio.com")
)
# For data visualization:
install.packages("ggplot2", repos = c(CRAN = "https://cran.rstudio.com"))

Session Info

suppressPackageStartupMessages({
  library(magrittr)
  library(ggplot2)
  library(dplyr)
  library(knitr)
})
  • Running R 3.4.0 on macOS Sierra 10.12.5
  • Rendered with rmarkdown 1.5 and knitr 1.16
  • The pipe (%>%) from magrittr is occasionally used
  • Using the following versions of packages for demos:
Package Version Imports
pageviews 0.3.0 jsonlite, httr, curl
WikipediR 1.5.0 httr, jsonlite
WikidataR 1.3.0 httr, jsonlite, WikipediR, utils
WikidataQueryServiceR 0.1.1 httr, dplyr, jsonlite

Wikipedia

Example of using Wikipedia.org portal with Russian set as the primary browser language.

  • Wikipedia is a free encyclopedia that anyone can edit
  • You may have heard of it
  • It is available in 296 languages
  • English Wikipedia has over 5.3 million articles
  • Wikipedia is powered by MediaWiki, which includes an API that makes it fast and easy to fetch content

WikipediR

WikipediR is a wrapper for MediaWiki API but aimed at Wikimedia's wikis such as Wikipedia. It can be used to retrieve page text, information about users or the history of pages, and elements of the category tree.

library(WikipediR); library(magrittr)
r_wiki <- page_content(
  language = "en",
  project = "wikipedia",
  page_name = "R (programming language)"
)
r_releases <- r_wiki$parse$text$`*` %>%
  xml2::read_html() %>%
  xml2::xml_find_first(".//table[@class='wikitable']") %>%
  rvest::html_table()

Release Date Description
0.16 This is the last alpha version developed...
0.49 1997-04-23 This is the oldest source release which ...
0.60 1997-12-05 R becomes an official part of the GNU Pr...
0.65.1 1999-10-07 First versions of update.packages and in...
1.0 2000-02-29 Considered by its developers stable enou...
1.4 2001-12-19 S4 methods are introduced and the first ...
2.0 2004-10-04 Introduced lazy loading, which enables f...
2.1 2005-04-18 Support for UTF-8 encoding, and the begi...
2.11 2010-04-22 Support for Windows 64 bit systems....
2.13 2011-04-14 Adding a new compiler function that allo...
2.14 2011-10-31 Added mandatory namespaces for packages....
2.15 2012-03-30 New load balancing functions. Improved s...
3.0 2013-04-03 Support for numeric index values 231 and...

MediaWiki-powered sites' APIs

  • Use language and project arguments for Wikimedia's wikis[^2]
  • Use domain for everything else, such as:
    • Project Gutenberg's wiki: domain = "www.gutenberg.org/w/api.php"
    • Mozilla Foundation's wiki: domain = "wiki.mozilla.org/api.php"
    • Geek Feminism wiki: domain = "geekfeminism.wikia.com/api.php"
    • A Wiki of Ice and Fire: domain = "awoiaf.westeros.org/api.php"
  • Tip: if using random_page, specify namespaces = 0 to only get articles

Pageviews

WMF provides an API for accessing daily and monthly pageviews of any article on any project for counts from 2015 onwards.[^3] The package pageviews allows you to get those counts in R:

library(pageviews)
r_pageviews <- article_pageviews(
  project = "en.wikipedia",
  article = "R (programming language)",
  user_type = "user", start = "2015100100",
  end = format(Sys.Date() - 1, "%Y%m%d00")
)

r_pageviews$date %<>% as.Date()
p <- ggplot(r_pageviews, aes(x = date, y = views)) +
  geom_line(color = rgb(0, 102, 153, maxColorValue = 255)) +
  geom_text(data = dplyr::top_n(r_pageviews, 1, views),
            aes(x = date, y = views, label = format(date, "%d %B %Y"),
                hjust = "left"), nudge_x = 10, size = 6) +
  geom_point(data = dplyr::top_n(r_pageviews, 1, views),
             aes(x = date, y = views), color = rgb(153/255, 0, 0)) +
  scale_y_continuous(
    breaks = seq(2e3, 10e3, 1e3),
    labels = function(x) { return(sprintf("%.0fK", x/1e3)) }
  ) +
  scale_x_date(date_breaks = "2 months", date_labels = "%b\n%Y") +
  labs(x = NULL, y = "Pageviews",
       title = "Daily pageviews of R's entry on English Wikipedia",
       subtitle = "Desktop and mobile traffic, excluding known bots") +
  theme_minimal(base_size = 18, base_family = "Gill Sans")
plot(p)

Wikidata

  • Wikidata is a language-agnostic open knowledge base
  • Facts are expressed as 3-part statements:
    • Subject (resource)
    • Predicate (property type)
    • Object (property value, can be another resource)
  • Examples:
  • Resources and properties have unique numeric identifiers but can have human-friendly labels in any language

WikidataR

library(WikidataR)
r_search <- find_item("R")[[8]]
r_search[c("id", "description")] # check the results
## $id
## [1] "Q206904"
## 
## $description
## [1] "programming language for statistical computing"

property <- get_property("P31")[[1]]
property$labels$`en`$value # check that we want P31
## [1] "instance of"
r_item <- get_item(r_search$id)[[1]]
r_item$claims$P31$mainsnak$datavalue$value$id
## [1] "Q9143"     "Q341"      "Q20825628" "Q28920142" "Q3839507"  "Q12772052"
## [7] "Q1993334"  "Q24529812"

This tells us that R is an instance of Q9143, Q341, Q20825628, Q28920142, Q3839507, Q12772052, Q1993334, Q24529812. Great?

Wikidata Query Service (WDQS)

  • Allows querying Wikidata with SPARQL
  • Provides a public SPARQL endpoint usable via:
  • For useful reference links, see help("WDQS", package = "WikidataQueryServiceR")

Basic SPARQL Example

# PREFIXes are optional when using WDQS
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd: <http://www.bigdata.com/rdf#>

SELECT DISTINCT ?instanceOfLabel
WHERE {
  wd:Q206904 wdt:P31 ?instanceOf .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en"
  }
}

library(WikidataQueryServiceR)
query_wikidata('SELECT DISTINCT ?instanceOfLabel
WHERE {
  wd:Q206904 wdt:P31 ?instanceOf .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en"
  }
}') %>% head(5)
##                       instanceOfLabel
## 1                programming language
## 2                       free software
## 3 multi-paradigm programming language
## 4                interpreted language
## 5     functional programming language

query_wikidata('SELECT DISTINCT ?instanceOfLabel
WHERE {
  wd:Q206904 wdt:P31 ?instanceOf .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "fr"
  }
}') %>% head(5)
##            instanceOfLabel
## 1                Q28920142
## 2 langage de programmation
## 3           logiciel libre
## 4 logiciel de statistiques
## 5       langage interprété

Advanced SPARQL Example

  • Prefix wd: points to an entity
  • Prefix p: points not to the object, but to a statement node
  • Prefix ps: within the statement node retrieves the object (value)
  • Prefix pq: within the statement node retrieves the qualifier info
r_versions_query <- "SELECT DISTINCT
  ?softwareVersion ?publicationDate
WHERE {
  BIND(wd:Q206904 AS ?R)
  ?R p:P348 [
    ps:P348 ?softwareVersion;
    pq:P577 ?publicationDate
  ] .
}"

r_versions_results <- query_wikidata(
  r_versions_query, format = "smart"
)
# "smart" mode formats the datetime columns
head(r_versions_results, 3)
##   softwareVersion publicationDate
## 1           3.3.3      2017-03-06
## 2           3.1.0      2014-04-10
## 3           3.1.2      2014-10-31
range(r_versions_results$publicationDate)
## [1] "2000-02-29 GMT" "2017-04-21 GMT"

set.seed(20170603)
r_versions_results$publicationDate %<>% as.Date
r_versions_results %<>% mutate(position = 8e3 + runif(nrow(.), -2e3, 2e3))
p +
  geom_smooth(formula = y ~ s(x, k = 9),
              method = "gam", se = FALSE,
              color = rgb(51, 153, 102, maxColorValue = 255)) +
  geom_vline(data = filter(r_versions_results, publicationDate >= "2015-10-01"),
             aes(xintercept = as.numeric(publicationDate)),
             color = "gray40", linetype = "dashed") +
  geom_text(data = filter(r_versions_results, publicationDate >= "2015-10-01"),
            aes(x = publicationDate, label = softwareVersion, y = position),
            color = "gray20", size = 6, angle = 30) +
  geom_point(data = filter(r_versions_results, publicationDate >= "2015-10-01"),
            aes(x = publicationDate, y = position - 7.5e2),
            color = "gray20", size = 3, shape = 17) +
  geom_point(data = dplyr::top_n(r_pageviews, 1, views),
             aes(x = date, y = views), color = rgb(153/255, 0, 0))

Final Remarks

Source for the whole shebang is up on GitHub: bearloga/wmf,[^4] available under CC BY-SA 4.0

Sorry for not leaving time for questions! If you have any, here's my

Contact Info

[^1]: If you're really curious just search for User:MPopov (WMF) on Meta-Wiki

[^2]: Currently: Commons, Wikivoyage, Wikiquote, Wikisource, Wikibooks, Wikinews, Wikiversity, Wikispecies, MediaWiki, Meta-Wiki, Wiktionary

[^3]: wikipediatrend package wraps the stats.grok.se API which has historical Wikipedia pageview data for 2008 up to 2016 from these pageview count dumps

[^4]: Specifically: wmf/presentations/talks/Cascadia R Conference 2017/