Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 1558b74
Showing
14 changed files
with
584 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
Package: ukds | ||
Type: Package | ||
Title: Reproducible Data Retrieval from the UK Data Service | ||
Version: 0.1.0 | ||
Date: 2017-03-22 | ||
Authors@R: c( | ||
person("Frederick", "Solt", email = "frederick-solt@uiowa.edu", role = c("aut", "cre"))) | ||
URL: https://github.com/fsolt/ukds | ||
BugReports: https://github.com/fsolt/ukds/issues | ||
Description: Reproducible, programmatic retrieval of datasets from the | ||
UK Data Service <https://www.ukdataservice.ac.uk>. The UKDS is "the | ||
UK's largest collection of social, economic and population data resources," | ||
but researchers taking advantage of these datasets are caught in a bind. | ||
The UKDS terms and conditions sharply limit redistribution of downloaded | ||
datasets, but to ensure that one's work can be reproduced, assessed, and | ||
built upon by others, one must provide access to the raw data one employed. | ||
The ukds package cuts this knot by providing programmatic, reproducible | ||
access to the UKDS datasets from within R. | ||
License: MIT + file LICENSE | ||
LazyData: TRUE | ||
Suggests: knitr, rmarkdown | ||
VignetteBuilder: knitr | ||
Imports: magrittr, rio, RSelenium (>= 1.7.1), stringr, tools, utils | ||
RoxygenNote: 6.0.1 | ||
NeedsCompilation: no | ||
Packaged: 2017-03-22 17:11:05 UTC; fredsolt | ||
Author: Frederick Solt [aut, cre] | ||
Maintainer: Frederick Solt <frederick-solt@uiowa.edu> | ||
Repository: CRAN | ||
Date/Publication: 2017-03-22 19:20:58 UTC |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
YEAR: 2017 | ||
COPYRIGHT HOLDER: Frederick Solt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
f5bbcfa8733cf609e79fcb86f551922a *DESCRIPTION | ||
29f3ce3a8d4e2bb7d24fc6e35389082f *LICENSE | ||
15d44eacee2d76689224524c7c00ff60 *NAMESPACE | ||
96e1dff4835bb1105072e6f90269b77c *NEWS.md | ||
5ea4aef32963a3e99379fbf461d83ca7 *R/ukds_download.R | ||
e06bbbc0b8b9ac5903a8844e6e94a398 *README.md | ||
9c21121da1dc187a51f48acc9db8201d *build/vignette.rds | ||
bfe75c0b0cd3c64eb967569ed227307f *inst/doc/ukds-vignette.R | ||
df62405f50a175e74b69d08552d574ab *inst/doc/ukds-vignette.Rmd | ||
a1a750ac518339e8847eb89a168e2ca2 *inst/doc/ukds-vignette.html | ||
99ce54b95347db3fd2d1a2c3f707e507 *man/ukds_download.Rd | ||
df62405f50a175e74b69d08552d574ab *vignettes/ukds-vignette.Rmd | ||
f7bdf0f655a3449bd76d5839f9a54034 *vignettes/ukds_bsa2015.png |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# Generated by roxygen2: do not edit by hand | ||
|
||
export(ukds_download) | ||
import(RSelenium) | ||
importFrom(magrittr,'%>%') | ||
importFrom(rio,convert) | ||
importFrom(stringr,str_detect) | ||
importFrom(stringr,str_subset) | ||
importFrom(tools,file_path_sans_ext) | ||
importFrom(utils,unzip) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
## Version 0.1.0 | ||
First release. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
#' Download datasets from the UK Data Service | ||
#' | ||
#' \code{ukds_download} provides a programmatic and reproducible means to download datasets | ||
#' from the UK Data Service's data archive | ||
#' | ||
#' @param file_id The unique identifier (or optionally a vector of these identifiers). | ||
#' for the dataset(s) to be downloaded (see details). | ||
#' @param org,user,password Your UK Data Service organization, username, and password (see details). | ||
#' @param use The number of a 'use of data' you have registered with the UK Data Service (see details). | ||
#' @param reset If TRUE, you will be asked to re-enter your organization, username, and password. | ||
#' @param download_dir The directory (relative to your working directory) to | ||
#' which files from the UK Data Service will be downloaded. | ||
#' @param msg If TRUE, outputs a message showing which data set is being downloaded. | ||
#' @param convert If TRUE, converts downloaded file(s) to .RData format. | ||
#' @param delay If the speed of your connection to the UK Data Service is particularly slow, | ||
#' \code{ukds_download} may encounter problems. Increasing the \code{delay} parameter | ||
#' may help. | ||
#' | ||
#' @details | ||
#' To avoid requiring others to edit your scripts to insert their own organization, email, | ||
#' password, and use or to force them to do so interactively, the default is set to fetch | ||
#' this information from the user's .Rprofile. Before running \code{ukds_download}, | ||
#' then, you should be sure to add these options to your .Rprofile substituting your | ||
#' info for the example below: | ||
#' | ||
#' \code{ | ||
#' options("ukds_org" = "UK Data Archive", | ||
#' "ukds_user" = "ukf0000000000", | ||
#' "ukds_password" = "password123!", | ||
#' "ukds_use" = "111111") | ||
#' } | ||
#' | ||
#' @return The function returns downloaded files. | ||
#' | ||
#' @examples | ||
#' \dontrun{ | ||
#' ukds_download(file_id = c()) | ||
#' } | ||
#' | ||
#' @import RSelenium | ||
#' @importFrom stringr str_detect str_subset | ||
#' @importFrom magrittr '%>%' | ||
#' @importFrom rio convert | ||
#' @importFrom tools file_path_sans_ext | ||
#' @importFrom utils unzip | ||
#' | ||
#' @export | ||
ukds_download <- function(file_id, | ||
org = getOption("ukds_org"), | ||
user = getOption("ukds_user"), | ||
password = getOption("ukds_password"), | ||
use = getOption("ukds_use"), | ||
reset = FALSE, | ||
download_dir = "ukds_data", | ||
msg = TRUE, | ||
convert = TRUE, | ||
delay = 5) { | ||
|
||
# detect login info | ||
if (reset){ | ||
org <- user <- password <- NULL | ||
} | ||
|
||
if (is.null(org)){ | ||
ukds_org <- readline(prompt = "The UK Data Service requires your user account information. Please enter your organization: \n") | ||
options("ukds_org" = ukds_org) | ||
org <- getOption("ukds_org") | ||
} | ||
|
||
if (is.null(user)){ | ||
ukds_user <- readline(prompt = "Please enter your UK Data Service username: \n") | ||
options("ukds_user" = ukds_user) | ||
user <- getOption("ukds_user") | ||
} | ||
|
||
if (is.null(password)){ | ||
ukds_password <- readline(prompt = "Please enter your UK Data Service password: \n") | ||
options("ukds_password" = ukds_password) | ||
password <- getOption("ukds_password") | ||
} | ||
|
||
if (is.null(use)) { | ||
ukds_use <- readline(prompt = "Please enter the ID number of a use of data registered with the UK Data Service: \n") | ||
options("ukds_use" = ukds_use) | ||
use <- getOption("ukds_use") | ||
} | ||
|
||
# build path to chrome's default download directory | ||
if (Sys.info()[["sysname"]]=="Linux") { | ||
default_dir <- file.path("home", Sys.info()[["user"]], "Downloads") | ||
} else { | ||
default_dir <- file.path("", "Users", Sys.info()[["user"]], "Downloads") | ||
} | ||
|
||
# create specified download directory if necessary | ||
if (!dir.exists(download_dir)) dir.create(download_dir, recursive = TRUE) | ||
|
||
# initialize driver | ||
if(msg) message("Initializing RSelenium driver") | ||
rD <- rsDriver(browser = "chrome", verbose = TRUE) | ||
remDr <- rD[["client"]] | ||
|
||
# sign in | ||
signin <- "https://qa.esds.ac.uk/secure/UKDSRegister_start.asp" | ||
remDr$navigate(signin) | ||
Sys.sleep(delay) | ||
remDr$findElement(using = "partial link text", "Let me choose")$clickElement() | ||
Sys.sleep(delay/2) | ||
remDr$findElement(using = "class", "as-selections")$sendKeysToElement(list(org)) | ||
remDr$findElement(using = "class", "btn-enabled")$clickElement() | ||
Sys.sleep(delay/2) | ||
remDr$findElement(using = "id", "j_username")$sendKeysToElement(list(user)) | ||
remDr$findElement(using = "id", "j_password")$sendKeysToElement(list(password)) | ||
remDr$findElement(using = "class", "input-submit")$clickElement() | ||
Sys.sleep(delay) | ||
|
||
# loop through files | ||
for (i in seq(file_id)) { | ||
item <- file_id[[i]] | ||
if(msg) message("Downloading UK Data Service file: ", item, sprintf(" (%s)", Sys.time())) | ||
|
||
# get list of current default download directory contents | ||
dd_old <- list.files(default_dir) | ||
|
||
# navigate to download page | ||
url <- paste0("https://discover.ukdataservice.ac.uk/catalogue/?sn=", item, "&type=Data%20catalogue") | ||
|
||
remDr$navigate(url) | ||
remDr$findElement(using = "partial link text", "Download")$clickElement() | ||
Sys.sleep(delay/2) | ||
remDr$findElement(using = "partial link text", "Login")$clickElement() | ||
Sys.sleep(delay/2) | ||
|
||
# select use | ||
remDr$findElement(using = "xpath", paste0("//input[@value=", use,"]"))$clickElement() # choose project | ||
Sys.sleep(delay) | ||
remDr$findElement(using = "xpath", "//input[@value='Add Datasets']")$clickElement() # add datasets | ||
Sys.sleep(delay/2) | ||
try(remDr$findElement(using = "xpath", "//input[@value='Add Datasets']")$clickElement()) # add datasets | ||
Sys.sleep(delay) | ||
|
||
# accept special terms, if any | ||
if (length(remDr$findElements(using = "partial link text", "Accept"))!=0) { | ||
remDr$findElement(using = "partial link text", "Accept")$clickElement() | ||
Sys.sleep(delay) | ||
remDr$findElement(using = "xpath", "//input[@value='I accept']")$clickElement() | ||
Sys.sleep(delay) | ||
} | ||
|
||
remDr$findElement(using = "xpath", paste0('//input[contains(@onclick,', item,')]'))$clickElement() # "Download" | ||
remDr$findElement(using = "xpath", "//input[@value='I accept']")$clickElement() # End User License | ||
|
||
remDr$findElement(using = "xpath", "//input[@value='STATA']")$clickElement() # Stata | ||
|
||
# check that download has completed | ||
dd_new <- list.files(default_dir)[!list.files(default_dir) %in% dd_old] | ||
wait <- TRUE | ||
tryCatch( | ||
while(all.equal(stringr::str_detect(dd_new, "\\.part$"), logical(0))) { | ||
Sys.sleep(1) | ||
dd_new <- list.files(default_dir)[!list.files(default_dir) %in% dd_old] | ||
}, error = function(e) 1 ) | ||
while(any(stringr::str_detect(dd_new, "\\.crdownload$"))) { | ||
Sys.sleep(1) | ||
dd_new <- list.files(default_dir)[!list.files(default_dir) %in% dd_old] | ||
} | ||
|
||
# unzip into specified directory and convert to .RData | ||
dld_old <- list.files(download_dir) | ||
unzip(file.path(default_dir, dd_new), exdir = download_dir) | ||
unlink(file.path(default_dir, dd_new)) | ||
dld_new <- list.files(download_dir)[!list.files(download_dir) %in% dld_old] | ||
file.rename(file.path(download_dir, dld_new), file.path(download_dir, item)) | ||
|
||
data_files <- list.files(path = file.path(download_dir, item), recursive = TRUE) %>% | ||
str_subset("\\.dta") | ||
if (convert == TRUE) { | ||
for (i in seq_along(data_files)) { | ||
data_file <- data_files[i] | ||
rio::convert(file.path(download_dir, item, data_file), | ||
paste0(tools::file_path_sans_ext(file.path(download_dir, | ||
item, | ||
basename(data_file))), ".RData")) | ||
} | ||
} | ||
} | ||
|
||
# Close driver | ||
remDr$close() | ||
rD[["server"]]$stop() | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
[![CRAN version](http://www.r-pkg.org/badges/version/ukds)](https://cran.r-project.org/package=icpsrdata) ![](http://cranlogs.r-pkg.org/badges/grand-total/ukds) [![Travis-CI Build Status](https://travis-ci.org/fsolt/ukds.svg?branch=master)](https://travis-ci.org/fsolt/ukds) | ||
------------------------------------------------------------------------ | ||
|
||
ukds | ||
========= | ||
|
||
`ukds` is an R package that provides reproducible, programmatic access to datasets stored in the [UK Data Service](https://www.ukdataservice.ac.uk) for [registered users](http://esds.ac.uk/newRegistration/newLogin.asp). | ||
|
||
|
||
To install: | ||
|
||
* the latest released version: `install.packages("ukds")` | ||
* the latest development version: | ||
|
||
```R | ||
if (!require(ghit)) install.packages("ghit") | ||
ghit::install_github("fsolt/ukds") | ||
``` | ||
|
||
For more details, check out [the vignette](https://cran.r-project.org/package=ukds/vignettes/ukds-vignette.html). | ||
|
||
* Note that, on Windows systems, `ukds` requires that [RTools](https://cran.r-project.org/bin/windows/Rtools/index.html) is installed. |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
## ----eval = FALSE-------------------------------------------------------- | ||
# options("ukds_org" = "UK Data Archive", | ||
# "ukds_user" = "ukf0000000000", | ||
# "ukds_password" = "password123!", | ||
# "ukds_use" = "111111") | ||
|
||
## ----eval=FALSE---------------------------------------------------------- | ||
# ukds_download(file_id = "8116") | ||
|
||
## ----eval=FALSE---------------------------------------------------------- | ||
# ukds_download(file_id = c("8116", "7809", "7500")) | ||
|
||
## ----eval=FALSE---------------------------------------------------------- | ||
# bsa2015 <- rio::import("ukds_data/8116/bsa15_to_ukds_final.RData) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
--- | ||
title: "ukds: Reproducible Retrieval of UK Data Service Datasets" | ||
author: "Frederick Solt" | ||
date: "`r Sys.Date()`" | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{ukds: Reproducible Retrieval of UK Data Service Datasets} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
|
||
The [UK Data Service](https://www.ukdataservice.ac.uk) (UKDS) is "the UK’s largest collection of social, economic and population data resources." Researchers taking advantage of these datasets, however, are caught in a bind. The [UK Data Service terms and conditions](https://www.ukdataservice.ac.uk/get-data/how-to-access/conditions) require users "to give access to the data collections only to registered users with a registered use."[^1] But to ensure that one's work can be reproduced, assessed, and built upon by others, one must provide access to the raw data one employed. The `ukds` package cuts this knot by providing programmatic, reproducible access to the UK Data Service's datasets from within R. | ||
|
||
## Setup | ||
To use `ukds`, you must first be a registered user of the UKDS, and you must have already registered your 'use of data' for any dataset you will download. | ||
|
||
When used interactively, the `ukds_download` function will be ask for the login information required by the UK Data Service: the user's organization, email, and password, as well as the 'use of data' for the datasets to be downloaded. | ||
After that information is input once, it will be entered automatically for any other download requests made in the same session. To change this contact information within a session, one may set the argument `reset` to `TRUE` when running `ukds_download` again, and the function will again request the required information. | ||
|
||
An optional, but highly recommended, setup step is to add the information the UK Data Service requires to your [.Rprofile](http://www.statmethods.net/interface/customizing.html) as in the following example: | ||
|
||
```{r eval = FALSE} | ||
options("ukds_org" = "UK Data Archive", | ||
"ukds_user" = "ukf0000000000", | ||
"ukds_password" = "password123!", | ||
"ukds_use" = "111111") | ||
``` | ||
|
||
The `ukds_download` function will then access the information it needs to pass on to the UKDS by default. This means that researchers will not have to expose their info in their R scripts and that others reproducing their results later will be able to execute those R scripts without modification. (They will, however, need to enter their own information into their own .Rprofiles, a detail that should be noted in the reproducibility materials to avoid confusion.) | ||
|
||
|
||
## Use | ||
|
||
The `ukds_download` function (1) opens a Chrome browser and navigates to the UKDS's sign-in page, (2) enters the required information to sign in, (3) navigates to a specified dataset, (4) adds the dataset to the specified registered 'use of data', (5) downloads the dataset's files, and, optionally but by default, (6) converts the dataset's files to `.Rdata` format. | ||
|
||
Datasets are specified using the `file_id` argument. The UKDS uses a unique SN number to identify each of its datasets. For the [2015 British Social Attitudes Survey](https://discover.ukdataservice.ac.uk/catalogue/?sn=8116&type=Data%20catalogue), for example, the file id is 8116: | ||
|
||
<img src="ukds_bsa2015.png" style="width: 100%;"/> | ||
|
||
To reproducibly download this dataset: | ||
|
||
```{r eval=FALSE} | ||
ukds_download(file_id = "8116") | ||
``` | ||
|
||
Multiple datasets may be downloaded from the same research area in a single command by passing a vector of ids to `file_id`. The following downloads the above-described 2015 BSA along with those for 2014 and 2013: | ||
|
||
```{r eval=FALSE} | ||
ukds_download(file_id = c("8116", "7809", "7500")) | ||
``` | ||
|
||
After the needed datasets are downloaded, they are, by default, converted to `.RData` format (via `rio::convert()`) and ready to be loaded into R using `load()` or `rio::import()`. | ||
|
||
```{r eval=FALSE} | ||
bsa2015 <- rio::import("ukds_data/8116/bsa15_to_ukds_final.RData) | ||
``` | ||
|
||
[^1]: The terms _do_ include exceptions "for teaching and the use of data collections for Commercial purposes set out in an additional Commercial Licence," but these clearly do not apply to the public provision of materials for reproducibility purposes. |
Oops, something went wrong.