Skip to content

Commit

Permalink
version 0.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
fsolt authored and cran-robot committed Mar 22, 2017
0 parents commit 1558b74
Show file tree
Hide file tree
Showing 14 changed files with 584 additions and 0 deletions.
30 changes: 30 additions & 0 deletions DESCRIPTION
@@ -0,0 +1,30 @@
Package: ukds
Type: Package
Title: Reproducible Data Retrieval from the UK Data Service
Version: 0.1.0
Date: 2017-03-22
Authors@R: c(
person("Frederick", "Solt", email = "frederick-solt@uiowa.edu", role = c("aut", "cre")))
URL: https://github.com/fsolt/ukds
BugReports: https://github.com/fsolt/ukds/issues
Description: Reproducible, programmatic retrieval of datasets from the
UK Data Service <https://www.ukdataservice.ac.uk>. The UKDS is "the
UK's largest collection of social, economic and population data resources,"
but researchers taking advantage of these datasets are caught in a bind.
The UKDS terms and conditions sharply limit redistribution of downloaded
datasets, but to ensure that one's work can be reproduced, assessed, and
built upon by others, one must provide access to the raw data one employed.
The ukds package cuts this knot by providing programmatic, reproducible
access to the UKDS datasets from within R.
License: MIT + file LICENSE
LazyData: TRUE
Suggests: knitr, rmarkdown
VignetteBuilder: knitr
Imports: magrittr, rio, RSelenium (>= 1.7.1), stringr, tools, utils
RoxygenNote: 6.0.1
NeedsCompilation: no
Packaged: 2017-03-22 17:11:05 UTC; fredsolt
Author: Frederick Solt [aut, cre]
Maintainer: Frederick Solt <frederick-solt@uiowa.edu>
Repository: CRAN
Date/Publication: 2017-03-22 19:20:58 UTC
2 changes: 2 additions & 0 deletions LICENSE
@@ -0,0 +1,2 @@
YEAR: 2017
COPYRIGHT HOLDER: Frederick Solt
13 changes: 13 additions & 0 deletions MD5
@@ -0,0 +1,13 @@
f5bbcfa8733cf609e79fcb86f551922a *DESCRIPTION
29f3ce3a8d4e2bb7d24fc6e35389082f *LICENSE
15d44eacee2d76689224524c7c00ff60 *NAMESPACE
96e1dff4835bb1105072e6f90269b77c *NEWS.md
5ea4aef32963a3e99379fbf461d83ca7 *R/ukds_download.R
e06bbbc0b8b9ac5903a8844e6e94a398 *README.md
9c21121da1dc187a51f48acc9db8201d *build/vignette.rds
bfe75c0b0cd3c64eb967569ed227307f *inst/doc/ukds-vignette.R
df62405f50a175e74b69d08552d574ab *inst/doc/ukds-vignette.Rmd
a1a750ac518339e8847eb89a168e2ca2 *inst/doc/ukds-vignette.html
99ce54b95347db3fd2d1a2c3f707e507 *man/ukds_download.Rd
df62405f50a175e74b69d08552d574ab *vignettes/ukds-vignette.Rmd
f7bdf0f655a3449bd76d5839f9a54034 *vignettes/ukds_bsa2015.png
10 changes: 10 additions & 0 deletions NAMESPACE
@@ -0,0 +1,10 @@
# Generated by roxygen2: do not edit by hand

export(ukds_download)
import(RSelenium)
importFrom(magrittr,'%>%')
importFrom(rio,convert)
importFrom(stringr,str_detect)
importFrom(stringr,str_subset)
importFrom(tools,file_path_sans_ext)
importFrom(utils,unzip)
2 changes: 2 additions & 0 deletions NEWS.md
@@ -0,0 +1,2 @@
## Version 0.1.0
First release.
192 changes: 192 additions & 0 deletions R/ukds_download.R
@@ -0,0 +1,192 @@
#' Download datasets from the UK Data Service
#'
#' \code{ukds_download} provides a programmatic and reproducible means to download datasets
#' from the UK Data Service's data archive
#'
#' @param file_id The unique identifier (or optionally a vector of these identifiers).
#' for the dataset(s) to be downloaded (see details).
#' @param org,user,password Your UK Data Service organization, username, and password (see details).
#' @param use The number of a 'use of data' you have registered with the UK Data Service (see details).
#' @param reset If TRUE, you will be asked to re-enter your organization, username, and password.
#' @param download_dir The directory (relative to your working directory) to
#' which files from the UK Data Service will be downloaded.
#' @param msg If TRUE, outputs a message showing which data set is being downloaded.
#' @param convert If TRUE, converts downloaded file(s) to .RData format.
#' @param delay If the speed of your connection to the UK Data Service is particularly slow,
#' \code{ukds_download} may encounter problems. Increasing the \code{delay} parameter
#' may help.
#'
#' @details
#' To avoid requiring others to edit your scripts to insert their own organization, email,
#' password, and use or to force them to do so interactively, the default is set to fetch
#' this information from the user's .Rprofile. Before running \code{ukds_download},
#' then, you should be sure to add these options to your .Rprofile substituting your
#' info for the example below:
#'
#' \code{
#' options("ukds_org" = "UK Data Archive",
#' "ukds_user" = "ukf0000000000",
#' "ukds_password" = "password123!",
#' "ukds_use" = "111111")
#' }
#'
#' @return The function returns downloaded files.
#'
#' @examples
#' \dontrun{
#' ukds_download(file_id = c())
#' }
#'
#' @import RSelenium
#' @importFrom stringr str_detect str_subset
#' @importFrom magrittr '%>%'
#' @importFrom rio convert
#' @importFrom tools file_path_sans_ext
#' @importFrom utils unzip
#'
#' @export
ukds_download <- function(file_id,
org = getOption("ukds_org"),
user = getOption("ukds_user"),
password = getOption("ukds_password"),
use = getOption("ukds_use"),
reset = FALSE,
download_dir = "ukds_data",
msg = TRUE,
convert = TRUE,
delay = 5) {

# detect login info
if (reset){
org <- user <- password <- NULL
}

if (is.null(org)){
ukds_org <- readline(prompt = "The UK Data Service requires your user account information. Please enter your organization: \n")
options("ukds_org" = ukds_org)
org <- getOption("ukds_org")
}

if (is.null(user)){
ukds_user <- readline(prompt = "Please enter your UK Data Service username: \n")
options("ukds_user" = ukds_user)
user <- getOption("ukds_user")
}

if (is.null(password)){
ukds_password <- readline(prompt = "Please enter your UK Data Service password: \n")
options("ukds_password" = ukds_password)
password <- getOption("ukds_password")
}

if (is.null(use)) {
ukds_use <- readline(prompt = "Please enter the ID number of a use of data registered with the UK Data Service: \n")
options("ukds_use" = ukds_use)
use <- getOption("ukds_use")
}

# build path to chrome's default download directory
if (Sys.info()[["sysname"]]=="Linux") {
default_dir <- file.path("home", Sys.info()[["user"]], "Downloads")
} else {
default_dir <- file.path("", "Users", Sys.info()[["user"]], "Downloads")
}

# create specified download directory if necessary
if (!dir.exists(download_dir)) dir.create(download_dir, recursive = TRUE)

# initialize driver
if(msg) message("Initializing RSelenium driver")
rD <- rsDriver(browser = "chrome", verbose = TRUE)
remDr <- rD[["client"]]

# sign in
signin <- "https://qa.esds.ac.uk/secure/UKDSRegister_start.asp"
remDr$navigate(signin)
Sys.sleep(delay)
remDr$findElement(using = "partial link text", "Let me choose")$clickElement()
Sys.sleep(delay/2)
remDr$findElement(using = "class", "as-selections")$sendKeysToElement(list(org))
remDr$findElement(using = "class", "btn-enabled")$clickElement()
Sys.sleep(delay/2)
remDr$findElement(using = "id", "j_username")$sendKeysToElement(list(user))
remDr$findElement(using = "id", "j_password")$sendKeysToElement(list(password))
remDr$findElement(using = "class", "input-submit")$clickElement()
Sys.sleep(delay)

# loop through files
for (i in seq(file_id)) {
item <- file_id[[i]]
if(msg) message("Downloading UK Data Service file: ", item, sprintf(" (%s)", Sys.time()))

# get list of current default download directory contents
dd_old <- list.files(default_dir)

# navigate to download page
url <- paste0("https://discover.ukdataservice.ac.uk/catalogue/?sn=", item, "&type=Data%20catalogue")

remDr$navigate(url)
remDr$findElement(using = "partial link text", "Download")$clickElement()
Sys.sleep(delay/2)
remDr$findElement(using = "partial link text", "Login")$clickElement()
Sys.sleep(delay/2)

# select use
remDr$findElement(using = "xpath", paste0("//input[@value=", use,"]"))$clickElement() # choose project
Sys.sleep(delay)
remDr$findElement(using = "xpath", "//input[@value='Add Datasets']")$clickElement() # add datasets
Sys.sleep(delay/2)
try(remDr$findElement(using = "xpath", "//input[@value='Add Datasets']")$clickElement()) # add datasets
Sys.sleep(delay)

# accept special terms, if any
if (length(remDr$findElements(using = "partial link text", "Accept"))!=0) {
remDr$findElement(using = "partial link text", "Accept")$clickElement()
Sys.sleep(delay)
remDr$findElement(using = "xpath", "//input[@value='I accept']")$clickElement()
Sys.sleep(delay)
}

remDr$findElement(using = "xpath", paste0('//input[contains(@onclick,', item,')]'))$clickElement() # "Download"
remDr$findElement(using = "xpath", "//input[@value='I accept']")$clickElement() # End User License

remDr$findElement(using = "xpath", "//input[@value='STATA']")$clickElement() # Stata

# check that download has completed
dd_new <- list.files(default_dir)[!list.files(default_dir) %in% dd_old]
wait <- TRUE
tryCatch(
while(all.equal(stringr::str_detect(dd_new, "\\.part$"), logical(0))) {
Sys.sleep(1)
dd_new <- list.files(default_dir)[!list.files(default_dir) %in% dd_old]
}, error = function(e) 1 )
while(any(stringr::str_detect(dd_new, "\\.crdownload$"))) {
Sys.sleep(1)
dd_new <- list.files(default_dir)[!list.files(default_dir) %in% dd_old]
}

# unzip into specified directory and convert to .RData
dld_old <- list.files(download_dir)
unzip(file.path(default_dir, dd_new), exdir = download_dir)
unlink(file.path(default_dir, dd_new))
dld_new <- list.files(download_dir)[!list.files(download_dir) %in% dld_old]
file.rename(file.path(download_dir, dld_new), file.path(download_dir, item))

data_files <- list.files(path = file.path(download_dir, item), recursive = TRUE) %>%
str_subset("\\.dta")
if (convert == TRUE) {
for (i in seq_along(data_files)) {
data_file <- data_files[i]
rio::convert(file.path(download_dir, item, data_file),
paste0(tools::file_path_sans_ext(file.path(download_dir,
item,
basename(data_file))), ".RData"))
}
}
}

# Close driver
remDr$close()
rD[["server"]]$stop()
}

22 changes: 22 additions & 0 deletions README.md
@@ -0,0 +1,22 @@
[![CRAN version](http://www.r-pkg.org/badges/version/ukds)](https://cran.r-project.org/package=icpsrdata) ![](http://cranlogs.r-pkg.org/badges/grand-total/ukds) [![Travis-CI Build Status](https://travis-ci.org/fsolt/ukds.svg?branch=master)](https://travis-ci.org/fsolt/ukds)
------------------------------------------------------------------------

ukds
=========

`ukds` is an R package that provides reproducible, programmatic access to datasets stored in the [UK Data Service](https://www.ukdataservice.ac.uk) for [registered users](http://esds.ac.uk/newRegistration/newLogin.asp).


To install:

* the latest released version: `install.packages("ukds")`
* the latest development version:

```R
if (!require(ghit)) install.packages("ghit")
ghit::install_github("fsolt/ukds")
```

For more details, check out [the vignette](https://cran.r-project.org/package=ukds/vignettes/ukds-vignette.html).

* Note that, on Windows systems, `ukds` requires that [RTools](https://cran.r-project.org/bin/windows/Rtools/index.html) is installed.
Binary file added build/vignette.rds
Binary file not shown.
15 changes: 15 additions & 0 deletions inst/doc/ukds-vignette.R
@@ -0,0 +1,15 @@
## ----eval = FALSE--------------------------------------------------------
# options("ukds_org" = "UK Data Archive",
# "ukds_user" = "ukf0000000000",
# "ukds_password" = "password123!",
# "ukds_use" = "111111")

## ----eval=FALSE----------------------------------------------------------
# ukds_download(file_id = "8116")

## ----eval=FALSE----------------------------------------------------------
# ukds_download(file_id = c("8116", "7809", "7500"))

## ----eval=FALSE----------------------------------------------------------
# bsa2015 <- rio::import("ukds_data/8116/bsa15_to_ukds_final.RData)

59 changes: 59 additions & 0 deletions inst/doc/ukds-vignette.Rmd
@@ -0,0 +1,59 @@
---
title: "ukds: Reproducible Retrieval of UK Data Service Datasets"
author: "Frederick Solt"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{ukds: Reproducible Retrieval of UK Data Service Datasets}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---


The [UK Data Service](https://www.ukdataservice.ac.uk) (UKDS) is "the UK’s largest collection of social, economic and population data resources." Researchers taking advantage of these datasets, however, are caught in a bind. The [UK Data Service terms and conditions](https://www.ukdataservice.ac.uk/get-data/how-to-access/conditions) require users "to give access to the data collections only to registered users with a registered use."[^1] But to ensure that one's work can be reproduced, assessed, and built upon by others, one must provide access to the raw data one employed. The `ukds` package cuts this knot by providing programmatic, reproducible access to the UK Data Service's datasets from within R.

## Setup
To use `ukds`, you must first be a registered user of the UKDS, and you must have already registered your 'use of data' for any dataset you will download.

When used interactively, the `ukds_download` function will be ask for the login information required by the UK Data Service: the user's organization, email, and password, as well as the 'use of data' for the datasets to be downloaded.
After that information is input once, it will be entered automatically for any other download requests made in the same session. To change this contact information within a session, one may set the argument `reset` to `TRUE` when running `ukds_download` again, and the function will again request the required information.

An optional, but highly recommended, setup step is to add the information the UK Data Service requires to your [.Rprofile](http://www.statmethods.net/interface/customizing.html) as in the following example:

```{r eval = FALSE}
options("ukds_org" = "UK Data Archive",
"ukds_user" = "ukf0000000000",
"ukds_password" = "password123!",
"ukds_use" = "111111")
```

The `ukds_download` function will then access the information it needs to pass on to the UKDS by default. This means that researchers will not have to expose their info in their R scripts and that others reproducing their results later will be able to execute those R scripts without modification. (They will, however, need to enter their own information into their own .Rprofiles, a detail that should be noted in the reproducibility materials to avoid confusion.)


## Use

The `ukds_download` function (1) opens a Chrome browser and navigates to the UKDS's sign-in page, (2) enters the required information to sign in, (3) navigates to a specified dataset, (4) adds the dataset to the specified registered 'use of data', (5) downloads the dataset's files, and, optionally but by default, (6) converts the dataset's files to `.Rdata` format.

Datasets are specified using the `file_id` argument. The UKDS uses a unique SN number to identify each of its datasets. For the [2015 British Social Attitudes Survey](https://discover.ukdataservice.ac.uk/catalogue/?sn=8116&type=Data%20catalogue), for example, the file id is 8116:

<img src="ukds_bsa2015.png" style="width: 100%;"/>

To reproducibly download this dataset:

```{r eval=FALSE}
ukds_download(file_id = "8116")
```

Multiple datasets may be downloaded from the same research area in a single command by passing a vector of ids to `file_id`. The following downloads the above-described 2015 BSA along with those for 2014 and 2013:

```{r eval=FALSE}
ukds_download(file_id = c("8116", "7809", "7500"))
```

After the needed datasets are downloaded, they are, by default, converted to `.RData` format (via `rio::convert()`) and ready to be loaded into R using `load()` or `rio::import()`.

```{r eval=FALSE}
bsa2015 <- rio::import("ukds_data/8116/bsa15_to_ukds_final.RData)
```

[^1]: The terms _do_ include exceptions "for teaching and the use of data collections for Commercial purposes set out in an additional Commercial Licence," but these clearly do not apply to the public provision of materials for reproducibility purposes.

0 comments on commit 1558b74

Please sign in to comment.