### R experiments with Dataverse client API

Ben Johnson

3/23/2019

start time: 7:00pm

break start: 8:40pm

break end: 9:40pm

end time: 11:40pm

The purpose of this script is to access the ICEWS dataverse API through the R client. Would prefer to use Python but having trouble with "authorization," even though it's an open dataset. Let's see if using R instead gets us the results we want.

Tasks:

1. **✔** access ICEWS files through [Dataverse API client for R](https://github.com/IQSS/dataverse-client-r)

Results:
1. Partial success, but promising. I followed the examples [here](https://github.com/IQSS/dataverse-client-r) and [here](https://www.rdocumentation.org/packages/dataverse/versions/0.2.0/topics/get_file), and skimmed the docs in relevant places. The pipeline should follow something like this:
    - access ICEWS *dataverse* class
    - access most recent ICEWS datase from the *dataverse_dataset* list within
    - get *dataFile.id* attribute, which returns a numerical handle for each unique dataset
    - make get_file() calls on handles
    - unzip the results to .tab files
I can pretty much finish this up with built-in utilities at any time, but I spent a lot of time trying to use API functions directly. Giving up on that for now.

In [295]:
# set system environment variables
# not sure if this is strictly needed for an open dataset? We're not posting anything.

# Sys.setenv('DATAVERSE_KEY' = key...)
Sys.setenv('DATAVERSE_SERVER' = 'dataverse.harvard.edu')

In [284]:
# example download from https://www.rdocumentation.org/packages/dataverse/versions/0.2.0/topics/get_file
monogan <- get_dataverse("monogan")
monogan_data <- dataverse_contents(monogan)
d1 <- get_dataset("doi:10.7910/DVN/ARKOTI")

# breaks every time. Again, copy pasted from example. 
f <- get_file(d1$files$datafile$id[3], "doi:10.7910/DVN/ARKOTI")

ERROR: Error in get_fileid.character(dataset, file, key = key, server = server, : File not found


In [301]:
# here's the problem... attributes aren't easy to get?
d1$files

ERROR while rich displaying an object: Error in vapply(part, format, character(nrow(part))): values must be length 32,
 but FUN(X[[10]]) result is length 2

Traceback:
1. FUN(X[[i]], ...)
2. tryCatch(withCallingHandlers({
 .     rpr <- mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler), error = outer_handler)
3. tryCatchList(expr, classes, parentenv, handlers)
4. tryCatchOne(expr, names, parentenv, handlers[[1L]])
5. doTryCatch(return(expr), name, parentenv, handler)
6. withCallingHandlers({
 .     rpr <- mime2repr[[mime]](obj)
 .     if (is.null(rpr)) 
 .         return(NULL)
 .     prepare_content(is.raw(rpr), rpr)
 . }, error = error_handler)
7. mime2repr[[mime]](obj)
8. repr_text.data.frame(obj)
9. ellip_limit_arr(obj, ...)
10. arr_parts_format(parts)
11. structure(lapply(parts, arr_part_format), omit = attr(parts, 
  .     "omit"))
12. lapply(parts, arr_part_format)
13. FUN(X[[i]], ...)
14.

In [324]:
# another example. Once we've gotten the file name/numerical id, everything's pretty smooth from there.
get_dataset("doi:10.7910/DVN/ARKOTI")

f <- get_file("constructionData.tab", "doi:10.7910/DVN/ARKOTI")

# load into memory. These are stata files, I believe; not sure how to adapt this to ICEWS .zip files, or if I should even worry about that this early.
tmp <- tempfile(fileext = ".dta")
writeBin(as.vector(f), tmp)
dat <- foreign::read.dta(tmp)

Dataset (75170): 
Version: 1.0, RELEASED
Release Date: 2015-07-07T02:57:02Z
License: CC0
21 Files:
                          label version      id                  contentType
1                  alpl2013.tab       2 2692294    text/tab-separated-values
2                   BPchap7.tab       2 2692295    text/tab-separated-values
3                   chapter01.R       2 2692202 text/plain; charset=US-ASCII
4                   chapter02.R       2 2692206 text/plain; charset=US-ASCII
5                   chapter03.R       2 2692210 text/plain; charset=US-ASCII
6                   chapter04.R       2 2692204 text/plain; charset=US-ASCII
7                   chapter05.R       2 2692205 text/plain; charset=US-ASCII
8                   chapter06.R       2 2692212 text/plain; charset=US-ASCII
9                   chapter07.R       2 2692209 text/plain; charset=US-ASCII
10                  chapter08.R       2 2692208 text/plain; charset=US-ASCII
11                  chapter09.R       2 2692211 text/p

In [302]:
# ICEWS dataverse main handle
icews_doi = 'doi:10.7910/DVN/QI2T9A'

In [325]:
# this is pretty much exactly what we want, but I can't find out how to access it.
(icews_dataset = get_dataset(icews_doi))

Dataset (151508): 
Version: 151.0, RELEASED
Release Date: 2019-03-24T04:50:57Z
License: CC0
16 Files:
                        label version      id     contentType
1   20181004-icews-events.zip       1 3234868 application/zip
2   20181005-icews-events.zip       1 3235021 application/zip
3   20181006-icews-events.zip       1 3238491 application/zip
4   20181007-icews-events.zip       1 3238493 application/zip
5   20181008-icews-events.zip       1 3238584 application/zip
6   20181009-icews-events.zip       1 3238918 application/zip
7   20181010-icews-events.zip       1 3239379 application/zip
8   20181011-icews-events.zip       1 3239478 application/zip
9   20181012-icews-events.zip       1 3239713 application/zip
10  20181013-icews-events.zip       1 3239773 application/zip
11  20181014-icews-events.zip       1 3240169 application/zip
12  20181015-icews-events.zip       1 3241107 application/zip
13  20181016-icews-events.zip       1 3241682 application/zip
14  20181017-icews-events.zip 

In [294]:
# 'files' attribute clearly visible...
attributes(icews_dataset)

In [328]:
# top-level look at the ICEWS dataset

# object containing all data about the ICEWS dataverse account
(icews = get_dataverse('icews'))

Dataverse (2900): icews
Created:     2014-12-08T16:01:26Z
Creator:     @jlautens

In [327]:
# list of all ICEWS dataSETS, not FILES of the dataset that concerns us
(icews_data = dataverse_contents(icews))

[[1]]
Dataset (3234391): https://doi.org/10.7910/DVN/QI2T9A
Publisher: Harvard Dataverse
publicationDate: 2018-10-04


[[2]]
Dataset (65871): https://doi.org/10.7910/DVN/28118
Publisher: Harvard Dataverse
publicationDate: 2015-03-27


[[3]]
Dataset (65872): https://doi.org/10.7910/DVN/28119
Publisher: Harvard Dataverse
publicationDate: 2015-03-27


[[4]]
Dataset (65873): https://doi.org/10.7910/DVN/28117
Publisher: Harvard Dataverse
publicationDate: 2015-03-27


[[5]]
Dataset (65874): https://doi.org/10.7910/DVN/28075
Publisher: Harvard Dataverse
publicationDate: 2015-03-27



In [330]:
# all .zips
(data_list = dataset_files(icews_doi))

[[1]]
File (3234868): 20181004-icews-events.zip
Dataset version: 151508
MD5: 9a9a3bcca7df7c6eef0f161fd3cdcd7a
Description: 

[[2]]
File (3235021): 20181005-icews-events.zip
Dataset version: 151508
MD5: 8123db0735e7b9bb85b24e36b19103b9
Description: 

[[3]]
File (3238491): 20181006-icews-events.zip
Dataset version: 151508
MD5: 97fa48cf7c903273a5cc22220b288571
Description: 

[[4]]
File (3238493): 20181007-icews-events.zip
Dataset version: 151508
MD5: b7e3e1725a2c43a0a2f079ff02fe5af2
Description: 

[[5]]
File (3238584): 20181008-icews-events.zip
Dataset version: 151508
MD5: ae100a6f60570c10a6773870340006fc
Description: 

[[6]]
File (3238918): 20181009-icews-events.zip
Dataset version: 151508
MD5: 0c1d3761f0942510b0c79aa485c4bbb2
Description: 

[[7]]
File (3239379): 20181010-icews-events.zip
Dataset version: 151508
MD5: 9167c7f3752432db37998c19fdb7d7cf
Description: 

[[8]]
File (3239478): 20181011-icews-events.zip
Dataset version: 151508
MD5: eba7ab008b520200043b2c373f440031
Description: 



In [331]:
data_list[[1]]

File (3234868): 20181004-icews-events.zip
Dataset version: 151508
MD5: 9a9a3bcca7df7c6eef0f161fd3cdcd7a
Description: 

In [332]:
data_list[[1]]['dataFile']

In [333]:
# this is the easiest form in which to deal with this at the moment. Still, there must be more elegant ways. I'm going to put off looking for them until later in the project.
unlist(data_list[[1]])

In [247]:
# test handle
url = 'https://dataverse.harvard.edu/api/access/datafile/3382959' 

In [251]:
# downloading from URL. The docs warn against this, in the case of large datasets. Not something we need to worry about, exactly, since all our datasets are pretty small and zipped at present (<200kb zipped). Still, in production, this should be avoided
download.file(url, 'test.zip', 'wget')