## Access files in Object Storage with R

This notebook shows you how to access data files stored in Object Storage by using the R programming language and SparkR, the light-weight frontend to use Apache Spark from R. 


## Table of contents

1. [Load data](#load_data)
1. [Access data](#access_data)
    1. [Access data by using R](#access_data_using_R)
    1. [Access data by using SparkR](#access_data_using_SparkR)
1. [Summary](#summary)

<a id="load_data"></a>
## Load data

Before you begin analyzing data in data files in your notebook, you must add the data files to the notebook. When you load data files to your notebook, these files are stored in Object Storage. 

To add files that you want to use in a notebook to Object Storage, click the **Data** icon on the notebook action bar. You can either drag the file that you want to add to the `Data` pane or click **Add Source** and browse to the file. The data files are listed on the `Data` pane. 

<a id="access_data"></a>
## Access data

To access data in a file in Object Storage, you need the Object Storage authentication credentials. 

Click the next code cell to set the focus on the cell. To add the credentials to access the data file to this code cell, select **Insert to code>Credentials** on the data file that you loaded in the `Data` pane.

This action returns an R `list` object with the credentials required to access the file in Object Storage. 

<div class="alert alert-block alert-info">Note: If you decide to share this notebook with other users, consider removing the credentials from the notebook.</div>


# @hidden_cell
credentials_2 <-list(auth_url = "https://identity.open.softlayer.com",project = "object_storage_5c7078fd_f364_4973_afef_3bee81426d6c",project_id = "dd2f1d02110e4fbc93745a2bacc2e46b",region = "dallas",user_id = "744b43a666284f738fe52775672732d2",domain_id = "e2f9e3757229466c82bfd5bf022efddc",domain_name = "1007041",username = "member_cc7b1049bc284c38be98f6a31787dec2e10cd4c1",password = "Tk(GPH6)cC#O2ry1",container = "dsxdemo",tenantId = "undefined",filename = "GoSales_Tx_NaiveBayes.csv")
<a id="access_data_using_R"></a>
### Access data by using R

Because the data file is located in Object Storage, you need to define a helper function to access the file that you loaded.  

Run the following cell to define the function called `getObjectStorageFile`. This function takes the list object with the credentials required to access the data file as input. The function accesses Object Storage using your credentials and opens the data file in text-mode format for reading in the notebook. 

In [2]:
getObjectStorageFile <- function(credentials) {
    if(!require(httr)) install.packages('httr')
    if(!require(RCurl)) install.packages('RCurl')
    library(httr, RCurl)
    auth_url <- paste(credentials[['auth_url']],'/v3/auth/tokens', sep= '')
    auth_args <- paste('{"auth": {"identity": {"password": {"user": {"domain": {"id": ', credentials[['domain_id']],'},"password": ',
                   credentials[['password']],',"name": ', credentials[['username']],'}},"methods": ["password"]}}}', sep='"')
    auth_response <- httr::POST(url = auth_url, body = auth_args)
    x_subject_token <- headers(auth_response)[['x-subject-token']]
    auth_body <- content(auth_response)
    access_url <- unlist(lapply(auth_body[['token']][['catalog']], function(catalog){
        if((catalog[['type']] == 'object-store')){
            lapply(catalog[['endpoints']], function(endpoints){
                if(endpoints[['interface']] == 'public' && endpoints[['region_id']] == credentials[['region']]) {
                   paste(endpoints[['url']], credentials[['container']], credentials[['filename']], sep='/')}
            })
        }
    })) 
    data <- content(httr::GET(url = access_url, add_headers ("Content-Type" = "application/json", "X-Auth-Token" = x_subject_token)), as="text")
    textConnection(data)
}

In [1]:
# The code was removed by DSX for sharing.

Loading required package: httr
Loading required package: RCurl
Loading required package: bitops

Attaching package: ‘RCurl’

The following object is masked from ‘package:SparkR’:

    base64



PRODUCT_LINE,GENDER,AGE,MARITAL_STATUS,PROFESSION
Personal Accessories,M,27,Single,Professional
Personal Accessories,F,39,Married,Other
Mountaineering Equipment,F,39,Married,Other
Personal Accessories,F,56,Unspecified,Hospitality
Golf Equipment,M,45,Married,Retired
Golf Equipment,M,45,Married,Retired


You can use the text-mode connection to the data file in Object Storage that the helper function returns as input to any standard R data import functions. 
For example, run the next cell to read a `.csv` file into an R data frame by using the `read.csv()` function:

In [4]:
R.data.frame <- read.csv(file = getObjectStorageFile(credentials_1))
head(R.data.frame)

Unnamed: 0,DATE,TIME,BOROUGH,ZIP.CODE,LATITUDE,LONGITUDE,LOCATION,ON.STREET.NAME,CROSS.STREET.NAME,OFF.STREET.NAME,...,CONTRIBUTING.FACTOR.VEHICLE.2,CONTRIBUTING.FACTOR.VEHICLE.3,CONTRIBUTING.FACTOR.VEHICLE.4,CONTRIBUTING.FACTOR.VEHICLE.5,UNIQUE.KEY,VEHICLE.TYPE.CODE.1,VEHICLE.TYPE.CODE.2,VEHICLE.TYPE.CODE.3,VEHICLE.TYPE.CODE.4,VEHICLE.TYPE.CODE.5
1,03/11/2015,23:15,BROOKLYN,11207.0,40.65781,-73.89612,"(40.6578144, -73.8961242)",LINDEN BOULEVARD,WILLIAMS AVENUE,,...,Unspecified,,,,3185005,PASSENGER VEHICLE,SPORT UTILITY / STATION WAGON,,,
2,03/11/2015,23:15,,,,,,,,,...,Unspecified,,,,3184851,PICK-UP TRUCK,PASSENGER VEHICLE,,,
3,03/11/2015,23:25,STATEN ISLAND,10304.0,40.6246,-74.0796,"(40.6246026, -74.0795982)",BROAD STREET,TOMPKINS AVENUE,,...,Driver Inattention/Distraction,,,,3185247,PICK-UP TRUCK,SPORT UTILITY / STATION WAGON,,,
4,03/11/2015,23:25,BRONX,10465.0,40.87829,-73.87006,"(40.8782895, -73.8700582)",EAST GUN HILL ROAD,BRONX RIVER PARKWAY,,...,Unspecified,,,,3184867,SPORT UTILITY / STATION WAGON,PASSENGER VEHICLE,,,
5,03/11/2015,23:40,QUEENS,11420.0,40.6773,-73.80456,"(40.677304, -73.8045606)",135 STREET,FOCH BOULEVARD,,...,Unspecified,,,,3185149,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
6,03/11/2015,23:53,MANHATTAN,10019.0,40.76417,-73.98473,"(40.7641704, -73.9847336)",WEST 53 STREET,8 AVENUE,,...,Unspecified,,,,3185906,PASSENGER VEHICLE,PASSENGER VEHICLE,,,


<a id="access_data_using_SparkR"></a>
### Access data by using SparkR

Before you can access data in the data file in Object Storage by using the [`SQLContext`](https://spark.apache.org/docs/latest/sparkr.html#starting-up-sparkcontext-sqlcontext) object, you must set the Hadoop configuration by using the following configuration function. Run the following cell to create the helper function:

In [5]:
setHadoopConfig <- function(credentials) {
    prefix = paste("fs.swift.service" , credentials[['name']], sep =".")
    hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
    SparkR:::callJMethod(hConf, "set", paste(prefix, "auth.url", sep='.'), paste(credentials[["auth_url"]],"/v3/auth/tokens",sep=""))    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "auth.endpoint.prefix", sep='.'), "endpoints")    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "tenant", sep='.'), credentials[["project_id"]])    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "username", sep='.'), credentials[["user_id"]])    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "password", sep='.'), credentials[["password"]])    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "region", sep='.'), credentials[["region"]])    
    invisible(SparkR:::callJMethod(hConf, "setBoolean", paste(prefix, "public", sep='.'), TRUE))
}

Set the Hadoop configuration and give it a name, for example, `keystone`:

In [6]:
credentials_1[["name"]] <- "keystone"
setHadoopConfig(credentials_1)

You can now use the `read.df` function from the SparkR API to load the data file as a Spark DataFrame. For example, run the next cell to read a `.csv` file into an Spark DataFrame. The variable `filePath` is the location of the data file in Object Storage.

In [7]:
filePath <- paste("swift://" , credentials_1[['container']] , "." , credentials_1[['name']] , "/" , credentials_1[['filename']], sep="")
SparkR.DataFrame <- read.df(sqlContext, filePath, source = "com.databricks.spark.csv", header = "true")
head(SparkR.DataFrame)

Unnamed: 0,DATE,TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,UNIQUE KEY,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
1,03/11/2015,23:15,BROOKLYN,11207.0,40.6578144,-73.8961242,"(40.6578144, -73.8961242)",LINDEN BOULEVARD,WILLIAMS AVENUE,,...,Unspecified,,,,3185005,PASSENGER VEHICLE,SPORT UTILITY / STATION WAGON,,,
2,03/11/2015,23:15,,,,,,,,,...,Unspecified,,,,3184851,PICK-UP TRUCK,PASSENGER VEHICLE,,,
3,03/11/2015,23:25,STATEN ISLAND,10304.0,40.6246026,-74.0795982,"(40.6246026, -74.0795982)",BROAD STREET,TOMPKINS AVENUE,,...,Driver Inattention/Distraction,,,,3185247,PICK-UP TRUCK,SPORT UTILITY / STATION WAGON,,,
4,03/11/2015,23:25,BRONX,10465.0,40.8782895,-73.8700582,"(40.8782895, -73.8700582)",EAST GUN HILL ROAD,BRONX RIVER PARKWAY,,...,Unspecified,,,,3184867,SPORT UTILITY / STATION WAGON,PASSENGER VEHICLE,,,
5,03/11/2015,23:40,QUEENS,11420.0,40.677304,-73.8045606,"(40.677304, -73.8045606)",135 STREET,FOCH BOULEVARD,,...,Unspecified,,,,3185149,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
6,03/11/2015,23:53,MANHATTAN,10019.0,40.7641704,-73.9847336,"(40.7641704, -73.9847336)",WEST 53 STREET,8 AVENUE,,...,Unspecified,,,,3185906,PASSENGER VEHICLE,PASSENGER VEHICLE,,,


Now your data is in a `Spark DataFrame` and you can begin analyzing it. 

<div class="alert alert-block alert-info">Note: To access CSV files in Object Storage and load data to use in the notebook, you can use the code generation functions on the `Insert to code` list below each data file in the `Data` pane in the notebook.</div>

<a id="summary"></a>
## Summary

This notebook demonstrated how to access files stored in Object Storage by using both R and SparkR. You can use and adapt these code snippets in a notebook you are developing if you want to load data to and access data from Object Storage.


### Author

Sumit Goyal is a Software Developer at IBM in Germany. He is a data science enthusiast and passionate about IBM's Data Science Experience. He holds a degree in Automation and Industrial IT. Meet him on twitter @imSumitGoyal.