# Python API for iRODS

**Authors**
- Arthur Newton (SURFsara)
- Christine Staiger (SURFsara)

**License**
Copyright 2018 SURFsara BV

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.



## Goal
You will learn how to perform a simple computation from data in iRODS using the python API covered in the introduction. We will keep into account that such a computation could be performed on a HPC cluster. You will:

- setup the environment for computation 
- perform simple word count calculation with existing data in iRODS
- use iRODS metadata to keep provenance

## Connect to iRODS
Note this is a recap from the previous chapter.
To connect to iRODS we will need to authenticate via a username and password (for this course username='mara' and password='NdxJvJQujF7dq5TyBms2Xp'). Passing sensitive information over internet can be insecure. Therefore, it is best practice to do this by encoding the password first.
The module `getpass ` asks for passwords without printing the input on screen. With  encoding function we prevent that the variable contains the plain password. 


In [1]:
import getpass
pw = getpass.getpass().encode('utf-16')

········


Now we can create an iRODS session. The iRODS instance you will be connecting to is hosted at SURFsara HPC Cloud. For connecting to iRODS you will need the following information:

In [2]:
hostname='irodscourse.irodspoc-sara.surf-hosted.nl'
port=1247
username='mara'
zonename='tempZone'

with this information we can create a session object:

In [3]:
from irods.session import iRODSSession
session = iRODSSession(host=hostname, port=port, user=username, password=pw.decode('utf-16'), zone=zonename)

and test whether we have access:

In [5]:
coll = session.collections.get('/tempZone/home/mara')
print(coll.data_objects)
print(coll.subcollections)

[]
[]


In [6]:
iHome = coll.path

## Setup computation
We will now prepare the compute workflow as it should be executed on any worker node in a cluster. We still do that in interactive mode and on the user interface node (remember it behaves as any node in the cluster).

In [None]:
import os
from helperFunctions import *

print('Creating directories for analysis and results')
dataDir = os.environ['TMPDIR']+'/wordcountData' + '<your id>'
ensure_dir(dataDir)
print(dataDir)
resultsDir = os.environ['TMPDIR']+'/wordcountResults' + '<your id>'
ensure_dir(resultsDir)
print(resultsDir)

## Searching for the dataset
We need to perform a query to obtain the dataset we are interested in. For this use case, we are interested a simple analysis of word frequency in Lewis Carroll books.

In [None]:
from irods.models import Collection, DataObject, CollectionMeta, DataObjectMeta

print('Searching for files')
ATTR_NAME = 'author'
ATTR_VALUE = 'Lewis Carroll'

query = session.query(Collection.name, DataObject.name)
filteredQuery = query.filter(DataObjectMeta.name == ATTR_NAME).\
                          filter(DataObjectMeta.value == ATTR_VALUE)
print filteredQuery.all()
iPaths = iParseQuery(filteredQuery)

## Downloading files from iRODS to scratch
Remember from the first part of the tutorial, that for downloading data from iRODS you first have to read the data object into memory and after that write it to a file on your local filesystem. For convenience we provide a function wich does that for us for a list of data objects.

In [None]:
print('Downloading: ')
print()'\n'.join(iPaths))
iGetList(session, iPaths, dataDir)

## Execute compute workflow
With everything setup, we can finally do our analysis:

In [None]:
dataFiles = [dataDir+'/'+f for f in os.listdir(dataDir)]
resFile = wordcount(dataFiles,resultsDir)

## Upload results to iRODS
After the computation has finished. We can make the upload of the results into iRODS part of our jobscript.

In [None]:
#Check if results are actually created and ingested into iRODS
coll = session.collections.get('/aliceZone/home/irods-user1')
objNames = [obj.name for obj in coll.data_objects]
f = os.path.basename(resFile)
count = 0

while f in objNames:
    f = os.path.basename(resFile) + '_' +str(count)
    count = count + 1

#upload
print('Upload results to: ', coll.path + '/' + f)
session.data_objects.put(resFile, coll.path + '/' + f)

## Introduce some metadata for provenance
Now that the results are stored in iRODS, we want to link our new data to the computation and old data. Here we put some generic metadata that could link the new results with the original data. This is definitely not exhaustive and a much more detailed description would be more appropriate.

In [None]:
import datetime
obj = session.data_objects.get(coll.path + '/' + f)
for iPath in iPaths:
    obj.metadata.add('INPUTDAT', iPath)

obj.metadata.add('ISEARCH', ATTR_NAME + '==' + ATTR_VALUE)
obj.metadata.add('ISEARCHDATE', str(datetime.date.today()))
print('\n'.join([item.name +' \t'+ item.value for item in obj.metadata.items()]))

## Summarize
We have performed a simple calculation with data from iRODS. This data was downloaded to our current location as this typically gives the best performance. This staging procedure could also be done well before the computation and is also typically advised as long as the size of for instance the scratch space on the HPC cluster you are working in allows the size of the dataset. In principle, a more complex computation won't differ that much from what we have done here. 