# Wordcount - an example how to work on a compute cluster with data stored in iRODS

## Imports
- Standard python modules to do file operations and generate timestamps
- Own library of useful functions
- The necessary iRODS modules for connecting to iRODS, Data, Collection and Metadata operations inside of iRODS

In [None]:
import os
import datetime
from shutil import rmtree
from helperFunctions import *

from irods.session import iRODSSession
from irods.models import Collection, DataObject, CollectionMeta, DataObjectMeta

## Connecting to iRODS

In [None]:
#PARAMETERS
# iRODS connection
host='<FILL IN>'
port=1247
user='<FILL IN>'
zone='<FILL IN>'

# create passwd file with your password and read it from there
with open('passwd', 'r') as f:
    passwd = f.readline().strip()

## Parameters for our computational pipeline
- Keywords and their values to search for the correct data in iRODS
- Setting up the folder structure on fast storage of the compute cluster.
  The data stored here is **not backed up**, nor safely stored, this storage is just used to allow very quick calculations on the data.

In [None]:
# data search
ATTR_NAME = 'AUTHOR'
ATTR_VALUE = 'Lewis Carroll'

In [None]:
print('Creating local directories for analysis and results')
dataDir = '<lustre path>/dataDir'
ensure_dir(dataDir)
resultsDir = '<lustre path>/resultsDir'
ensure_dir(resultsDir)
print('<lustre path>')

In [None]:
print('Connect to iRODS '+ zone)
session = iRODSSession(host=host, port=port, user=user, password=passwd, zone=zone)
print('You have access to: ')
colls = [coll.path for coll in session.collections.get('/'+zone+'/'+'home').subcollections]
print(colls)


## Search for your input data
User defined metadata is stored as Key-Value-Unit triples. In this iRODS instance we are looking for books which carry the key "AUTHOR" and where the value is "Lewis Carroll".

In [None]:
print('Searching for files')
query = session.query(Collection.name, DataObject.name)
# Filtering for AUTHOR == Lewis Carroll
filteredQuery = query.filter(DataObjectMeta.name == ATTR_NAME).\
                          filter(DataObjectMeta.value == ATTR_VALUE)
print(filteredQuery.all())
iPaths = iParseQuery(filteredQuery)


## Prepare data for analysis
To have look inside of the data we have two options in iRODS:
1. We download the data to our fast storage system and have the data available and ready for being read from there.
2. In some cases single files can become too large to be downloaded quickly or even too large to fit into the memory of the machine you are working on. In that case we can stream files into memory, i.e. reading a file bit by bit or just the interesting parts.

In this tutorial we will continue with option 1:

In [None]:
print('Downloading: ')
print('\n'.join(iPaths))
iGetList(session, iPaths, dataDir)

## Start your computational pipeline

In [None]:
print('Start wordcount')
dataFiles = [dataDir+'/'+f for f in os.listdir(dataDir)]
resFile = wordcount(dataFiles,resultsDir)
print('Results of calculations:', resFile)


What have we actually calculated?

In [None]:
with open(resFile, 'r') as f:
    print(f.readlines())

**Note**, our results are stored on the fast but not safe storage! We need to upload the data to iRODS quickly!
## Uploading your data to safe storage through iRODS and annotating the results

In [None]:
coll = session.collections.get('/' + zone + '/home/' +user)
objNames = [obj.name for obj in coll.data_objects]
f = os.path.basename(resFile)
# little trick to prevent overwriting of data, if the filename already exists in iRODS we extend it with a number
count = 0
while f in objNames:
        f = os.path.basename(resFile) + '_' +str(count)
        count = count + 1

In [None]:
print('Upload results to: ', coll.path + '/' + f)
session.data_objects.put(resFile, coll.path + '/' + f)

Now, we can annontate the data in iRODS to ensure we know later where we got it from:

In [None]:
print('Adding metadata to', coll.path + '/' + f)
obj = session.data_objects.get(coll.path + '/' + f)
for iPath in iPaths:
        obj.metadata.add('prov:wasDerivedFrom', iPath)

obj.metadata.add('ISEARCH', ATTR_NAME + '==' + ATTR_VALUE)
obj.metadata.add('ISEARCHDATE', str(datetime.date.today()))
obj.metadata.add('prov:SoftwareAgent', 'wordcount.py')


## Last check: How is the file annotated in iRODS?

In [None]:
print('Metadata for: ', coll.path + '/' + f)
print('\n'.join([item.name +' \t'+ item.value for item in obj.metadata.items()]))

## Remove temporary data from scratch space

In [None]:
print("Removing local data in", dataDir)
rmtree(dataDir)
print("Removing local data in", resultsDir)
rmtree(resultsDir)