# Wordcount - an example how to work on a compute cluster with data stored in iRODS

## Imports
- Standard python modules to do file operations and generate timestamps
- Own library of useful functions
- The necessary iRODS modules for connecting to iRODS, Data, Collection and Metadata operations inside of iRODS

In [None]:
import os
import json
import string
import datetime
from collections import Counter
from shutil import rmtree
from pathlib import Path
from helperFunctions import *

from irods.session import iRODSSession
from irods.models import Collection, DataObject, CollectionMeta, DataObjectMeta

## Connecting to iRODS
### Standard login
On a standard iRODS environment you can login with your username and password, you will also have to provide the port, the zonename and the host address.

In [None]:
#PARAMETERS
# iRODS connection
host='<FILL IN>'
port=1247
user='<FILL IN>'
zone='<FILL IN>'

# create passwd file with your password and read it from there
with open('passwd', 'r') as f:
    passwd = f.readline().strip()

### Login to SSL enabled iRODS
To increase the security of messages exchanged between you and the iRODS server SSL encryption is used. In that case your iRODS sys-admin will give you a file `irods-environment.json` which contains all parameters to connect to the server. Please store that file in your home-directory `~/.irods/irods-environment.json`.
We will use thios file to connect to the iRODS server like this:

In [None]:
with open('passwd', 'r') as f:
    passwd = f.readline().strip()
    

with open(os.path.expanduser("~/.irods/irods_environment.json"), "r") as f:
    ienv = json.load(f)
session = iRODSSession(**ienv, password=passwd)

Now we can check to which iRODS collections we have access to. The usual home directory for users and groups you can find in the collection '/zonename/home'. In some cases this collection is not open to users and you need to directly lookinto your personal home collection '/zonename/home/username'

In [None]:
# Path to general home
homeCollPath = '/' + session.zone + '/home/'
homeColl = session.collections.get(homeCollPath)
print("Data objects in", homeCollPath, homeColl.data_objects)
print("Subcollections in", homeCollPath, homeColl.subcollections)

In [None]:
# Path to personal home
#homeCollPath = '/' + session.zone + '/home/' + session.username
#homeColl = session.collections.get(homeCollPath)
#print("Data objects in", homeCollPath, homeColl.data_objects)
#print("Subcollections in", homeCollPath, homeColl.subcollections)

## Some general remarks on data objects and collections
We see in the output above, that a collection is not represented by a simple string but it is a python object with some useful functions:

In [None]:
print("Path:", homeColl.path)
print("Name:", homeColl.name)

The same is true for data objects. Let's inspect one of the data objects that we will also use in the pipeline below

In [None]:
objPath = "/nluu12p/home/research-test-christine/books/AdventuresSherlockHolmes.txt"
obj = session.data_objects.get(objPath)
print("Path:", obj.path)
print("Name:", obj.name)

For data objects we also have some system metadata available:

In [None]:
print("Size:", obj.size)
print("Checksum", obj.checksum)

Both, collections and data objects can be annotated with metadata directly. We can retrieve the metadata as a python object. However, we cannot access that metadata right away in human readable format:

In [None]:
print("Coll metadata:", homeColl.metadata)
print("Obj metadata:", obj.metadata)

Here a small showcase how to get the keys and values of the metadata from our object:

In [None]:
for item in obj.metadata.items():
    # Retrieve key(name), value, units
    print("Key:", item.name, ", Value:", item.value, ", Units:", item.units)

This metadata we are going to use in the following pipeline. We will search for books written by a specific author and analyse their contents.

## Parameters for our computational pipeline
- Keywords and their values to search for the correct data in iRODS
- Setting up the folder structure on fast storage of the compute cluster.
  The data stored here is **not backed up**, nor safely stored, this storage is just used to allow very quick calculations on the data.

In [None]:
# data search
ATTR_NAME = 'AUTHOR'
ATTR_VALUE = 'Lewis Carroll'

## Search for your input data
User defined metadata is stored as Key-Value-Unit triples. In this iRODS instance we are looking for books which carry the key "AUTHOR" and where the value is "Lewis Carroll".

In [None]:
print('Searching for files')
query = session.query(Collection.name, DataObject.name)
# Filtering for AUTHOR == Lewis Carroll
filteredQuery = query.filter(DataObjectMeta.name == ATTR_NAME).\
                          filter(DataObjectMeta.value == ATTR_VALUE)
print(filteredQuery.all())
irods_paths = parse_query(filteredQuery)

## Prepare data for analysis
To look inside of the data we have two options in iRODS:
1. We download the data to our fast storage system and have the data available and ready for being read from there.
2. In some cases single files can become too large to be downloaded quickly or even too large to fit into the memory of the machine you are working on. In that case we can stream files into memory, i.e. reading a file bit by bit or just the interesting parts.

### Here the code example for option 1:

In [None]:
print('Downloading: ')
data_dir = os.path.expanduser("~/wordcount_data")
ensure_dir(data_dir)
print('\n'.join(irods_paths))
get_data(session, irods_paths, data_dir)

### Reading data into memory
In our example the data is relatively small and we have enough memory available. Moreover, it is textual data which we need to parse word by word as string anyway. Hence, we can directly load the content of the files into memory:

In [None]:
text = ''
for path in irods_paths:
    obj = session.data_objects.get(path)
    with obj.open('r') as objRead:
        text = text + objRead.read().decode()

## Start your computational pipeline

### Reading in data
If you did not load the iRODS data into memory, you will now have to read in the files into one large string:


In [None]:
text = files_to_text(data_dir)

### The pipeline

In [None]:
def wordcount(text):
    # Convert to list of words, without punctuation
    words = [''.join(char for char in word
             if char not in string.punctuation) for word in text.split()]
    print("Number of words:", len(words))
    unique_words_count = Counter(words)
    return unique_words_count

result = wordcount(text)

We receive a dictionary mapping from words to the number of their occurences in all three books

In [None]:
print("Alice:", result["Alice"])
print("Rabbit:", result["Rabbit"])
print("Queen:", result["Queen"])

Of course we can write the results to a file like this and then upload it to iRODS again.

In [None]:
res_dir = os.path.expanduser("~/wordcount_results")
ensure_dir(res_dir)
res_file = res_dir + "/wordcount_res.txt"
with open(res_file, 'w') as file:
    file.write(json.dumps(result))

**Note**, our results are stored on the fast but not safe storage! We need to upload the data to iRODS quickly!
## Uploading your data to safe storage through iRODS and annotating the results

In [None]:
coll = session.collections.get('/' + session.zone + '/home/' + 'research-test-christine')
objs_names = [obj.name for obj in coll.data_objects]
f = os.path.basename(res_file)
# little trick to prevent overwriting of data, if the object name already exists in iRODS we extend it with a number
count = 0
while f in objs_names:
        f = os.path.basename(res_file) + '_' +str(count)
        count = count + 1

In [None]:
print('Upload results to: ', coll.path + '/' + f)
res_obj = put_file(session, res_file, coll.path + '/' + f)

Now, we can annontate the data in iRODS to ensure we know later where we got it from:

In [None]:
print('Adding metadata to', coll.path + '/' + f)
obj = session.data_objects.get(coll.path + '/' + f)
for path in irods_paths:
        obj.metadata.add('prov:wasDerivedFrom', path)

obj.metadata.add('ISEARCH', ATTR_NAME + '==' + ATTR_VALUE)
obj.metadata.add('ISEARCHDATE', str(datetime.date.today()))
obj.metadata.add('prov:SoftwareAgent', 'wordcount.py')

## Streaming the results to iRODS
To avoid creating yet another file on our system, we can also directly stream the data to iRODS.

In [None]:
coll = session.collections.get('/' + session.zone + '/home/research-test-christine')
obj_names = [obj.name for obj in coll.data_objects]
new_obj_name = "wordcount_result.txt"

# Ensuring that we do not overwrite a previous results file
count = 0
while new_obj_name in obj_names:
        new_obj_name = new_obj_name + '_' +str(count)
        count = count + 1

Now that we have a valid object name for our new object we can create it and stream the content into the object:

In [None]:
obj = session.data_objects.create(coll.path + "/" + new_obj_name)
with obj.open('w') as obj_write:
    obj_write.write(json.dumps(result).encode())

Now, we can annontate the data in iRODS to ensure we know later where we got it from:

In [None]:
print('Adding metadata to', obj.path)
for path in irods_paths:
        obj.metadata.add('prov:wasDerivedFrom', path)

obj.metadata.add('ISEARCH', ATTR_NAME + '==' + ATTR_VALUE)
obj.metadata.add('ISEARCHDATE', str(datetime.date.today()))
obj.metadata.add('prov:SoftwareAgent', 'wordcount.py')

## Last check: How is the file annotated in iRODS?

In [None]:
print('Metadata for: ', obj.path)
print('\n'.join([item.name +' \t'+ item.value for item in obj.metadata.items()]))

## Remove temporary data from scratch space

In [None]:
print("Removing local data in", data_dir)
rmtree(data_dir)
print("Removing local data in", res_dir)
rmtree(res_dir)