# Wordcount - an example how to work with data stored in iRODS

## Imports
- Standard python modules to do file operations and generate timestamps
- Own library of useful functions
- The necessary iRODS modules for connecting to iRODS, Data, Collection and Metadata operations inside of iRODS

In [None]:
import os
import json
import string
import datetime
from collections import Counter
from shutil import rmtree
from pathlib import Path
from helperFunctions import *

from irods.session import iRODSSession
from irods.models import Collection, DataObject, CollectionMeta, DataObjectMeta

## Connecting to iRODS

### Login to Yoda instances
To connect to Yoda you will need an environment file. These files can be found [here (Step 2. Configuring iCommands)](https://www.uu.nl/en/research/yoda/guide-to-yoda/i-am-using-yoda/using-icommands-for-large-datasets). Please adjust the username in the file for your Yoda instance and store that file in your home-directory `~/.irods/irods-environment.json`.

You will also need a [Data Access Password](https://www.uu.nl/en/research/yoda/using-data-access-passwords). 

**Step 1**:
To connect to Yoda via Python you will need to create an environment file. First create hidden folder called `.irods` in your home directory:

In [None]:
env_dir = str(Path.home()) + '/.irods/'
if not os.path.exists(env_dir):
  os.mkdir(env_dir)
  print("Folder %s created!" % env_dir)
else:
  print("Folder %s already exists" % env_dir)

**Step 2**: Now create the environment file. The contents of the file are specific for your faculty and can be found [here (Step 2. Configuring iCommands)](https://www.uu.nl/en/research/yoda/guide-to-yoda/i-am-using-yoda/using-icommands-for-large-datasets). You can create the file manually or adapt the cell below. Copy and paste the info relevant for your faculty in the file (or the code cell below) and adjust the `irods_user_name` to your email address.

In [None]:
env_file = env_dir + 'irods_environment.json'

if os.path.exists(env_file):
    print("File %s already exists" % env_file)
else:
    dictionary = {}
        # REPLACE THIS PART WITH THE INFO FOR YOUR FACULTY
        #    {   
        #    "irods_host": "science.data.uu.nl",   
        #    "irods_port": 1247,    "irods_home": "/nluu6p/home",   
        #    "irods_user_name": "exampleuser@uu.nl",   
        #    "irods_zone_name": "nluu6p",   
        #    "irods_authentication_scheme": "pam",   
        #    "irods_encryption_algorithm": "AES-256-CBC",   
        #    "irods_encryption_key_size": 32,   
        #    "irods_encryption_num_hash_rounds": 16,   
        #    "irods_encryption_salt_size": 8,   
        #    "irods_client_server_policy": "CS_NEG_REQUIRE",
        #    "irods_client_server_negotiation": "request_server_negotiation"
        #    }
    with open(env_file, 'w') as outfile:  
        json.dump(dictionary, outfile)

**Step 3**: You will also need to create a [Data Access Password](https://www.uu.nl/en/research/yoda/using-data-access-passwords). When you have a data access password, run the cell below and Enter your Data Access Password in the pop up window that asks for your password:"

In [None]:
import getpass
passwd = getpass.getpass("Yoda data access password")

In [None]:
with open(os.path.expanduser("~/.irods/irods_environment.json"), "r") as f:
    ienv = json.load(f)
session = iRODSSession(**ienv, password=passwd)
print(ienv)

Now we can check to which iRODS collections we have access to. The usual home directory for users and groups you can find in the collection '/zonename/home'. In some cases this collection is not open to users and you need to directly lookinto your personal home collection '/zonename/home/username'

In [None]:
# Path to general home (YODA)
path = '/' + session.zone + '/home'
print(path)
collection = session.collections.get(path)
print("Data objects in", path, collection.data_objects)
print("Subcollections in", path, collection.subcollections)

Now we get our group's collection:

In [None]:
# Path to personal home (deafult iRODS)
homeCollPath = '/' + session.zone + '/home/research-test-christine'
homeColl = session.collections.get(homeCollPath)
print("Data objects in", homeCollPath, homeColl.data_objects)
print("Subcollections in", homeCollPath, homeColl.subcollections)

## Some general remarks on data objects and collections
**Terminology**: In Yoda we do not simply work with files and folders but data objects and collections.

We see in the output above, that a collection is not represented by a simple string but it is a python object with some useful functions:

In [None]:
print("Path:", homeColl.path)
print("Name:", homeColl.name)

The same is true for data objects. Let's inspect one of the data objects that we will also use in the pipeline below

In [None]:
objPath = "/nluu12p/home/research-test-christine/books/AdventuresOfSherlockHolmes_1.txt"
obj = session.data_objects.get(objPath)
print("Path:", obj.path)
print("Name:", obj.name)

For data objects we also have some system metadata available:

In [None]:
print("Size:", obj.size)
print("Checksum", obj.checksum)

Both, collections and data objects can be annotated with user metadata, i.e tags that are meaningful to you. This metadata is formatted as key (name), value, units triple. The data object or collection cannot be separated from its metadata and with that we get a new powerfull feature and improvement over simple files and folders. 

**Terminology**: User-defined metadata in Yoda is called *attributes*.

Let's dive into the technical aspects a bit.
We can retrieve the metadata as a python object. However, we cannot access that metadata right away in human readable format:

In [None]:
print("Coll metadata:", homeColl.metadata)
print("Obj metadata:", obj.metadata)

Here a small showcase how to get the keys and values of the metadata from our object:

In [None]:
for item in obj.metadata.items():
    # Retrieve key(name), value, units
    print("Key:", item.name, ", Value:", item.value, ", Units:", item.units)

This metadata we are going to use in the following pipeline. We will search for books written by a specific author and analyse their contents.

## Wordcount pipeline

1. Search for data in Yoda
2. Get the data
    1. Make a copy of that data on our local storage
    2. Stream the content of the data directly into a python variable
3. Analyse the data
4. Upload the result to Yoda
    1. Store the result in a file and upload the file to Yoda
    2. Stream the result directly to Yoda
5. Annotate the result in Yoda with meaningful metadata, employing the [Prov-O Ontology](https://www.w3.org/TR/2013/REC-prov-o-20130430/#prov-o-at-a-glance)

## Search for your input data
User defined metadata is stored as Key-Value-Unit triples and someone already curated some data for us and stored it in Yoda. We will look for books which carry the key "AUTHOR" and where the value is "Lewis Carroll".

In [None]:
ATTR_NAME = "AUTHOR" #case sensitive!!!
ATTR_VALUE = "Lewis Carroll"

print('Searching for files')
query = session.query(Collection.name, DataObject.name)
# Filtering for AUTHOR == Lewis Carroll
filteredQuery = query.filter(DataObjectMeta.name == ATTR_NAME).\
                          filter(DataObjectMeta.value == ATTR_VALUE)
print(filteredQuery.all())
irods_paths = parse_query(filteredQuery)

## Prepare data for analysis
To look inside of the data we have two options in iRODS:
1. We download the data to our fast storage system and have the data available and ready for being read from there.
2. In some cases single files can become too large to be downloaded quickly or even too large to fit into the memory of the machine you are working on. In that case we can stream files into memory, i.e. reading a file bit by bit or just the interesting parts.

### Option A: Create a working copy of the file in the data objects

In [None]:
print('Downloading: ')
data_dir = os.path.expanduser("~/wordcount_data")
ensure_dir(data_dir)
print('\n'.join(irods_paths))
get_data(session, irods_paths, data_dir)
#Reading the data from the files into python variable
text = files_to_text(data_dir)

### Option B: Read the content of the file in the data object into memory
In our example the data is relatively small and we have enough memory available. Moreover, it is textual data which we need to parse word by word as string anyway. Hence, we can directly load the content of the files into memory:

In [None]:
text = ''
for path in irods_paths:
    obj = session.data_objects.get(path)
    with obj.open('r') as objRead:
        text = text + objRead.read().decode()
print(text[:100])

## Start your computational pipeline

### The pipeline

In [None]:
def wordcount(text):
    # Convert to list of words, without punctuation
    words = [''.join(char for char in word
             if char not in string.punctuation) for word in text.split()]
    print("Number of words:", len(words))
    unique_words_count = Counter(words)
    return unique_words_count

result = wordcount(text)

We receive a dictionary mapping from words to the number of their occurences in all three books

In [None]:
print("Alice:", result["Alice"])
print("Rabbit:", result["Rabbit"])
print("Queen:", result["Queen"])

## Option A: Uploading your data to safe storage through iRODS and annotating the results

**Write results to a file.** Of course we can write the results to a file like this and then upload it to Yoda again.

In [None]:
res_dir = os.path.expanduser("~/wordcount_results")
ensure_dir(res_dir)
res_file = res_dir + "/wordcount_res.txt"
with open(res_file, 'w') as file:
    file.write(json.dumps(result))

**Note**, our results are only stored locally without any proper backup! We need to upload the data to Yoda quickly!

In [None]:
coll = session.collections.get('/' + session.zone + '/home/' + 'research-test-christine')
objs_names = [obj.name for obj in coll.data_objects]
f = os.path.basename(res_file)
# little trick to prevent overwriting of data, 
# if the object name already exists in iRODS we extend it with a number
count = 0
while f in objs_names:
        f = os.path.basename(res_file) + '_' +str(count)
        count = count + 1

In [None]:
print('Upload results to: ', coll.path + '/' + f)
res_obj = put_file(session, res_file, coll.path + '/' + f)

## Option B: Streaming the results to Yoda
To avoid creating yet another file on our system, we can also directly stream the data to iRODS.

In [None]:
coll = session.collections.get('/' + session.zone + '/home/research-test-christine')
obj_names = [obj.name for obj in coll.data_objects]
new_obj_name = "wordcount_result.txt"

# Ensuring that we do not overwrite a previous results file
count = 0
while new_obj_name in obj_names:
        new_obj_name = new_obj_name + '_' +str(count)
        count = count + 1

Now that we have a valid object name for our new object we can create it and stream the content into the object:

In [None]:
obj = session.data_objects.create(coll.path + "/" + new_obj_name)
with obj.open('w') as obj_write:
    obj_write.write(json.dumps(result).encode())

### Adding metadata to the results in Yoda
Now, we can annontate the data object in Yoda to ensure we know later how we created the data:

In [None]:
print('Adding metadata to', obj.path)
for path in irods_paths:
        obj.metadata.add('prov:wasDerivedFrom', path)

obj.metadata.add('ISEARCH', ATTR_NAME + '==' + ATTR_VALUE)
obj.metadata.add('ISEARCHDATE', str(datetime.date.today()))
obj.metadata.add('prov:SoftwareAgent', 'wordcount.py')

## Last check: How is the file annotated in iRODS?

In [None]:
print('Metadata for: ', obj.path)
print('\n'.join([item.name +' \t'+ item.value for item in obj.metadata.items()]))

## Remove temporary data from scratch space
If you chose option 1 and created temprorary copies of the data, **do not forget to clean up and free the space** for new data and computations.

In [None]:
print("Removing local data in", data_dir)
rmtree(data_dir)
print("Removing local data in", res_dir)
rmtree(res_dir)