## To Do

- Ask Marius about jupyter hub rwth
- figure out best application profile / create one?
- use fcs files to extract metadata
- email Amin and Petar with dates for demos
- fix how we're adding metadata to the folder

# With metadata to better data! (Meta)Data transfer from and to Coscine

The [Coscine research data platform](https://www.coscine.de) provides an API interface to transfer metadata annotated data to Coscine in automated processes. In the workshop, we will show in small-scale steps how to move data to Coscine using a JupyterNotebook (Python) and Coscine's personal authentication token, and how to specify the metadata using the application profile provided by the application. Prior knowledge of Python is desirable.

First things first, you need a Coscine project. You can use your own if you have access, or we will add you to this one: https://coscine.rwth-aachen.de/p/fdmwerkstatt/ 

Next, head to your [user profile](https://coscine.rwth-aachen.de/user/) and get your Access Token. Copy this into your config file under `token`.


Create a resource... 

Go to the resource setting and copy all relevant information into your config file. This includes resource name and ID and project name. 

Now let's load all dependencies and configurations into our jupyter notebook.

In [1]:
import coscine
import json
from datetime import datetime
from pathlib import Path
import re 

  from .autonotebook import tqdm as notebook_tqdm


If you have any errors loading the packages, run the code below with the associated package name:

In [None]:
pip install PACKAGE_NAME

Load the configuration:

In [2]:
# type this
with open("config.json") as f:
    cfg = json.load(f)

RESOURCE: str = cfg['resourceName']
PROJECT: str = cfg['projectName']
TOKEN: str = cfg['token']

We use the Coscine package to connect with Coscine REST API, which enables us to interact with our project and resource. 

For more information and other examples: [Coscine Python SDK](https://git.rwth-aachen.de/coscine/community-features/coscine-python-sdk)

In [3]:
client = coscine.Client(TOKEN)

project = client.project(PROJECT)
resource = project.resource(RESOURCE)


In [4]:
print(project)

+------------------------------------------------------------------------------+
|                            Project FDM_Werkstatt                             |
+-------------------------+----------------------------------------------------+
|         Property        |                       Value                        |
+-------------------------+----------------------------------------------------+
|            ID           |        ae6aa507-27fb-4e1f-84ab-16f1f4c6d320        |
|           Name          |                   FDM_Werkstatt                    |
|       Display Name      |                   FDM_Werkstatt                    |
|       Description       | With metadata to better data! (Meta)Data transfer  |
|                         |                from and to Coscine                 |
|                         |                                                    |
|                         | The Coscine research data platform provides an API |
|                         | 

In [5]:
print(resource)

+---------------------------------------------------------------------+
|                          Resource Prep_Test                         |
+---------------------+-----------------------------------------------+
|       Property      |                     Value                     |
+---------------------+-----------------------------------------------+
|          ID         |      dc774194-5bb8-4cf0-b84e-93027adec0ea     |
|    Resource Name    |                   Prep_Test                   |
|     Display Name    |                   Prep_Test                   |
|     Description     |               prep fdm_werkstatt              |
|         PID         | 21.11102/dc774194-5bb8-4cf0-b84e-93027adec0ea |
|         Type        |                   rdss3rwth                   |
|     Disciplines     |              Computer Science 409             |
|       License       |                                               |
| Application Profile |       https://purl.org/coscine/ap/base/ 

Get the metadata form and take a look at it:

In [6]:
metadata = resource.metadata_form()
print(metadata)

+---+----------+----------------+-------+
| C | Type     | Property       | Value |
+---+----------+----------------+-------+
|   | [str]    | Title*         |       |
|   | [str]    | Creator*       |       |
|   | datetime | Creation Date* |       |
| V | [str]    | Subject Area   |       |
| V | [str]    | Type           |       |
+---+----------+----------------+-------+


This form is a dictionary-like data structure, so you can interact with it like a python dictionary:

In [7]:
metadata['Title'] = 'My fun title'

In [8]:
metadata['Type'] = 'this thing'

ValueError: Invalid value 'this thing' for vocabulary controlled key 'Type'! Perhaps you meant ?

The error is because the field is a controlled vocabulary. Let's see what is allowed:

In [9]:
metadata.vocabulary('Type').keys()

['Moving Image',
 'Sound',
 'Collection',
 'Dataset',
 'Event',
 'Image',
 'Interactive Resource',
 'Service',
 'Software',
 'Text',
 'Physical Object',
 'Still Image']

In [10]:
allowed_vals = metadata.vocabulary('Type').keys()


In [11]:
metadata['Type'] = allowed_vals[9]

In [12]:
print(metadata['Type'])

Text


In [13]:
allowed_vals = metadata.vocabulary('Subject Area').keys()


In [14]:
controlled_vocab = {}
for i, val in enumerate(allowed_vals):
    controlled_vocab[i] = val

print(controlled_vocab)

{0: 'Humanities and Social Sciences', 1: 'Ancient Cultures', 2: 'Prehistory', 3: 'Classical Philology', 4: 'Ancient History', 5: 'Classical Archaeology', 6: 'Egyptology and Ancient Near Eastern Studies', 7: 'History', 8: 'Medieval History', 9: 'Early Modern History', 10: 'Modern and Current History', 11: 'History of Science', 12: 'Fine Arts, Music, Theatre and Media Studies', 13: 'Art History', 14: 'Musicology', 15: 'Theatre and Media Studies', 16: 'Linguistics', 17: 'General and Comparative Linguistics, Typology, Non-European Languages', 18: 'Individual Linguistics', 19: 'Historical Linguistics', 20: 'Applied Linguistics, Experimental Linguistics, Computational Linguistics', 21: 'Literary Studies', 22: 'Medieval German Literature', 23: 'Modern German Literature', 24: 'European and American Literature', 25: 'General and Comparative Literature and Cultural Studies', 26: 'Social and Cultural Anthropology, Non-European Cultures, Jewish Studies and Religious Studies', 27: 'Social and Cultu

In [15]:
selection = input(f'enter value by selecting the respective number {controlled_vocab}')
metadata['Subject Area'] = controlled_vocab[int(selection)]

In [16]:
print(metadata)

+---+----------+----------------+--------------+
| C | Type     | Property       | Value        |
+---+----------+----------------+--------------+
|   | [str]    | Title*         | My fun title |
|   | [str]    | Creator*       |              |
|   | datetime | Creation Date* |              |
| V | [str]    | Subject Area   | Prehistory   |
| V | [str]    | Type           | Text         |
+---+----------+----------------+--------------+


Let's deal with the date. It needs to be formatted as a datetime object:

In [17]:
date = '2023-06-09'
type(date)

str

In [18]:
metadata['Creation Date'] = date

TypeError: Value of type <class 'str'> specified for key Creation Date does not match expected type <class 'datetime.datetime'>!

In [19]:
metadata['Creation Date'] = datetime.strptime(date, '%Y-%m-%d')

In [20]:
print(metadata)

+---+----------+----------------+--------------+
| C | Type     | Property       | Value        |
+---+----------+----------------+--------------+
|   | [str]    | Title*         | My fun title |
|   | [str]    | Creator*       |              |
|   | datetime | Creation Date* | 2023-06-09   |
| V | [str]    | Subject Area   | Prehistory   |
| V | [str]    | Type           | Text         |
+---+----------+----------------+--------------+


We're still missing `Creator`. Depending on the properties set to each field in the application profile, we can enter multiple values. For example, the `Creator` field takes a vlue of type `[str]`, indicating that a list of strings may be entered. Let's add multiple creators:  

In [21]:
metadata['Creator'] = ['Cat', 'Nikki']

In [22]:
print(metadata)

+---+----------+----------------+--------------+
| C | Type     | Property       | Value        |
+---+----------+----------------+--------------+
|   | [str]    | Title*         | My fun title |
|   | [str]    | Creator*       | Cat,Nikki    |
|   | datetime | Creation Date* | 2023-06-09   |
| V | [str]    | Subject Area   | Prehistory   |
| V | [str]    | Type           | Text         |
+---+----------+----------------+--------------+


Now we upload the metadata and the data. We'll just use a dummy text file here called `myData.txt`.

In [24]:
file_loc = "data/myData.txt"# local file path
file_name = "myData.txt" # file name in coscine
resource.upload(file_name, file_loc, metadata)

myData.txt: 100%|##########| 231/231 [00:00<00:00, 272B/s]  


If you were working with a nested directory structure or larger data, you'd want to use the S3 credentials to interact with the resource via s3 protocol. 

We can get these using the API:

In [27]:
access_key: str = resource.s3.write_access_key
secret_key: str = resource.s3.write_secret_key
endpoint: str = resource.s3.endpoint
bucket: str = resource.s3.bucket

This let's us make directories:

In [41]:
resource.s3.mkdir("myDir/")

And upload files to a directory:

In [43]:
resource.s3.upload("myDir/s3_test.txt", "s3_test.txt")

myDir/s3_test.txt:   0%|          | 0.00/19.0 [00:00<?, ?B/s]

We cannot add metadata via S3, so we use the API to update the metadata:

In [28]:
obj = resource.object('myDir/s3_test.txt')

In [29]:
print(obj)

+------------------------------+
|      Object s3_test.txt      |
+----------+-------------------+
| Property |       Value       |
+----------+-------------------+
|   Name   |    s3_test.txt    |
|   Size   |       19.0 B      |
|   Type   |        file       |
|   Path   | myDir/s3_test.txt |
|  Folder  |       False       |
+----------+-------------------+


In [30]:
obj.update(metadata)

Let's add metadata to the folder as well:

In [40]:
folder = resource.object('myDir/')
print(folder)

+-------------------+
|    Object myDir   |
+----------+--------+
| Property | Value  |
+----------+--------+
|   Name   | myDir  |
|   Size   | 0.0 B  |
|   Type   | folder |
|   Path   | myDir/ |
|  Folder  |  True  |
+----------+--------+


In [32]:
folder.update(metadata)

In [37]:
if folder.has_metadata:
    print(folder.form())

+---+----------+----------------+--------------+
| C | Type     | Property       | Value        |
+---+----------+----------------+--------------+
|   | [str]    | Title*         | My fun title |
|   | [str]    | Creator*       | Cat,Nikki    |
|   | datetime | Creation Date* | 2023-06-09   |
| V | [str]    | Subject Area   | Prehistory   |
| V | [str]    | Type           | Text         |
+---+----------+----------------+--------------+


## Let's get fancier and extract metadata from a file

Let's try getting some metadata out of a TIFF image file. For this we use the PIlö

In [None]:
from PIL import Image
from PIL.TiffTags import TAGS

with Image.open('image.tif') as img:
    meta_dict = {TAGS[key] : img.tag[key] for key in img.tag.iterkeys()}

We'll work with flow cytometry - Fluorescence Activated Cell Sorting (FACS) data. 

For this we need the `FlowKit module:

In [45]:
pip install FlowKit

Defaulting to user installation because normal site-packages is not writeable
Collecting FlowKit
  Using cached FlowKit-1.0.1.tar.gz (88 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting anytree>=2.6 (from FlowKit)
  Using cached anytree-2.8.0-py2.py3-none-any.whl (41 kB)
Collecting bokeh<3,>=2.4 (from FlowKit)
  Using cached bokeh-2.4.3-py3-none-any.whl (18.5 MB)
Collecting flowio==1.1.1 (from FlowKit)
  Using cached FlowIO-1.1.1.tar.gz (14 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): 

  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [60 lines of output]
      <string>:16: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
      !!
      
              ********************************************************************************
              Requirements should be satisfied by a PEP 517 installer.
              If you are using pip, you can try `pip install --use-pep517`.
              ********************************************************************************
      
      !!
      C:\Program Files\Python311\python.exe: No module named pip
      Traceback (most recent call last):
        File "C:\Users\pj430626\AppData\Local\Temp\pip-build-env-cm8r62ec\overlay\Lib\site-packages\setuptools\installer.py", line 96, in _fetch_build_egg_no_warn
          subprocess.check_call(cmd)
        File "C:\Program Files\Python311\Lib\subprocess.py", line 413, in c

In [43]:
import flowkit as fk # installation has errors... 
from coscine.client import User


ModuleNotFoundError: No module named 'flowkit'

We'll use another resource here with a fitting application profile. Since everything else stays the same, let's just change the resource to 'FACS_data':

In [5]:
resource = project.resource('FACS_data')
print(resource)

NameError: name 'project' is not defined

We can use regex patterns to validate some of this data. This can also be done when creating the profile and setting field properties, but it wasn't here. However, you can play around with this on the [application profile generator](https://coscine.rwth-aachen.de/coscine/apps/aimsfrontend/).

In [79]:
metadata = resource.metadata_form()
assert metadata is not None
print("loaded metadata form")
print(metadata)

loaded metadata form
+---+----------+--------------------+-------+
| C | Type     | Property           | Value |
+---+----------+--------------------+-------+
|   | [str]    | Hypothesis         |       |
| V | str      | Instrument*        |       |
|   | [str]    | Organism           |       |
|   | [str]    | Cell source        |       |
|   | [str]    | Markers analysed*  |       |
|   | int      | Number of Samples* |       |
|   | [str]    | Reference          |       |
|   | str      | Creator*           |       |
|   | str      | Affiliation        |       |
|   | str      | ORCID              |       |
|   | datetime | Date of Analysis*  |       |
|   | str      | Rights             |       |
+---+----------+--------------------+-------+


Get User info (based on token) to fill out creator:

In [83]:
# create an instance of the User class
user = User(client)

# print the user's name
print(user.name)

Nicole Parks


Specify a path for the research data. Here, we use the path our notebook is located in:

In [57]:
files = Path('fcs_files').glob('*')
for file in files:
    print(file)

fcs_files\d5_1_Col_1to16_012.fcs
fcs_files\d5_1_Col_1to32_013.fcs
fcs_files\d5_1_Col_1to4_010.fcs
fcs_files\d5_1_Col_1to8_011.fcs


In [56]:
file = 'fcs_files\d5_1_Col_1to16_012.fcs'

[]

In [63]:
with open(file) as f:
    lines = f.readlines()

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3731: character maps to <undefined>

In [62]:
for line in lines:
    print(line)

TypeError: 'builtin_function_or_method' object is not iterable

Write a function that retrieves all '.fcs' or '.LMD' files from a given folder path.

In [1]:
def get_all_files_from_directory(folder_path):
    return_value = {}
    for pattern in ('*.fcs', '*.LMD'):
          for file_name in Path(folder_path).glob(pattern):
            sample = fk.Sample(file_name)
            fcs_metadata = sample.get_metadata()
            return_value[file_name.as_posix()] = sample.get_metadata()

    return return_value

Write a function that extracts metadata from the file 

In [100]:
def extract_metadata_from_file(file_path: str) -> dict:
    sample = fk.Sample(file_path)
    return sample.get_metadata()

In [101]:
metadata_files = get_all_files_from_directory(folder_path)

Defining fcs_metadata outside of the defined function

In [102]:
for pattern in ('*.fcs', '*.LMD'):
     for file_name in Path(folder_path).glob(pattern):
        sample = fk.Sample(file_name)
        fcs_metadata = sample.get_metadata()

Count number of files with the same Experiment Name in order to Obtain Number of Samples

Write a function that counts the number of files with the same file name. 
The total number of files with the same name will listed on the metadata form as the 'Number of Samples'.

In [103]:
def count_exp_name_files(metadata_files):
    exp_name_count = {}
    for file_name, metadata in metadata_files.items():
        exp_name = metadata.get('experiment name')
        if exp_name is not None:
            exp_name_count[exp_name] = exp_name_count.get(exp_name, 0) + 1
        else:
            exp_name = metadata.get('groupname')
            if exp_name is not None:
                exp_name_count[exp_name] = exp_name_count.get(exp_name, 0) + 1 
            else:
                exp_name = metadata.get('experiment_name')
                if exp_name is not None:
                    exp_name_count[exp_name] = exp_name_count.get(exp_name, 0) + 1   
                else:
                    exp_name = metadata.get('@testname')
                    if exp_name is not None:
                        exp_name_count[exp_name] = exp_name_count.get(exp_name, 0) + 1 
    return sum(exp_name_count.values())

In [104]:
count_exp_name_files(metadata_files)

1

Set instrument Name in Coscine based on how it is stored in file metadata keys

In [111]:
def set_instrument_name(name):
    if name == 'Aurora':
        return 'Cytek Aurora'
    if name == "FACSAriaII":
        return "BD Aria"
    if name == 'LSRFortessa':
        return 'BD LSRFortessa'        
    if name == 'Gallios':
        return 'andere'
    raise NotImplementedError

In [112]:
fcs_metadata['cyt']

'LSRFortessa'

In [113]:
fcs_metadata["cyt"] = set_instrument_name(fcs_metadata["cyt"])

Format time to match required field in Coscine Resource

In [114]:
fcs_metadata['date'] = datetime.strptime(fcs_metadata['date'],'%d-%b-%Y')

Define Function to obtain Markers Analyzed

In [115]:
def cd4_values(metadata_files):
    pattern = re.compile(r'^p\d+s$')    
    metadata_values = []
    for file_path in metadata_files:
        with open(file_path, 'rb') as file:
            sample = fk.Sample(file)
            metadata = sample.get_metadata()
            metadata_values += [metadata[key] for key in metadata if pattern.match(key)]
    return metadata_values

metadata_files = get_all_metadata_from_directory(folder_path)
markers = cd4_values(metadata_files)

num_markers = len(markers)

In [116]:
print(fcs_metadata)

{'beginanalysis': '0', 'endanalysis': '0', 'beginstext': '0', 'endstext': '0', 'begindata': '3602', 'enddata': '9321781            ', 'fil': 'd5_1_Col_1to4_010.fcs', 'sys': 'Windows 7 6.1', 'tot': '155303             ', 'par': '15', 'mode': 'L', 'byteord': '4,3,2,1', 'datatype': 'F', 'nextdata': '0', 'creator': 'BD FACSDiva Software Version 8.0.1', 'tube name': '1_Col_1to4', 'src': 'd5', 'experiment name': '20210209_invitro_DCs-Tcells_OVApulse_d5', 'guid': '029b30bb-2354-4420-bb6e-6507aa9d9210', 'date': datetime.datetime(2021, 2, 9, 0, 0), 'btim': '10:23:15', 'etim': '10:26:45', 'cyt': 'BD LSRFortessa', 'settings': 'Cytometer', 'cytnum': 'H647794E6045', 'window extension': '10.00', 'export user name': 'UserIMM', 'export time': '09-FEB-2021-11:55:06', 'op': 'UserIMM', 'fsc asf': '0.68', 'autobs': 'TRUE', 'inst': ' ', 'laser1name': 'Blue', 'laser1delay': '0.00', 'laser1asf': '0.80', 'laser2name': 'Red', 'laser2delay': '71.15', 'laser2asf': '0.75', 'laser3name': 'Violet', 'laser3delay': '

In [117]:
print(cd4_values(metadata_files))

['MHCII', '7AAD', 'hCD2', 'CD4', 'eF450', 'CD45 or CD62L', 'FMO b7', 'CD11c or CCR9']


Input Metadata from file to Coscine

In [118]:
metadata["Number of Samples"] = count_exp_name_files(metadata_files) 
metadata['Date of Analysis'] = fcs_metadata['date']
metadata['Creator'] = user.name
metadata['Instrument'] = fcs_metadata['cyt']
metadata['Markers analysed'] = cd4_values(metadata_files)

# Check for 'experiment name' metadata key and assign its value to 'Hypothesis' in Coscine
if 'experiment name' in fcs_metadata:
    metadata['Hypothesis'] = fcs_metadata['experiment name']
else:
    # Check for 'groupname' metadata key and assign its value to 'Hypothesis' in Coscine
    if 'groupname' in fcs_metadata:
        metadata['Hypothesis'] = fcs_metadata['groupname']
    else:
        # Check for 'experiment_name' metadata key and assign its value to 'Hypothesis' in Coscine
        if 'experiment_name' in fcs_metadata:
            metadata['Hypothesis'] = fcs_metadata['experiment_name']
        else:
        # Check for 'experiment_name' metadata key and assign its value to 'Hypothesis' in Coscine
            if 'proj' in fcs_metadata:
                metadata['Hypothesis'] = fcs_metadata['proj']

In [119]:
print(metadata)

+---+----------+--------------------+----------------------------------------------------+
| C | Type     | Property           | Value                                              |
+---+----------+--------------------+----------------------------------------------------+
|   | [str]    | Hypothesis         | 20210209_invitro_DCs-Tcells_OVApulse_d5            |
| V | str      | Instrument*        | BD LSRFortessa                                     |
|   | [str]    | Organism           |                                                    |
|   | [str]    | Cell source        |                                                    |
|   | [str]    | Markers analysed*  | MHCII,7AAD,hCD2,CD4,eF450,CD45 or CD62L,FMO        |
|   |          |                    | b7,CD11c or CCR9                                   |
|   | int      | Number of Samples* | 1                                                  |
|   | [str]    | Reference          |                                                    |

In [120]:
file_name

PosixPath('d5_1_Col_1to4_010.fcs')

In [121]:
resource.upload(file_name.name, str(file_name.absolute()), metadata)

d5_1_Col_1to4_010.fcs:   0%|          | 0.00/9.32M [00:00<?, ?B/s]