# Create Raw and Derived Datasets
## DMSC Summer School
  
This notebook show how to create locally a raw and a derived dataset with scientific metadata and related files.   
After the local dataset is complete, it leverage Scitacean capabilities to upload the associated files and create the entry in SciCat.


URL of the scicat instance containing the data

In [None]:
scicat_instance = "https://staging.scicat.ess.eu/api/v3"

Valid Authentication token  
(Also called access token or SciCat token)  
_Follow the steps listed below to obtain the token_, 
- visit [ESS SciCat staging environment](https://staging.scicat.ess.eu)
- log in using the credentials provided
- go to User->settings page, 
- and click on the __copy to clipboard__ icon added at the end of the __SciCat Token__ .

![SciCat User Settings](scicat_user_settings.png)

Access token example:  
`eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJfaWQiOiI2MzliMmE1MWI0MTU0OWY1M2RmOWVjMzYiLCJyZWFsbSI6ImxvY2FsaG9zdCIsInVzZXJuYW1lIjoiaW5nZXN0b3IiLCJlbWFpbCI6InNjaWNhdGluZ2VzdG9yQHlvdXIuc2l0ZSIsImVtYWlsVmVyaWZpZWQiOnRydWUsImF1dGhTdHJhdGVneSI6ImxvY2FsIiwiaWQiOiI2MzliMmE1MWI0MTU0OWY1M2RmOWVjMzYiLCJpYXQiOjE2OTIwODc0ODUsImV4cCI6MTY5MjA5MTA4NX0.Phca4UF7WKY367-10Whgwd5jaFjiPku6WsgiPeDh_-o`

In [None]:
scicat_token = "<YOUR_SCICAT_TOKEN>"

User name and access key used to access files.
The ssh key file is provided at the beginning of the session.
Note that the key filename only works on the School's JupyterHub.

In [None]:
sftp_username = "dss2024"
sftp_key_filename = "/home/jovyan/.ssh/id_summerschool2024"

Import Scitacean.
For more information please check the official [repository](https://github.com/SciCatProject/scitacean) and [documentation](https://scicatproject.github.io/scitacean/)

In [None]:
from scitacean import Client, Dataset
from scitacean.transfer.sftp import SFTPFileTransfer

Function to perform some magic and establish connection to the data repository

In [None]:
def connect(host, port):
    from paramiko import SSHClient, AutoAddPolicy

    client = SSHClient()
    client.load_system_host_keys()
    client.set_missing_host_key_policy(AutoAddPolicy())
    client.connect(
        hostname=host, 
        username=sftp_username,
        key_filename=sftp_key_filename,
        timeout=1)
    return client.open_sftp()

Instantiate scitacean client

In [None]:
client = Client.from_token(
    url=scicat_instance,
    token=scicat_token,
    file_transfer=SFTPFileTransfer(
        host="sftpserver2.esss.dk",
        connect=connect,
    ))

Before we start creating new dataset,  
we need to define who is the reference person for this dataset.  

In order to simply, the reference person will be use as principal inverstigator, owner and contact person

In [None]:
reference_person_name = "<YOUR_NAME>"
reference_person_email = "<YOUR_EMAIL>"

We need a unique name for the folder where to upload the data to.  
We use UUID to achieve that.

In [None]:
import uuid
run_uuid = str(uuid.uuid4())

First, we are going to create a local copy of a raw dataset.  

In [None]:
raw_dataset = Dataset(
    type='raw',
    contact_email=reference_person_email,
    principal_investigator=reference_person_name,
    owner=reference_person_name,
    owner_email=reference_person_email,
    creation_location='/ESS/DMSC/Summer_School',
    data_format='Random binary file',
    is_published=False,
    owner_group='dss2024',
    access_groups=['ess','dram','swap'],
    instrument_id=None,
    techniques=[],
    keywords=[
        'DMSC Summer School', 
        '2025', 
        'DMSC Summer School 2025',
        'Upload Test', 
        'Raw Upload Test',
    ],
    license='unknown',
    proposal_id=None,
    source_folder=f'/ess/data/dmsc_summer_school/2024/upload/{run_uuid}',
    source_folder_host='SpectrumScale.esss.dk',
    name='This is a DMSC Summer School test raw dataset',
    description=f'This is a DMSC Summer School test raw dataset. Run {run_uuid}',
)

Now we add scientific metadata

In [None]:
raw_dataset.meta = {
    'wavelength' : {
        'value' : 1.5,
        'unit' : 'angstrom'
    },
    'detector' : {
        'value' : 3,
        'unit' : 'm'
    },
    'sample_weight' : {
        'value' : 4,
        'unit' : 'Kg'
    },
    'number_of_pulses' : {
        'value' : 1,
        'unit' : ''
    },
}

Last step before uploading the dataset, we add the related files

In [None]:
raw_dataset.add_local_files("sample_data/dmsc_summer_school_test_data_file_1.dat", base_path="sample_data")

Before we proceed and upload the dataset,  
let's view it and visually verify that everything is there 

In [None]:
raw_dataset

Now we are ready to create the dataset in SciCat and upload the files

In [None]:
uploaded_raw_dataset = client.upload_new_dataset_now(raw_dataset)

Uploaded_raw_dataset is almost exact copy of raw_dataset, except for the pid which is the unique identifier that has been assigned to this dataset by SciCat

In [None]:
raw_dataset_pid = str(uploaded_raw_dataset.pid)

In [None]:
print(f"The dataset has been created and has been assigned pid {raw_dataset_pid}")

We can verify the pid by inspecting the returned dataset.  
_Important_: This dataset has a valid PID, which was assigned by SciCat, while in the one above, that we created locally, the PID field was empty.

In [None]:
uploaded_raw_dataset

Now that we have created a raw dataset, we can move on and create a derived dataset.

In [None]:
derived_dataset = Dataset(
    type='derived',
    contact_email=reference_person_email,
    investigator=reference_person_name,
    owner=reference_person_name,
    owner_email=reference_person_email,
    is_published=False,
    owner_group='dss2024',
    access_groups=['ess','dram','swap'],
    instrument_id=None,
    techniques=[],
    keywords=[
        'DMSC Summer School', 
        '2025',
        'DMSC Summer School 2025',
        'Upload Test', 
        'Derived Upload Test',
    ],
    license='unknown',
    proposal_id=None,
    source_folder=f'/ess/data/dmsc_summer_school/2024/upload/{run_uuid}',
    source_folder_host='SpectrumScale.esss.dk',
    input_datasets=[raw_dataset_pid],
    used_software=['magic and fantastic software'],
    name='This is a DMSC Summer School test derived dataset',
    description=f'This is a DMSC Summer School test derived dataset. Run {run_uuid}',
)

Now we add scientific metadata

In [None]:
derived_dataset.meta = {
    'estimated_wavelength' : {
        'value' : 1.5,
        'unit' : 'angstrom'
    },
    'estimated_detector' : {
        'value' : 3,
        'unit' : 'm'
    },
    'estimated_sample_weight' : {
        'value' : 4,
        'unit' : 'Kg'
    },
    'number_of_pulses' : {
        'value' : 1,
        'unit' : ''
    },
    'secret_algorithm_parametrer_1' : {
        'value' : 0.0034,
        'unit' : ''
    }
}

Last step before uploading the dataset, we add the related files

In [None]:
derived_dataset.add_local_files("sample_data/dmsc_summer_school_test_data_file_2.dat", base_path="sample_data")

Before we proceed and upload the dataset,  
let's view it and visually verify that everything is there 

In [None]:
derived_dataset

Now we are ready to create the dataset in SciCat and upload the files

In [None]:
uploaded_derived_dataset = client.upload_new_dataset_now(derived_dataset)

As for the raw dataset, uploaded_derived_dataset is almost exact copy of derived_dataset, except for the pid which is the unique identifier that has been assigned to this dataset by SciCat

In [None]:
derived_dataset_pid = uploaded_derived_dataset.pid

In [None]:
print(f"The dataset has been created and has been assigned pid {derived_dataset_pid}")

We can verify the pid by inspecting the returned dataset

In [None]:
uploaded_derived_dataset