# Access Multiple Datasets
## DMSC Summer SChool
  
This notebook show how to load an arbitrary number of datasets from SciCat, access their information, and download programmatically the first file of each dataset.

Load standard libraries

In [None]:
import sys
import os

URL of the scicat instance containing the data

In [None]:
scicat_instance = "https://staging.scicat.ess.eu/api/v3"

Valid Authentication token  
(Also called access token or SciCat token)  
_Follow the steps listed below to obtain the token_, 
- visit (ESS SciCat staging environment)[https://staging.scicat.ess.eu]
- log in using the credentials provided
- go to User->settings page, 
- and click on the __copy to clipboard__ icon added at the end of the __SciCat Token__ .

![SciCat User Settings](scicat_user_settings.png)

Access token example:  
`eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJfaWQiOiI2MzliMmE1MWI0MTU0OWY1M2RmOWVjMzYiLCJyZWFsbSI6ImxvY2FsaG9zdCIsInVzZXJuYW1lIjoiaW5nZXN0b3IiLCJlbWFpbCI6InNjaWNhdGluZ2VzdG9yQHlvdXIuc2l0ZSIsImVtYWlsVmVyaWZpZWQiOnRydWUsImF1dGhTdHJhdGVneSI6ImxvY2FsIiwiaWQiOiI2MzliMmE1MWI0MTU0OWY1M2RmOWVjMzYiLCJpYXQiOjE2OTIwODc0ODUsImV4cCI6MTY5MjA5MTA4NX0.Phca4UF7WKY367-10Whgwd5jaFjiPku6WsgiPeDh_-o`

In [None]:
token="<YOUR_SCICAT_TOKEN>"

User name and access key used to access files.  
The ssh key file is provided at the beginning of the session.

In [None]:
sftp_username = "dss2023"
sftp_key_filename = "<PATH_OF_THE_SSH_KEY_FILE>"

We want to work with all the datasets that have been prepared for this course and are available in SciCat.  
The list of the dataset's pids is in the following cell.  
  
If you are courious how this list has been obtained, below is the linux command line:
```bash
curl \
  -X 'GET' \
  'https://staging.scicat.ess.eu/api/v3/datasets/fullquery?limits=%7B%20%22skip%22%3A%200%2C%20%22limit%22%3A%2025%2C%20%22order%22%3A%20%22creationTime%3Adesc%22%20%7D&fields=%7B%22mode%22%3A%7B%7D%2C%22keywords%22%3A%5B%22DMSC%20Summer%20School%202023%22%5D%7D' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer <YOUR_SCICAT_TOKEN>' | \
  jq . | \
  grep pid | \
  cut -d\" -f4
```

In [None]:
dataset_pids = [
    "20.500.12269/a0b187e6-3a48-42db-9c01-7c55db41dca5",
    "20.500.12269/260ac20b-8354-40b8-847a-0fbda3ecc5ae",
    "20.500.12269/75feee64-9795-4430-af2b-f84028bf9f17",
    "20.500.12269/b9ef58e5-48f9-4ee8-95a0-dc12b0f20638",
    "20.500.12269/0e586f04-5413-4dfb-a78d-5768de924ac6",
    "20.500.12269/06d85a15-6979-4b42-9eb1-3080fbdffc10",
    "20.500.12269/3027e6df-36fd-498c-b143-4a42f0e8e06d",
    "20.500.12269/48b7d7af-8ef7-4993-8a33-671299230d76",
    "20.500.12269/0cb7099c-fee1-4e3a-8dc9-a699c32b6f98",
    "20.500.12269/b7657501-29a0-4e1f-9ffc-32881a5bd09a",
    "20.500.12269/53ec1287-b0fe-4171-bf71-80673a54262e",
    "20.500.12269/488681d6-73cf-477e-8a30-1d625354cc85",
    "20.500.12269/f2947f0e-97e6-470b-a914-9dc8ac03c893",
    "20.500.12269/c566043f-f37c-417f-8dc7-d9d17b25c8ef",
    "20.500.12269/087a0844-d0d8-4f3d-88ba-e6505eea8c7a",
    "20.500.12269/d84012fe-679d-4608-82a8-8e39ad092f40",
    "20.500.12269/249a7405-8ab9-4859-8ea5-e691b80e4007",
    "20.500.12269/17dbda39-0ce7-493c-82fc-24c09b35e0c9",
    "20.500.12269/bdfa6765-1479-4b59-a095-86b75f3ae295",
    "20.500.12269/035d4cbd-e2a2-45a4-a919-d66216ccb29a",
    "20.500.12269/7a3cb15d-992d-4409-b62e-024b509d570c",
    "20.500.12269/25f58b6c-8f45-454f-bd22-ca9a398ab24b",
    "20.500.12269/0445cf2d-53a3-4f3a-8714-be6ea2aeccf2"
]

Local folder where the downloaded data should be saved

In [None]:
local_data_folder = "./data"

Import Scitacean
For more information please check the official [repository](https://github.com/SciCatProject/scitacean) and [documentation](https://scicatproject.github.io/scitacean/)

In [None]:
from scitacean import Client
from scitacean.transfer.sftp import SFTPFileTransfer

Function to perform some magic and establish connection to the data repository

In [None]:
def connect(host, port):
    from paramiko import SSHClient, AutoAddPolicy

    client = SSHClient()
    client.load_system_host_keys()
    client.set_missing_host_key_policy(AutoAddPolicy())
    client.connect(
        hostname=host, 
        username=sftp_username,
        key_filename=sftp_key_filename,
        timeout=1)
    return client.open_sftp()

Instantiate scitacean client

In [None]:
client = Client.from_token(
    url=scicat_instance,
    token=token,
    file_transfer=SFTPFileTransfer(
        host="sftpserver2.esss.dk",
        connect=connect
    ))

Load all the datasets whose pids are listed above.   
We are looping on all the pids and load the dataset through scitacean client.  
This can be re-factored in a more pythonic way, although we decided to write an explicit loop to provide more visual feedback while loading

In [None]:
datasets = []
for pid in dataset_pids:
    datasets.append(client.get_dataset(pid))
    print(".",end="")

Let's validate that we retrieved all the datasets that we requested

In [None]:
print(f"Number of pids provided ......: {len(dataset_pids)}\nNumber of datasets retrieved .: {len(datasets)}")

Let's explore all the metadata of the first dataset

In [None]:
datasets[0]

As we already saw in the [single dataset notebook](./access_individual_dataset.ipynb),  
we can expand __Files__ and __Scientific Metadata__ to explore further the dataset information

Let's download the first file of each dataset.  
We decided to use an explicit loop, instead of a comprehension, to provide more visual feedback during the download

In [None]:
downloaded_datasets = []
for dataset in datasets:
    try: 
        temp_dataset = client.download_files(
            dataset,
            target=local_data_folder,
            select=dataset.files[0].remote_path.name
        )
    except:
        print("x",end="")
        continue
    
    downloaded_datasets.append(temp_dataset)
    print(".",end="")

Now we can review if the file has been downloaded.
Let's check the first dataset.

In [None]:
downloaded_datasets[-1]