# Data Curation,<br> Data Catalog,<br> and SciCat
## Day 1: Introduction

Max Novelli  
DMSC Summer School 2024  
Copenhagen, Denmark

## Dataset

> A Dataset is a collection of multi-modal items which: 
> - share a common source
> - were collected for a specific purpose, 
> - describes an individual event or a series of events or an experiment, 
> - pertains to the same experimental data collection.  

A dataset might contain one or more files and must have a set of descriptors that, according to the current standards, are called _metadata_ and help users to find the dataset itself.  

## Data Curation
> __Data curation__ is the effort to organize multi-modal data (measurements and observations) collected during an experimental data acquisition in an organized set, called dataset, with the purpose of making the data more logical, easy to find, understand and reproduce.

Data curation refers to the complete process of collecting the files with all the data measured during the experiment.
This includes the set of information that is available in lab notebooks or in the memory of scientists and all the actors involved in collecting, reducing, and analyzing such data.


## FAIR Data

The FAIR principles can be summarized in the following four points:
- Findable
- Accessible
- Interoperable
- Reusable

![FAIR logo](images/fair.png)

## TRUSTed Repositories

The TRUST principles are geared more towards the repositories holding the data, how users interact with them.  
They are:
- Transparency
- Responsibility
- User Focus
- Sustainability
- Technology

<img src="images/trust.png" width="500">

## Data
> __Data__ are every measurement, time series and piece of information, quantitative and qualitative, including log book entries, that can be digitally acquired and recorded during the experimental time. Such information is saved in one or multiple files which are generated during the experimental data acquisition.

## Metadata
> __Metadata__ are any information saved in the data catalogue and available to the user to search the individual dataset or a group of them and their linked data.  
> Metadata can be:
> 1) a duplicate of any piece of data contained in the linked data files. Their size must be small both in size and dimensionality.
> 2) any derived information that results from any type of data aggregation (like average, min, max or something more complex) performed on information contained in the linked data files
> 3) any qualitative or quantitative information that has been collected or discovered at a later stage, which is relevant to describe and facilitate finding the dataset itself.


## Data catalog
> A __data catalog__ is the tool that allows for a detailed inventory of all the scientific and experimental data produced at the facility and related to its scientific process. It is designed and configured to increase data FAIRness, to improve findability, accessibility, interoperability and reusability.

## SciCat

SciCat is the data catalog of choice in ESS.  
It has been developed as an in-kind contribution and through a collaboration between ESS, [PSI](https://www.psi.ch/en), and [MaxIV](https://www.maxiv.lu.se/).  

The current version is _4.x_, and it is based on the following technologies:
- backend
  - mongodb
  - mongoose
  - typescript
  - node.js
  - nest.js
- frontend
  - node.js
  - angular.js

## Datasets List

![SciCat Datasets List](images/scicat_datasets_list.png)

## Dataset Details
### General Information
![SciCat Dataset List](images/scicat_dataset_general_information.png)

## Dataset Details
### Scientific Metadata
![SciCat Dataset List](images/scicat_dataset_scientific_metadata.png)

## Python Libraries

- Pyscicat https://github.com/SciCatProject/pyscicat  
  <img src="images/pyscicat.png" width="100">
  
- Scitacean https://github.com/SciCatProject/scitacean  
  <img src="images/scitacean.png" width="400">  


## Exercise 1

Discuss and list few properties of the objects shown below

<img src="images/juggling_balls_1_s.png" width="200"><img src="images/juggling_balls_2_s.png" width="200"><img src="images/juggling_balls_3_s.png" width="200">

<!-- ![Juggling balls 1](images/juggling_balls_1_s.png) ![Juggling balls 2](images/juggling_balls_2_s.png) ![Juggling balls 3](images/juggling_balls_3_s.png) -->

## Exercise 2

How could we find all the balls that are intact and of the same color?  

Which minimum set of metadata could we use?
  
![Juggling balls 4](images/juggling_balls_4_s.png)

## Thank you
### Data Curation, Data Catalog, and Scicat 
### Day 1: Introduction
  
Max (Massimiliano) Novelli  
max.novelli@ess.eu

# Data Curation,<br> Data Catalog,<br> and SciCat
## Day 2: Hands-on exercise

Max Novelli  
DMSC Summer School 2024  
Copenhagen, Denmark

## Examples

The material provided contains the document [8-example.md](8-example.md) which explains in details how to use Scitacean to interact with SciCat.

There are also three additional working notebooks which illustrate further how to retrieve a single dataset, multiple datasets and create a raw and a derived dataset:
- [access individual dataset](./notebooks/access_individual_dataset.ipynb)
- [access multiple datasets](./notebooks/access_multiple_datasets.ipynb)
- [upload and create dataset](./notebooks/create_single_dataset.ipynb)

## Hands-on Exercises 

Please open the notebook [9-exercise.ipynb](./9-exercise.ipynb).  
This notebook contains the tasks that you should be working on in the SciCat section of the DMSC Summer School.  
The instructions and all the necessary information are contained in the notebook

### Good Luck

## Thank you
### Data Curation, Data Catalog, and SciCat
### Day 2: Hands-on Exercises
  
Max (Massimiliano) Novelli  
max.novelli@ess.eu