<a href="https://colab.research.google.com/github/gdcc/easyDataverse/blob/main/examples/EasyDataverseBasics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EasyDataverse Basics

Welcome to this introductory notebook that demonstrates the basic features of EasyDataverse. EasyDataverse is a user-friendly tool that aims to simplify the process of managing data with Dataverse. The tool utilizes Dataverse's API-first philosophy and offers a GUI-like experience when interacting with datasets. It is backed by Dataverse's powerful Native API, providing an efficient, streamlined, and simplified data management solution.

Within this notebook you will learn more about the following topics:

* Connecting to a Dataverse instance
* Gathering metadata block information
* Creating a dataset and adding metadata
* File uploads (S3 and native)
* Fetching and updating datasets

In [None]:
# Install easyDataverse
!pip install easyDataverse

# Download data
!wget https://github.com/gdcc/easyDataverse/raw/main/examples/data/README.md
!wget https://github.com/gdcc/easyDataverse/raw/main/examples/data/anscomb.json
!wget https://github.com/gdcc/easyDataverse/raw/main/examples/data/mnist_test.csv
!wget https://github.com/gdcc/easyDataverse/raw/main/examples/data/mnist_train_small.csv 

In [1]:
import rich # For printing

from easyDataverse import Dataverse

### Connecting to Dataverse

To begin, you can establish a connection to a Dataverse by using the `Dataverse` class. Within the connection process, EasyDataverse will gather all available metadata blocks and convert them into PyDantic data types. These data types follow a strict typing system to ensure that any data passed in complies with the expectations of your instance. These types will come in handy later when creating datasets.

In [4]:
# Connect to a Dataverse instance of choice
dataverse = Dataverse(
    server_url='https://demo.dataverse.org', # @param {type:"string"}
    api_token='<API_TOKEN>' # @param {type:"string"}
)

Output()





### Metadata block fields

It's possible that you might not be familiar with all the metadata fields that are available, but you don't want to keep switching back and forth between the Graphical User Interface (GUI) and the notebook. Fortunately, EasyDataverse offers a solution to this problem by providing a `list_metadatablocks` method that allows you to check what metadata blocks are available. This method has two modes of functionality:

* `detailed=False` - Lists the names of the available metadata blocks
* `detailed=True` - Lists each metadata block, its fields and available add-methods

In [5]:
# Mode 1: Only names
dataverse.list_metadatablocks(detailed=False)

# Mode 2: Names, fields and methods 
# dataverse.list_metadatablocks(detailed=True)

geospatial
socialscience
astrophysics
biomedical
journal
customMRA
customGSD
customARCS
customPSRI
customPSI
customCHIA
customDigaai
citation
customSAEF


----

### Dataset creation and population

Our next step is to establish a new dataset and add relevant metadata and files to it. Once we connect to the Dataverse instance, we can use the `dataverse` object that we had previously created to generate new datasets. The `create_dataset` method will return a `Dataset` object that comes with all the available metadata blocks from the connected instance.

In [63]:
dataset = dataverse.create_dataset()

print(dataset) # Should be empty by now

metadatablocks: {}



Before we populate the metadata of this dataset, we should check what information is available. Each metadata block can be accessed as an attribute of the `dataset` instance. Just like we can check the content of all metadata blocks, we can also inspect the content of individual instances. Specifically, we will inspect the content of the `geospatial` metadata block.

In [64]:
dataset.geospatial.info()

Once we have a clear understanding of the fields available, we can add values to the corresponding attributes/fields in the metadata block. For lists of compound objects or sub-objects, we can add them using dedicated add-methods that will be demonstrated below.

In [65]:
# Singular fields
dataset.geospatial.unit = ["unit"]

# Adding to list of compounds
dataset.geospatial.add_coverage(
    country="Germany",
    state="BW",
    city="Stuttgart",
)

print(dataset)

metadatablocks:
  geospatial:
    geographicUnit:
      - unit
    geographicCoverage:
      - country: Germany
        state: BW
        city: Stuttgart



### Data validation

You may have noticed that instead of setting a single unit, we have provided a list of units to the `unit` field. This is because the metadata block expects this field to be a multiple of text. As the input we supplied meets this requirement, we did not encounter any problems. Similarly, the `country` field of the added `coverage` compound also expects a value from a controlled vocabulary.

EasyDataverse provides data types that follow a strict typing system. This system is enforced during initialization and assignment, which helps to prevent errors during upload and enables developers to identify non-compliant data. Let's attempt to intentionally provide incorrect input types and see how the system responds.

In [66]:
# Using try-except to not break the flow
from pydantic import ValidationError

try:
  # Adding a non-list element
  dataset.geospatial.unit = "unit"
except ValidationError as e:
  rich.print(e)

try:
  # Adding a value that is not part of a controlled vocab
  dataset.geospatial.add_coverage(
      country="germany",
      state="BW",
      city="Stuttgart",
  )
except ValidationError as e:
  # ... you see the world is quite large :')
  rich.print(e)

To proceed, we can add more information to our dataset by including metadata. For now, we will only focus on filling out the required fields, but don't hesitate to try out other metadata options. It's worth noting that EasyDataverse will check for the mandatory fields during the upload process.

In [67]:
dataset.citation.title = "My Dataset"
dataset.citation.add_author(name="John Doe")
dataset.citation.add_dataset_contact(name="John Doe", email="john@doe.com")
dataset.citation.add_ds_description(value="This is a description")
dataset.citation.subject = ["Other"]

### File uploads

To add files to a dataset, you can utilize dedicated methods to add either a single file or a directory of files. Both of these methods are available within the `Dataset` instance and are register your files for upload. You also have the option to customize the directory in which the file is stored and rename it according to your preferences.

In [68]:
# Upload a single file
dataset.add_file(
    local_path="./data/anscomb.json", # Path to the file on your system
    dv_dir="some/sub/dir", # Optional sub directory on Dataverse
    file_name="different_name.json" # Optional renaming of the file
)


# Upload an entire directory
dataset.add_directory(
    dirpath="./data",
    ignores=[
        r"anscomb.json", # Ignore the previously added file
        "^\..*",         # Ignore hidden files and dirs
    ]
)

# Check if all the files have been added
rich.print(f"Added {len(dataset.files)} files.")

### Uploading a dataset

Before we upload a dataset, let's take a look at what goes on behind the scenes. EasyDataverse creates a Dataverse JSON that adds metadata to your dataset, which can be viewed by calling the `dataverse_dict`-method. However, you don't need to worry about this as EasyDataverse takes care of it for you.

In [69]:
rich.print(dataset.dataverse_dict())

After completing all the necessary steps, we are now ready to move on to the final step which is to upload the dataset to Demo Dataverse.

> Please note that this collection is privat and you may want to replace it with one of your own. Also, this is an S3-enabled collection that demonstrates how you can directly upload to S3!

In [70]:
pid = dataset.upload(dataverse_name="ed_test", n_parallel=4)

Dataset with pid 'doi:10.70122/FK2/2ZNL2G' created.




Output()









Output()

---

## Fetching and updating datasets

To fetch datasets, you can use the `load_dataset` method of a connected `Dataverse` instance. However, you must provide the persistent identifier (such as a DOI) of the dataset for this to work. Once you have provided the identifier, the method will automatically download all files of the dataset and restore the metadata block objects, just like the previous dataset we have created.

In [72]:
dataset = dataverse.load_dataset(pid)

print(dataset)

Output()

metadatablocks:
  geospatial:
    geographicUnit:
      - unit
    geographicCoverage:
      - country: Germany
        state: BW
        city: Stuttgart
  citation:
    title: My Dataset
    subject:
      - Other
    author:
      - authorName: John Doe
    datasetContact:
      - datasetContactName: John Doe
        datasetContactEmail: john@doe.com
    dsDescription:
      - dsDescriptionValue: This is a description
dataset_id: doi:10.70122/FK2/2ZNL2G



### Updating a dataset

After you have fetched a dataset and you want to edit it, you can simply edit the metadata and/or add files to the dataset. The files Once you have retrieved a dataset and wish to make changes to it, you can easily modify the metadata or append additional files to the dataset. Additionally, you can alter the existing files that were downloaded during the initial retrieval process. The uploader will detect any changes you make and prompt a replacement upload. However, if a file remains unchanged, it will not be uploaded again.

In [73]:
dataset.citation.title = "My updated dataset"

dataset.update()





Output()







