# Live inspection of provenance from an AiiDA archive

This notebook allows you to explore and analyze AiiDA archives interactively. You can examine the provenance graph, query calculations, and extract data from computational workflows.

For more information see [the AiiDA documentation](https://aiida.readthedocs.io/projects/aiida-core/en/latest/howto/data.html).

## Getting Started

Since no archive was pre-configured for this session, you have two options to get started:

### Option 1: Use Materials Cloud Archive
**Follow the steps below** to automatically download and set up an archive from Materials Cloud Archive:

1. **Provide the archive URL** in the cell below (from Materials Cloud Archive)
2. **Run the setup cells** to download and configure the archive
3. **Start exploring** using the AiiDA commands and examples

### Option 2: Manual Setup with Local Archive
If you have a local `.aiida` file, you first need to upload it to this Renku session.
Then, you can set up your profile manually in a terminal:

```shell
# Then create a read-only AiiDA profile
❯ verdi profile setup core.sqlite_zip --filepath your-archive.aiida
```

Alternatively, you can import it into an existing profile:

```shell
❯ verdi presto  # Create a profile
❯ verdi archive import your-archive.aiida
```

**For the automated setup, continue with the cell below ⬇️**

## Step 0: Provide the Materials Cloud Archive URL
Set the `archive_url` variable in the cell below

In [None]:
# CONFIGURE YOUR ARCHIVE URL HERE:
archive_url = ""  # Paste your Materials Cloud Archive URL here

# Example URLs:
# archive_url = "https://archive.materialscloud.org/record/2023.81"
# archive_url = "https://archive.materialscloud.org/records/yf0rj-w3r97/files/acwf-verification_unaries-verification-PBE-v1-results_quantum_espresso-SSSP-1.3-PBE-precision.aiida"

In [None]:
# Execute to fetch the archive metadata

import os
import subprocess

if not archive_url.strip():
    print("❌ Please set the archive_url variable above and run this cell again.")
else:
    print(f"🔗 Processing URL: {archive_url}")

    try:
        # Call the metadata fetching script
        result = subprocess.run(
            [
                "python3",
                "/home/jovyan/work/renku2-aiida-integration/.scripts/fetch_mca_metadata.py",
                "--archive-url",
                archive_url,
            ],
            capture_output=True,
            text=True,
            check=True,
        )

        # Print the script output
        if result.stdout:
            print(result.stdout)

        # Load the generated metadata to display summary
        metadata_file = "/tmp/mca_metadata.json"
        if os.path.exists(metadata_file):
            import json

            with open(metadata_file, "r") as f:
                metadata = json.load(f)

        else:
            print("⚠️ Metadata file not found after script execution")

    except subprocess.CalledProcessError as e:
        print(f"❌ Failed to fetch metadata: {e}")
        if e.stderr:
            print(f"Error output: {e.stderr}")
        print("Please check the URL and try again")

    except Exception as e:
        print(f"❌ Error running metadata script: {e}")
        print("Please check the URL format and try again")

# Live inspection of provenance from an AiiDA archive
## Dataset: {{ title }}
* DOI of the data: [{{ doi_url }}]({{ doi_url }})
* Materials Cloud Archive entry: `{{ mca_entry }}`
* Archive file: `{{ archive_filename }}`
* AiiDA profile name: `{{ aiida_profile }}`

## Instructions
This session is configured to work with the archive file mentioned above. The archive has not been downloaded yet to keep startup fast.

**Follow these steps:**
1. **Run the cells below** to download the archive and set up AiiDA
2. **Start exploring** using the AiiDA commands and examples

**NOTE**: *If you were expecting a different archive or file, you probably already have an open Renku session. Each
Renku user can only have one session at a given time. To see the new file, close the current session by clicking on the
trash button on the top left corner of this browser window, and then click again on the file in Materials Cloud Archive
to open a new session pointing to the file you want.*

You can inspect its content in the notebook below, which already contains a simple template with basic AiiDA commands,
as well as check [the AiiDA documentation](https://aiida.readthedocs.io/projects/aiida-core/en/latest/howto/data.html) to learn how to interact with the data.

## Step 1: Download the Archive
This cell will download the archive file from Materials Cloud Archive.

In [None]:
# Execute to download the AiiDA archive from Materials Cloud

import os
import json
import urllib.request
from urllib.error import URLError
from pathlib import Path
from tqdm import tqdm

# Load metadata from JSON file
metadata_file = "/tmp/mca_metadata.json"

if not os.path.exists(metadata_file):
    print("❌ No metadata file found. Please run the session setup first.")
else:
    with open(metadata_file, "r") as f:
        metadata = json.load(f)

    # Extract information from metadata
    archive_url = metadata.get("archive_url", os.environ.get("archive_url"))
    archive_filename = metadata.get("archive_filename")
    archive_title = metadata.get("title", "Unknown Dataset")
    doi = metadata.get("doi")
    mca_entry = metadata.get("mca_entry")

    if not archive_url:
        print("❌ No archive URL found. This cell is only for pre-configured archives.")
    elif not archive_filename:
        print("❌ No archive filename found in metadata.")
    else:
        print(f"📦 Dataset: {archive_title}")
        print(f"📁 Archive file: {archive_filename}")
        print(f"🔗 Source URL: {archive_url}")
        if doi:
            print(f"📄 DOI: {doi}")
        if mca_entry:
            print(f"🏷️ MCA Entry: {mca_entry}")
        print("")

        # Create data directory
        data_dir = Path.cwd().parent / "data" / "aiida"
        data_dir.mkdir(exist_ok=True)

        archive_path = data_dir / archive_filename

        if archive_path.exists():
            size_mb = archive_path.stat().st_size / (1024 * 1024)
            print(f"✅ Archive already exists ({size_mb:.1f} MB)")
        else:
            print("⬇️ Downloading archive...")

            try:

                class TqdmUpTo(tqdm):
                    def update_to(self, b=1, bsize=1, tsize=None):
                        if tsize is not None:
                            self.total = tsize
                        return self.update(b * bsize - self.n)

                # Update every 2 seconds to avoid IOPub rate limit
                with TqdmUpTo(unit="B", unit_scale=True, miniters=1, mininterval=2.0, desc=archive_filename) as t:
                    urllib.request.urlretrieve(archive_url, archive_path, reporthook=t.update_to)

                size_mb = archive_path.stat().st_size / (1024 * 1024)
                print(f"✅ Archive downloaded successfully ({size_mb:.1f} MB)")

            except URLError as e:
                print(f"❌ Download failed: {e}")

        print(f"\n📍 Archive location: {archive_path.absolute()}")

## Step 2: Set up the AiiDA Profile
This cell creates a read-only AiiDA profile from the downloaded archive.

In [None]:
# Execute to set up the AiiDA profile from the downloaded archive

import subprocess
import os
from pathlib import Path

# Get profile information
profile_name = metadata.get("aiida_profile", "aiida-renku")
archive_filename = metadata.get("archive_filename")
data_dir = Path.cwd().parent / "data" / "aiida"
archive_path = data_dir / archive_filename

if not archive_path.exists():
    print("❌ Archive file not found. Please run the download cell above first.")
else:
    print(f"🔧 Setting up AiiDA profile: {profile_name}")
    print(f"📁 Using archive: {archive_path}")
    print("")

    # Check if profile already exists
    try:
        result = subprocess.run(["verdi", "profile", "show", profile_name], capture_output=True, text=True, check=True)
        print(f"✅ Profile '{profile_name}' already exists")

    except subprocess.CalledProcessError:
        # Profile doesn't exist, create it
        print("⚙️ Creating AiiDA profile...")
        print("(This may take a few minutes if a database migration is required)\n")

        try:
            result = subprocess.run(
                [
                    "verdi",
                    "profile",
                    "setup",
                    "core.sqlite_zip",
                    "--profile-name",
                    profile_name,
                    "--first-name",
                    "AiiDA",
                    "--last-name",
                    "User",
                    "--email",
                    "aiida@renku",
                    "--institution",
                    "RenkuLab",
                    "--set-as-default",
                    "--non-interactive",
                    "--no-use-rabbitmq",
                    "--filepath",
                    str(archive_path.absolute()),
                ],
                capture_output=True,
                text=True,
                check=True,
            )

            print(f"✅ Profile '{profile_name}' created successfully!")

        except subprocess.CalledProcessError as e:
            print(f"❌ Failed to create profile: {e}")
            print(f"Error output: {e.stderr}")
            print("\nYou can try creating the profile manually with:")
            print(f"verdi profile setup core.sqlite_zip --filepath {archive_path}")

    # Set as default profile
    try:
        subprocess.run(["verdi", "profile", "setdefault", profile_name], capture_output=True, check=True)
        print(f"🎯 Profile '{profile_name}' set as default")
    except subprocess.CalledProcessError:
        print("⚠️ Could not set as default profile, but it should still work")

    print("\n🎉 AiiDA setup complete! You can now explore the archive data below.")

## Step 3: Start Exploring
Now you can start exploring the AiiDA database!

In [None]:
from aiida import orm, load_profile

# Load the default AiiDA profile
profile = load_profile()
print(f"✅ Loaded AiiDA profile: {profile.name}")

In [None]:
# Start querying the database using AiiDA's QueryBuilder
qb = orm.QueryBuilder()
qb.append(orm.Node)
print(f"Number of nodes in the loaded AiiDA archive: {qb.count()}")

print("List of groups in the AiiDA archive:")
for group in orm.Group.collection.all():
    num = len([node for node in group.nodes if node.node_type.startswith("process.")])
    print(f"* {group.label} [containing {num} calculations or workflows]")
    if num:
        print("  UUIDs of the calculations or workflows in the group:")
        for idx, node in enumerate(group.nodes):
            if idx > 5:
                print(f"  ... (run `verdi group show {group.label}` in a terminal to see all of them`")
                break
            if node.node_type.startswith("process."):
                print(f"    - {node.uuid}")

### Exporting the provenance graph

AiiDA tracks the full provenance of your simulations, including inputs, outputs, and process metadata in its provenance
graph.

In [None]:
# Uncomment the following line to generate the provenance graph of a given calculation or workflow:
# !verdi node graph generate {node.uuid}
# Here, we are using the last `node` from the cell above, so make sure to run it, otherwise it might be undefined.
# Alternatively, you can use any process UUID in the command.

# This will generate a PDF file with the provenance graph in the current working directory, which you can open in the
# file explorer on the left (the file name will be printed by the command).
# NOTE: Depending on the complexity of the workflow, the generated PDF file might be very big and not open correctly in
# the browser: we recommend right-clicking on the file name on the left browser and downloading it, to visualize it
# on your computer.

### Export raw calculation inputs and outputs

You can also export data to disk for analysis with other tools (e.g., Linux command line utilities like `grep`):

In [None]:
# NOTE: AiiDA uses three types of identifiers:
# * UUIDs: Universally unique identifiers (already used in the cells above ;), which are guaranteed to be globally
#   unique, even between different AiiDA databases and profiles, hence these are generally the most "safe" to use.
# * PKs: Primary keys, which are simpler integer identifiers, and are only unique within a single AiiDA database
# * Labels: Human-readable strings that can be assigned to AiiDA entities for easier reference

# Uncomment and modify these commands to export specific data:

# * Export inputs/outputs of a given process (calculation or workflow)
#   !verdi process dump <process-(uuid|pk|label)>

# * Export raw data of all calculations and workflows in a given group
#   !verdi group dump <group-(uuid|pk|label)>

# * Export raw data of all calculations and workflows in the AiiDA profile
#   !verdi profile dump --all

# NOTE: Dumping all profile data might take a considerable amount of time if your database is large.
# Therefore, the command provides various options to filter by groups, creation time of nodes, etc.
# You can see all the available options using: `!verdi profile dump --help`