# Data Discovery AI – Non-Technical Tutorial

**Author**: Yuxuan Hu

---

## Introduction

This tutorial introduces an AI framework developed to support the [new portal](https://portal.staging.aodn.org.au/). Its current focus is on **metadata record inference**, which helps improve the quality and consistency of metadata across datasets. This notebook targets for **non-technical** audiences.

The framework is designed as a **Multi-Agent System (MAS)**, where different agents collaborate to perform specific post-processing tasks. It uses a **supervisor–worker architecture**:

- The **Supervisor Agent** manages and coordinates the overall process.
- Specialised **Worker (Task Agents)** handle individual tasks and return their results to the supervisor.

This modular architecture ensures scalability and provides centralized access through a single API interface.

---

## System Overview

The diagram below shows a high-level view of the system architecture:

![System Design](data-discovery-ai-system-AI%20project%20simple.png)

The framework includes four types of Task Agents, each with a specific role:

- **Description Formatting Agent**
  Converts plain-text metadata descriptions into structured Markdown for improved readability.

- **Keyword Classification Agent**
  Predicts relevant AODN vocabulary keywords for records missing metadata parameters.

- **Delivery Mode Classification Agent**
  Classifies ongoing datasets as either real-time or delayed delivery mode.

- **Link Grouping Agent**
  Categorizes external links in metadata into one of four types:
  *Data Access*, *Documentation*, *Python Notebook*, or *Other*.

---

## Usage

The framework is implemented as a FastAPI application. The main endpoint is `\process_record`, which initializes a Supervisor Agent to coordinate the task.

In this notebook, we simulate a call to that endpoint using an example request, so that to allow users to explore the system without needing to run the full backend.

In [1]:
request = {
    "selected_model":["description_formatting", "link_grouping"],
    "title": "IMOS - National Reef Monitoring Network Sub-Facility - Global cryptobenthic fish abundance",
    "abstract": "The National Reef Monitoring Network brings together shallow reef surveys conducted around Australia into a centralised database. The IMOS National Reef Monitoring Network sub-Facility collates, cleans, stores and makes this data rapidly available from contributors including: Reef Life Survey, Parks Australia, Department of Biodiversity, Conservation and Attractions (Western Australia), Department of Environment, Water and Natural Resources (South Australia), Department of Primary Industries (New South Wales), Tasmanian Parks and Wildlife Service and Parks Victoria. \n The data provided by the National Reef Monitoring Network contributes to establishing and supporting national marine baselines, and assisting with the management of Commonwealth and State marine reserves. \n Reef Life Survey (RLS) and the Australian Temperate Reef Network (ATRC) aims to improve biodiversity conservation and the sustainable management of marine resources by coordinating surveys of rocky and coral reefs using scientific methods, with the ultimate goal to improve coastal stewardship. Our activities depend on the skills of marine scientists, experienced and motivated recreational SCUBA divers, partnerships with management agencies and university researchers, and active input from the ATRC partners and RLS Advisory Committee.\n RLS and ATRC data are freely available to the public for non-profit purposes, so not only managers, but also groups such as local dive clubs or schools may use these data to look at changes over time in their own local reefs. By making data freely available and through public outputs, RLS and ATRC aims to raise broader community awareness of the status of Australia?s marine biodiversity and associated conservation issues. \n This dataset contains records of cryptobenthic fishes collected by RLS and ATRC divers and partners along 50m transects on shallow rocky and coral reefs using standard methods. Abundance information is available for all species recorded within quantitative survey limits (50 x 1 m swathes either side of the transect line, each distinguished as a 'Block'), with divers searching the reef surface (including cracks) carefully for hidden fishes. These observations are recorded concurrently with the macroinvertebrate observations and together make up the 'Method 2' component of the surveys. For this method, typically one 'Block' is completed per 50 m transect for the program ATRC and 2 blocks are completed for RLS ? although exceptions to this rule exist. \n This dataset supersedes the RLS specific \"Reef Life Survey (RLS): Cryptic Fish\" collection that was available at https://catalogue-rls.imas.utas.edu.au/geonetwork/srv/en/metadata.show?uuid=6a56db3f-d1b2-438d-98c6-bd7dd540a4d5 (provision of data was stopped in June 2021).",
    "lineage": "",
    "links": [{
        "href": "https://registry.opendata.aws/aodn_mooring_ctd_delayed_qc/",
        "rel": "related",
        "type": "text/html",
        "title": "Access To AWS Open Data Program registry for the Cloud Optimised version of this dataset"
      }]
}

In [2]:
from data_discovery_ai.agents.supervisorAgent import SupervisorAgent

supervisor = SupervisorAgent()

2025-05-12 11:10:53.263866: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-12 11:10:53.966716: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform

In [3]:
supervisor.execute(request)

In [4]:
supervisor.response

{'result': {'formatted_abstract': '# IMOS - National Reef Monitoring Network Sub-Facility - Global cryptobenthic fish abundance\n\nThe National Reef Monitoring Network brings together shallow reef surveys conducted around Australia into a centralised database. The IMOS National Reef Monitoring Network sub-Facility collates, cleans, stores and makes this data rapidly available from contributors including:\n\n- Reef Life Survey\n- Parks Australia\n- Department of Biodiversity, Conservation and Attractions (Western Australia)\n- Department of Environment, Water and Natural Resources (South Australia)\n- Department of Primary Industries (New South Wales)\n- Tasmanian Parks and Wildlife Service\n- Parks Victoria\n\nThe data provided by the National Reef Monitoring Network contributes to establishing and supporting national marine baselines, and assisting with the management of Commonwealth and State marine reserves.\n\nReef Life Survey (RLS) and the Australian Temperate Reef Network (ATRC) 

### Understanding the Output

The output is returned by the Supervisor Agent under the `"result"` key. Each Task Agent generates a distinct section of the response, identified by a unique key.

For example:
- `"formatted_abstract"` corresponds to the output from the **Description Formatting Agent**.
- `"grouped_links"` corresponds to the output from the **Link Grouping Agent**, which shows how external links are categorized.

These keys help you understand which part of the output comes from which agent.

---

### Try It Yourself

You can customise your own request based on your needs. Every request must include the required field `"selected_model"`, which is a list of strings indicating the Task Agents to be used for this task.

The following four agent types are currently supported (you can select one or more):

```json
["keyword_classification", "delivery_classification", "description_formatting", "link_grouping"]