### Elysia Tutorial: From Basics to Advanced (with Students use case)

This tutorial notebook walks you through Elysia — an agentic framework that chooses and runs tools — from the very basics to advanced customization, ending with a practical use case using a Students dataset.

What you'll do:
- Install and configure Elysia
- Create and run a minimal custom tool
- Connect to a Weaviate cluster and preprocess a collection
- Ask natural-language questions over data
- Build an advanced custom tool (linear regression)
- Apply it to a Students dataset to answer a real question

Dataset context: we'll reference a `Student` collection with fields like `Student_ID`, `Gender`, `Study_Hours`, `Attendance`, `Past_Exam`, `Parental_Education`, `Internet_Access`, `Extracurricular`, `Final_Exam`, `Pass_Fail`. Adjust field names if your schema differs.


### 0. Setup and Installation

- This installs `elysia-ai` and optional `weaviate-client`.
- If you plan to use Weaviate Cloud, you will need `WCD_URL` and `WCD_API_KEY`.
- Keep your keys secret. In Colab, you can use `google.colab.userdata`.

If you're not using Weaviate, you can still run the basic examples.


In [None]:
# Install core package (quiet to reduce noise)
%pip install -U elysia-ai --quiet

# Optional: client for interacting with Weaviate directly
%pip install -U weaviate-client --quiet


### 1. Configure Elysia

You must configure which models to use and how to access them. Elysia integrates via LiteLLM through DSPy and supports many providers.

- Set your `OPENAI_API_KEY` (or other provider keys) here.
- If using Weaviate Cloud, also set `WCD_URL` and `WCD_API_KEY`.

You can leave Weaviate variables unset to run only the basic examples.


In [None]:
# Configure models and (optionally) Weaviate
import os

# Colab users: uncomment to pull from Colab secrets
# from google.colab import userdata
# OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")
WCD_URL = os.environ.get("WCD_URL", "")
WCD_API_KEY = os.environ.get("WCD_API_KEY", "")

from elysia import configure

configure(
    base_model="gpt-4o-mini",
    base_provider="openai",
    complex_model="gpt-4o",
    complex_provider="openai",
    openai_api_key=OPENAI_API_KEY,
    wcd_url=WCD_URL or None,
    wcd_api_key=WCD_API_KEY or None,
)


### 2. Basic: Minimal Tool and Tree

A tool is just an async function decorated with `@tool`. The docstring becomes the tool's description.

We'll create `add(x, y)` and call the tree with a simple math question.


In [None]:
from elysia import tool, Tree

# Create a decision tree
basic_tree = Tree()

@tool(tree=basic_tree)
async def add(x: int, y: int) -> int:
    """Return the sum of two integers x and y."""
    return x + y

# Run the tree with a simple prompt
basic_response = basic_tree("What is the sum of 9009 and 6006?")
basic_response


### 3. Connecting to Weaviate (Optional)

If you have a Weaviate Cloud instance, set `WCD_URL` and `WCD_API_KEY` above. Elysia can then query your collections.

We must preprocess a collection so Elysia understands field names, types, and summary context.


In [None]:
# Preprocess collections only if Weaviate credentials are provided
from elysia import preprocess as preprocess_sync

if WCD_URL and WCD_API_KEY:
    try:
        preprocess_sync("Student")
        print("Preprocessed 'Student' collection.")
    except Exception as e:
        print("Preprocess skipped or failed:", e)
else:
    print("Weaviate not configured; skipping preprocessing.")


### 4. Querying with Natural Language

Once preprocessed, you can ask questions in natural language. Elysia chooses the right tools (query, aggregate, summarization) to answer.


In [None]:
import elysia

qa_tree = elysia.Tree()

# If Weaviate is configured, pass the collection name(s)
if WCD_URL and WCD_API_KEY:
    qa_response = qa_tree(
        "What is this dataset about?",
        collection_names=["Student"],
    )
else:
    # Without Weaviate, this will still run but won't retrieve data
    qa_response = qa_tree("Describe the goal of this tutorial.")

qa_response


### 5. Advanced: Custom Analysis Tool (Linear Regression)

Sometimes you need custom analysis beyond built-in tools. Below we implement a `BasicLinearRegression` tool as a class that Elysia can call. It:
- Extracts numeric fields from retrieved objects in the environment
- Fits a simple linear regression (with intercept) using NumPy
- Returns coefficients to the environment and plots the fit

Theory (very short): for a single feature x and target y, the least-squares solution solves for coefficients β minimizing \(\sum_i (y_i - (β_0 + β_1 x_i))^2\). In matrix form, \(β = (X^T X)^{-1} X^T y\) with X having a leading column of ones for the intercept.


In [None]:
from elysia import Error, Tool, Result
import numpy as np
import matplotlib.pyplot as plt

class BasicLinearRegression(Tool):
    def __init__(self, logger=None, **kwargs):
        super().__init__(
            name="basic_linear_regression_tool",
            description=(
                "Use this tool to perform linear regression between two numeric variables "
                "found in retrieved objects in the environment."
            ),
            status="Running linear regression...",
            inputs={
                "environment_key": {
                    "description": (
                        "A key of the environment to use (e.g., 'query'). "
                        "All objects under that key will be used."
                    ),
                    "required": True,
                    "type": str,
                    "default": None,
                },
                "x_variable_field": {
                    "description": "Independent variable field name.",
                    "required": True,
                    "type": str,
                    "default": None,
                },
                "y_variable_field": {
                    "description": "Dependent variable field name.",
                    "required": True,
                    "type": str,
                    "default": None,
                },
            },
            end=False,
        )

    async def __call__(
        self,
        tree_data,
        inputs,
        base_lm,
        complex_lm,
        client_manager,
        **kwargs,
    ):
        environment = tree_data.environment.environment
        environment_key = inputs["environment_key"]
        x_variable_field = inputs["x_variable_field"]
        y_variable_field = inputs["y_variable_field"]

        try:
            X = np.empty((0, 2))
            y = np.empty((0, 1))

            for inner_key in environment.get(environment_key, {}):
                inner_X = np.array(
                    [
                        [obj[x_variable_field]]
                        for environment_list in environment[environment_key][inner_key]
                        for obj in environment_list["objects"]
                        if x_variable_field in obj and y_variable_field in obj
                    ]
                )
                if inner_X.size == 0:
                    continue
                inner_X = np.hstack([np.ones((inner_X.shape[0], 1)), inner_X])
                X = np.vstack([X, inner_X])

                inner_y = np.array(
                    [
                        [obj[y_variable_field]]
                        for environment_list in environment[environment_key][inner_key]
                        for obj in environment_list["objects"]
                        if x_variable_field in obj and y_variable_field in obj
                    ]
                )
                y = np.vstack([y, inner_y])

            if X.shape[0] == 0:
                yield Error(
                    "No rows with both fields present. Check field names or query step."
                )
                return

            beta_hat = np.linalg.inv(X.T @ X + 1e-10 * np.eye(X.shape[1])) @ X.T @ y
            beta_hat_dict = {
                "intercept": float(beta_hat[0]),
                "slope": float(beta_hat[1]),
            }
            pred_y = X @ beta_hat

            fig, ax = plt.subplots()
            ax.scatter(X[:, 1], y)
            ax.plot(X[:, 1], pred_y, color="red")
            ax.set_title(
                f"Linear regression between {x_variable_field} and {y_variable_field}"
            )
            ax.set_xlabel(x_variable_field)
            ax.set_ylabel(y_variable_field)
            fig.show()

            yield Result(
                objects=[beta_hat_dict],
                metadata={
                    "x_variable_field": x_variable_field,
                    "y_variable_field": y_variable_field,
                },
                llm_message=(
                    "Completed linear regression analysis where: "
                    f"x={x_variable_field}, y={y_variable_field}."
                ),
            )
        except Exception as e:
            yield Error(str(e))

    async def is_tool_available(self, tree_data, base_lm, complex_lm, client_manager):
        return (
            "query" in tree_data.environment.environment
            and len(tree_data.environment.environment["query"]) > 0
        )


### 6. Use case: Students dataset — Do more study hours correlate with final exam score?

We'll:
1) Query the `Student` collection to bring objects into the environment
2) Run `BasicLinearRegression` with `x=Study_Hours` and `y=Final_Exam`

If your schema uses different field names, adjust the strings in the cell below.


In [None]:
students_tree = elysia.Tree()

# Add our custom tool to the tree
students_tree.add_tool(BasicLinearRegression)

if WCD_URL and WCD_API_KEY:
    # Step 1: bring data into the environment
    response1 = students_tree(
        "Retrieve student records with study hours and final exam scores",
        collection_names=["Student"],
    )

    # Step 2: run the regression tool
    response2 = students_tree(
        "Run linear regression with x=Study_Hours and y=Final_Exam",
        # Hints to the decision agent via the prompt; the tool will still validate fields
    )

    response1, response2
else:
    print("Weaviate not configured; skipping Students use case execution.")


### 7. Interpreting Results

- The tool prints the intercept and slope. A positive slope indicates higher `Study_Hours` is associated with higher `Final_Exam`.
- The scatterplot displays data points with a red best-fit line.
- Always validate with domain knowledge; correlation does not imply causation.

Next steps:
- Try different x/y fields (e.g., `Attendance` vs `Final_Exam`)
- Add categorical handling or multi-feature regression
- Build more tools for diagnostics (R², residual analysis)


In [None]:
!pip install -U elysia-ai --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.2/41.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.0/64.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.0/278.0 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.8/165.8 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m100.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import os
from google.colab import userdata

OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")

from elysia import configure

configure(
    base_model="gpt-4.1-mini",
    base_provider="openai",
    complex_model="gpt-4.1",
    complex_provider="openai",
    openai_api_key=OPENAI_API_KEY,

)

In [3]:
from elysia import tool, Tree

tree = Tree()

@tool(tree=tree)
async def add(x: int, y: int) -> int:
    return x + y

tree("What is the sum of 9009 and 6006?")

Output()

('I will calculate the sum of 9009 and 6006 using the addition function. The sum of 9009 and 6006 is 15015.',
 [[{'tool_result': 15015, '_REF_ID': 'add_default_0_0'}]])

## Data ingestion to Weaviate (short, with advanced options)

The following cell creates a collection and imports JSON data with vectorization. Adjust provider and schema per your data.


In [4]:
!pip install weaviate-client --quiet

In [9]:
from elysia import configure

configure(
    base_model="gpt-4.1-mini",
    base_provider="openai",
    complex_model="gpt-4.1",
    complex_provider="openai",
    openai_api_key=OPENAI_API_KEY,
    WCD_URL="st3vqfvjsnkwrcfxrq5nlw.c0.asia-southeast1.gcp.weaviate.cloud",
    WCD_API_KEY="L2RRblE4VGhZUk5vd1RyN19UaUJmM3JZOFdURWQ5MXdyNk5NcldlNGVadkZxTmU3R2dBUkp2dlZsbDNJPV92MjAw",
)


In [7]:
from elysia import preprocess

In [10]:
from elysia.preprocess.collection import preprocess
preprocess("Student")

Output()

In [13]:
import elysia
tree = elysia.Tree()
response, objects = tree(
    "what is data about",
    collection_names = ["Student"]
)

Output()