# Weaviate Transformation Agent - Workshop

<a target="_blank" href="https://colab.research.google.com/github/weaviate-tutorials/intro-to-weaviate-agents/blob/main/transformation-agent-workshop.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Prerequisites

1. Log in to [Weaviate Cloud](https://console.weaviate.cloud) account (sign up if you don't have one yet)
1. Create a Weaviate Cloud [Sandbox](https://weaviate.io/developers/wcs/manage-clusters/create#sandbox-clusters) instance
1. Go to the 'Embedding' tab (on the left column) and enable `Weaviate Embeddings`
1. Take note of the `REST Endpoint` and a `Admin` `API Key`. 
1. Update `WEAVIATE_CLOUD_URL` with the `REST Endpoint` and `WEAVIATE_API_KEY` with the `Admin` `API Key`.
    - (Option 1): **If using Google Colab**
        - Set the values in the "Secrets" tab in the left column.
        <img src="./img/colab_secrets.png" alt="Update the .env file" width="400"/>
    - (Option 2): **Using an environment with a copy of the repository**
        - Update the values the `.env` file in the root directory of this repository.
        - Make sure to restart the Jupyter notebook after updating & saving the `.env` file.
        

Load our secrets (Weaviate URL & API key)

In [None]:
def is_colab():
    """Check if the current notebook is running in Google Colab."""
    try:
        import google.colab
        return True
    except ImportError:
        return False


if is_colab():
    from google.colab import userdata
    weaviate_url = userdata.get('WEAVIATE_CLOUD_URL')
    weaviate_api_key = userdata.get('WEAVIATE_CLOUD_API_KEY')
    print("Running in Colab, secrets retrieved. URL:", weaviate_url)

else:
    import os
    import dotenv

    dotenv.load_dotenv()

    # Remember to Update the .env file & RESTART the kernel (if running a local environment)
    weaviate_url = os.getenv("WEAVIATE_CLOUD_URL")
    weaviate_api_key = os.getenv("WEAVIATE_CLOUD_API_KEY")

In [None]:
# If in Colab, install the packages:

!pip install weaviate-client[agents] datasets

## Introduction

### Agenda

Let's talk about:
- What the Transformation Agent is
- What you can do with the Transformation Agent
- Some tips & tricks
- How to get started

### About the Transformation Agent

The *Weaviate Transformation Agent* is 

- A cloud-based service 
- for transforming your data in a Weaviate instance
- available for Weaviate Cloud users

**And** it is: in technical preview (do **not** use in production)

<center><img src="img/agents_tech_preview.png" width="60%"></center>

> ⚠️ The Weaviate Transformation Agent modifies data objects in Weaviate. **While the Agent is in technical preview, do not use it in a production environment.** 
> 
> The Agent may not work as expected, and the data in your Weaviate instance may be affected in unexpected ways.

**What the Transformation Agent is**

<center><img src="img/ta_obj.png" width="60%"></center>

The `TransformationAgent` can modify objects in a Weaviate collection to add new properties or update existing properties.

**What you can do with the Transformation Agent**

<center><img src="img/ta_overview.png" width="60%"></center>

Provide instructions to the `TransformationAgent` using natural language, and other required parameters. 

## Preparation

Here, we are going to use the [**Weaviate/ArxivPapers**](https://huggingface.co/datasets/weaviate/agents/viewer/query-agent-ecommerce) dataset. 

It includes titles and abstracts of a few research papers.

First, we load the dataset & add it to Weaviate.

### Load dataset

In [None]:
from datasets import load_dataset

papers_dataset = load_dataset("weaviate/agents", "transformation-agent-papers", split="train")

In [None]:
print(papers_dataset.shape)
print(papers_dataset[0]["properties"].keys())

In [None]:
for k, v in papers_dataset[0]["properties"].items():
    if len(v) > 100:
        v = v[:100] + "..."
    print(f"{k}: {v}")

Iterate through the data

In [None]:
columns = papers_dataset[0]["properties"].keys()

for i, item in enumerate(papers_dataset):
    if i < 2:
        properties = {
            col: item["properties"][col] for col in columns
        }
        print(properties)

### Ingest data into Weaviate

#### Connect to Weaviate

In [None]:
weaviate_url

In [None]:
import weaviate
from weaviate.classes.init import Auth

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url, auth_credentials=Auth.api_key(weaviate_api_key)
)

assert client.is_ready()

#### Set up a collection

**Important:** Make sure to enable 'Embeddings' in the Weaviate Cloud console. 

[See above](#prerequisites)

In [None]:
from weaviate.classes.config import Configure, Property, DataType

collection_name = "ArxivPapersDemo"

# Can delete the collection if you would like to (re)start fresh
client.collections.delete(collection_name)

if client.collections.exists(collection_name):
    # For re-running this tutorial, do nothing
    pass
else:
    client.collections.create(
        collection_name,
        description="A dataset that lists research paper titles and abstracts",
        properties=[
            Property(name="title", data_type=DataType.TEXT),
            Property(name="abstract", data_type=DataType.TEXT),
        ],
        vectorizer_config=[
            Configure.NamedVectors.text2vec_weaviate(
                name="default",
                source_properties=["title", "abstract"],
            )
        ]
    )

#### Add data to Weaviate

We loop through the data and add it to Weaviate. 

For the demo/workshop, we add only a few rows for speed and simplicity.

In [None]:
papers_collection = client.collections.get(collection_name)
columns = papers_dataset[0]["properties"].keys()

with papers_collection.batch.fixed_size(100) as batch:
    for i, item in enumerate(papers_dataset):
        if i < 50:
            properties = {col: item["properties"][col] for col in columns}
            batch.add_object(properties=properties)


if papers_collection.batch.failed_objects:
    for fo in papers_collection.batch.failed_objects[:3]:
        print(fo.message)
        print(fo.object_)

In [None]:
len(papers_collection)

#### Inspect the collection 



In [None]:
response = papers_collection.query.fetch_objects(
    limit=3,
    include_vector=True
)

for o in response.objects:
    for k, v in o.properties.items():
        print(f"{k}: {v[:50]}")
    print()
    print(o.vector["default"][:10])  # No need to print the entire vector

**Alternative: Use the `Explorer` cloud tool**

On Weaviate Cloud Console, click on the `Explorer` tab on the left column.

When you click on each object, you should see 2 properties:
- `title`
- `abstract`

As well as its `vectors`

## Using the original dataset:


### Can you find what you need?

Can you find papers about a specific topic (e.g. machine learning)?

In [None]:
response = papers_collection.query.near_text(
    query="machine learning",
    limit=5
)

for o in response.objects:
    print(o.properties["title"])

Can you filter only for papers with a particular main topic? (e.g. classification)

In [None]:
## ???
## Not sure if we actually can do this with the current data

### Does your data meet your needs?

What if: 
- The data is in the wrong language?
- Each abstract is too long?

Would you want to perform a RAG query each time?




## Try the Weaviate Transformation Agent 

### Task 1: Create a `topics` property

Define the operation(s) that you want to perform on the data.

In [None]:
prompt_create_topics = """
Create a list of topic tags based on the abstract.
Topics should be distinct from each other. Provide a maximum of 5 topics.
Group similar topics under one topic tag.
"""

In [None]:
from weaviate.agents.classes import Operations

add_topics = Operations.append_property(
    property_name="topics",             # Property to create
    data_type=DataType.TEXT_ARRAY,      # Data type of the property
    view_properties=["abstract"],       # Existing properties to view for the operation
    instruction=prompt_create_topics,   # Instruction to the Transformation Agent
)

Instantiate the agent & start the operations

In [None]:
from weaviate.agents.transformation import TransformationAgent

ta = TransformationAgent(
    client=client,              # Weaviate client object
    collection=collection_name, # Collection name
    operations=[add_topics]     # List of transform operations
)

ta_response = ta.update_all()

What does the response look like?

In [None]:
ta_response

The response contains the unique `workflow_id` of the operations. 

This does not mean that the operations are finished!

**The Transformation Agent is asynchronous**. You can check the status of the operation using the `workflow_id`.

In [None]:
ta.get_status(workflow_id=ta_response.workflow_id)

We can periodically check if the operation is done

In [None]:
def get_ta_status(agent_instance, workflow_id):
    # Rough code to check the status of the TA workflow
    import time
    from datetime import datetime

    while True:
        status = agent_instance.get_status(workflow_id=workflow_id)

        if status["status"]["state"] != "running":
            break

        # Calculate elapsed time from start_time
        start = datetime.strptime(status["status"]["start_time"], "%Y-%m-%d %H:%M:%S")
        elapsed = (datetime.now() - start).total_seconds()

        print(f"Waiting... Elapsed time: {elapsed:.2f} seconds")
        time.sleep(10)

    # Calculate total time
    if status["status"]["total_duration"]:
        total = status["status"]["total_duration"]
    else:
        start = datetime.strptime(status["status"]["start_time"], "%Y-%m-%d %H:%M:%S")
        end = datetime.now() if not status["status"]["end_time"] else datetime.strptime(status["status"]["end_time"], "%Y-%m-%d %H:%M:%S")
        total = (end - start).total_seconds()

    print(f"Total time: {total:.2f} seconds")
    print(status)

In [None]:
get_ta_status(agent_instance=ta, workflow_id=ta_response.workflow_id)

**How the Transformation Agent works**

<center><img src="img/ta_schematic.png" width="60%"></center>

The `TransformationAgent` connects to your Weaviate Cloud instance, and uses LLMs to follow these instructions.

When the operation is complete - let's see what we can do with the data:

In [None]:
from weaviate.classes.query import Metrics

response = papers_collection.aggregate.over_all(
    return_metrics=Metrics("topics").text(
        top_occurrences_count=True,
        top_occurrences_value=True,
        min_occurrences=10
    )
)

for t in response.properties["topics"].top_occurrences:
    print(t)

Try to filter for papers with particular topics:

In [None]:
from weaviate.classes.query import Filter

response = papers_collection.query.fetch_objects(
    limit=3,
    filters=Filter.by_property("topics").like("*machine*")
)

for o in response.objects:
    print(o.properties["title"])

Inspect an object again:

In [None]:
response = papers_collection.query.fetch_objects(
    limit=3,
)

for o in response.objects:
    for k, v in o.properties.items():
        print(f"{k}: {v[:50]}")
    print()

### Task 2: Perform multiple operations

- Add a `paper_type` property (e.g. `survey`, `method`, `resource`)
- Add a boolean property `relevant_to_rag` (True/False)

In [None]:
prompt_paper_type = """
Determine the primary type of paper based on the abstract. Assign exactly one of the following categories that best represents the paper's main contribution:

'survey':   Comprehensive review or meta-analysis of existing work in a field
'model':    Introduction of a new predictive model, statistical method, or algorithmic approach
'system':   Description of a new data pipeline, workflow, framework, or system architecture
'analysis': Focused on insights derived from analyzing data
'resource': Introduction of a new dataset, benchmark, or tool for data science
'other':    None of the above
"""

add_paper_type = Operations.append_property(
      property_name="paper_type",
      data_type=DataType.TEXT,
      view_properties=["abstract"],
      instruction=prompt_paper_type,
)

In [None]:
prompt_about_classification = """
Based on the abstract, determine whether the paper is
primarily about the machine field of classification.

Do not include papers that are obliquely, or vaguely about classification.
"""

add_about_classification_bool = Operations.append_property(
    property_name="about_classification",
    data_type=DataType.BOOL,
    view_properties=["abstract"],
    instruction=prompt_about_classification,
)

In [None]:
prompt_add_french_title_suffix = """
Update the title to ensure that it contains the French translation of itself in parantheses, after the original title.
"""

update_title = Operations.update_property(
    property_name="title",
    view_properties=["title"],
    instruction=prompt_add_french_title_suffix,
)

In [None]:
from weaviate.agents.transformation import TransformationAgent

ta = TransformationAgent(
    client=client,
    collection=collection_name,
    operations=[
        update_title,
        add_paper_type,
        add_about_classification_bool
    ],
)

ta_response = ta.update_all()

Note that this still returns one object, with one workflow ID, even though we are performing multiple operations.

In [None]:
ta.get_status(workflow_id=ta_response.workflow_id)

Let's monitor the operation as before:

In [None]:
get_ta_status(agent_instance=ta, workflow_id=ta_response.workflow_id)

And again, inspect a few transformed objects:

In [None]:
response = papers_collection.query.fetch_objects(
    limit=3,
)

for o in response.objects:
    for k, v in o.properties.items():
        if type(v) == str:
            if len(v) > 50:
                v = v[:50] + "..."
        print(f"{k}: {v}")
    print()

We see it did, in fact, perform all the specified transformation operations.

We can now use these improved properties to perform new queries. 

- e.g. what paper types do we have?

In [None]:
from weaviate.classes.query import Metrics

response = papers_collection.aggregate.over_all(
    return_metrics=Metrics("paper_type").text(
        top_occurrences_count=True,
        top_occurrences_value=True,
        min_occurrences=10
    )
)

for t in response.properties["paper_type"].top_occurrences:
    print(t)

How many objects are about classifications?

In [None]:
from weaviate.classes.query import Filter

response = papers_collection.aggregate.over_all(
    filters=Filter.by_property("about_classification").equal(True),
)

response.total_count

In [None]:
from weaviate.classes.query import Filter

response = papers_collection.query.fetch_objects(
    filters=Filter.by_property("about_classification").equal(True),
    limit=10
)

for o in response.objects:
    print(o.properties["title"])

What about intersections of multiple properties?

In [None]:
from weaviate.classes.query import Filter

response = papers_collection.aggregate.over_all(
    filters=(
        Filter.by_property("paper_type").equal("model") &
        Filter.by_property("about_classification").equal(True)
    )
)

response.total_count

Let's take a look at a few:

In [None]:
from weaviate.classes.query import Filter

response = papers_collection.query.near_text(
    query="vector",
    filters=(
        Filter.by_property("paper_type").equal("model") &
        Filter.by_property("about_classification").equal(True)
    )
)

for o in response.objects:
    print(o.properties["title"])

## Bonus: Use the Query Agent

The Weaviate [Query Agent](https://weaviate.io/developers/agents/query) is another agentic service on Weaviate Cloud. The Query Agent allows you to query your Weaviate instance using natural language.

In [None]:
from weaviate.agents.query import QueryAgent

qa = QueryAgent(
    client=client, collections=[collection_name]
)

Now, we can just tell the Query Agent to do the hard & boring stuff (syntax lookup!) for us.

In [None]:
# Perform a query
response = qa.run(
    """
    Find papers that are about classification. Tell me about some of them.
    Hint: There is a property called 'about_classification' that you can use.
    """,
)

# Print the response
response.display()

In [None]:
# Perform a query
response = qa.run(
    """
    How many papers are primarily about models?

    Hint: There is a property called 'paper_type' where the available values are: 'survey', 'model', 'system', 'analysis', 'resource', 'other'.
    """
)

# Print the response
response.display()

We can even ask it follow-up queries:

In [None]:
followup_response = qa.run(
    query="Can you select one or two of these papers and explain them in simple terms? I am not a data scientist.", context=response
)

followup_response.display()

Read more about the [Query Agent](https://weaviate.io/blog/query-agent) on our blog.

## Bonus: Current limitations

Remember that the Transformation Agent is being asked to update data objects for us. So, be very careful with the instructions you provide.

And currently, it is in technical preview. Do not use it in a production environment (*yet* 😉).

- Do not run multiple agents at the same time - this can cause conflicts (race conditions).
- There is a limit of 10,000 operations per day per Weaviate Cloud organization.

In [None]:
from weaviate.classes.config import Configure, Property, DataType

collection_name = "ArxivPapersDemo"

# Can delete the collection if you would like to (re)start fresh
client.collections.delete(collection_name)

client.collections.create(
    collection_name,
    description="A dataset that lists research paper titles and abstracts",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="abstract", data_type=DataType.TEXT),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_weaviate(
            name="default",
            source_properties=["title", "abstract"],
        )
    ]
)

papers_collection = client.collections.get(collection_name)
columns = papers_dataset[0]["properties"].keys()

with papers_collection.batch.fixed_size(100) as batch:
    for i, item in enumerate(papers_dataset):
        if i < 5:
            properties = {col: item["properties"][col] for col in columns}
            batch.add_object(properties=properties)


if papers_collection.batch.failed_objects:
    for fo in papers_collection.batch.failed_objects[:3]:
        print(fo.message)
        print(fo.object_)

len(papers_collection)

In [None]:
from weaviate.agents.transformation import TransformationAgent

responses = []
new_languages = ["spanish", "german", "italian"]

for lang in new_languages:

    prompt_task = f"""
    Create a {lang} version of the abstract
    """

    task = Operations.append_property(
        property_name=f"test_{lang}_abstract",
        data_type=DataType.TEXT,
        view_properties=["abstract"],
        instruction=prompt_task,
    )

    ta = TransformationAgent(
        client=client,
        collection=collection_name,
        operations=[task],
    )

    ta_response = ta.update_all()
    responses.append(ta_response)

print(responses)

In [None]:
for r in responses:
    get_ta_status(agent_instance=ta, workflow_id=r.workflow_id)

If these operations worked perfectly, all objects should have all new properties (`test_spanish_abstract`, `test_german_abstract`, `test_italian_abstract`). 

In [None]:
response = papers_collection.query.fetch_objects(
    limit=50
)

properties = []
for o in response.objects:
    for p in properties:
        if o.properties[p] is None or o.properties[p] == "":
            print(f"Property {p} is empty in object UUID: {o.uuid}")

But since we have very few objects, multiple objects worked on the same object at the same time. 

This shouldn't happen much in a real-world scenario, but it's something to keep in mind.

## Further resources

- Blog: ["Introducing the Weaviate Transformation Agent"](https://weaviate.io/blog/transformation-agent)
- Documentation: [Weaviate Transformation Agent](https://weaviate.io/developers/agents/transformation)