Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 23, 2025

📄 185% (1.85x) speedup for BatchReference._to_internal in weaviate/collections/classes/batch.py

⏱️ Runtime : 2.35 milliseconds 825 microseconds (best of 162 runs)

📝 Explanation and details

The optimized code achieves a 185% speedup by eliminating three key performance bottlenecks:

What was optimized:

  1. Eliminated object mutation: The original code modified self.to_object_collection directly, which is expensive in Pydantic models due to validation overhead. The optimization uses a local variable toc_str instead, avoiding the mutation entirely.

  2. Cached UUID string conversions: The original code called str(self.from_object_uuid) and str(self.to_object_uuid) multiple times. The optimization computes these once and reuses the cached strings, eliminating redundant conversions.

  3. Optimized string concatenation: Replaced the slower self.to_object_collection + "/" concatenation with f-string formatting f"{toc}/", which is more efficient in Python.

Why this leads to speedup:

  • Pydantic model mutation triggers validation and change tracking mechanisms, making it significantly slower than working with local variables
  • UUID string conversion is computationally expensive, so caching these results eliminates redundant work
  • F-string formatting is generally faster than string concatenation operators in Python

Performance characteristics:
The optimization shows consistent 180-240% speedups across all test scenarios, with particularly strong performance on:

  • Basic references with collections (232% faster)
  • References without to_object_collection (236% faster)
  • Large-scale batches processing 1000+ references (184% faster)
  • Cases with long names and special characters (190-215% faster)

The optimization maintains identical behavior and output while dramatically improving performance for any workload involving batch reference creation.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 3134 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Optional
from uuid import UUID as _UUID
from uuid import uuid4

# imports
import pytest
from weaviate.collections.classes.batch import BatchReference

# Simulate BEACON constant and _BatchReference type for testing
BEACON = "weaviate://localhost/"
class _BatchReference:
    def __init__(self, from_uuid, from_, to, to_uuid, tenant):
        self.from_uuid = from_uuid
        self.from_ = from_
        self.to = to
        self.to_uuid = to_uuid
        self.tenant = tenant

    def __eq__(self, other):
        if not isinstance(other, _BatchReference):
            return False
        return (
            self.from_uuid == other.from_uuid and
            self.from_ == other.from_ and
            self.to == other.to and
            self.to_uuid == other.to_uuid and
            self.tenant == other.tenant
        )
from weaviate.collections.classes.batch import BatchReference

# -------------------- UNIT TESTS --------------------

# Basic Test Cases

def test_basic_reference_with_collection():
    """Test standard reference with all fields provided."""
    from_uuid = uuid4()
    to_uuid = uuid4()
    ref = BatchReference(
        from_object_collection="MyClass",
        from_object_uuid=from_uuid,
        from_property_name="myProp",
        to_object_uuid=to_uuid,
        to_object_collection="OtherClass",
        tenant="tenantA"
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 6.33μs -> 1.91μs (232% faster)

def test_basic_reference_without_to_object_collection():
    """Test reference when to_object_collection is None."""
    from_uuid = uuid4()
    to_uuid = uuid4()
    ref = BatchReference(
        from_object_collection="A",
        from_object_uuid=from_uuid,
        from_property_name="p",
        to_object_uuid=to_uuid,
        to_object_collection=None,
        tenant=None
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 5.80μs -> 1.73μs (236% faster)

def test_basic_reference_empty_tenant():
    """Test reference when tenant is not provided."""
    from_uuid = uuid4()
    to_uuid = uuid4()
    ref = BatchReference(
        from_object_collection="A",
        from_object_uuid=from_uuid,
        from_property_name="p",
        to_object_uuid=to_uuid,
        to_object_collection="B"
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 5.99μs -> 1.83μs (228% faster)

# Edge Test Cases

def test_edge_reference_empty_collection_name():
    """Test with minimal length collection name (should be allowed)."""
    from_uuid = uuid4()
    to_uuid = uuid4()
    ref = BatchReference(
        from_object_collection="X",
        from_object_uuid=from_uuid,
        from_property_name="prop",
        to_object_uuid=to_uuid,
        to_object_collection="Y"
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 6.10μs -> 1.85μs (231% faster)

def test_edge_reference_empty_property_name():
    """Test with empty property name (should be allowed)."""
    from_uuid = uuid4()
    to_uuid = uuid4()
    ref = BatchReference(
        from_object_collection="Class",
        from_object_uuid=from_uuid,
        from_property_name="",
        to_object_uuid=to_uuid,
        to_object_collection="Other"
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 6.30μs -> 1.89μs (233% faster)

def test_edge_reference_none_tenant():
    """Test with tenant explicitly set to None."""
    from_uuid = uuid4()
    to_uuid = uuid4()
    ref = BatchReference(
        from_object_collection="Class",
        from_object_uuid=from_uuid,
        from_property_name="prop",
        to_object_uuid=to_uuid,
        to_object_collection="Other",
        tenant=None
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 6.41μs -> 1.99μs (221% faster)


def test_edge_reference_special_characters():
    """Test with special characters in collection and property names."""
    from_uuid = uuid4()
    to_uuid = uuid4()
    ref = BatchReference(
        from_object_collection="Cl@ss#1",
        from_object_uuid=from_uuid,
        from_property_name="pr$p",
        to_object_uuid=to_uuid,
        to_object_collection="Oth$r"
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 9.31μs -> 3.05μs (205% faster)

def test_edge_reference_uuid_as_string():
    """Test with UUIDs passed as strings (should fail type checking)."""
    from_uuid = str(uuid4())
    to_uuid = str(uuid4())
    # Should raise AttributeError or TypeError when trying to use string as UUID
    with pytest.raises(AttributeError):
        ref = BatchReference(
            from_object_collection="Class",
            from_object_uuid=from_uuid,  # Not a UUID object
            from_property_name="prop",
            to_object_uuid=to_uuid,      # Not a UUID object
            to_object_collection="Other"
        )
        ref._to_internal()


def test_large_scale_many_references():
    """Test creating and converting a large batch of references."""
    N = 500  # Large but under 1000
    refs = []
    from_uuids = [uuid4() for _ in range(N)]
    to_uuids = [uuid4() for _ in range(N)]
    for i in range(N):
        refs.append(BatchReference(
            from_object_collection=f"Class{i}",
            from_object_uuid=from_uuids[i],
            from_property_name=f"prop{i}",
            to_object_uuid=to_uuids[i],
            to_object_collection=f"Other{i}",
            tenant=f"tenant{i}"
        ))
    internals = [ref._to_internal() for ref in refs]
    # Check a few random samples
    for i in [0, N//2, N-1]:
        pass

def test_large_scale_long_collection_names():
    """Test with very long collection and property names."""
    long_name = "A" * 250
    from_uuid = uuid4()
    to_uuid = uuid4()
    ref = BatchReference(
        from_object_collection=long_name,
        from_object_uuid=from_uuid,
        from_property_name=long_name,
        to_object_uuid=to_uuid,
        to_object_collection=long_name,
        tenant=long_name
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 9.48μs -> 3.25μs (192% faster)

def test_large_scale_reference_with_none_tenant():
    """Test large batch with tenant=None."""
    N = 300
    refs = []
    from_uuids = [uuid4() for _ in range(N)]
    to_uuids = [uuid4() for _ in range(N)]
    for i in range(N):
        refs.append(BatchReference(
            from_object_collection=f"Class{i}",
            from_object_uuid=from_uuids[i],
            from_property_name=f"prop{i}",
            to_object_uuid=to_uuids[i],
            to_object_collection=f"Other{i}",
            tenant=None
        ))
    internals = [ref._to_internal() for ref in refs]
    for i in [0, N//2, N-1]:
        pass

def test_large_scale_unique_uuids_and_collections():
    """Test that each reference is uniquely mapped."""
    N = 100
    refs = []
    from_uuids = [uuid4() for _ in range(N)]
    to_uuids = [uuid4() for _ in range(N)]
    for i in range(N):
        refs.append(BatchReference(
            from_object_collection=f"Class{i}",
            from_object_uuid=from_uuids[i],
            from_property_name=f"prop{i}",
            to_object_uuid=to_uuids[i],
            to_object_collection=f"Other{i}"
        ))
    internals = [ref._to_internal() for ref in refs]
    # Ensure uniqueness
    froms = set(i.from_ for i in internals)
    tos = set(i.to for i in internals)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from dataclasses import dataclass
from typing import Optional

# imports
import pytest  # used for our unit tests
# --- BatchReference class from prompt ---
from pydantic import BaseModel, Field
from weaviate.collections.classes.batch import BatchReference

# --- Minimal stubs for BEACON, UUID, and _BatchReference ---
BEACON = "weaviate://"
def is_valid_uuid(u):
    # Simple check for UUID string format
    import re
    return bool(re.match(r"^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$", u))

class UUID(str):
    def __new__(cls, value):
        if not is_valid_uuid(value):
            raise ValueError("Invalid UUID")
        return str.__new__(cls, value)

@dataclass
class _BatchReference:
    from_uuid: str
    from_: str
    to: str
    to_uuid: str
    tenant: Optional[str] = None
from weaviate.collections.classes.batch import BatchReference

# --- Unit tests ---

# Basic Test Cases

def test_basic_reference_with_all_fields():
    # Test with all fields provided
    ref = BatchReference(
        from_object_collection="MyClass",
        from_object_uuid=UUID("12345678-1234-1234-1234-123456789abc"),
        from_property_name="refProp",
        to_object_uuid=UUID("abcdefab-cdef-abcd-efab-cdefabcdefab"),
        to_object_collection="OtherClass",
        tenant="tenant1"
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 10.9μs -> 3.66μs (198% faster)

def test_basic_reference_without_optional_fields():
    # Test with minimal required fields, omitting to_object_collection and tenant
    ref = BatchReference(
        from_object_collection="ClassA",
        from_object_uuid=UUID("11111111-2222-3333-4444-555555555555"),
        from_property_name="propA",
        to_object_uuid=UUID("99999999-8888-7777-6666-555555555555")
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 7.41μs -> 2.18μs (241% faster)

def test_basic_reference_with_empty_tenant():
    # Tenant is explicitly set to None
    ref = BatchReference(
        from_object_collection="ClassB",
        from_object_uuid=UUID("aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"),
        from_property_name="propB",
        to_object_uuid=UUID("bbbbbbbb-cccc-dddd-eeee-ffffffffffff"),
        tenant=None
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 6.76μs -> 2.03μs (233% faster)

# Edge Test Cases

def test_edge_empty_collection_name_raises():
    # from_object_collection must be non-empty due to min_length=1
    with pytest.raises(ValueError):
        BatchReference(
            from_object_collection="",
            from_object_uuid=UUID("12345678-1234-1234-1234-123456789abc"),
            from_property_name="refProp",
            to_object_uuid=UUID("abcdefab-cdef-abcd-efab-cdefabcdefab"),
        )

def test_edge_invalid_uuid_raises():
    # from_object_uuid is not a valid UUID
    with pytest.raises(ValueError):
        BatchReference(
            from_object_collection="ClassX",
            from_object_uuid=UUID("not-a-uuid"),
            from_property_name="propX",
            to_object_uuid=UUID("abcdefab-cdef-abcd-efab-cdefabcdefab"),
        )


def test_edge_property_name_with_special_characters():
    # from_property_name with special characters
    ref = BatchReference(
        from_object_collection="ClassD",
        from_object_uuid=UUID("eeeeeeee-eeee-eeee-eeee-eeeeeeeeeeee"),
        from_property_name="prop!@# $%^&*()_+",
        to_object_uuid=UUID("ffffffff-ffff-ffff-ffff-ffffffffffff")
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 9.56μs -> 3.03μs (215% faster)

def test_edge_long_collection_and_property_names():
    # Very long collection and property names
    long_name = "A" * 255
    ref = BatchReference(
        from_object_collection=long_name,
        from_object_uuid=UUID("12345678-1234-1234-1234-123456789abc"),
        from_property_name=long_name,
        to_object_uuid=UUID("abcdefab-cdef-abcd-efab-cdefabcdefab"),
        to_object_collection=long_name
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 7.50μs -> 2.44μs (207% faster)

def test_edge_none_tenant_is_preserved():
    # Explicitly set tenant to None
    ref = BatchReference(
        from_object_collection="ClassE",
        from_object_uuid=UUID("11111111-2222-3333-4444-555555555555"),
        from_property_name="propE",
        to_object_uuid=UUID("99999999-8888-7777-6666-555555555555"),
        tenant=None
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 6.71μs -> 2.06μs (226% faster)

# Large Scale Test Cases

def test_large_scale_many_references():
    # Test performance and correctness with 1000 references
    refs = []
    for i in range(1000):
        ref = BatchReference(
            from_object_collection=f"Class{i}",
            from_object_uuid=UUID(f"{i:08x}-1234-1234-1234-123456789abc"),
            from_property_name=f"prop{i}",
            to_object_uuid=UUID(f"{(i+1)%1000:08x}-cdef-abcd-efab-cdefabcdefab"),
            to_object_collection=f"OtherClass{i}"
        )
        refs.append(ref)
    # Test that all internal objects are correct
    for i, ref in enumerate(refs):
        codeflash_output = ref._to_internal(); internal = codeflash_output # 2.03ms -> 715μs (184% faster)

def test_large_scale_long_names():
    # Test with long collection and property names, 1000 chars
    long_name = "X" * 1000
    ref = BatchReference(
        from_object_collection=long_name,
        from_object_uuid=UUID("12345678-1234-1234-1234-123456789abc"),
        from_property_name=long_name,
        to_object_uuid=UUID("abcdefab-cdef-abcd-efab-cdefabcdefab"),
        to_object_collection=long_name
    )
    codeflash_output = ref._to_internal(); internal = codeflash_output # 8.35μs -> 2.92μs (186% faster)

def test_large_scale_unique_tenants():
    # Test that tenant field is preserved across many references
    for i in range(100):
        tenant = f"tenant_{i}"
        ref = BatchReference(
            from_object_collection="Class",
            from_object_uuid=UUID(f"{i:08x}-1234-1234-1234-123456789abc"),
            from_property_name="prop",
            to_object_uuid=UUID(f"{(i+1)%100:08x}-cdef-abcd-efab-cdefabcdefab"),
            tenant=tenant
        )
        codeflash_output = ref._to_internal(); internal = codeflash_output # 210μs -> 73.6μs (186% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from weaviate.collections.classes.batch import BatchReference

Timer unit: 1e-09 s

To edit these changes git checkout codeflash/optimize-BatchReference._to_internal-mh38jgyi and push.

Codeflash

The optimized code achieves a **185% speedup** by eliminating three key performance bottlenecks:

**What was optimized:**

1. **Eliminated object mutation**: The original code modified `self.to_object_collection` directly, which is expensive in Pydantic models due to validation overhead. The optimization uses a local variable `toc_str` instead, avoiding the mutation entirely.

2. **Cached UUID string conversions**: The original code called `str(self.from_object_uuid)` and `str(self.to_object_uuid)` multiple times. The optimization computes these once and reuses the cached strings, eliminating redundant conversions.

3. **Optimized string concatenation**: Replaced the slower `self.to_object_collection + "/"` concatenation with f-string formatting `f"{toc}/"`, which is more efficient in Python.

**Why this leads to speedup:**
- **Pydantic model mutation** triggers validation and change tracking mechanisms, making it significantly slower than working with local variables
- **UUID string conversion** is computationally expensive, so caching these results eliminates redundant work
- **F-string formatting** is generally faster than string concatenation operators in Python

**Performance characteristics:**
The optimization shows consistent 180-240% speedups across all test scenarios, with particularly strong performance on:
- Basic references with collections (232% faster)
- References without to_object_collection (236% faster) 
- Large-scale batches processing 1000+ references (184% faster)
- Cases with long names and special characters (190-215% faster)

The optimization maintains identical behavior and output while dramatically improving performance for any workload involving batch reference creation.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 23, 2025 09:44
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant