Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 24, 2025

📄 238% (2.38x) speedup for unmarshal_json in src/mistralai/utils/serializers.py

⏱️ Runtime : 24.3 milliseconds 7.19 milliseconds (best of 110 runs)

📝 Explanation and details

The optimization introduces LRU caching for Pydantic model creation, which eliminates the expensive overhead of repeatedly creating the same unmarshaller models.

Key changes:

  • Extracted model creation into _get_unmarshaller() function decorated with @lru_cache(maxsize=64)
  • The create_model() call, which was taking 93.8% of execution time in the original code, is now cached and reused for identical types

Why this optimization works:

  • create_model() is computationally expensive as it dynamically creates new Pydantic model classes with validation logic
  • The line profiler shows the original create_model() call took ~55.8ms out of 59.5ms total (93.8% of time)
  • With caching, subsequent calls for the same typ retrieve the pre-built model in ~0.44ms instead of recreating it
  • The cache hit ratio is high since applications typically unmarshal the same types repeatedly

Performance benefits:

  • 237% speedup overall (24.3ms → 7.19ms)
  • Individual test cases show 4000-10000% improvements for simple types that benefit most from caching
  • Large data structures (1000-item lists/dicts) show more modest but still significant gains (300-1000% faster)

This optimization is particularly effective for workloads that repeatedly deserialize the same data types, which is common in API clients, data processing pipelines, and serialization-heavy applications.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 53 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import json
from typing import Any, Dict, List, Optional, Union

# imports
import pytest  # used for our unit tests
from mistralai.utils.serializers import unmarshal_json
# function to test
from pydantic import BaseModel, ConfigDict, ValidationError, create_model
from pydantic_core import from_json

# unit tests

# --- Basic Test Cases ---

def test_basic_int():
    # Test unmarshalling a simple integer
    codeflash_output = unmarshal_json(json.dumps(42), int) # 318μs -> 7.13μs (4360% faster)

def test_basic_float():
    # Test unmarshalling a simple float
    codeflash_output = unmarshal_json(json.dumps(3.14), float) # 290μs -> 5.55μs (5123% faster)

def test_basic_str():
    # Test unmarshalling a simple string
    codeflash_output = unmarshal_json(json.dumps("hello"), str) # 285μs -> 5.10μs (5493% faster)

def test_basic_bool_true():
    # Test unmarshalling a boolean True
    codeflash_output = unmarshal_json(json.dumps(True), bool) # 284μs -> 4.71μs (5941% faster)

def test_basic_bool_false():
    # Test unmarshalling a boolean False
    codeflash_output = unmarshal_json(json.dumps(False), bool) # 279μs -> 4.14μs (6643% faster)

def test_basic_list_of_ints():
    # Test unmarshalling a list of integers
    codeflash_output = unmarshal_json(json.dumps([1, 2, 3]), List[int]) # 339μs -> 6.44μs (5176% faster)

def test_basic_dict_str_int():
    # Test unmarshalling a dict with str keys and int values
    codeflash_output = unmarshal_json(json.dumps({"a": 1, "b": 2}), Dict[str, int]) # 369μs -> 6.66μs (5453% faster)

def test_basic_optional_present():
    # Test unmarshalling an Optional[int] with value present
    codeflash_output = unmarshal_json(json.dumps(5), Optional[int]) # 362μs -> 5.99μs (5962% faster)

def test_basic_optional_none():
    # Test unmarshalling an Optional[int] with value None
    codeflash_output = unmarshal_json(json.dumps(None), Optional[int]) # 344μs -> 4.87μs (6973% faster)

def test_basic_union():
    # Test unmarshalling a Union[int, str]
    codeflash_output = unmarshal_json(json.dumps(7), Union[int, str]) # 364μs -> 5.30μs (6769% faster)
    codeflash_output = unmarshal_json(json.dumps("foo"), Union[int, str]) # 311μs -> 2.84μs (10867% faster)

def test_basic_nested_list_dict():
    # Test unmarshalling nested lists and dicts
    data = {"a": [1, 2], "b": [3, 4]}
    codeflash_output = unmarshal_json(json.dumps(data), Dict[str, List[int]]) # 411μs -> 7.20μs (5610% faster)

# --- Edge Test Cases ---

def test_edge_empty_list():
    # Test unmarshalling an empty list
    codeflash_output = unmarshal_json(json.dumps([]), List[int]) # 325μs -> 4.81μs (6667% faster)

def test_edge_empty_dict():
    # Test unmarshalling an empty dict
    codeflash_output = unmarshal_json(json.dumps({}), Dict[str, int]) # 365μs -> 4.85μs (7432% faster)

def test_edge_null_value():
    # Test unmarshalling a null value for Optional type
    codeflash_output = unmarshal_json(json.dumps(None), Optional[str]) # 352μs -> 5.47μs (6346% faster)

def test_edge_invalid_type():
    # Test unmarshalling with mismatched types (should raise ValidationError)
    with pytest.raises(ValidationError):
        unmarshal_json(json.dumps("not an int"), int) # 289μs -> 5.56μs (5098% faster)




def test_edge_union_with_none():
    # Test unmarshalling Union[str, None] with None value
    codeflash_output = unmarshal_json(json.dumps(None), Union[str, None]) # 351μs -> 8.52μs (4024% faster)

def test_edge_union_with_valid_value():
    # Test unmarshalling Union[str, None] with valid string
    codeflash_output = unmarshal_json(json.dumps("abc"), Union[str, None]) # 349μs -> 6.57μs (5224% faster)

def test_edge_list_of_optional():
    # Test unmarshalling a list of Optional[int]
    data = [1, None, 2]
    codeflash_output = unmarshal_json(json.dumps(data), List[Optional[int]]) # 390μs -> 7.21μs (5316% faster)

def test_edge_dict_with_optional_value():
    # Test unmarshalling a dict with Optional[int] values
    data = {"a": 1, "b": None}
    codeflash_output = unmarshal_json(json.dumps(data), Dict[str, Optional[int]]) # 403μs -> 6.92μs (5724% faster)

def test_edge_empty_string():
    # Test unmarshalling an empty string
    codeflash_output = unmarshal_json(json.dumps(""), str) # 300μs -> 4.80μs (6149% faster)

def test_edge_large_integer():
    # Test unmarshalling a very large integer
    large_int = 10**18
    codeflash_output = unmarshal_json(json.dumps(large_int), int) # 284μs -> 5.60μs (4974% faster)

def test_edge_float_precision():
    # Test unmarshalling a float with high precision
    value = 1.1234567890123456
    codeflash_output = unmarshal_json(json.dumps(value), float); result = codeflash_output # 278μs -> 4.52μs (6072% faster)

def test_edge_invalid_json():
    # Test passing invalid JSON (should raise JSONDecodeError)
    with pytest.raises(Exception):
        # from_json will raise an error for invalid JSON
        unmarshal_json("{invalid json}", int) # 3.44μs -> 3.36μs (2.26% faster)

def test_edge_wrong_type_in_list():
    # Test unmarshalling a list with wrong type elements
    with pytest.raises(ValidationError):
        unmarshal_json(json.dumps([1, "two", 3]), List[int]) # 340μs -> 8.15μs (4078% faster)

def test_edge_wrong_type_in_dict():
    # Test unmarshalling a dict with wrong type values
    with pytest.raises(ValidationError):
        unmarshal_json(json.dumps({"a": 1, "b": "two"}), Dict[str, int]) # 369μs -> 7.66μs (4723% faster)





def test_large_list_of_ints():
    # Test unmarshalling a large list of integers
    large_list = list(range(1000))
    codeflash_output = unmarshal_json(json.dumps(large_list), List[int]) # 381μs -> 33.3μs (1043% faster)

def test_large_dict_of_str_int():
    # Test unmarshalling a large dict of str->int
    large_dict = {str(i): i for i in range(1000)}
    codeflash_output = unmarshal_json(json.dumps(large_dict), Dict[str, int]) # 508μs -> 117μs (333% faster)






#------------------------------------------------
from typing import Any, Dict, List, Optional, Tuple, Union

# imports
import pytest
from mistralai.utils.serializers import unmarshal_json
# function to test
from pydantic import BaseModel, ConfigDict, ValidationError, create_model
from pydantic_core import from_json


# Helper pydantic models for testing
class SimpleModel(BaseModel):
    a: int
    b: str

class NestedModel(BaseModel):
    x: int
    y: SimpleModel

class OptionalModel(BaseModel):
    a: Optional[int]
    b: Optional[str]

class UnionModel(BaseModel):
    a: Union[int, str]
    b: str

# ==============================
# Basic Test Cases
# ==============================

def test_basic_int():
    """Test unmarshalling a simple integer."""
    raw = b'123'
    codeflash_output = unmarshal_json(raw, int); result = codeflash_output # 292μs -> 5.51μs (5206% faster)

def test_basic_str():
    """Test unmarshalling a simple string."""
    raw = b'"hello"'
    codeflash_output = unmarshal_json(raw, str); result = codeflash_output # 283μs -> 5.12μs (5444% faster)

def test_basic_float():
    """Test unmarshalling a simple float."""
    raw = b'3.1415'
    codeflash_output = unmarshal_json(raw, float); result = codeflash_output # 278μs -> 5.02μs (5451% faster)

def test_basic_bool_true():
    """Test unmarshalling a boolean True."""
    raw = b'true'
    codeflash_output = unmarshal_json(raw, bool); result = codeflash_output # 273μs -> 4.54μs (5925% faster)

def test_basic_bool_false():
    """Test unmarshalling a boolean False."""
    raw = b'false'
    codeflash_output = unmarshal_json(raw, bool); result = codeflash_output # 274μs -> 4.22μs (6404% faster)

def test_basic_list_of_ints():
    """Test unmarshalling a list of integers."""
    raw = b'[1, 2, 3]'
    codeflash_output = unmarshal_json(raw, List[int]); result = codeflash_output # 353μs -> 6.04μs (5742% faster)

def test_basic_dict_str_int():
    """Test unmarshalling a dict with str keys and int values."""
    raw = b'{"a": 1, "b": 2}'
    codeflash_output = unmarshal_json(raw, Dict[str, int]); result = codeflash_output # 370μs -> 6.55μs (5557% faster)

def test_basic_pydantic_model():
    """Test unmarshalling a simple pydantic model."""
    raw = b'{"a": 10, "b": "foo"}'
    codeflash_output = unmarshal_json(raw, SimpleModel); result = codeflash_output # 319μs -> 7.54μs (4136% faster)

# ==============================
# Edge Test Cases
# ==============================

def test_empty_list():
    """Test unmarshalling an empty list."""
    raw = b'[]'
    codeflash_output = unmarshal_json(raw, List[int]); result = codeflash_output # 343μs -> 4.93μs (6879% faster)

def test_empty_dict():
    """Test unmarshalling an empty dict."""
    raw = b'{}'
    codeflash_output = unmarshal_json(raw, Dict[str, int]); result = codeflash_output # 356μs -> 4.84μs (7259% faster)


def test_null_optional_field():
    """Test unmarshalling a model with explicit null optional field."""
    raw = b'{"a": null, "b": null}'
    codeflash_output = unmarshal_json(raw, OptionalModel); result = codeflash_output # 359μs -> 10.8μs (3236% faster)

def test_union_field_int():
    """Test unmarshalling a union field with int."""
    raw = b'{"a": 42, "b": "bar"}'
    codeflash_output = unmarshal_json(raw, UnionModel); result = codeflash_output # 327μs -> 8.59μs (3720% faster)

def test_union_field_str():
    """Test unmarshalling a union field with str."""
    raw = b'{"a": "baz", "b": "bar"}'
    codeflash_output = unmarshal_json(raw, UnionModel); result = codeflash_output # 317μs -> 7.16μs (4339% faster)

def test_nested_model():
    """Test unmarshalling a nested pydantic model."""
    raw = b'{"x": 1, "y": {"a": 2, "b": "hi"}}'
    codeflash_output = unmarshal_json(raw, NestedModel); result = codeflash_output # 337μs -> 8.75μs (3757% faster)

def test_list_of_models():
    """Test unmarshalling a list of pydantic models."""
    raw = b'[{"a": 1, "b": "x"}, {"a": 2, "b": "y"}]'
    codeflash_output = unmarshal_json(raw, List[SimpleModel]); result = codeflash_output # 366μs -> 9.22μs (3879% faster)

def test_dict_of_models():
    """Test unmarshalling a dict of pydantic models."""
    raw = b'{"foo": {"a": 3, "b": "bar"}, "bar": {"a": 4, "b": "baz"}}'
    codeflash_output = unmarshal_json(raw, Dict[str, SimpleModel]); result = codeflash_output # 396μs -> 9.36μs (4134% faster)

def test_tuple_of_mixed_types():
    """Test unmarshalling a tuple of mixed types."""
    raw = b'[1, "two", 3.0]'
    codeflash_output = unmarshal_json(raw, Tuple[int, str, float]); result = codeflash_output # 395μs -> 7.28μs (5330% faster)

def test_invalid_type_raises():
    """Test error handling for type mismatch."""
    raw = b'"not an int"'
    with pytest.raises(ValidationError):
        unmarshal_json(raw, int) # 288μs -> 5.71μs (4949% faster)

def test_invalid_json_raises():
    """Test error handling for invalid JSON."""
    raw = b'{invalid json}'
    with pytest.raises(ValueError):
        unmarshal_json(raw, SimpleModel) # 3.53μs -> 3.30μs (6.95% faster)

def test_extra_fields_ignored():
    """Test extra fields in JSON are ignored by model."""
    raw = b'{"a": 5, "b": "ok", "extra": 999}'
    codeflash_output = unmarshal_json(raw, SimpleModel); result = codeflash_output # 323μs -> 8.37μs (3768% faster)

def test_missing_required_field():
    """Test missing required field raises ValidationError."""
    raw = b'{"a": 5}'
    with pytest.raises(ValidationError):
        unmarshal_json(raw, SimpleModel) # 308μs -> 6.25μs (4843% faster)

def test_none_for_non_optional_field():
    """Test None for non-optional field raises ValidationError."""
    raw = b'{"a": null, "b": "ok"}'
    with pytest.raises(ValidationError):
        unmarshal_json(raw, SimpleModel) # 319μs -> 8.52μs (3654% faster)


def test_large_list_of_ints():
    """Test unmarshalling a large list of integers."""
    data = list(range(1000))
    import json
    raw = json.dumps(data).encode()
    codeflash_output = unmarshal_json(raw, List[int]); result = codeflash_output # 381μs -> 34.0μs (1022% faster)

def test_large_dict_of_strings():
    """Test unmarshalling a large dict with string keys and values."""
    data = {f"key{i}": f"value{i}" for i in range(1000)}
    import json
    raw = json.dumps(data).encode()
    codeflash_output = unmarshal_json(raw, Dict[str, str]); result = codeflash_output # 533μs -> 127μs (318% faster)

To edit these changes git checkout codeflash/optimize-unmarshal_json-mh4jmtfq and push.

Codeflash

The optimization introduces **LRU caching for Pydantic model creation**, which eliminates the expensive overhead of repeatedly creating the same unmarshaller models.

**Key changes:**
- Extracted model creation into `_get_unmarshaller()` function decorated with `@lru_cache(maxsize=64)`
- The `create_model()` call, which was taking 93.8% of execution time in the original code, is now cached and reused for identical types

**Why this optimization works:**
- `create_model()` is computationally expensive as it dynamically creates new Pydantic model classes with validation logic
- The line profiler shows the original `create_model()` call took ~55.8ms out of 59.5ms total (93.8% of time)
- With caching, subsequent calls for the same `typ` retrieve the pre-built model in ~0.44ms instead of recreating it
- The cache hit ratio is high since applications typically unmarshal the same types repeatedly

**Performance benefits:**
- **237% speedup** overall (24.3ms → 7.19ms)
- Individual test cases show **4000-10000% improvements** for simple types that benefit most from caching
- Large data structures (1000-item lists/dicts) show more modest but still significant gains (300-1000% faster)

This optimization is particularly effective for workloads that repeatedly deserialize the same data types, which is common in API clients, data processing pipelines, and serialization-heavy applications.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 24, 2025 07:42
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant