Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 10, 2025

📄 45% (0.45x) speedup for extract_project_and_location_from_parent in google/cloud/aiplatform/utils/__init__.py

⏱️ Runtime : 3.40 milliseconds 2.34 milliseconds (best of 208 runs)

📝 Explanation and details

The optimized code achieves a 45% speedup by pre-compiling the regular expression pattern instead of recompiling it on every function call.

Key optimization:

  • Pre-compiled regex pattern: The regex pattern r"^projects/(?P<project>.+?)/locations/(?P<location>.+?)(/|$)" is compiled once at module load time and stored in _PROJECT_LOCATION_RE, rather than being recompiled by re.match() on every function invocation.

Why this improves performance:

  • Eliminates regex compilation overhead: re.match() internally compiles the pattern string into a regex object every time it's called. By using re.compile() once and reusing the compiled pattern, we avoid this expensive compilation step.
  • Reduces function call overhead: The compiled pattern object's match() method is called directly, eliminating the need for re.match() to parse and compile the pattern string.

Performance benefits across test cases:

  • Significant gains on simple cases: Basic valid inputs show 30-50% improvements (e.g., standard cases improving from ~2.5μs to ~1.7μs)
  • Massive gains on invalid inputs: Edge cases with invalid patterns show 90-115% improvements (e.g., empty strings improving from ~1.4μs to ~650ns) because the pre-compiled pattern quickly rejects non-matching strings
  • Consistent improvements at scale: Large-scale tests with 1000+ iterations show 40-50% improvements, demonstrating the optimization scales well with repeated usage

The optimization is most effective for functions called frequently with the same regex pattern, which is typical for utility functions like this one used throughout a codebase.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 4056 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import re
from typing import Dict

# imports
import pytest  # used for our unit tests
from aiplatform.utils.__init__ import extract_project_and_location_from_parent

# unit tests

# -------------------------
# Basic Test Cases
# -------------------------

def test_basic_standard_case():
    # Standard input with dataset at the end
    parent = "projects/123/locations/us-central1/datasets/456"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.88μs -> 2.21μs (30.4% faster)

def test_basic_with_trailing_slash():
    # Input ends right after location with a slash
    parent = "projects/abc/locations/europe-west1/"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.62μs -> 2.02μs (29.3% faster)

def test_basic_with_no_trailing_slash():
    # Input ends right after location with no slash
    parent = "projects/p1/locations/l1"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.36μs -> 1.68μs (40.6% faster)

def test_basic_with_additional_path():
    # Input has more than just dataset after location
    parent = "projects/myproj/locations/loc123/datasets/789/something/else"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.47μs -> 1.80μs (37.3% faster)

def test_basic_with_numeric_project_and_location():
    # Project and location are numbers
    parent = "projects/9999/locations/8888/datasets/7777"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.40μs -> 1.58μs (51.4% faster)

# -------------------------
# Edge Test Cases
# -------------------------

def test_edge_empty_string():
    # Empty input string should return empty dict
    parent = ""
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 1.37μs -> 689ns (99.0% faster)

def test_edge_missing_projects_prefix():
    # Missing 'projects/' prefix should return empty dict
    parent = "123/locations/us-central1/datasets/456"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 1.33μs -> 699ns (91.0% faster)

def test_edge_missing_locations_prefix():
    # Missing 'locations/' prefix should return empty dict
    parent = "projects/123/us-central1/datasets/456"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.08μs -> 1.40μs (48.7% faster)

def test_edge_missing_project_value():
    # No value after 'projects/'
    parent = "projects//locations/us-central1/datasets/456"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.03μs -> 1.30μs (56.0% faster)

def test_edge_missing_location_value():
    # No value after 'locations/'
    parent = "projects/123/locations//datasets/456"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.71μs -> 2.17μs (24.9% faster)

def test_edge_project_and_location_with_special_characters():
    # Project and location contain special characters
    parent = "projects/proj-!@#/locations/loc$%^/datasets/456"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.50μs -> 1.83μs (36.5% faster)

def test_edge_multiple_projects_locations():
    # Multiple 'projects' and 'locations' in path, should match first occurrence
    parent = "projects/first/locations/one/projects/second/locations/two"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.48μs -> 1.75μs (41.1% faster)

def test_edge_project_and_location_with_slashes():
    # Project or location contain slashes (should not be possible, but test anyway)
    parent = "projects/a/b/locations/c/d"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.23μs -> 1.54μs (44.9% faster)

def test_edge_project_and_location_with_unicode():
    # Unicode characters in project and location
    parent = "projects/项目/locations/位置/datasets/456"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 3.16μs -> 2.47μs (28.3% faster)

def test_edge_only_projects_and_locations():
    # Only projects and locations, nothing after location
    parent = "projects/proj123/locations/loc456"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.76μs -> 1.99μs (38.4% faster)

def test_edge_projects_and_locations_at_end_of_string():
    # Input ends exactly after location, no slash
    parent = "projects/p/locations/l"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.19μs -> 1.46μs (50.2% faster)

def test_edge_location_with_slash_in_value():
    # Location value contains a slash (should not match correctly)
    parent = "projects/proj/locations/loc/extra"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.40μs -> 1.67μs (43.6% faster)

def test_edge_project_with_slash_in_value():
    # Project value contains a slash (should not match correctly)
    parent = "projects/proj/extra/locations/loc"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 2.59μs -> 1.91μs (35.4% faster)

def test_edge_only_projects_prefix():
    # Only 'projects/' prefix, nothing else
    parent = "projects/"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 1.37μs -> 651ns (110% faster)

def test_edge_only_locations_prefix():
    # Only 'locations/' prefix, nothing else
    parent = "locations/"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 1.31μs -> 610ns (115% faster)

def test_edge_non_string_input():
    # Non-string input should raise TypeError
    with pytest.raises(TypeError):
        extract_project_and_location_from_parent(None) # 2.06μs -> 1.22μs (68.6% faster)
    with pytest.raises(TypeError):
        extract_project_and_location_from_parent(12345) # 1.41μs -> 706ns (99.2% faster)
    with pytest.raises(TypeError):
        extract_project_and_location_from_parent(["projects/1/locations/2"]) # 941ns -> 572ns (64.5% faster)

# -------------------------
# Large Scale Test Cases
# -------------------------

def test_large_scale_many_valid_inputs():
    # Test with 1000 valid parent strings
    for i in range(1, 1001):
        parent = f"projects/proj{i}/locations/loc{i}/datasets/dataset{i}"
        codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 820μs -> 565μs (45.1% faster)

def test_large_scale_long_project_and_location_names():
    # Test with very long project and location names
    long_project = "p" * 500
    long_location = "l" * 500
    parent = f"projects/{long_project}/locations/{long_location}/datasets/1"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 14.8μs -> 14.0μs (6.06% faster)

def test_large_scale_mixed_valid_and_invalid_inputs():
    # Test a mix of valid and invalid parent strings
    valid = [f"projects/p{i}/locations/l{i}/datasets/d{i}" for i in range(1, 501)]
    invalid = [
        f"projects/p{i}/locs/l{i}/datasets/d{i}" for i in range(1, 251)
    ] + [
        f"proj/p{i}/locations/l{i}/datasets/d{i}" for i in range(1, 251)
    ]
    for parent in valid:
        codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 377μs -> 248μs (51.7% faster)
        idx = int(parent.split('/')[1][1:])
    for parent in invalid:
        codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 298μs -> 179μs (66.8% faster)

def test_large_scale_project_and_location_with_numbers_and_letters():
    # Test with project and location names mixed with numbers and letters
    for i in range(1, 1001):
        project = f"proj{i}abc"
        location = f"loc{i}xyz"
        parent = f"projects/{project}/locations/{location}/datasets/123"
        codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 937μs -> 667μs (40.5% faster)

def test_large_scale_with_extra_long_path():
    # Test with extra long path after location
    project = "p"
    location = "l"
    extra = "/".join([f"segment{i}" for i in range(1, 100)])
    parent = f"projects/{project}/locations/{location}/{extra}"
    codeflash_output = extract_project_and_location_from_parent(parent); result = codeflash_output # 3.00μs -> 2.19μs (36.8% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import re
from typing import Dict

# imports
import pytest  # used for our unit tests
from aiplatform.utils.__init__ import extract_project_and_location_from_parent

# unit tests

# ---------------- BASIC TEST CASES ----------------

@pytest.mark.parametrize(
    "input_str,expected",
    [
        # Standard case with numeric project and named location
        ("projects/123/locations/us-central1/datasets/456", {"project": "123", "location": "us-central1"}),
        # Standard case with alphanumeric project and location
        ("projects/proj-abc/locations/europe-west2", {"project": "proj-abc", "location": "europe-west2"}),
        # Standard case with trailing slash
        ("projects/abc123/locations/loc-1/", {"project": "abc123", "location": "loc-1"}),
        # Standard case with extra path after location
        ("projects/myproj/locations/my-loc/other/resource", {"project": "myproj", "location": "my-loc"}),
        # Only up to location (no trailing slash)
        ("projects/p/locations/l", {"project": "p", "location": "l"}),
        # Only up to location (with trailing slash)
        ("projects/p/locations/l/", {"project": "p", "location": "l"}),
    ]
)
def test_basic_extraction(input_str, expected):
    """Test extraction of project and location in basic, valid cases."""
    codeflash_output = extract_project_and_location_from_parent(input_str); result = codeflash_output # 14.9μs -> 11.1μs (34.7% faster)

# ---------------- EDGE TEST CASES ----------------

@pytest.mark.parametrize(
    "input_str,expected",
    [
        # Missing 'projects/' prefix
        ("proj/123/locations/us-central1", {}),
        # Missing 'locations/' prefix
        ("projects/123/location/us-central1", {}),
        # Empty string
        ("", {}),
        # Only project, no location
        ("projects/123", {}),
        # Only location, no project
        ("locations/us-central1", {}),
        # Project and location are empty
        ("projects//locations//", {"project": "", "location": ""}),
        # Project is empty, location present
        ("projects//locations/loc", {"project": "", "location": "loc"}),
        # Project present, location is empty
        ("projects/abc/locations/", {"project": "abc", "location": ""}),
        # Multiple slashes in project or location
        ("projects/foo/bar/locations/baz", {"project": "foo/bar", "location": "baz"}),
        ("projects/foo/locations/bar/baz", {"project": "foo", "location": "bar/baz"}),
        # Project or location contains special characters
        ("projects/pr@j!ct/locations/loc#1", {"project": "pr@j!ct", "location": "loc#1"}),
        # Project or location contains spaces
        ("projects/my project/locations/my location", {"project": "my project", "location": "my location"}),
        # Project or location contains unicode characters
        ("projects/项目/locations/位置", {"project": "项目", "location": "位置"}),
        # Project or location contains URL-encoded characters
        ("projects/proj%2Fid/locations/loc%2Fid", {"project": "proj%2Fid", "location": "loc%2Fid"}),
        # Project and location at the end of string (no trailing slash)
        ("projects/abc/locations/def", {"project": "abc", "location": "def"}),
        # Project and location at the end of string (with trailing slash)
        ("projects/abc/locations/def/", {"project": "abc", "location": "def"}),
        # Path with query string (should not match)
        ("projects/abc/locations/def?foo=bar", {"project": "abc", "location": "def?foo=bar"}),
        # Path with fragment (should not match)
        ("projects/abc/locations/def#section", {"project": "abc", "location": "def#section"}),
        # Project or location is 'None' string
        ("projects/None/locations/None", {"project": "None", "location": "None"}),
    ]
)
def test_edge_cases(input_str, expected):
    """Test extraction with edge and unusual cases."""
    codeflash_output = extract_project_and_location_from_parent(input_str); result = codeflash_output # 43.9μs -> 30.7μs (43.1% faster)

def test_non_string_input_raises():
    """Test that non-string input raises TypeError (since re.match expects str)."""
    with pytest.raises(TypeError):
        extract_project_and_location_from_parent(None) # 2.03μs -> 1.29μs (57.4% faster)
    with pytest.raises(TypeError):
        extract_project_and_location_from_parent(123) # 1.41μs -> 716ns (96.5% faster)
    with pytest.raises(TypeError):
        extract_project_and_location_from_parent(["projects/1/locations/2"]) # 1.03μs -> 595ns (72.9% faster)

# ---------------- LARGE SCALE TEST CASES ----------------

def test_large_project_and_location_names():
    """Test extraction with very large project and location names."""
    large_project = "p" * 500
    large_location = "l" * 400
    s = f"projects/{large_project}/locations/{large_location}/datasets/foo"
    expected = {"project": large_project, "location": large_location}
    codeflash_output = extract_project_and_location_from_parent(s); result = codeflash_output # 12.8μs -> 12.2μs (5.33% faster)

def test_many_resource_paths():
    """Test extraction over a large number of valid resource paths."""
    # Generate 1000 resource paths with unique project/location values
    for i in range(1, 1001):
        project = f"proj{i}"
        location = f"loc{i}"
        s = f"projects/{project}/locations/{location}/datasets/{i}"
        expected = {"project": project, "location": location}
        codeflash_output = extract_project_and_location_from_parent(s); result = codeflash_output # 817μs -> 565μs (44.5% faster)

def test_large_number_of_slashes():
    """Test extraction when project/location contains many slashes."""
    project = "/".join([f"p{i}" for i in range(50)])
    location = "/".join([f"l{i}" for i in range(50)])
    s = f"projects/{project}/locations/{location}/datasets/123"
    expected = {"project": project, "location": location}
    codeflash_output = extract_project_and_location_from_parent(s); result = codeflash_output # 5.16μs -> 4.30μs (20.0% faster)

def test_performance_large_inputs(benchmark):
    """Benchmark extraction with a large input string."""
    large_project = "x" * 500
    large_location = "y" * 500
    s = f"projects/{large_project}/locations/{large_location}/datasets/foobar"
    # Benchmark the function to ensure reasonable performance
    result = benchmark(extract_project_and_location_from_parent, s)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-extract_project_and_location_from_parent-mgkl03l9 and push.

Codeflash

The optimized code achieves a 45% speedup by **pre-compiling the regular expression pattern** instead of recompiling it on every function call.

**Key optimization:**
- **Pre-compiled regex pattern**: The regex pattern `r"^projects/(?P<project>.+?)/locations/(?P<location>.+?)(/|$)"` is compiled once at module load time and stored in `_PROJECT_LOCATION_RE`, rather than being recompiled by `re.match()` on every function invocation.

**Why this improves performance:**
- **Eliminates regex compilation overhead**: `re.match()` internally compiles the pattern string into a regex object every time it's called. By using `re.compile()` once and reusing the compiled pattern, we avoid this expensive compilation step.
- **Reduces function call overhead**: The compiled pattern object's `match()` method is called directly, eliminating the need for `re.match()` to parse and compile the pattern string.

**Performance benefits across test cases:**
- **Significant gains on simple cases**: Basic valid inputs show 30-50% improvements (e.g., standard cases improving from ~2.5μs to ~1.7μs)
- **Massive gains on invalid inputs**: Edge cases with invalid patterns show 90-115% improvements (e.g., empty strings improving from ~1.4μs to ~650ns) because the pre-compiled pattern quickly rejects non-matching strings
- **Consistent improvements at scale**: Large-scale tests with 1000+ iterations show 40-50% improvements, demonstrating the optimization scales well with repeated usage

The optimization is most effective for functions called frequently with the same regex pattern, which is typical for utility functions like this one used throughout a codebase.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 10, 2025 08:25
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant