Skip to content

Cannot create multiple model endpoints in local mode #2020

@pwerth

Description

@pwerth

Describe the bug
In local mode, two model endpoints cannot be created in the same session. When the second endpoint is created, it tries to use the same port (8080) that is still being occupied by the first endpoint, which results in any calls to predict being routed to the first container.

This is breaking my unit tests, because I have tests for model training and predictions that span multiple models. My hunch is that this is not related to Pytest though.

If I run either unit test individually, it passes. But I cannot run them together because whichever runs second always fails, because the response has the wrong format - since when I call predict in the second test, it ends up hitting the /invocations endpoint on the first container.

To reproduce
I have two images corresponding to two different containers, let's call them image-1 and image-2.

In test_train_model1.py:

def test_train_and_predict(tmp_path):
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}

    training_channel = tmp_path / "training"
    model_output = tmp_path / "model"
    training_output = tmp_path / "output"

    training_channel.mkdir(exist_ok=True)
    model_output.mkdir(exist_ok=True)
    training_output.mkdir(exist_ok=True)

    # Move the test dataset into the location expected by train container
    train = pd.read_csv(<path_to_local_csv>)
    train.to_csv(str(training_channel) + "/data.csv", index=False)

    estimator = Estimator(
        settings.get("image"), # equals `image_1`
        settings.get("iam_role_arn"),
        settings.getint('instance_count'),
        settings.get("instance_type"),
        base_job_name='model-1-train',
        volume_size=512,
        model_uri="file://" + str(model_output),
        output_path="file://" + str(training_output),
        sagemaker_session=sagemaker_session,
        hyperparameters=settings.get("hyperparameters")
    )

    estimator.fit({
        "training": f"file://{training_channel}"
    }, logs="All", wait=True)

    assert estimator

    # Check that model got saved to correct location
    assert estimator.model_data == f'file://{training_output}/model.tar.gz'

    # Deploy the model locally
    predictor = estimator.deploy(initial_instance_count=1, instance_type='local')

    # Check that we got a valid prediction
    data = ... # some test input
    response = json.loads(predictor.predict(json.dumps(data)))
    
    assert response 

In test_model_2.py (identical logic, different image):

def test_train_and_predict(tmp_path):
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}

    training_channel = tmp_path / "training"
    model_output = tmp_path / "model"
    training_output = tmp_path / "output"

    training_channel.mkdir(exist_ok=True)
    model_output.mkdir(exist_ok=True)
    training_output.mkdir(exist_ok=True)

    # Move the test dataset into the location expected by train container
    train = pd.read_csv(<path_to_local_csv>)
    train.to_csv(str(training_channel) + "/data.csv", index=False)

    estimator = Estimator(
        settings.get("image"), # equals `image_2`
        settings.get("iam_role_arn"),
        settings.getint('instance_count'),
        settings.get("instance_type"),
        base_job_name='model-1-train',
        volume_size=512,
        model_uri="file://" + str(model_output),
        output_path="file://" + str(training_output),
        sagemaker_session=sagemaker_session,
        hyperparameters=settings.get("hyperparameters")
    )

    estimator.fit({
        "training": f"file://{training_channel}"
    }, logs="All", wait=True)

    assert estimator

    # Check that model got saved to correct location
    assert estimator.model_data == f'file://{training_output}/model.tar.gz'

    # Deploy the model locally
    predictor = estimator.deploy(initial_instance_count=1, instance_type='local')

    # Check that we got a valid prediction
    data = ... # some test input
    response = json.loads(predictor.predict(json.dumps(data)))
    
    assert response 

Expected behavior
I expect both containers to be built, both models to train, and both predictions to be served. When I call predict on the second estimator, it should hit the /invocations endpoint on the second container.

Screenshots or logs
Logs from the second test starting to run:

INFO     sagemaker.local.image:image.py:508 docker command: docker-compose -f /private/var/folders/yq/nt_pyt5112b702866l2l4zrm0000gn/T/tmphvpfolv1/docker-compose.yaml up --build --abort-on-container-exit
Creating tmphvpfolv1_algo-1-n4k0x_1 ... 
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/me/Documents/code/my-repo/venv/lib/python3.7/site-packages/sagemaker-2.18.0-py3.7.egg/sagemaker/local/image.py", line 627, in run
    _stream_output(self.process)
  File "/Users/me/Documents/code/my-repo/venv/lib/python3.7/site-packages/sagemaker-2.18.0-py3.7.egg/sagemaker/local/image.py", line 687, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/Users/me/Documents/code/my-repo/venv/lib/python3.7/site-packages/sagemaker-2.18.0-py3.7.egg/sagemaker/local/image.py", line 632, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/yq/nt_pyt5112b702866l2l4zrm0000gn/T/tmpkq9_j1e0/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
Creating tmphvpfolv1_algo-1-n4k0x_1 ... done

I tried running the command in manually in my terminal and got the following :

docker-compose -f '/private/var/folders/yq/nt_pyt5112b702866l2l4zrm0000gn/T/tmpkq9_j1e0/docker-compose.yaml' up --build --abort-on-container-exit
Starting tmpkq9_j1eo_algo-1-6cyan_1 ...
Starting tmpkq9_j1eo_algo-1-6cyan_1 ... error

ERROR: for tmpkq9_j1eo_algo-1-6cyan_1 Cannot start service algo-1-6cyan: driver failed programming external connectivity on endpoint tmpq9_j1e0_algo-1-6cyan_1 (7c68cf8e050f0c06aa99532dc52335a6961b298ec19c328068944d3504ed98): Bind for 0.0.0.0:8080 failed: port is already allocated

ERROR: for algo-1-6cyan Cannot start service algo-1-6cyan: driver failed programming external connectivity on endpoint tmpkq9_j1e0_algo-1-6cyan_1 (7c68cf8e050f0c06aa99532dc52335a6961b298ec19c328068944d3504ed98): Bind for 0.0.0.0:8080 failed: port is already allocated
ERROR: Encountered errors while bringing up the project.

Note: this output is more useful than what comes out of the SDK.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.18.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Lightgbm
  • Framework version: 3.1.0
  • Python version: 3.7
  • Custom Docker image (Y/N): Y

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions