Cannot create multiple model endpoints in local mode

**Describe the bug**
In local mode, two model endpoints cannot be created in the same session. When the second endpoint is created, it tries to use the same port (8080) that is still being occupied by the first endpoint, which results in any calls to `predict` being routed to the first container.

This is breaking my unit tests, because I have tests for model training and predictions that span multiple models. My hunch is that this is not related to Pytest though.

If I run either unit test individually, it passes. But I cannot run them together because whichever runs second always fails, because the response has the wrong format - since when I call `predict` in the second test, it ends up hitting the `/invocations` endpoint on the first container.

**To reproduce**
I have two images corresponding to two different containers, let's call them `image-1` and `image-2`. 

In `test_train_model1.py`:

```
def test_train_and_predict(tmp_path):
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}

    training_channel = tmp_path / "training"
    model_output = tmp_path / "model"
    training_output = tmp_path / "output"

    training_channel.mkdir(exist_ok=True)
    model_output.mkdir(exist_ok=True)
    training_output.mkdir(exist_ok=True)

    # Move the test dataset into the location expected by train container
    train = pd.read_csv(<path_to_local_csv>)
    train.to_csv(str(training_channel) + "/data.csv", index=False)

    estimator = Estimator(
        settings.get("image"), # equals `image_1`
        settings.get("iam_role_arn"),
        settings.getint('instance_count'),
        settings.get("instance_type"),
        base_job_name='model-1-train',
        volume_size=512,
        model_uri="file://" + str(model_output),
        output_path="file://" + str(training_output),
        sagemaker_session=sagemaker_session,
        hyperparameters=settings.get("hyperparameters")
    )

    estimator.fit({
        "training": f"file://{training_channel}"
    }, logs="All", wait=True)

    assert estimator

    # Check that model got saved to correct location
    assert estimator.model_data == f'file://{training_output}/model.tar.gz'

    # Deploy the model locally
    predictor = estimator.deploy(initial_instance_count=1, instance_type='local')

    # Check that we got a valid prediction
    data = ... # some test input
    response = json.loads(predictor.predict(json.dumps(data)))
    
    assert response 
```

In `test_model_2.py` (identical logic, different image):

```
def test_train_and_predict(tmp_path):
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}

    training_channel = tmp_path / "training"
    model_output = tmp_path / "model"
    training_output = tmp_path / "output"

    training_channel.mkdir(exist_ok=True)
    model_output.mkdir(exist_ok=True)
    training_output.mkdir(exist_ok=True)

    # Move the test dataset into the location expected by train container
    train = pd.read_csv(<path_to_local_csv>)
    train.to_csv(str(training_channel) + "/data.csv", index=False)

    estimator = Estimator(
        settings.get("image"), # equals `image_2`
        settings.get("iam_role_arn"),
        settings.getint('instance_count'),
        settings.get("instance_type"),
        base_job_name='model-1-train',
        volume_size=512,
        model_uri="file://" + str(model_output),
        output_path="file://" + str(training_output),
        sagemaker_session=sagemaker_session,
        hyperparameters=settings.get("hyperparameters")
    )

    estimator.fit({
        "training": f"file://{training_channel}"
    }, logs="All", wait=True)

    assert estimator

    # Check that model got saved to correct location
    assert estimator.model_data == f'file://{training_output}/model.tar.gz'

    # Deploy the model locally
    predictor = estimator.deploy(initial_instance_count=1, instance_type='local')

    # Check that we got a valid prediction
    data = ... # some test input
    response = json.loads(predictor.predict(json.dumps(data)))
    
    assert response 
```

**Expected behavior**
I expect both containers to be built, both models to train, and both predictions to be served. When I call `predict` on the second estimator, it should hit the `/invocations` endpoint on the second container.

**Screenshots or logs**
Logs from the second test starting to run:

```
INFO     sagemaker.local.image:image.py:508 docker command: docker-compose -f /private/var/folders/yq/nt_pyt5112b702866l2l4zrm0000gn/T/tmphvpfolv1/docker-compose.yaml up --build --abort-on-container-exit
Creating tmphvpfolv1_algo-1-n4k0x_1 ... 
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/me/Documents/code/my-repo/venv/lib/python3.7/site-packages/sagemaker-2.18.0-py3.7.egg/sagemaker/local/image.py", line 627, in run
    _stream_output(self.process)
  File "/Users/me/Documents/code/my-repo/venv/lib/python3.7/site-packages/sagemaker-2.18.0-py3.7.egg/sagemaker/local/image.py", line 687, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/Users/me/Documents/code/my-repo/venv/lib/python3.7/site-packages/sagemaker-2.18.0-py3.7.egg/sagemaker/local/image.py", line 632, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/yq/nt_pyt5112b702866l2l4zrm0000gn/T/tmpkq9_j1e0/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
Creating tmphvpfolv1_algo-1-n4k0x_1 ... done
```

I tried running the command in manually in my terminal and got the following :

```
docker-compose -f '/private/var/folders/yq/nt_pyt5112b702866l2l4zrm0000gn/T/tmpkq9_j1e0/docker-compose.yaml' up --build --abort-on-container-exit
Starting tmpkq9_j1eo_algo-1-6cyan_1 ...
Starting tmpkq9_j1eo_algo-1-6cyan_1 ... error

ERROR: for tmpkq9_j1eo_algo-1-6cyan_1 Cannot start service algo-1-6cyan: driver failed programming external connectivity on endpoint tmpq9_j1e0_algo-1-6cyan_1 (7c68cf8e050f0c06aa99532dc52335a6961b298ec19c328068944d3504ed98): Bind for 0.0.0.0:8080 failed: port is already allocated

ERROR: for algo-1-6cyan Cannot start service algo-1-6cyan: driver failed programming external connectivity on endpoint tmpkq9_j1e0_algo-1-6cyan_1 (7c68cf8e050f0c06aa99532dc52335a6961b298ec19c328068944d3504ed98): Bind for 0.0.0.0:8080 failed: port is already allocated
ERROR: Encountered errors while bringing up the project.
```

Note: this output is more useful than what comes out of the SDK.

**System information**
A description of your system. Please provide:
- **SageMaker Python SDK version**: 2.18.0
- **Framework name (eg. PyTorch) or algorithm (eg. KMeans)**: Lightgbm
- **Framework version**: 3.1.0
- **Python version**: 3.7
- **Custom Docker image (Y/N)**: Y

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot create multiple model endpoints in local mode #2020

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot create multiple model endpoints in local mode #2020

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions