-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
In local mode, two model endpoints cannot be created in the same session. When the second endpoint is created, it tries to use the same port (8080) that is still being occupied by the first endpoint, which results in any calls to predict being routed to the first container.
This is breaking my unit tests, because I have tests for model training and predictions that span multiple models. My hunch is that this is not related to Pytest though.
If I run either unit test individually, it passes. But I cannot run them together because whichever runs second always fails, because the response has the wrong format - since when I call predict in the second test, it ends up hitting the /invocations endpoint on the first container.
To reproduce
I have two images corresponding to two different containers, let's call them image-1 and image-2.
In test_train_model1.py:
def test_train_and_predict(tmp_path):
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}
training_channel = tmp_path / "training"
model_output = tmp_path / "model"
training_output = tmp_path / "output"
training_channel.mkdir(exist_ok=True)
model_output.mkdir(exist_ok=True)
training_output.mkdir(exist_ok=True)
# Move the test dataset into the location expected by train container
train = pd.read_csv(<path_to_local_csv>)
train.to_csv(str(training_channel) + "/data.csv", index=False)
estimator = Estimator(
settings.get("image"), # equals `image_1`
settings.get("iam_role_arn"),
settings.getint('instance_count'),
settings.get("instance_type"),
base_job_name='model-1-train',
volume_size=512,
model_uri="file://" + str(model_output),
output_path="file://" + str(training_output),
sagemaker_session=sagemaker_session,
hyperparameters=settings.get("hyperparameters")
)
estimator.fit({
"training": f"file://{training_channel}"
}, logs="All", wait=True)
assert estimator
# Check that model got saved to correct location
assert estimator.model_data == f'file://{training_output}/model.tar.gz'
# Deploy the model locally
predictor = estimator.deploy(initial_instance_count=1, instance_type='local')
# Check that we got a valid prediction
data = ... # some test input
response = json.loads(predictor.predict(json.dumps(data)))
assert response
In test_model_2.py (identical logic, different image):
def test_train_and_predict(tmp_path):
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}
training_channel = tmp_path / "training"
model_output = tmp_path / "model"
training_output = tmp_path / "output"
training_channel.mkdir(exist_ok=True)
model_output.mkdir(exist_ok=True)
training_output.mkdir(exist_ok=True)
# Move the test dataset into the location expected by train container
train = pd.read_csv(<path_to_local_csv>)
train.to_csv(str(training_channel) + "/data.csv", index=False)
estimator = Estimator(
settings.get("image"), # equals `image_2`
settings.get("iam_role_arn"),
settings.getint('instance_count'),
settings.get("instance_type"),
base_job_name='model-1-train',
volume_size=512,
model_uri="file://" + str(model_output),
output_path="file://" + str(training_output),
sagemaker_session=sagemaker_session,
hyperparameters=settings.get("hyperparameters")
)
estimator.fit({
"training": f"file://{training_channel}"
}, logs="All", wait=True)
assert estimator
# Check that model got saved to correct location
assert estimator.model_data == f'file://{training_output}/model.tar.gz'
# Deploy the model locally
predictor = estimator.deploy(initial_instance_count=1, instance_type='local')
# Check that we got a valid prediction
data = ... # some test input
response = json.loads(predictor.predict(json.dumps(data)))
assert response
Expected behavior
I expect both containers to be built, both models to train, and both predictions to be served. When I call predict on the second estimator, it should hit the /invocations endpoint on the second container.
Screenshots or logs
Logs from the second test starting to run:
INFO sagemaker.local.image:image.py:508 docker command: docker-compose -f /private/var/folders/yq/nt_pyt5112b702866l2l4zrm0000gn/T/tmphvpfolv1/docker-compose.yaml up --build --abort-on-container-exit
Creating tmphvpfolv1_algo-1-n4k0x_1 ...
Exception in thread Thread-1:
Traceback (most recent call last):
File "/Users/me/Documents/code/my-repo/venv/lib/python3.7/site-packages/sagemaker-2.18.0-py3.7.egg/sagemaker/local/image.py", line 627, in run
_stream_output(self.process)
File "/Users/me/Documents/code/my-repo/venv/lib/python3.7/site-packages/sagemaker-2.18.0-py3.7.egg/sagemaker/local/image.py", line 687, in _stream_output
raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/Users/me/Documents/code/my-repo/venv/lib/python3.7/site-packages/sagemaker-2.18.0-py3.7.egg/sagemaker/local/image.py", line 632, in run
raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/yq/nt_pyt5112b702866l2l4zrm0000gn/T/tmpkq9_j1e0/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1
Creating tmphvpfolv1_algo-1-n4k0x_1 ... done
I tried running the command in manually in my terminal and got the following :
docker-compose -f '/private/var/folders/yq/nt_pyt5112b702866l2l4zrm0000gn/T/tmpkq9_j1e0/docker-compose.yaml' up --build --abort-on-container-exit
Starting tmpkq9_j1eo_algo-1-6cyan_1 ...
Starting tmpkq9_j1eo_algo-1-6cyan_1 ... error
ERROR: for tmpkq9_j1eo_algo-1-6cyan_1 Cannot start service algo-1-6cyan: driver failed programming external connectivity on endpoint tmpq9_j1e0_algo-1-6cyan_1 (7c68cf8e050f0c06aa99532dc52335a6961b298ec19c328068944d3504ed98): Bind for 0.0.0.0:8080 failed: port is already allocated
ERROR: for algo-1-6cyan Cannot start service algo-1-6cyan: driver failed programming external connectivity on endpoint tmpkq9_j1e0_algo-1-6cyan_1 (7c68cf8e050f0c06aa99532dc52335a6961b298ec19c328068944d3504ed98): Bind for 0.0.0.0:8080 failed: port is already allocated
ERROR: Encountered errors while bringing up the project.
Note: this output is more useful than what comes out of the SDK.
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 2.18.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): Lightgbm
- Framework version: 3.1.0
- Python version: 3.7
- Custom Docker image (Y/N): Y
Additional context
Add any other context about the problem here.