[SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect#40793
Conversation
|
let me try merging commits from master several times, to see whether this fix is stable enough |
|
This PR actually only change In all the 6 runs, the PyTorch related tests all passed, so I think it is ready for review. |
python/pyspark/ml/tests/connect/test_parity_torch_distributor.py
Outdated
Show resolved
Hide resolved
python/pyspark/ml/tests/connect/test_parity_torch_distributor.py
Outdated
Show resolved
Hide resolved
| gpu_discovery_script_file = tempfile.NamedTemporaryFile(delete=False) | ||
| gpu_discovery_script_file_name = gpu_discovery_script_file.name | ||
| gpu_discovery_script_file.write( | ||
| b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}' |
There was a problem hiding this comment.
no biggie but I would convert a dict to JSON and write it after encoding it.
There was a problem hiding this comment.
I don't find a easy way to add the \, let me just follow existing test for now
| gpu_discovery_script_file.write( | ||
| b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}' | ||
| ) | ||
| gpu_discovery_script_file.close() |
There was a problem hiding this comment.
got it, will update
python/pyspark/ml/tests/connect/test_parity_torch_distributor.py
Outdated
Show resolved
Hide resolved
| # create temporary directory for Worker resources coordination | ||
| tempdir = tempfile.NamedTemporaryFile(delete=False) | ||
| os.unlink(tempdir.name) | ||
| os.chmod( |
There was a problem hiding this comment.
Mind add a comment what this means?
There was a problem hiding this comment.
actually this set_up_test_dirs method just follows the initial logic.
| return train_fn | ||
|
|
||
|
|
||
| def set_up_test_dirs(): |
There was a problem hiding this comment.
Ideally we should define another mixin (like SQLTestUtils), and inherits it to be consistent but I am fine as are for now since it's just few methods, and they do not assume states.
|
I see a failure in |
|
in all the 10 runs, |
|
merged to master |
What changes were proposed in this pull request?
TorchDistributorLocalUnitTestsOnConnectandTorchDistributorLocalUnitTestsIIOnConnectwere not stable and occasionally got stuck. However, I can not reproduce the issue locally.The two UTs were disabled, and this PR is to reenable them. I found that the all the tests for PyTorch set up the regular sessions or connect sessions in
setUpand close them intearDown, however such session operations are very expensive and should be placed intosetUpClassandtearDownClassinstead. After this change, the related tests seems much stable. So I think the root cause is still related to the resources, since TorchDistributor works on barrier mode, when there is not enough resources in Github Action, the tests just keep waiting.Why are the changes needed?
for test coverage
Does this PR introduce any user-facing change?
No, test-only
How was this patch tested?
CI