Skip to content

[SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect#40793

Closed
zhengruifeng wants to merge 11 commits intoapache:masterfrom
zhengruifeng:torch_reenable
Closed

[SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect#40793
zhengruifeng wants to merge 11 commits intoapache:masterfrom
zhengruifeng:torch_reenable

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Apr 14, 2023

What changes were proposed in this pull request?

TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect were not stable and occasionally got stuck. However, I can not reproduce the issue locally.

The two UTs were disabled, and this PR is to reenable them. I found that the all the tests for PyTorch set up the regular sessions or connect sessions in setUp and close them in tearDown, however such session operations are very expensive and should be placed into setUpClass and tearDownClass instead. After this change, the related tests seems much stable. So I think the root cause is still related to the resources, since TorchDistributor works on barrier mode, when there is not enough resources in Github Action, the tests just keep waiting.

Why are the changes needed?

for test coverage

Does this PR introduce any user-facing change?

No, test-only

How was this patch tested?

CI

@zhengruifeng
Copy link
Contributor Author

let me try merging commits from master several times, to see whether this fix is stable enough

@zhengruifeng zhengruifeng changed the title [WIP][SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect Apr 17, 2023
@zhengruifeng zhengruifeng marked this pull request as ready for review April 17, 2023 07:25
@zhengruifeng
Copy link
Contributor Author

This PR actually only change setUp & tearDown to setUpClass & tearDownClass.

In all the 6 runs, the PyTorch related tests all passed, so I think it is ready for review.

@HyukjinKwon @dongjoon-hyun @WeichenXu123

gpu_discovery_script_file = tempfile.NamedTemporaryFile(delete=False)
gpu_discovery_script_file_name = gpu_discovery_script_file.name
gpu_discovery_script_file.write(
b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no biggie but I would convert a dict to JSON and write it after encoding it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find a easy way to add the \, let me just follow existing test for now

gpu_discovery_script_file.write(
b'echo {\\"name\\": \\"gpu\\", \\"addresses\\": [\\"0\\",\\"1\\",\\"2\\"]}'
)
gpu_discovery_script_file.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try finally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will update

# create temporary directory for Worker resources coordination
tempdir = tempfile.NamedTemporaryFile(delete=False)
os.unlink(tempdir.name)
os.chmod(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind add a comment what this means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this set_up_test_dirs method just follows the initial logic.

return train_fn


def set_up_test_dirs():
Copy link
Member

@HyukjinKwon HyukjinKwon Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should define another mixin (like SQLTestUtils), and inherits it to be consistent but I am fine as are for now since it's just few methods, and they do not assume states.

@HyukjinKwon HyukjinKwon changed the title [SPARK-43122][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect [SPARK-43122][CONNECT][PYTHON][ML][TESTS] Reenable TorchDistributorLocalUnitTestsOnConnect and TorchDistributorLocalUnitTestsIIOnConnect Apr 17, 2023
@zhengruifeng
Copy link
Contributor Author

I see a failure in pyspark.ml.clustering, should be unrelated.
but let me re-run pyspark.ml.clustering and test torch one more time.
right now, I didn't see the previous torch-related issue.

@zhengruifeng
Copy link
Contributor Author

in all the 10 runs, test_parity_torch_distributor and test_distributor passed without being hanged.
so I think it is stable enough to re-enable.

@zhengruifeng
Copy link
Contributor Author

merged to master

@zhengruifeng zhengruifeng deleted the torch_reenable branch April 18, 2023 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments