Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7463] parallelize BQ IT tests #8751

Closed
wants to merge 1 commit into from
Closed

Conversation

Juta
Copy link
Contributor

@Juta Juta commented Jun 4, 2019

Updating _multiprocess_can_split_ = True makes sure that the test cases don't share any variables when run in parallel, this avoids flaky test results

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@Juta Juta changed the title [BEAM-7463] parellelize BQ IT tests [BEAM-7463] parallelize BQ IT tests Jun 4, 2019
@Juta
Copy link
Contributor Author

Juta commented Jun 4, 2019

Run Python PostCommit

@Juta
Copy link
Contributor Author

Juta commented Jun 4, 2019

R: @tvalentyn

@tvalentyn
Copy link
Contributor

Hi @Juta, could you please comment which variables are being shared by test scenarios? This change should improve test parallelism in test modules that have multiple test cases, however I believe it does not remove side-effects in existing test scenarios: the tests you modify use test-case level fixtures (setUp()). For every test method a new instance of TestCase is created, and setUp will be called on this instance.

From unittest docs: https://docs.python.org/3/library/unittest.html#class-and-module-fixtures

A new TestCase instance is created as a unique test fixture used to execute each individual test method.

This change would indeed take effect if we used module-level or class-level fixtures (e.g. setUpClass()): with _multiprocesses_can_split_ module-level fixtures would be called multiple times. However I don't see usages of those types of fixtures in integration tests, and we should not use them since they can create side-effects. So it is not clear to me which variables are shared in the tests, that will not be shared with this change. If you think I am missing something, could you post a simplified code-snippet that demonstrates a potential race/side effect in existing tests? Thank you.

@Juta
Copy link
Contributor Author

Juta commented Jun 5, 2019

When the tests are run in parallel with nose they share class variables. The setup and tear down methods are called once for each test case in a class. Many of the big query test classes create a unique dataset name in the setup and delete this dataset in the tear down. Some of the tests share the same output table name

This can lead to the following executions:

  • when data is written with WRITE_EMPTY, it is possible that a second test already empties the table and right after this the data is checked in the first test. This would cause this checksum error
  • when the tear down method is called this deleted the dataset and it is possible that after deleting a dataset, the data is checked in another test. This would cause
    not found: Table apache-beam-testing:python_bq_file_loads_15595843114095.beam_load_2019_06_03_175817_9_5b3b6196cef9fd926078ec31862c8d0b_0 was not found in location US

@Juta
Copy link
Contributor Author

Juta commented Jun 5, 2019

I tested this on a small usecase
(documentation: https://nose.readthedocs.io/en/latest/doc_tests/test_multiprocess/multiprocess.html)

a = []

class ParallelTest(unittest.TestCase):
  _multiprocess_can_split_ = True

  def setUp(self):
    a.append('hello')

  def tearDown(self):
    a = []

  def test_a(self):
    self.assertEqual(a, ['hello'])

  def test_b(self):
    self.assertEqual(a, ['hello'])

and run this with nosetests --tests test.py --processes=2

If we do not specify _multiprocess_can_split_ = True: what happens is that the a variable is appended to twice and one of the tests will fail. However in the other case the test does not share the a variable and both test succeed

In the bigquery tests https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/big_query_query_to_table_it_test.py#L79 it is the dataset_id that is shared and that causes the problems in the test as explained above

@tvalentyn
Copy link
Contributor

tvalentyn commented Jun 5, 2019

When the tests are run in parallel with nose they share class variables.
I believe this is not the case as per unittest documentation, and I was not able to reproduce such behavior in a simple test.

The test codesnippet you posted actually fails with and without _multiprocess_can_split = True since test methods access a global variable. However, if we make a a field in the class, the test passes, also with and without _multiprocess_can_split_ = True:

(py27env) :nose$ cat nose_test3.py
import unittest

class ParallelTest(unittest.TestCase):
  # _multiprocess_can_split_ = True

  def setUp(self):
    self.a = []
    self.a.append('hello')

  def tearDown(self):
    self.a = []

  def test_a(self):
    self.assertEqual(self.a, ['hello'])

  def test_b(self):
    self.assertEqual(self.a, ['hello'])

(py27env) :nose$ nosetests --nologcapture -v  --processes=2 nose_test3.py
test_a (nose_test3.ParallelTest) ... ok
test_b (nose_test3.ParallelTest) ... ok

@Juta
Copy link
Contributor Author

Juta commented Jun 5, 2019

Thanks Valentyn, you are right, I didn't put my variables inside the class which indeed makes a difference..

@Juta
Copy link
Contributor Author

Juta commented Jun 5, 2019

Run Python PostCommit

@Juta
Copy link
Contributor Author

Juta commented Jun 11, 2019

Run Python PostCommit

@Juta
Copy link
Contributor Author

Juta commented Jun 11, 2019

Run Python PostCommit

@Juta
Copy link
Contributor Author

Juta commented Jun 14, 2019

Run Python PostCommit

@Juta
Copy link
Contributor Author

Juta commented Jun 14, 2019

Run Python PostCommit

1 similar comment
@Juta
Copy link
Contributor Author

Juta commented Jun 17, 2019

Run Python PostCommit

@Juta Juta closed this Jul 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants