IMPALA-3501: ee tests: detect build type and support different timeou…

…ts based on the same Impala compiled with the address sanitizer, or compiled with code coverage, runs through code paths much slower. This can cause end-to-end tests that pass on a non-ASAN or non-code coverage build to fail. Some examples include IMPALA-2721, IMPALA-2973, and IMPALA-3501. These classes of failures tend always to involve some time-sensitive condition that fails to succeed under such "slow builds". The works-around in the past have been to simply increase the timeout. The problem with this approach is that it relaxes conditions for tests on builds that see the field--i.e., release builds--for builds that never will--i.e., ASAN and code coverage. This patch fixes that problem by allowing test authors to set timeout values based on a *specific* build type. The author may choose timeouts with a default value, and different timeouts for either or both so-called "slow builds": ASAN and code coverage. We detect the so-called "specific build type" by inspecting the binary expected to be at the path under test. This removes the need to make alterations to Impala itself. The inspection done is to read the DWARF information in the binary, specifically the first compile unit's DW_AT_producer and DW_AT_name DIE attributes. We employ a heuristic based on these attributes' values to guess the build type. If we can't determine the build type, we will assume it's a debug build. More information on this is in IMPALA-3501. A quick summary of the changes follows: 1. Move some of the logic in tests.common.skip to tests.common.environ and rework some skip marks to be more precise. 2. Add Pyelftools for convenient deserialization of DWARF 3. Our Pyelftools usage requires collections.OrderedDict, which isn't in python2.6; also add Monkeypatch to handle this. 4. Add ImpalaBuild and specific_build_type_timeout, the core of the new functionality 5. Fix the statestore tests that only fail under code coverage (the basis for IMPALA-3501) Testing: The tests that were previously, reliably failing under code coverage now pass. I also ran perfunctory tests of debug, release, and ASAN builds to ensure our detection of build type is working. This patch will *not* turn the code coverage builds green; there are other tests that fail, and fixing all of them here is out of the scope of this patch. Change-Id: I2b675c04c54e36d404fd9e5a6cf085fb8d6d0e47 Reviewed-on: http://gerrit.cloudera.org:8080/3156 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins
cloudera · May 26, 2016 · bbad07d · bbad07d
1 parent 60238ed
commit bbad07d
Show file tree

Hide file tree

Showing 8 changed files with 267 additions and 34 deletions.
diff --git a/infra/python/deps/monkeypatch-0.1rc3.zip b/infra/python/deps/monkeypatch-0.1rc3.zip
diff --git a/infra/python/deps/pyelftools-0.23.tar.gz b/infra/python/deps/pyelftools-0.23.tar.gz
diff --git a/infra/python/deps/requirements.txt b/infra/python/deps/requirements.txt
@@ -36,11 +36,13 @@ impyla == 0.11.2
   # before thirdparty is built thrift will be installed anyways.
   thrift == 0.9.0
   thrift_sasl == 0.1.0
+monkeypatch == 0.1rc3
 ordereddict == 1.1
 pexpect == 3.3
 pg8000 == 1.10.2
 prettytable == 0.7.2
 psutil == 0.7.1
+pyelftools == 0.23
 pyparsing == 2.0.3
 pytest == 2.7.2
   py == 1.4.30

diff --git a/tests/common/environ.py b/tests/common/environ.py
@@ -0,0 +1,246 @@
+# Copyright (c) 2016 Cloudera, Inc. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+import re
+
+try:
+  from elftools.elf.elffile import ELFFile
+except ImportError as e:
+  # Handle pre-python2.7s' lack of collections.OrderedDict, which we include in
+  # impala-python as ordereddict.OrderedDict.
+  if 'cannot import name OrderedDict' == str(e):
+    import monkeypatch
+    from ordereddict import OrderedDict
+    monkeypatch.patch(OrderedDict, 'collections', 'OrderedDict')
+    from elftools.elf.elffile import ELFFile
+  else:
+    raise e
+
+
+LOG = logging.getLogger('tests.common.environ')
+
+
+# See if Impala is running with legacy aggregations and/or hash joins. This is kind of a
+# hack. It would be better to poll Impala whether it is doing so.
+test_start_cluster_args = os.environ.get("TEST_START_CLUSTER_ARGS", "")
+old_agg_regex = "enable_partitioned_aggregation=false"
+old_hash_join_regex = "enable_partitioned_hash_join=false"
+USING_OLD_AGGS_JOINS = re.search(old_agg_regex, test_start_cluster_args) is not None or \
+    re.search(old_hash_join_regex, test_start_cluster_args) is not None
+
+# Find the likely BuildType of the running Impala. Assume it's found through the path
+# $IMPALA_HOME/be/build/latest as a fallback.
+impala_home = os.environ.get("IMPALA_HOME", "")
+build_type_arg_regex = re.compile(r'--build_type=(\w+)', re.I)
+build_type_arg_search_result = re.search(build_type_arg_regex, test_start_cluster_args)
+
+if build_type_arg_search_result is not None:
+  build_type_dir = build_type_search_result.groups()[0].lower()
+else:
+  build_type_dir = 'latest'
+
+# Resolve any symlinks in the path.
+impalad_basedir = \
+    os.path.realpath(os.path.join(impala_home, 'be/build', build_type_dir)).rstrip('/')
+
+IMPALAD_PATH = os.path.join(impalad_basedir, 'service', 'impalad')
+
+
+class SpecificImpaladBuildTypes:
+  """
+  Represent a specific build type. In reality, there 5 specific build types. These
+  specific build types are needed by Python test code.
+
+  The specific build types and their *most distinguishing* compiler options are:
+
+  1. ADDRESS_SANITIZER (clang -fsanitize=address)
+  2. DEBUG (gcc -ggdb)
+  3. DEBUG_CODE_COVERAGE (gcc -ggdb -ftest-coverage)
+  4. RELEASE (gcc)
+  5. RELEASE_CODE_COVERAGE (gcc -ftest-coverage)
+  """
+  # ./buildall.sh -asan
+  ADDRESS_SANITIZER = 'address_sanitizer'
+  # ./buildall.sh
+  DEBUG = 'debug'
+  # ./buildall.sh -codecoverage
+  DEBUG_CODE_COVERAGE = 'debug_code_coverage'
+  # ./buildall.sh -release
+  RELEASE = 'release'
+  # ./buildall.sh -release -codecoverage
+  RELEASE_CODE_COVERAGE = 'release_code_coverage'
+
+
+class ImpaladBuild(object):
+  """
+  Acquires and provides characteristics about the way the Impala under test was compiled
+  and its likely effects on its responsiveness to automated test timings.
+  """
+  def __init__(self, impalad_path):
+    self.impalad_path = impalad_path
+    die_name, die_producer = self._get_impalad_dwarf_info()
+    self._set_impalad_build_type(die_name, die_producer)
+
+  @property
+  def specific_build_type(self):
+    """
+    Return the correct SpecificImpaladBuildTypes for the Impala under test.
+    """
+    return self._specific_build_type
+
+  def has_code_coverage(self):
+    """
+    Return whether the Impala under test was compiled with code coverage enabled.
+    """
+    return self.specific_build_type in (SpecificImpaladBuildTypes.DEBUG_CODE_COVERAGE,
+                                        SpecificImpaladBuildTypes.RELEASE_CODE_COVERAGE)
+
+  def is_asan(self):
+    """
+    Return whether the Impala under test was compiled with ASAN.
+    """
+    return self.specific_build_type == SpecificImpaladBuildTypes.ADDRESS_SANITIZER
+
+  def is_dev(self):
+    """
+    Return whether the Impala under test is a development build (i.e., any debug or ASAN
+    build).
+    """
+    return self.specific_build_type in (
+        SpecificImpaladBuildTypes.ADDRESS_SANITIZER, SpecificImpaladBuildTypes.DEBUG,
+        SpecificImpaladBuildTypes.DEBUG_CODE_COVERAGE)
+
+  def runs_slowly(self):
+    """
+    Return whether the Impala under test "runs slowly". For our purposes this means
+    either compiled with code coverage enabled or ASAN.
+    """
+    return self.has_code_coverage() or self.is_asan()
+
+  def _get_impalad_dwarf_info(self):
+    """
+    Read the impalad_path ELF binary, which is supposed to contain DWARF, and read the
+    DWARF to understand the compiler options. Return a 2-tuple of the two useful DIE
+    attributes of the first compile unit: the DW_AT_name and DW_AT_producer. If
+    something goes wrong doing this, log a warning and return nothing.
+    """
+    # Some useful references:
+    # - be/CMakeLists.txt
+    # - gcc(1), especially -grecord-gcc-switches, -g, -ggdb, -gdwarf-2
+    # - readelf(1)
+    # - general reading about DWARF
+    # A useful command for exploration without having to wade through many bytes is:
+    # readelf --debug-dump=info --dwarf-depth=1 impalad
+    # The DWARF lines are long, raw, and nasty; I'm hesitant to paste them here, so
+    # curious readers are highly encouraged to try the above, or read IMPALA-3501.
+    die_name = None
+    die_producer = None
+    try:
+      with open(self.impalad_path, 'rb') as fh:
+        impalad_elf = ELFFile(fh)
+        if impalad_elf.has_dwarf_info():
+          dwarf_info = impalad_elf.get_dwarf_info()
+          # We only need the first CU, hence the unconventional use of the iterator
+          # protocol.
+          cu_iterator = dwarf_info.iter_CUs()
+          first_cu = next(cu_iterator)
+          top_die = first_cu.get_top_DIE()
+          die_name = top_die.attributes['DW_AT_name'].value
+          die_producer = top_die.attributes['DW_AT_producer'].value
+    except Exception as e:
+      LOG.warn('Failure to read DWARF info from {0}: {1}'.format(self.impalad_path,
+                                                                 str(e)))
+    return die_name, die_producer
+
+  def _set_impalad_build_type(self, die_name, die_producer):
+    """
+    Use a heuristic based on the DW_AT_producer and DW_AT_name of the first compile
+    unit, as returned by _get_impalad_dwarf_info(), to figure out which of 5 supported
+    builds of impalad we're dealing with. If the heuristic can't determine, fall back to
+    assuming a debug build and log a warning.
+    """
+    ASAN_CU_NAME = 'asan_preinit.cc'
+    NON_ASAN_CU_NAME = 'daemon-main.cc'
+    GDB_FLAG = '-ggdb'
+    CODE_COVERAGE_FLAG = '-ftest-coverage'
+
+    if die_name is None or die_producer is None:
+      LOG.warn('Not enough DWARF info in {0} to determine build type; choosing '
+               'DEBUG'.format(self.impalad_path))
+      self._specific_build_type = SpecificImpaladBuildTypes.DEBUG
+      return
+
+    is_debug = GDB_FLAG in die_producer
+    specific_build_type = SpecificImpaladBuildTypes.DEBUG
+
+    if die_name.endswith(ASAN_CU_NAME):
+      specific_build_type = SpecificImpaladBuildTypes.ADDRESS_SANITIZER
+    elif not die_name.endswith(NON_ASAN_CU_NAME):
+      LOG.warn('Unexpected DW_AT_name in first CU: {0}; choosing '
+               'DEBUG'.format(die_name))
+      specific_build_type = SpecificImpaladBuildTypes.DEBUG
+    elif CODE_COVERAGE_FLAG in die_producer:
+      if is_debug:
+        specific_build_type = SpecificImpaladBuildTypes.DEBUG_CODE_COVERAGE
+      else:
+        specific_build_type = SpecificImpaladBuildTypes.RELEASE_CODE_COVERAGE
+    else:
+      if is_debug:
+        specific_build_type = SpecificImpaladBuildTypes.DEBUG
+      else:
+        specific_build_type = SpecificImpaladBuildTypes.RELEASE
+
+    self._specific_build_type = specific_build_type
+
+
+IMPALAD_BUILD = ImpaladBuild(IMPALAD_PATH)
+
+
+def specific_build_type_timeout(
+    default_timeout, slow_build_timeout=None, asan_build_timeout=None,
+    code_coverage_build_timeout=None):
+  """
+  Return a test environment-specific timeout based on the sort of
+  SpecificImpalaBuildType under test.
+
+  Required parameter: default_timeout - default timeout value. This applies when Impala is
+  a standard release or debug build, or if no other timeouts are specified.
+
+  Optional parameters:
+  slow_build_timeout - timeout to use if we're running against *any* build known to be
+  slow. If specified, this will preempt default_timeout if Impala is expected to be
+  "slow". You can use this as a shorthand in lieu of specifying all of the following
+  parameters.
+
+  The parameters below correspond to specific build types. These preempt both
+  slow_build_timeout and default_timeout, if the Impala under test is a build of the
+  applicable type:
+
+  asan_build_timeout - timeout to use if Impala with ASAN is running
+
+  code_coverage_build_timeout - timeout to use if Impala with code coverage is running
+  (both debug and release code coverage)
+  """
+
+  if IMPALAD_BUILD.is_asan() and asan_build_timeout is not None:
+    timeout_val = asan_build_timeout
+  elif IMPALAD_BUILD.has_code_coverage() and code_coverage_build_timeout is not None:
+    timeout_val = code_coverage_build_timeout
+  elif IMPALAD_BUILD.runs_slowly() and slow_build_timeout is not None:
+    timeout_val = slow_build_timeout
+  else:
+    timeout_val = default_timeout
+  return timeout_val
diff --git a/tests/common/skip.py b/tests/common/skip.py
@@ -21,6 +21,8 @@
 import os
 import pytest
 from functools import partial
+
+from tests.common.environ import IMPALAD_BUILD, USING_OLD_AGGS_JOINS
 from tests.util.filesystem_utils import IS_DEFAULT_FS, IS_S3, IS_ISILON, IS_LOCAL
 
 
@@ -66,22 +68,14 @@ class SkipIfIsilon:
       reason="This Isilon issue has yet to be triaged.")
   jira = partial(pytest.mark.skipif, IS_ISILON)
 
-# TODO: looking at TEST_START_CLUSTER_ARGS is a hack. It would be better to add an option
-# to pytest.
-test_start_cluster_args = os.environ.get("TEST_START_CLUSTER_ARGS","")
-old_agg_regex = "enable_partitioned_aggregation=false"
-old_hash_join_regex = "enable_partitioned_hash_join=false"
-using_old_aggs_joins = re.search(old_agg_regex, test_start_cluster_args) is not None or \
-    re.search(old_hash_join_regex, test_start_cluster_args) is not None
-
 class SkipIfOldAggsJoins:
-  nested_types = pytest.mark.skipif(using_old_aggs_joins,
+  nested_types = pytest.mark.skipif(USING_OLD_AGGS_JOINS,
       reason="Nested types not supported with old aggs and joins")
-  passthrough_preagg = pytest.mark.skipif(using_old_aggs_joins,
+  passthrough_preagg = pytest.mark.skipif(USING_OLD_AGGS_JOINS,
       reason="Passthrough optimization not implemented by old agg")
-  unsupported = pytest.mark.skipif(using_old_aggs_joins,
+  unsupported = pytest.mark.skipif(USING_OLD_AGGS_JOINS,
       reason="Query unsupported with old aggs and joins")
-  requires_spilling = pytest.mark.skipif(using_old_aggs_joins,
+  requires_spilling = pytest.mark.skipif(USING_OLD_AGGS_JOINS,
       reason="Test case requires spilling to pass")
 
 class SkipIfLocal:
@@ -111,21 +105,6 @@ class SkipIfLocal:
   root_path = pytest.mark.skipif(IS_LOCAL,
       reason="Tests rely on the root directory")
 
-# Try to derive the build type. Assume it's 'latest' by default.
-impala_home = os.environ.get("IMPALA_HOME", "")
-build_type_regex = re.compile(r'--build_type=(\w+)', re.I)
-build_type_search_result = re.search(build_type_regex, test_start_cluster_args)
-
-if build_type_search_result is not None:
-  build_type = build_type_search_result.groups()[0].lower()
-else:
-  build_type = 'latest'
-
-# Resolve any symlinks in the path.
-impalad_basedir = \
-    os.path.realpath(os.path.join(impala_home, 'be/build', build_type)).rstrip('/')
-debug_build = os.path.basename(impalad_basedir).lower() == 'debug'
-
-class SkipIfNotDebugBuild:
-  debug_only = pytest.mark.skipif(not debug_build,
+class SkipIfBuildType:
+  not_dev_build = pytest.mark.skipif(not IMPALAD_BUILD.is_dev(),
       reason="Tests depends on debug build startup option.")
diff --git a/tests/custom_cluster/test_alloc_fail.py b/tests/custom_cluster/test_alloc_fail.py
@@ -16,9 +16,9 @@
 import pytest
 from copy import deepcopy
 from tests.common.custom_cluster_test_suite import CustomClusterTestSuite
-from tests.common.skip import SkipIfNotDebugBuild
+from tests.common.skip import SkipIfBuildType
 
-@SkipIfNotDebugBuild.debug_only
+@SkipIfBuildType.not_dev_build
 class TestAllocFail(CustomClusterTestSuite):
   """Tests for handling malloc() failure for UDF/UDA"""
 

diff --git a/tests/custom_cluster/test_exchange_delays.py b/tests/custom_cluster/test_exchange_delays.py
@@ -14,9 +14,9 @@
 
 import pytest
 from tests.common.custom_cluster_test_suite import CustomClusterTestSuite
-from tests.common.skip import SkipIfNotDebugBuild
+from tests.common.skip import SkipIfBuildType
 
-@SkipIfNotDebugBuild.debug_only
+@SkipIfBuildType.not_dev_build
 class TestExchangeDelays(CustomClusterTestSuite):
   """Tests for handling delays in finding data stream receivers"""
 

diff --git a/tests/statestore/test_statestore.py b/tests/statestore/test_statestore.py
@@ -21,6 +21,8 @@
 import urllib2
 import json
 
+from tests.common.environ import specific_build_type_timeout
+
 from thrift.transport import TSocket
 from thrift.transport import TTransport
 from thrift.protocol import TBinaryProtocol
@@ -59,6 +61,9 @@ def get_statestore_subscribers(host='localhost', port=25010):
 DEFAULT_UPDATE_STATE_RESPONSE = TUpdateStateResponse(status=STATUS_OK, topic_updates=[],
                                                      skipped=False)
 
+# IMPALA-3501: the timeout needs to be higher in code coverage builds
+WAIT_FOR_FAILURE_TIMEOUT = specific_build_type_timeout(40, code_coverage_build_timeout=60)
+
 class WildcardServerSocket(TSocket.TSocketBase, TTransport.TServerTransportBase):
   """Specialised server socket that binds to a random port at construction"""
   def __init__(self, host=None, port=0):
@@ -292,7 +297,8 @@ def wait_for_update(self, count=None):
     finally:
       self.update_event.release()
 
-  def wait_for_failure(self, timeout=40):
+
+  def wait_for_failure(self, timeout=WAIT_FOR_FAILURE_TIMEOUT):
     """Waits until this subscriber no longer appears in the statestore's subscriber
     list. If 'timeout' seconds pass, throws an exception."""
     start = time.time()