Skip to content

Commit

Permalink
IMPALA-3343, IMPALA-9489: Make impala-shell compatible with python 3.
Browse files Browse the repository at this point in the history
This is the main patch for making the the impala-shell cross-compatible with
python 2 and python 3. The goal is wind up with a version of the shell that will
pass python e2e tests irrepsective of the version of python used to launch the
shell, under the assumption that the test framework itself will continue to run
with python 2.7.x for the time being.

Notable changes for reviewers to consider:

- With regard to validating the patch, my assumption is that simply passing
  the existing set of e2e shell tests is sufficient to confirm that the shell
  is functioning properly. No new tests were added.

- A new pytest command line option was added in conftest.py to enable a user
  to specify a path to an alternate impala-shell executable to test. It's
  possible to use this to point to an instance of the impala-shell that was
  installed as a standalone python package in a separate virtualenv.

  Example usage:
  USE_THRIFT11_GEN_PY=true impala-py.test --shell_executable=/<path to virtualenv>/bin/impala-shell -sv shell/test_shell_commandline.py

  The target virtualenv may be based on either python3 or python2. However,
  this has no effect on the version of python used to run the test framework,
  which remains tied to python 2.7.x for the foreseeable future.

- The $IMPALA_HOME/bin/impala-shell.sh now sets up the impala-shell python
  environment independenty from bin/set-pythonpath.sh. The default version
  of thrift is thrift-0.11.0 (See IMPALA-9489).

- The wording of the header changed a bit to include the python version
  used to run the shell.

    Starting Impala Shell with no authentication using Python 3.7.5
    Opened TCP connection to localhost:21000
    ...

    OR

    Starting Impala Shell with LDAP-based authentication using Python 2.7.12
    Opened TCP connection to localhost:21000
    ...

- By far, the biggest hassle has been juggling str versus unicode versus
  bytes data types. Python 2.x was fairly loose and inconsistent in
  how it dealt with strings. As a quick demo of what I mean:

  Python 2.7.12 (default, Nov 12 2018, 14:36:49)
  [GCC 5.4.0 20160609] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> d = 'like a duck'
  >>> d == str(d) == bytes(d) == unicode(d) == d.encode('utf-8') == d.decode('utf-8')
  True

  ...and yet there are weird unexpected gotchas.

  >>> d.decode('utf-8') == d.encode('utf-8')
  True
  >>> d.encode('utf-8') == bytearray(d, 'utf-8')
  True
  >>> d.decode('utf-8') == bytearray(d, 'utf-8')   # fails the eq property?
  False

  As a result, this was inconsistency was reflected in the way we handled
  strings in the impala-shell code, but things still just worked.

  In python3, there's a much clearer distinction between strings and bytes, and
  as such, much tighter type consistency is expected by standard libs like
  subprocess, re, sqlparse, prettytable, etc., which are used throughout the
  shell. Even simple calls that worked in python 2.x:

  >>> import re
  >>> re.findall('foo', b'foobar')
  ['foo']

  ...can throw exceptions in python 3.x:

  >>> import re
  >>> re.findall('foo', b'foobar')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/data0/systest/venvs/py3/lib/python3.7/re.py", line 223, in findall
      return _compile(pattern, flags).findall(string)
  TypeError: cannot use a string pattern on a bytes-like object

  Exceptions like this resulted in a many, if not most shell tests failing
  under python 3.

  What ultimately seemed like a better approach was to try to weed out as many
  existing spurious str.encode() and str.decode() calls as I could, and try to
  implement what is has colloquially been called a "unicode sandwich" -- namely,
  "bytes on the outside, unicode on the inside, encode/decode at the edges."

  The primary spot in the shell where we call decode() now is when sanitising
  input...

  args = self.sanitise_input(args.decode('utf-8'))

  ...and also whenever a library like re required it. Similarly, str.encode()
  is primarily used where a library like readline or csv requires is.

- PYTHONIOENCODING needs to be set to utf-8 to override the default setting for
  python 2. Without this, piping or redirecting stdout results in unicode errors.

- from __future__ import unicode_literals was added throughout

Testing:

  To test the changes, I ran the e2e shell tests the way we always do (against
  the normal build tarball), and then I set up a python 3 virtual env with the
  shell installed as a package, and manually ran the tests against that.

  No effort has been made at this point to come up with a way to integrate
  testing of the shell in a python3 environment into our automated test
  processes.

Change-Id: Idb004d352fe230a890a6b6356496ba76c2fab615
Reviewed-on: http://gerrit.cloudera.org:8080/15524
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
  • Loading branch information
David Knupp authored and Impala Public Jenkins committed Apr 18, 2020
1 parent 6fcc758 commit bc9d7e0
Show file tree
Hide file tree
Showing 19 changed files with 352 additions and 149 deletions.
33 changes: 32 additions & 1 deletion bin/impala-shell.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,38 @@
# specific language governing permissions and limitations
# under the License.

set -euo pipefail

# This script runs the impala shell from a dev environment.
PYTHONPATH=${PYTHONPATH:-}
SHELL_HOME=${IMPALA_SHELL_HOME:-${IMPALA_HOME}/shell}
exec impala-python ${SHELL_HOME}/impala_shell.py "$@"

# ${IMPALA_HOME}/bin has bootstrap_toolchain.py, required by bootstrap_virtualenv.py
PYTHONPATH=${PYTHONPATH}:${IMPALA_HOME}/bin

# Default version of thrift for the impala-shell is thrift >= 0.11.0.
PYTHONPATH=${PYTHONPATH}:${SHELL_HOME}/build/thrift-11-gen/gen-py
IMPALA_THRIFT_PY_VERSION=${IMPALA_THRIFT11_VERSION}

THRIFT_PY_ROOT="${IMPALA_TOOLCHAIN}/thrift-${IMPALA_THRIFT_PY_VERSION}"

LD_LIBRARY_PATH+=":$(PYTHONPATH=${PYTHONPATH} \
python "$IMPALA_HOME/infra/python/bootstrap_virtualenv.py" \
--print-ld-library-path)"

IMPALA_PY_DIR="$(dirname "$0")/../infra/python"
IMPALA_PYTHON_EXECUTABLE="${IMPALA_PY_DIR}/env/bin/python"

for PYTHON_LIB_DIR in ${THRIFT_PY_ROOT}/python/lib{64,}; do
[[ -d ${PYTHON_LIB_DIR} ]] || continue
for PKG_DIR in ${PYTHON_LIB_DIR}/python*/site-packages; do
PYTHONPATH=${PYTHONPATH}:${PKG_DIR}/
done
done

# Note that this uses the external system python executable
PYTHONPATH=${PYTHONPATH} python "${IMPALA_PY_DIR}/bootstrap_virtualenv.py"

# This uses the python executable in the impala python env
PYTHONIOENCODING='utf-8' PYTHONPATH=${PYTHONPATH} \
exec "${IMPALA_PYTHON_EXECUTABLE}" ${SHELL_HOME}/impala_shell.py "$@"
3 changes: 2 additions & 1 deletion bin/set-pythonpath.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,9 @@
export PYTHONPATH=${IMPALA_HOME}:${IMPALA_HOME}/bin

# Generated Thrift files are used by tests and other scripts.
if [ -n "${USE_THRIFT11_GEN_PY:-}" ]; then
if [ "${USE_THRIFT11_GEN_PY:-}" == "true" ]; then
PYTHONPATH=${PYTHONPATH}:${IMPALA_HOME}/shell/build/thrift-11-gen/gen-py
THRIFT_HOME="${IMPALA_TOOLCHAIN}/thrift-${IMPALA_THRIFT11_VERSION}"
else
PYTHONPATH=${PYTHONPATH}:${IMPALA_HOME}/shell/gen-py
fi
Expand Down
1 change: 1 addition & 0 deletions shell/TSSLSocketWithWildcardSAN.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
from __future__ import print_function, unicode_literals

import re
import ssl
Expand Down
11 changes: 10 additions & 1 deletion shell/compatibility.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,20 @@
# under the License.
from __future__ import print_function, unicode_literals


"""
A module where we can aggregate python2 -> 3 code contortions.
"""

import os
import sys


if sys.version_info.major == 2:
# default is typically ASCII, but unicode_literals dictates UTF-8
# See also https://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python # noqa
os.environ['PYTHONIOENCODING'] = 'utf-8'


try:
_xrange = xrange
except NameError:
Expand Down
4 changes: 2 additions & 2 deletions shell/impala-shell
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# under the License.


# This script runs the Impala shell. Python is required.
# This script runs the Impala shell. Python is required.
#
# This script assumes that the supporting library files for the Impala shell are
# rooted in either the same directory that this script is in, or in a directory
Expand Down Expand Up @@ -51,4 +51,4 @@ for EGG in $(ls ${SHELL_HOME}/ext-py/*.egg); do
done

PYTHONPATH="${EGG_PATH}${SHELL_HOME}/gen-py:${SHELL_HOME}/lib:${PYTHONPATH}" \
exec python ${SHELL_HOME}/impala_shell.py "$@"
PYTHONIOENCODING='utf-8' exec python ${SHELL_HOME}/impala_shell.py "$@"
20 changes: 10 additions & 10 deletions shell/impala_client.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
Expand All @@ -16,7 +17,7 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
from __future__ import print_function
from __future__ import print_function, unicode_literals
from compatibility import _xrange as xrange

from bitarray import bitarray
Expand Down Expand Up @@ -118,7 +119,7 @@ def __init__(self, impalad, kerberos_host_fqdn, use_kerberos=False,
ldap_password=None, use_ldap=False, client_connect_timeout_ms=60000,
verbose=True, use_http_base_transport=False, http_path=None):
self.connected = False
self.impalad_host = impalad[0].encode('ascii', 'ignore')
self.impalad_host = impalad[0]
self.impalad_port = int(impalad[1])
self.kerberos_host_fqdn = kerberos_host_fqdn
self.imp_service = None
Expand Down Expand Up @@ -378,8 +379,8 @@ def _get_http_transport(self, connect_timeout_ms):

if self.use_ldap:
# Set the BASIC auth header
auth = base64.encodestring(
"{0}:{1}".format(self.user, self.ldap_password)).strip('\n')
user_passwd = "{0}:{1}".format(self.user, self.ldap_password)
auth = base64.encodestring(user_passwd.encode()).decode().strip('\n')
transport.setCustomHeaders({"Authorization": "Basic {0}".format(auth)})

transport.open()
Expand Down Expand Up @@ -1005,12 +1006,12 @@ def _get_thrift_client(self, protocol):
return ImpalaService.Client(protocol)

def _options_to_string_list(self, set_query_options):
if sys.version_info.major < 3:
key_value_pairs = set_query_options.iteritems()
else:
key_value_pairs = set_query_options.items()
if sys.version_info.major < 3:
key_value_pairs = set_query_options.iteritems()
else:
key_value_pairs = set_query_options.items()

return ["%s=%s" % (k, v) for (k, v) in key_value_pairs]
return ["%s=%s" % (k, v) for (k, v) in key_value_pairs]

def _open_session(self):
# Beeswax doesn't have a "session" concept independent of connections, so
Expand Down Expand Up @@ -1241,4 +1242,3 @@ def _do_beeswax_rpc(self, rpc, suppress_error_on_cancel=True):
raise RPCException("ERROR: %s" % e.message)
if "QueryNotFoundException" in str(e):
raise QueryStateException('Error: Stale query handle')

Loading

0 comments on commit bc9d7e0

Please sign in to comment.