Skip to content

Include original error in ConnectionShutdown messages#1266

Closed
dkropachev wants to merge 777 commits intoapache:trunkfrom
scylladb:fix/issue-614-bad-file-descriptor
Closed

Include original error in ConnectionShutdown messages#1266
dkropachev wants to merge 777 commits intoapache:trunkfrom
scylladb:fix/issue-614-bad-file-descriptor

Conversation

@dkropachev
Copy link
Contributor

Summary

When a connection is closed or becomes defunct, include the last_error in the ConnectionShutdown exception message.

Before:

cassandra.connection.ConnectionShutdown: Connection to 10.12.33.86:9042 is defunct

After:

cassandra.connection.ConnectionShutdown: Connection to 10.12.33.86:9042 is defunct: [Errno 9] Bad file descriptor

This helps investigate #614 by preserving the original error reason when subsequent operations try to use a closed connection.

Test plan

  • Added unit tests for send_msg() and wait_for_responses() verifying last_error is included
  • All existing unit tests pass

dkropachev and others added 30 commits January 4, 2025 19:11
It does not match orthodox python style.
And trigger warnings:
  cqlengine/query/test_queryoperators.py:157: SyntaxWarning: "is" with 'int' literal. Did you mean "=="?
    self.assertTrue(len(first_page) is 1)
Invalidate tablets when table or keyspace is deleted
Following warnings shows up all over the code:
```
 SyntaxWarning: invalid escape sequence ...
```
Let's fix it.
python 3.13 is now released, and we should supply
wheel for it.
seems like scylla 6.2 cqlsh isn't working
as expect on old versions of glibc

```
2025-01-19 13:51:07.901 DEBUG [cluster:757]:
node1: (EE) /home/runner/.ccm/scylla-repository/release/6.2.2/python3/bin/../libexec/python3.12.bin:
symbol lookup error: /lib/x86_64-linux-gnu/libpthread.so.0: undefined symbol: __libc_pthread_init, version GLIBC_PRIVATE
```
`setup.py` is pinning to use `Cython>=0.20,!=0.25,<0.30`
code generated form those old versions is not compiling anymore with python3.13

in turn it cause libev part to be missing in the driver, which surface
other issue with asyncio eventloop we currently use as default

in this chnage we pin to newer version of cython when we build
the wheel

Ref: #409
github actions: user newer cython versions
PyEval_InitThreads is depricated, since 3.7 it is not necessary to call
it.
On python 3.12 it is not present anymore, which cases build to fail.
Ref: https://docs.python.org/3/c-api/init.html#c.PyEval_InitThreads
Otherwise it fails to convert unicode to 'str':
```
Error compiling Cython file:
------------------------------------------------------------
...
    if metadata_request_timeout is None:
        return stmt
    ms = int(metadata_request_timeout / datetime.timedelta(milliseconds=1))
    if ms == 0:
        return stmt
    return f"{stmt} USING TIMEOUT {ms}ms"
           ^
------------------------------------------------------------

cassandra/util.py:1813:11: Cannot convert Unicode string to 'str' implicitly. This is not portable and requires explicit encoding.
```
We need to see when extension is failed to build on CICD.
Let's introduce CASS_DRIVER_BUILD_EXTENSIONS_ARE_MUST that is False by
default.
When it is set, setup.py fails when building of any extension is failed.
There is a bug in distutils that does not allow it to pick up cython for
python 3.12 and later.
move to pyproject.toml that supports PEP517/518
this would help us use isolated build env, and avoid
all the fragile situation we have with Cython installation dependcy
missing in some CI variation
By default docker/setup-qemu-action uses `binmt:master`, which leads to
unstable cicd that hard to debug, randm crashes, bogus errors.
It fixes one of the issue with aarch64 builds.
We use old 20.04, it is time to update it.
Runner os should match cibuildwheels `MACOSX_DEPLOYMENT_TARGET`,
otherwise brew pulls wrong packages that not supported by runner OS.
VSCode cl.exe fails to recognize some of the syntax:
```
    cassandra\c_shard_info.c(3083): error C2065: '__uint128_t': undeclared identifier
    cassandra\c_shard_info.c(3083): error C2146: syntax error: missing ')' before identifier '__pyx_v_biased_token'
    cassandra\c_shard_info.c(3083): error C2059: syntax error: ')'
    cassandra\c_shard_info.c(3083): error C2059: syntax error: ')'
    cassandra\c_shard_info.c(3083): error C2059: syntax error: ')'
```
Let's switch to clang.
Full scan queries are slower even if there is only one record.
Old package 3.3.2 was removed from repo and not available anymore
* cicd: trigger workflows only when it is needed

* cicd: make build-push workflow to be reusable

* cicd: make pre-release workflow manually triggerable

1. Make it reuse `lib-build-and-push`
2. Make it manually triggerable only

This workflow is checking if driver works with next pre-release python version.
We don't want to have them in PRs, if it fails it does not mean that PR
is bad.

It would be better to manually trigger it for now, and later we will make it
check on existing pre-release versions and run on them once a month.
On python 3.12.x cqlsh fails with:
```
libexec/python3.12.bin: symbol lookup error: /lib/x86_64-linux-gnu/libpthread.so.0: undefined symbol: __libc_pthread_init, version GLIBC_PRIVATE
```

So we need to work it around by not using cqlsh
* PYTHON-1364 Fix ssl.wrap_socket errors (from eventlet) for Python 3.12 (apache#1181)

* PYTHON-1366 Handle removal of asyncore in Python 3.12 (apache#1187)

* Remove outdated Python pre-3.7 references (apache#1186)

* PYTHON-1371 Add explicit exception type for serialization failures (apache#1193)

* PYTHON-1331 ssl.match_hostname() is deprecated in 3.7 (apache#1191)

* Documentation (and other) updates for 3.29.0 (apache#1194)

* fix(test-requirements): remove python 3.13 restrictions

Original commit c9b24b7 made it not
install certain modules on python 3.13 because at the time python was
broken and did not support these modules, now they work ok, roll it back.

---------

Co-authored-by: Bret McGuire <absurdfarce@users.noreply.github.com>
Co-authored-by: Brad Schoening <5796692+bschoening@users.noreply.github.com>
* cicd/lib-build-and-push: fix python-version description and default

* cicd/lib-build-and-push: make choco dont print progress

* cicd/lib-build-and-push: use fixed windows version

* cicd/lib-build-and-push: allow prerelease python for aarch64

* cicd/lib-build-and-push: use correct `macos-14` instead of `macos-latest`

* cicd/lib-build-and-push: load calculated matrix

* cicd: move cibuildwheel configuration to pyproject.toml

* setup.py: don't panic on wrong command line

* cicd/lib-build-and-push: merge aarch64 job to build-wheels

* cicd/lib-build-and-push: build pypy and native in single blow
following code:
```
.update(
            bin_map__remove={123, 456}
        )
```
Would fail if map is defined with non-nullable value type:
```
columns.Map(columns.BigInt, columns.Bytes, required=False, default={})
```
dgarcia360 and others added 28 commits December 9, 2025 06:53
…utofix

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…utofix

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…ation

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: mykaul <4655593+mykaul@users.noreply.github.com>
…n permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
…ng alert no. 7: Workflow does not contain permissions

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Node pools should be stable, if cc fails to connect it is not good
enough reason to neither to kill it nor to mark node down.
Continued effort to remove protocol versions < 3 as was done in the native Python code.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Co-authored-by: mykaul <4655593+mykaul@users.noreply.github.com>
… starting point

The `test_profile_lb_swap` test logic assumed that `populate`
was called before control connection (cc) was created, meaning
only the contact points from the cluster configuration were
known (a single host). Due to that the starting point was not random.

This commit updates the test to reflect the new behavior, where `populate`
is called on the load-balancing policy after the control connection is
created. This allows the policy to be updated with all known hosts and
ensures the starting point is properly randomized.
Previously, the driver relied on the load-balancing policy (LBP) to determine
the order of hosts to connect to. Since the default LBP is Round Robin, each
reconnection would start from a different host.

After removing fake hosts with random IDs at startup, this behavior changed.
When the LBP is not yet initialized, the driver now uses the endpoints provided
by the control connection (CC), so there is no guarantee that different hosts
will be selected on reconnection.

This change updates the test logic to first establish a connection and
initialize the LBP, and only then verify that two subsequent reconnections
land on different hosts in a healthy cluster.
Only compare hosts endpoints not whole Host instances as we
don't know hosts ids.
In DC aware lbp when local_dc is not provided we set it in
on_add and it needs to be initialized for distance to give
proper results.
Previously, we used endpoints provided to the cluster to create Host
instances with random host_ids in order to populate the LBP before the
ControlConnection was established. This logic led to creating many
connections that were opened and then quickly closed, because once we
learned the correct host_ids from system.peers, we removed the old Hosts
with random IDs and created new ones with the proper host_ids.

This commit introduces a new approach. To establish the ControlConnection,
we now use only the resolved contact points from the cluster configuration.
Only after a successful connection do we populate Host information in the
LBP. If the LBP is already initialized during ControlConnection
reconnection, we reuse the existing values.
…eck if a table is using tablets

table_has_tablets() performs the same self._tablets.get((keyspace, table) that
get_tablet_for_key() does which is a called a line later does, so it's redundant.
Removed the former.

Note - perhaps table_has_tablets() is not needed and can be removed? Unsure, it's unclear
if it's part of the API or not. It's now completely unused across the code.

Adjusted unit tests as well.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Replace sleep-based timing with proper Event synchronization:
- Remove executor_init with 0.5s sleep
- Replace time.sleep(0.5) with connection_created.wait()
- Replace time.sleep(3) with executor.shutdown(wait=True)
- Reduce iterations from 20 to 3 (deterministic with proper sync)
- Add explicit assertions for shutdown state

Performance improvement:
- Before: 70.13s call tests/unit/test_host_connection_pool.py::HostConnectionTests::test_fast_shutdown
- After: 0.01s call tests/unit/test_host_connection_pool.py::HostConnectionTests::test_fast_shutdown

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
When a host changes its IP address, the driver previously updated the host endpoint to the new IP before calling on_down.
This caused on_down to mistakenly target the new IP for connection cleanup.
This change reorders the operations to ensure on_down cleans up the old IP's resources before the host object is updated and on_up is called for the new IP.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
When a connection is closed or becomes defunct, include the last_error
in the ConnectionShutdown exception message. This helps investigate
issues where the original cause (e.g., "Bad file descriptor") was
previously lost when subsequent operations tried to use the connection.

Refs #614
@dkropachev dkropachev closed this Jan 29, 2026
@dkropachev dkropachev deleted the fix/issue-614-bad-file-descriptor branch January 30, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.