[CT-1760] [Bug] Python resource_tracker causing `dbt test` to fail #6535

alexrosenfeld10 · 2023-01-06T15:02:44Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

We're seeing sporadic errors causing our production Argo Workflow runs to fail. This is not a consistent issue nor is it easily reproducible, but when it happens the dbt test command fails and the pod is terminated with code 143. This is obviously not a good thing in a production environment, and it causes alerts that go to on call engineers.

For full clarity, I will post the various layers that ultimately invoke dbt test.

The runs execute the following Argo Workflow template:

          - name: test-prod
            when: "'{{workflow.parameters.snowflake-env}}' =~ '^prod'"
            inline:
              container:
                image: my-image
                command: ["./execute_domain.sh"]
                args: [test, --select, published]

The execute_domain.sh script looks like this (slightly abridged, but this is where the rubber meets the road):

echo "{ \"level\": \"info\", \"message\": \"running command 'dbt ${DBT_ARGS}'\"}"
dbt --no-use-colors --profiles-dir . --log-format json "$@"

The error that occurs during the run is:

/usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
time="2023-01-05T19:25:53.702Z" level=warning msg="Non-transient error: context canceled"

Here is a screenshot detailing the error happening during the test run, and not during any other time:

Note: I realize the logs indicate that the resource_tracker issue is a warning level log, implying it's not the root cause of the failure... However, there are no other relevant logs output by dbt during these failures, so this is all I have to go on.

Expected Behavior

dbt test should run end to end without crashing.

Steps To Reproduce

This is challenging to reproduce. Let's get a discussion going, and from there if no one on the dbt Labs side of things has ideas, we can investigate ways to reproduce this in a shareable fashion.

Relevant log output

/usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
time="2023-01-05T19:25:53.702Z" level=warning msg="Non-transient error: context canceled"

Environment

OS:

# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Python:

# python --version
Python 3.10.3

dbt:

# dbt --version
Core:
  - installed: 1.1.0
  - latest:    1.3.2 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - snowflake: 1.1.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Which database adapter are you using with dbt?

No response

Additional Context

No response

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2023-01-09T14:59:24Z

@alexrosenfeld10 Thanks for opening!

After a bit of googling, it seems like this will be tricky to reproduce. When a Python application is running multiple processes (as dbt does to handle threaded execution), this warning crops up when the application is terminated partway through, e.g. if it hits a memory limit. For example:

Is there any chance your Argo Workflow is running out of memory? I'd hope that there'd be other error messages to say something to that effect, and explain why the process is being killed...

But it also sounds like this is related to a bug in Python 3.8+, related to a switch within the multiprocessing library to use spawn instead of fork (!). Check out the bug here: python/cpython#90549. It looks like there might have been a fix for it merged last summer, and included in Python 3.11?

alexrosenfeld10 · 2023-01-09T15:03:34Z

@jtcohen6 thanks for the reply! I wonder if the best course of action here is to simply wait for dbt-core to support python 3.11? Do you know when that might be?

I'm unable to check the memory history on those pods (they've been churned and our monitoring software no longer has the data). I can check if / when this happens again. However, I'd be pretty surprised if they ran out of memory, as all they're responsible for is dbt test, which I believe is a relatively lightweight process? Most of the heavy lifting there happens in the data warehouse, I'd expect.

I didn't see any other error messages related to pods being OOMKilled, which I would've expected to see as well if that was the issue.

jtcohen6 · 2023-01-09T15:17:20Z

I wonder if the best course of action here is to simply wait for dbt-core to support python 3.11? Do you know when that might be?

dbt-core will support Py311 in v1.4 (next!), currently in beta, release candidate coming this week. Support on dbt-snowflake will take a little longer because we're waiting on snowflake-connector-python: dbt-labs/dbt-snowflake#331

jtcohen6 · 2023-01-09T15:23:26Z

However, I'd be pretty surprised if they ran out of memory, as all they're responsible for is dbt test, which I believe is a relatively lightweight process?

Agreed, I wouldn't have expected that either. It would make more sense if it were a more memory-intensive command, e.g. dbt docs generate.

I missed this on the first read, but I do see a message Error: exit status 143, which means the container running dbt received SIGTERM (this link among many others). Is it possible that Argo/k8s are killing the process due to resource limitations, timeout, a new workflow being triggered, ... any other imaginable reason?

My hypothesis remains that, while the leaked semaphore warning is pretty gross, it's actually a red herring. The process is being terminated for some reason, and there's a bug in Python multiprocessing 3.8-3.10 which causes that warning to appear whenever a multi-threaded process is interrupted.

alexrosenfeld10 · 2023-01-09T16:35:25Z

dbt-core will support Py311 in v1.4 (next!), currently in beta, release candidate coming this week. Support on dbt-snowflake will take a little longer because we're waiting on snowflake-connector-python: dbt-labs/dbt-snowflake#331

Glad dbt has their sights set on the new version! Thanks for the info. I've had to fork the snowflake connectors and DIY stuff because their development is fairly slow (that said, I'm using the .NET connector which I understand isn't their main marketshare). Hopefully they come through w/python 3.11 support soonish!

It's possible it's being interrupted somewhere somehow, but I don't see any other indicators of that in our system. We'll look for more details if it pops up again. I'll close this out, as the next actionable thing here is upgrade python + collect more data. Thanks again for the info!

alexrosenfeld10 added bug Something isn't working triage labels Jan 6, 2023

github-actions bot changed the title ~~[Bug] Python resource_tracker causing dbt test to fail~~ [CT-1760] [Bug] Python resource_tracker causing dbt test to fail Jan 6, 2023

jtcohen6 added the Team:Execution label Jan 6, 2023

jtcohen6 self-assigned this Jan 9, 2023

jtcohen6 added awaiting_response and removed triage labels Jan 9, 2023

jtcohen6 removed their assignment Jan 9, 2023

github-actions bot added triage and removed awaiting_response labels Jan 9, 2023

jtcohen6 added awaiting_response and removed triage labels Jan 9, 2023

alexrosenfeld10 closed this as completed Jan 9, 2023

github-actions bot added triage and removed awaiting_response labels Jan 9, 2023

jtcohen6 removed the triage label Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1760] [Bug] Python resource_tracker causing `dbt test` to fail #6535

[CT-1760] [Bug] Python resource_tracker causing `dbt test` to fail #6535

alexrosenfeld10 commented Jan 6, 2023 •

edited

jtcohen6 commented Jan 9, 2023

alexrosenfeld10 commented Jan 9, 2023 •

edited

jtcohen6 commented Jan 9, 2023

jtcohen6 commented Jan 9, 2023

alexrosenfeld10 commented Jan 9, 2023

[CT-1760] [Bug] Python resource_tracker causing dbt test to fail #6535

[CT-1760] [Bug] Python resource_tracker causing dbt test to fail #6535

Comments

alexrosenfeld10 commented Jan 6, 2023 • edited

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

jtcohen6 commented Jan 9, 2023

alexrosenfeld10 commented Jan 9, 2023 • edited

jtcohen6 commented Jan 9, 2023

jtcohen6 commented Jan 9, 2023

alexrosenfeld10 commented Jan 9, 2023

[CT-1760] [Bug] Python resource_tracker causing `dbt test` to fail #6535

[CT-1760] [Bug] Python resource_tracker causing `dbt test` to fail #6535

alexrosenfeld10 commented Jan 6, 2023 •

edited

alexrosenfeld10 commented Jan 9, 2023 •

edited