Replies: 3 comments
-
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
@lopezvit Without wanting to brush this aside, have you approached Google support about this first? How Airflow runs in a Cloud Composer context is unique to them and they might be best suited to identify the root cause of the problem. If it is indeed a bug with the core of Airflow that can be easily understood and replicated then I'm sure the community would aim to address it. But unfortunately, as it stands, I'm not sure you're likely to see much action on this. |
Beta Was this translation helpful? Give feedback.
-
Converting it into a disccussion if more is needed. |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.6.3
What happened?
The problem is that, quite often (but not always!), the task that (I guess) Airflow detect as a zombie is not retried; and I don't understand why: for me this is clearly a bug.
The tasks is memory intensive, and I guess that it is the underlaying problem. I have increased the worker memory size from 4 GB to 6.5 GB and it hasn't failed yet.
But this doesn't look like a very sustainable solution since the memory is expensive, and because, when the task is retried it always succeed (probably because the worker doesn't have so much pressure anymore) as it can be checked by other execution of the same task (as can be checked in the Anything else? section).
I have went through the documentation and the troubleshooting and known issues and the only related issue was #37041 but it is hard to tell, since Composer uses Celery executor.
What is the business impact you are facing?
Task that are failing force a human to go and retry to tasks manually (currently left in failed state to allow better troubleshooting of the issue). The solution of increasing the memory seems expensive, as the issue is not in our code, but in the infrastructure.
What you think should happen instead?
Well, since this happens when there is a moment of high demand, just by simply retrying, as it should, should solve the problem without any human intervention.
How to reproduce
We have a quite memory intensive (around 200MB) task that requires to be run every time around 7 times, as it fetches 7 days worth of data from the past, as it might take so many days for the data to be golden.
When all these tasks are running in parallel, (and possibly other tasks from other dags) it uses all the memory from the VM, which provokes the task to be killed.
This is anyway a rare occurrence, as the DAG is scheduled twice an hour, 16 hours a day and it only happened 18 times during 3 days period.
Operating System
composer-2.5.2-airflow-2.6.3
Versions of Apache Airflow Providers
directly from the documentation:
absl-py==2.0.0
agate==1.6.3
aiodebug==2.3.0
aiofiles==23.2.1
aiohttp==3.8.6
aiosignal==1.3.1
alembic==1.11.1
amqp==5.1.1
anyio==3.7.1
apache-airflow==2.6.3+composer
apache-airflow-providers-apache-beam==5.3.0
apache-airflow-providers-cncf-kubernetes==7.9.0
apache-airflow-providers-common-sql==1.8.0
apache-airflow-providers-dbt-cloud==3.4.0
apache-airflow-providers-ftp==3.6.1
apache-airflow-providers-google==10.11.1
apache-airflow-providers-hashicorp==3.5.0
apache-airflow-providers-http==4.6.0
apache-airflow-providers-imap==3.4.0
apache-airflow-providers-mysql==5.2.0
apache-airflow-providers-postgres==5.8.0
apache-airflow-providers-sendgrid==3.3.0
apache-airflow-providers-sqlite==3.5.0
apache-airflow-providers-ssh==3.8.1
apache-beam==2.51.0
apispec==5.2.2
appdirs==1.4.4
argcomplete==3.1.1
asgiref==3.7.2
astunparse==1.6.3
async-timeout==4.0.2
attrs==23.1.0
Babel==2.12.1
backoff==2.2.1
backports.zoneinfo==0.2.1
bcrypt==4.0.1
billiard==4.1.0
blinker==1.6.2
cachecontrol==0.13.1
cachelib==0.9.0
cachetools==5.3.1
cattrs==23.1.2
celery==5.3.1
certifi==2023.7.22
cffi==1.15.1
chardet==5.2.0
charset-normalizer==3.1.0
click==8.1.3
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
clickclick==20.10.2
cloudpickle==2.2.1
colorama==0.4.6
colorlog==4.8.0
ConfigUpdater==3.1.1
connexion==2.14.2
crcmod==1.7
cron-descriptor==1.4.0
croniter==1.4.1
cryptography==41.0.5
db-dtypes==1.1.1
dbt-bigquery==1.5.4
dbt-core==1.5.4
dbt-extractor==0.4.1
decorator==5.1.1
Deprecated==1.2.14
diff-cover==8.0.0
dill==0.3.1.1
distlib==0.3.6
dnspython==2.3.0
docopt==0.6.2
docutils==0.20.1
email-validator==1.3.1
exceptiongroup==1.1.2
fastavro==1.9.0
fasteners==0.19
filelock==3.12.2
firebase-admin==6.2.0
Flask==2.2.5
Flask-AppBuilder==4.3.1
Flask-Babel==2.0.0
Flask-Bcrypt==1.0.1
Flask-Caching==2.0.2
Flask-JWT-Extended==4.5.2
Flask-Limiter==3.3.1
Flask-Login==0.6.2
flask-session==0.5.0
Flask-SQLAlchemy==2.5.1
Flask-WTF==1.1.1
flatbuffers==23.5.26
flower==2.0.0
frozenlist==1.3.3
fsspec==2023.10.0
future==0.18.3
gast==0.4.0
gcloud-aio-auth==4.2.3
gcloud-aio-bigquery==7.0.0
gcloud-aio-storage==9.0.0
gcsfs==2023.10.0
google-ads==22.1.0
google-api-core==2.14.0
google-api-python-client==2.107.0
google-apitools==0.5.32
google-auth==2.23.4
google-auth-httplib2==0.1.1
google-auth-oauthlib==1.0.0
google-cloud-access-context-manager==0.1.16
google-cloud-aiplatform==1.36.2
google-cloud-appengine-logging==1.3.2
google-cloud-asset==3.20.0
google-cloud-audit-log==0.2.5
google-cloud-automl==2.11.3
google-cloud-batch==0.17.3
google-cloud-bigquery==3.13.0
google-cloud-bigquery-datatransfer==3.12.1
google-cloud-bigquery-storage==2.22.0
google-cloud-bigtable==2.21.0
google-cloud-build==3.21.0
google-cloud-common==1.2.0
google-cloud-compute==1.14.1
google-cloud-container==2.33.0
google-cloud-core==2.3.3
google-cloud-datacatalog==3.16.0
google-cloud-datacatalog-lineage==0.3.1
google-cloud-datacatalog-lineage-producer-client==0.1.0
google-cloud-dataflow-client==0.8.5
google-cloud-dataform==0.5.4
google-cloud-dataplex==1.8.1
google-cloud-dataproc==5.7.0
google-cloud-dataproc-metastore==1.13.0
google-cloud-datastore==2.18.0
google-cloud-dlp==3.13.0
google-cloud-documentai==2.20.2
google-cloud-filestore==1.6.2
google-cloud-firestore==2.13.1
google-cloud-kms==2.19.2
google-cloud-language==2.11.1
google-cloud-logging==3.8.0
google-cloud-memcache==1.7.3
google-cloud-monitoring==2.16.0
google-cloud-orchestration-airflow==1.9.2
google-cloud-org-policy==1.8.3
google-cloud-os-config==1.15.3
google-cloud-os-login==2.11.0
google-cloud-pubsub==2.18.4
google-cloud-pubsublite==0.6.1
google-cloud-redis==2.13.2
google-cloud-resource-manager==1.10.4
google-cloud-run==0.10.0
google-cloud-secret-manager==2.16.4
google-cloud-spanner==3.40.1
google-cloud-speech==2.22.0
google-cloud-storage==2.13.0
google-cloud-storage-transfer==1.9.2
google-cloud-tasks==2.14.2
google-cloud-texttospeech==2.14.2
google-cloud-translate==3.12.1
google-cloud-videointelligence==2.11.4
google-cloud-vision==3.4.5
google-cloud-workflows==1.12.1
google-crc32c==1.5.0
google-pasta==0.2.0
google-re2==1.1
google-resumable-media==2.6.0
googleapis-common-protos==1.60.0
graphviz==0.20.1
greenlet==2.0.2
grpc-google-iam-v1==0.12.7
grpcio==1.59.2
grpcio-gcp==0.2.2
grpcio-status==1.59.2
gunicorn==20.1.0
h11==0.14.0
h5py==3.10.0
hdfs==2.7.3
hologram==0.0.16
httpcore==0.17.3
httplib2==0.22.0
httpx==0.24.1
humanize==4.7.0
hvac==2.0.0
idna==3.4
importlib-metadata==4.13.0
importlib-resources==5.12.0
inflection==0.5.1
iniconfig==2.0.0
isodate==0.6.1
itsdangerous==2.1.2
jaraco.classes==3.3.0
jeepney==0.8.0
Jinja2==3.1.2
Js2Py==0.74
json-merge-patch==0.2
jsonschema==4.18.6
jsonschema-specifications==2023.7.1
keras==2.13.1
keyring==24.3.0
keyrings.google-artifactregistry-auth==1.1.2
kombu==5.3.1
kubernetes==23.6.0
kubernetes-asyncio==24.2.3
lazy-object-proxy==1.9.0
leather==0.3.4
libclang==16.0.6
limits==3.5.0
linkify-it-py==2.0.2
lockfile==0.12.2
Logbook==1.5.3
looker-sdk==23.16.0
Mako==1.2.4
Markdown==3.4.3
markdown-it-py==3.0.0
MarkupSafe==2.1.3
marshmallow==3.19.0
marshmallow-enum==1.5.1
marshmallow-oneofschema==3.0.1
marshmallow-sqlalchemy==0.26.1
mashumaro==3.6
mdit-py-plugins==0.4.0
mdurl==0.1.2
minimal-snowplow-tracker==0.0.2
more-itertools==10.1.0
msgpack==1.0.5
multidict==6.0.4
mysqlclient==2.2.0
networkx==2.8.8
numpy==1.24.3
oauth2client==4.1.3
oauthlib==3.2.2
objsize==0.6.1
opt-einsum==3.3.0
ordered-set==4.1.0
orjson==3.9.10
overrides==6.5.0
packaging==23.1
pandas==2.0.3
pandas-gbq==0.19.2
paramiko==3.3.1
parsedatetime==2.4
pathspec==0.9.0
pendulum==2.1.2
pip==20.2.4
pipdeptree==2.13.1
pkgutil-resolve-name==1.3.10
platformdirs==3.8.1
pluggy==1.2.0
prison==0.2.1
prometheus-client==0.17.0
prompt-toolkit==3.0.39
proto-plus==1.22.3
protobuf==4.24.4
psutil==5.9.5
psycopg2-binary==2.9.9
pyarrow==11.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==1.10.12
pydata-google-auth==1.8.2
pydot==1.4.2
Pygments==2.16.1
pyjsparser==2.7.1
PyJWT==2.7.0
pymongo==4.6.0
PyNaCl==1.5.0
pyOpenSSL==23.3.0
pyparsing==3.1.1
pytest==7.4.3
python-daemon==3.0.1
python-dateutil==2.8.2
python-http-client==3.3.7
python-nvd3==0.15.0
python-slugify==8.0.1
pytimeparse==1.1.8
pytz==2023.3
pytzdata==2020.1
PyYAML==6.0
redis==3.5.3
referencing==0.30.2
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==1.0.0
rfc3339-validator==0.1.4
rich==13.4.2
rich-argparse==1.2.0
rpds-py==0.10.0
rsa==4.9
SecretStorage==3.3.3
sendgrid==6.10.0
setproctitle==1.3.2
setuptools==66.1.1
shapely==2.0.2
six==1.16.0
sniffio==1.3.0
SQLAlchemy==1.4.49
sqlalchemy-bigquery==1.8.0
SQLAlchemy-JSONField==1.0.1.post0
sqlalchemy-spanner==1.6.2
SQLAlchemy-Utils==0.41.1
sqlfluff==2.3.3
sqllineage==1.4.8
sqlparse==0.4.4
sshtunnel==0.4.0
starkbank-ecdsa==2.2.0
statsd==4.0.1
tabulate==0.9.0
tblib==2.0.0
tenacity==8.2.2
tensorboard==2.13.0
tensorboard-data-server==0.7.2
tensorflow==2.13.1
tensorflow-estimator==2.13.0
tensorflow-io-gcs-filesystem==0.34.0
termcolor==2.3.0
text-unidecode==1.3
toml==0.10.2
tomli==2.0.1
tornado==6.3.2
tqdm==4.66.1
typing-extensions==4.5.0
tzdata==2023.3
tzlocal==5.2
uc-micro-py==1.0.2
unicodecsv==0.14.1
uritemplate==4.1.1
urllib3==1.26.18
vine==5.0.0
virtualenv==20.23.1
wcwidth==0.2.6
websocket-client==1.6.1
Werkzeug==2.2.3
wheel==0.41.3
wrapt==1.15.0
WTForms==3.0.1
yarl==1.9.2
zipp==3.15.0
zstandard==0.22.0
Deployment
Google Cloud Composer
Deployment details
Version
composer-2.5.2-airflow-2.6.3
Airflow Configuration Overrides
Environment Configuration:
Pypi Packages
Anything else?
Example of the previously correct execution (it did retry):
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions