Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in Consul result backend when saving a result #5605

Closed
ShaheedHaque opened this issue Jun 16, 2019 · 40 comments · Fixed by #6823
Closed

Race condition in Consul result backend when saving a result #5605

ShaheedHaque opened this issue Jun 16, 2019 · 40 comments · Fixed by #6823

Comments

@ShaheedHaque
Copy link
Contributor

ShaheedHaque commented Jun 16, 2019

I'm encountering an issue using the Consul result backend which I suspect is NOT a Celery issue as such but rather an issue in the lack of thread safety in the library the backend uses.

The symptoms are consistent with confusion between the sorts of responses that Consul gives to the library when creating saved results, and expiring those saved results. So, if Celery both writes and expires saved results using threads and, the lack of thread-safety in the python-consul library could produce the results described.

Can somebody confirm if Celery access to a results backend involves multiple threads?

@matteius
Copy link
Contributor

@ShaheedHaque Basically if you are using the prefork pool which is the default for celery, it will use forking the process instead of threads to get past limitations of the Python GIL. This means that if you are running a multi-core CPU (you probably are) that having more than one worker process would mean the possibility of writing in parallel while consuming tasks and sending the results along. I would have to know more about the setup, but if your python-consul library is only listening in one spot for those response and thinks that its going to enforce a linear order of execution, that could be the issue.

@ShaheedHaque
Copy link
Contributor Author

ShaheedHaque commented Jun 17, 2019

Hi @matteius,

I am indeed using the default prefork pool. I have taken a cursory look at the Celery code, but could not easily see where the result backend code is called from, so it is unclear to me if there are multiple ports involved (i.e. one per forked worker) or one (i.e. in the master process).

However, my suspicion is that in either event, the result backend is doing both "set" operations to store results and "delete" operations to expire them. The suspicion is based on instrumenting the Consul backend code, and seeing that the "set" code, which expects to have created a Consul session, and expects a dictionary response (which describes the newly created session) something like this:

{'ID': '3e1d1cd2-4de1-be07-02d4-8b3d279acedb'}

is actually sometimes getting a simple boolean response. What can return such a result? The Consul HTTP API that returns a boolean response is the "delete". The "set" code and "delete" code both appear to be called, and the evidence is clear that the various responses are getting mixed up. For example, I see instances where the code is attempting to get the ['ID'] of a boolean!

From the way the backend code is written with inline callbacks, the only way I can see this happening is if a given listener is threaded and the session context is getting confused, but before I plunge into the deep end and propose a fix for the Consul library handling the sessions, I am looking for confirmation that my suspicions match Celery usage of its backend.

@matteius
Copy link
Contributor

Which backend are you using for the broker, ie: RabbitMQ, Redis, SQS, something else?

@ShaheedHaque
Copy link
Contributor Author

RabbitMQ.

@matteius
Copy link
Contributor

In that case every task that gets a result must define a unique response queue which celery handles internally -- is consul reading from the queue for both expired results or just using the result.get() method? I read somewhere here recently you have to call .get() or .ignore() on every result promise object when using the results backend otherwise it can lead to something possibly like you are describing. It can happen if in your code you only care about the results from some tasks.

@ShaheedHaque
Copy link
Contributor Author

ShaheedHaque commented Jun 19, 2019

For clarity, I'm just a user. I didn't write either the Consul result backend or the Python-consul package.

That said my code does wait for and check every result.

However, I don't understand what any of that has to do with the problem I described, or the clarification I am seeking.

All I need to check is whether, when Celery invokes the backend to save results, it uses multiple threads? The reason for asking is that I am seeing errors which can be explained if threads are in use because the Python-consul package is not threadsafe

@thedrow
Copy link
Member

thedrow commented Jun 19, 2019

In the future, please use the template for bug reports.

Do you have a test case?

@matteius
Copy link
Contributor

@ShaheedHaque As I already stated above, if you are using the default prefork pool, which you have said you are, then Celery is invoking multiple processes and not multiple threads.

@ShaheedHaque
Copy link
Contributor Author

ShaheedHaque commented Jun 19, 2019

@thedrow Apologies, I did not consider this to fit any of the templates, and specifically I did not consider this a bug report against Celery. Here is the template anyway...

  • I have read the relevant section in the
    contribution guide
    on reporting bugs.
  • I have checked the issues list
    for similar or identical bug reports.
  • I have checked the pull requests list
    for existing proposed fixes.
  • I have checked the commit log
    to find out if the bug was already fixed in the master branch.
  • I have included all related issues and possible duplicate issues
    in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

  • I have included the output of celery -A proj report in the issue.
  • I have verified that the issue exists against the master branch of Celery.
  • I have included the contents of pip freeze in the issue.
  • I have included all the versions of all the external dependencies required
    to reproduce this bug.

Optional Debugging Information

  • I have tried reproducing the issue on more than one Python version
    and/or implementation.
  • I have tried reproducing the issue on more than one message broker and/or
    result backend.
  • I have tried reproducing the issue on more than one version of the message
    broker and/or result backend.
  • I have tried reproducing the issue on more than one operating system.
  • I have tried reproducing the issue on more than one workers pool.
  • I have tried reproducing the issue with autoscaling, retries,
    ETA/Countdown & rate limits disabled.
  • I have tried reproducing the issue after downgrading
    and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

  • None

Environment & Settings

Celery version:

celery report Output:

$ celery -A paiyroll report
software -> celery:4.3.0 (rhubarb) kombu:4.5.0 py:3.7.3
            billiard:3.6.0.0 py-amqp:2.4.2
platform -> system:Linux arch:64bit, ELF
            kernel version:5.0.0-16-generic imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp results:consul://localhost:8500/ 

ABSOLUTE_URL_OVERRIDES: {
 }
ADMINS: []
ALLOWED_HOSTS: ['*']
APPEND_SLASH: True
AUTHENTICATION_BACKENDS: ['django.contrib.auth.backends.ModelBackend']
AUTH_PASSWORD_VALIDATORS: '********'
AUTH_USER_MODEL: 'paiyroll.User'
BASE_DIR: '/main/srhaque/Innovatieltd/source'
CACHES: {
 'default': {'BACKEND': 'django.core.cache.backends.locmem.LocMemCache'}}
CACHE_MIDDLEWARE_ALIAS: 'default'
CACHE_MIDDLEWARE_KEY_PREFIX: '********'
CACHE_MIDDLEWARE_SECONDS: 600
CELERY_BROKER_HEARTBEAT: None
CELERY_BROKER_URL: 'amqp://guest:********@localhost:5672//'
CELERY_RESULT_BACKEND: 'consul://localhost:8500/'
CSRF_COOKIE_AGE: 31449600
CSRF_COOKIE_DOMAIN: None
CSRF_COOKIE_HTTPONLY: False
CSRF_COOKIE_NAME: 'csrftoken'
CSRF_COOKIE_PATH: '/'
CSRF_COOKIE_SAMESITE: 'Lax'
CSRF_COOKIE_SECURE: False
CSRF_FAILURE_VIEW: 'django.views.csrf.csrf_failure'
CSRF_HEADER_NAME: 'HTTP_X_CSRFTOKEN'
CSRF_TRUSTED_ORIGINS: []
CSRF_USE_SESSIONS: False
DATABASES: {
    'default': {   'ATOMIC_REQUESTS': False,
                   'AUTOCOMMIT': True,
                   'CONN_MAX_AGE': 0,
                   'ENGINE': 'django.db.backends.postgresql',
                   'HOST': 'localhost',
                   'NAME': 'foo',
                   'OPTIONS': {},
                   'PASSWORD': '********',
                   'PORT': '5432',
                   'TEST': {   'CHARSET': None,
                               'COLLATION': None,
                               'MIRROR': None,
                               'NAME': None},
                   'TIME_ZONE': None,
                   'USER': 'dbcoreuser'},
    'fdw': {   'ATOMIC_REQUESTS': False,
               'AUTOCOMMIT': True,
               'CONN_MAX_AGE': 0,
               'ENGINE': 'django.db.backends.postgresql',
               'HOST': 'localhost',
               'NAME': 'foo',
               'OPTIONS': {},
               'PASSWORD': '********',
               'PORT': '5432',
               'TEST': {   'CHARSET': None,
                           'COLLATION': None,
                           'MIRROR': None,
                           'NAME': None},
               'TIME_ZONE': None,
               'USER': 'dbcoreuser'}}
DATABASE_ROUTERS: '********'
DATA_UPLOAD_MAX_MEMORY_SIZE: 2621440
DATA_UPLOAD_MAX_NUMBER_FIELDS: 1000
DATETIME_FORMAT: 'N j, Y, P'
DATETIME_INPUT_FORMATS: ['%Y-%m-%d %H:%M:%S',
 '%Y-%m-%d %H:%M:%S.%f',
 '%Y-%m-%d %H:%M',
 '%Y-%m-%d',
 '%m/%d/%Y %H:%M:%S',
 '%m/%d/%Y %H:%M:%S.%f',
 '%m/%d/%Y %H:%M',
 '%m/%d/%Y',
 '%m/%d/%y %H:%M:%S',
 '%m/%d/%y %H:%M:%S.%f',
 '%m/%d/%y %H:%M',
 '%m/%d/%y']
DATE_FORMAT: 'N j, Y'
DATE_INPUT_FORMATS: ['%Y-%m-%d',
 '%m/%d/%Y',
 '%m/%d/%y',
 '%b %d %Y',
 '%b %d, %Y',
 '%d %b %Y',
 '%d %b, %Y',
 '%B %d %Y',
 '%B %d, %Y',
 '%d %B %Y',
 '%d %B, %Y']
DEBUG: True
DEBUG_PROPAGATE_EXCEPTIONS: False
DECIMAL_SEPARATOR: '.'
DEFAULT_CHARSET: 'utf-8'
DEFAULT_CONTENT_TYPE: 'text/html'
DEFAULT_DB: 'default'
DEFAULT_EXCEPTION_REPORTER_FILTER: 'django.views.debug.SafeExceptionReporterFilter'
DEFAULT_FILE_STORAGE: 'django.core.files.storage.FileSystemStorage'
DEFAULT_FROM_EMAIL: 'webmaster@localhost'
DEFAULT_INDEX_TABLESPACE: ''
DEFAULT_TABLESPACE: ''
DISALLOWED_USER_AGENTS: []
EMAIL_BACKEND: 'django.core.mail.backends.filebased.EmailBackend'
EMAIL_FILE_PATH: '/tmp/email_messages'
EMAIL_HOST: 'smtp.gmail.com'
EMAIL_HOST_PASSWORD: '********'
EMAIL_HOST_USER: 'paiyroll.com@gmail.com'
EMAIL_PORT: 587
EMAIL_SSL_CERTFILE: ''
EMAIL_SSL_KEYFILE: '********'
EMAIL_SUBJECT_PREFIX: '[Django] '
EMAIL_TIMEOUT: None
EMAIL_USE_LOCALTIME: False
EMAIL_USE_SSL: False
EMAIL_USE_TLS: True
FDW_DB: 'fdw'
FILE_CHARSET: 'utf-8'
FILE_UPLOAD_DIRECTORY_PERMISSIONS: None
FILE_UPLOAD_HANDLERS: 
    ('django_excel.ExcelMemoryFileUploadHandler',
 'django_excel.TemporaryExcelFileUploadHandler')
FILE_UPLOAD_MAX_MEMORY_SIZE: 2621440
FILE_UPLOAD_PERMISSIONS: None
FILE_UPLOAD_TEMP_DIR: None
FIRST_DAY_OF_WEEK: 0
FIXTURE_DIRS: []
FORCE_SCRIPT_NAME: None
FORMAT_MODULE_PATH: None
FORM_RENDERER: 'django.forms.renderers.DjangoTemplates'
IGNORABLE_404_URLS: []
INSTALLED_APPS: ['paiyroll.apps.PaiyrollConfig',
 'django.contrib.admin',
 'django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.messages',
 'django.contrib.staticfiles',
 'django.contrib.postgres',
 'django_jinja',
 'bootstrapform_jinja',
 'import_export',
 'phonenumber_field',
 'polymorphic',
 'viewflow']
INTERNAL_IPS: []
LANGUAGES: [('af', 'Afrikaans'),
 ('ar', 'Arabic'),
 ('ast', 'Asturian'),
 ('az', 'Azerbaijani'),
 ('bg', 'Bulgarian'),
 ('be', 'Belarusian'),
 ('bn', 'Bengali'),
 ('br', 'Breton'),
 ('bs', 'Bosnian'),
 ('ca', 'Catalan'),
 ('cs', 'Czech'),
 ('cy', 'Welsh'),
 ('da', 'Danish'),
 ('de', 'German'),
 ('dsb', 'Lower Sorbian'),
 ('el', 'Greek'),
 ('en', 'English'),
 ('en-au', 'Australian English'),
 ('en-gb', 'British English'),
 ('eo', 'Esperanto'),
 ('es', 'Spanish'),
 ('es-ar', 'Argentinian Spanish'),
 ('es-co', 'Colombian Spanish'),
 ('es-mx', 'Mexican Spanish'),
 ('es-ni', 'Nicaraguan Spanish'),
 ('es-ve', 'Venezuelan Spanish'),
 ('et', 'Estonian'),
 ('eu', 'Basque'),
 ('fa', 'Persian'),
 ('fi', 'Finnish'),
 ('fr', 'French'),
 ('fy', 'Frisian'),
 ('ga', 'Irish'),
 ('gd', 'Scottish Gaelic'),
 ('gl', 'Galician'),
 ('he', 'Hebrew'),
 ('hi', 'Hindi'),
 ('hr', 'Croatian'),
 ('hsb', 'Upper Sorbian'),
 ('hu', 'Hungarian'),
 ('hy', 'Armenian'),
 ('ia', 'Interlingua'),
 ('id', 'Indonesian'),
 ('io', 'Ido'),
 ('is', 'Icelandic'),
 ('it', 'Italian'),
 ('ja', 'Japanese'),
 ('ka', 'Georgian'),
 ('kab', 'Kabyle'),
 ('kk', 'Kazakh'),
 ('km', 'Khmer'),
 ('kn', 'Kannada'),
 ('ko', 'Korean'),
 ('lb', 'Luxembourgish'),
 ('lt', 'Lithuanian'),
 ('lv', 'Latvian'),
 ('mk', 'Macedonian'),
 ('ml', 'Malayalam'),
 ('mn', 'Mongolian'),
 ('mr', 'Marathi'),
 ('my', 'Burmese'),
 ('nb', 'Norwegian Bokmål'),
 ('ne', 'Nepali'),
 ('nl', 'Dutch'),
 ('nn', 'Norwegian Nynorsk'),
 ('os', 'Ossetic'),
 ('pa', 'Punjabi'),
 ('pl', 'Polish'),
 ('pt', 'Portuguese'),
 ('pt-br', 'Brazilian Portuguese'),
 ('ro', 'Romanian'),
 ('ru', 'Russian'),
 ('sk', 'Slovak'),
 ('sl', 'Slovenian'),
 ('sq', 'Albanian'),
 ('sr', 'Serbian'),
 ('sr-latn', 'Serbian Latin'),
 ('sv', 'Swedish'),
 ('sw', 'Swahili'),
 ('ta', 'Tamil'),
 ('te', 'Telugu'),
 ('th', 'Thai'),
 ('tr', 'Turkish'),
 ('tt', 'Tatar'),
 ('udm', 'Udmurt'),
 ('uk', 'Ukrainian'),
 ('ur', 'Urdu'),
 ('vi', 'Vietnamese'),
 ('zh-hans', 'Simplified Chinese'),
 ('zh-hant', 'Traditional Chinese')]
LANGUAGES_BIDI: ['he', 'ar', 'fa', 'ur']
LANGUAGE_CODE: 'en-us'
LANGUAGE_COOKIE_AGE: None
LANGUAGE_COOKIE_DOMAIN: None
LANGUAGE_COOKIE_NAME: 'django_language'
LANGUAGE_COOKIE_PATH: '/'
LOCALE_PATHS: []
LOGGING: {
    'disable_existing_loggers': False,
    'formatters': {   'standard': {   'format': '%(asctime)s [%(levelname)s] '
                                                '%(name)s: %(message)s'}},
    'handlers': {   'file': {   'backupCount': 10,
                                'class': 'logging.handlers.RotatingFileHandler',
                                'filename': '/main/srhaque/.local/share/paiyroll/django.log',
                                'formatter': 'standard',
                                'level': 'DEBUG',
                                'maxBytes': 15728640}},
    'loggers': {'': {'handlers': ['file'], 'level': 'INFO', 'propagate': True}},
    'version': 1}
LOGGING_CONFIG: 'logging.config.dictConfig'
LOGGING_DIR: '/main/srhaque/.local/share/paiyroll'
LOGIN_REDIRECT_URL: '/accounts/profile/'
LOGIN_URL: '/accounts/login/'
LOGOUT_REDIRECT_URL: None
MANAGERS: []
MEDIA_ROOT: ''
MEDIA_URL: ''
MESSAGE_STORAGE: 'django.contrib.messages.storage.fallback.FallbackStorage'
MIDDLEWARE: ['django.middleware.security.SecurityMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'project.middleware.SessionExpiryMiddleware',
 'django.middleware.common.CommonMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware',
 'django.middleware.clickjacking.XFrameOptionsMiddleware']
MIGRATION_MODULES: {
 }
MONTH_DAY_FORMAT: 'F j'
NUMBER_GROUPING: 0
PAIYROLL_CLIENT_OUTPUT_DIR: '/tmp'
PAIYROLL_DEBUG_ADMIN_ACCESS_ALL_WORKFLOWS: True
PAIYROLL_DEBUG_NOTIFICATIONS: True
PAIYROLL_DEBUG_PAY_WALK: 0
PAIYROLL_DEBUG_PROMPTS: 259
PAIYROLL_DEBUG_SCHEDULING_DATE: '2018-01-01'
PAIYROLL_DEBUG_SCHEDULING_FAST: 10
PAIYROLL_GB_RTI_ALL_TRANSACT: None
PAIYROLL_GB_RTI_OTHERS_TRANSACT_FROM: '2019-04-06'
PAIYROLL_GB_RTI_TEST_COMPANIES: 
    ('HMRC', 'HMRC RTI Recognition')
PAIYROLL_GB_RTI_TEST_COMPANIES_ACTUAL_T: '2019-03-31'
PAIYROLL_INTERNAL_NETWORK: IPv4Network('192.168.1.0/24')
PAIYROLL_NOTIFICATIONS: {
    'email': {   'backend': 'django.core.mail.backends.filebased.EmailBackend',
                 'file_path': '/tmp/email_messages',
                 'host': 'smtp.gmail.com',
                 'password': '********',
                 'port': 587,
                 'ssl_certfile': '',
                 'ssl_keyfile': '********',
                 'timeout': None,
                 'use_ssl': False,
                 'use_tls': True,
                 'username': 'paiyroll.com@gmail.com'},
    'sms': {   'account_sid': 'ACbabf740b7a3d23010cbe381a8a184fe0',
               'from': '+15005550006',
               'password': '********'}}
PAIYROLL_REPORT_SPEC_DIR: '/main/srhaque/Innovatieltd/source/paiyroll/report/JasperReports'
PAIYROLL_REPORT_SPEC_SPREADSHEET: '/main/srhaque/Innovatieltd/source/paiyroll/report/Spreadsheets'
PAIYROLL_SITE_NAME: 'login.paiyroll.com'
PASSWORD_HASHERS: '********'
PASSWORD_RESET_TIMEOUT_DAYS: '********'
PREPEND_WWW: False
ROOT_URLCONF: 'project.urls'
SECRET_KEY: '********'
SECURE_BROWSER_XSS_FILTER: False
SECURE_CONTENT_TYPE_NOSNIFF: False
SECURE_HSTS_INCLUDE_SUBDOMAINS: False
SECURE_HSTS_PRELOAD: False
SECURE_HSTS_SECONDS: 0
SECURE_PROXY_SSL_HEADER: None
SECURE_REDIRECT_EXEMPT: []
SECURE_SSL_HOST: None
SECURE_SSL_REDIRECT: False
SERVER_EMAIL: 'root@localhost'
SESSION_CACHE_ALIAS: 'default'
SESSION_COOKIE_AGE: 5400
SESSION_COOKIE_DOMAIN: None
SESSION_COOKIE_HTTPONLY: True
SESSION_COOKIE_NAME: 'sessionid'
SESSION_COOKIE_PATH: '/'
SESSION_COOKIE_SAMESITE: 'Lax'
SESSION_COOKIE_SECURE: False
SESSION_ENGINE: 'django.contrib.sessions.backends.db'
SESSION_EXPIRE_AT_BROWSER_CLOSE: False
SESSION_FILE_PATH: None
SESSION_SAVE_EVERY_REQUEST: False
SESSION_SERIALIZER: 'django.contrib.sessions.serializers.JSONSerializer'
SETTINGS_MODULE: 'project.settings'
SHORT_DATETIME_FORMAT: 'm/d/Y P'
SHORT_DATE_FORMAT: 'm/d/Y'
SIGNING_BACKEND: 'django.core.signing.TimestampSigner'
SILENCED_SYSTEM_CHECKS: []
STATICFILES_DIRS: []
STATICFILES_FINDERS: ['django.contrib.staticfiles.finders.FileSystemFinder',
 'django.contrib.staticfiles.finders.AppDirectoriesFinder']
STATICFILES_STORAGE: 'django.contrib.staticfiles.storage.StaticFilesStorage'
STATIC_ROOT: '../staticroot'
STATIC_URL: '/static/'
TEMPLATES: [{'APP_DIRS': True,
  'BACKEND': 'django_jinja.backend.Jinja2',
  'DIRS': ['/main/srhaque/Innovatieltd/source/templates'],
  'OPTIONS': {'app_dirname': 'templates',
              'auto_reload': True,
              'autoescape': True,
              'bytecode_cache': {'backend': 'django_jinja.cache.BytecodeCache',
                                 'enabled': False,
                                 'name': 'default'},
              'constants': {},
              'context_processors': ['django.contrib.auth.context_processors.auth',
                                     'django.contrib.messages.context_processors.messages'],
              'extensions': ['jinja2.ext.do',
                             'jinja2.ext.loopcontrols',
                             'jinja2.ext.with_',
                             'jinja2.ext.i18n',
                             'jinja2.ext.autoescape',
                             'django_jinja.builtins.extensions.CsrfExtension',
                             'django_jinja.builtins.extensions.CacheExtension',
                             'django_jinja.builtins.extensions.TimezoneExtension',
                             'django_jinja.builtins.extensions.UrlsExtension',
                             'django_jinja.builtins.extensions.StaticFilesExtension',
                             'django_jinja.builtins.extensions.DjangoFiltersExtension',
                             'jinja_extensions.DebugExtension'],
              'filters': {},
              'globals': {},
              'match_extension': ('.html',
                                  '.jinja'),
              'match_regex': '^(?!admin/).*',
              'newstyle_gettext': True,
              'tests': {},
              'translation_engine': 'django.utils.translation',
              'undefined': None}},
 {'APP_DIRS': True,
  'BACKEND': 'django.template.backends.django.DjangoTemplates',
  'DIRS': ['/main/srhaque/Innovatieltd/source/templates'],
  'OPTIONS': {'context_processors': ['django.template.context_processors.debug',
                                     'django.template.context_processors.request',
                                     'django.contrib.auth.context_processors.auth',
                                     'django.contrib.messages.context_processors.messages']}}]
TEST_NON_SERIALIZED_APPS: []
TEST_RUNNER: 'django.test.runner.DiscoverRunner'
THOUSAND_SEPARATOR: ','
TIME_FORMAT: 'P'
TIME_INPUT_FORMATS: ['%H:%M:%S', '%H:%M:%S.%f', '%H:%M']
TIME_ZONE: 'UTC'
USE_I18N: True
USE_L10N: True
USE_THOUSAND_SEPARATOR: False
USE_TZ: True
USE_X_FORWARDED_HOST: False
USE_X_FORWARDED_PORT: False
WSGI_APPLICATION: 'project.wsgi.application'
X_FRAME_OPTIONS: 'SAMEORIGIN'
YEAR_MONTH_FORMAT: 'F Y'
is_overridden: <bound method Settings.is_overridden of <Settings "project.settings">>

Steps to Reproduce

Required Dependencies

  • Minimal Python Version: N/A or Unknown
  • Minimal Celery Version: N/A or Unknown
  • Minimal Kombu Version: N/A or Unknown
  • Minimal Broker Version: N/A or Unknown
  • Minimal Result Backend Version: N/A or Unknown
  • Minimal OS and/or Kernel Version: N/A or Unknown
  • Minimal Broker Client Version: N/A or Unknown
  • Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:

$ pip3 freeze
alembic==1.0.10
amqp==2.4.2
apt-xapian-index==0.47
asn1crypto==0.24.0
atomicwrites==1.3.0
attrs==19.1.0
aws-shell==0.2.1
awscli==1.16.147
Babel==2.6.0
backcall==0.1.0
backports.csv==1.0.7
bcrypt==3.1.6
beautifulsoup4==4.7.1
billiard==3.6.0.0
blinker==1.4
boto3==1.9.137
botocore==1.12.137
calendra==3.0
celery==4.3.0
certifi==2018.8.24
cffi==1.12.3
chardet==3.0.4
chromedriver==2.24.1
Click==7.0
colorama==0.3.9
command-not-found==0.3
configobj==5.0.6
coverage==4.5.3
cryptography==2.3
cssselect==1.0.3
cupshelpers==1.0
datadiff==2.0.0
decorator==4.4.0
deepdiff==4.0.6
defer==1.0.6
defusedxml==0.6.0
diff-match-patch==20181111
distro-info===0.21ubuntu2
Django==2.2
django-excel==0.0.10
django-extra-views==0.12.0
django-filter==2.1.0
django-import-export==1.2.0
django-jinja==2.4.1
django-jinja-bootstrap-form==4.2.3
django-model-utils==3.1.2
django-phonenumber-field==2.3.1
django-polymorphic==2.0.3
django-viewflow==1.5.3
django-webtest==1.9.4
docutils==0.14
entrypoints==0.3
ephem==3.7.6.0
et-xmlfile==1.0.1
Flask==1.0.2
Flask-BabelEx==0.9.3
Flask-Gravatar==0.5.0
Flask-HTMLmin==1.5.0
Flask-Login==0.4.1
Flask-Mail==0.9.1
Flask-Migrate==2.4.0
Flask-Paranoid==0.2.0
Flask-Principal==0.4.0
Flask-Security==3.0.0
Flask-SQLAlchemy==2.3.2
Flask-WTF==0.14.2
gunicorn==19.9.0
html5lib==1.0.1
htmlmin==0.1.12
httplib2==0.11.3
idna==2.6
ipython==7.5.0
ipython-genutils==0.2.0
itsdangerous==1.1.0
jdcal==1.4.1
jedi==0.13.3
Jinja2==2.10.1
jmespath==0.9.4
jpy==0.10.0.dev1
jsonpickle==1.1
keyring==17.1.1
keyrings.alt==3.1.1
kombu==4.5.0
language-selector==0.1
ldap3==2.4.1
lml==0.0.9
lunardate==0.2.0
lxml==4.3.3
Mako==1.0.9
MarkupSafe==1.1.1
more-itertools==7.0.0
multicorn===-VERSION-
netifaces==0.10.4
networkx==2.3
odfpy==1.4.0
olefile==0.46
openpyxl==2.5.14
ordered-set==3.1.1
paiyroll==0.1
paiyroll-fdw==0.1
paramiko==2.4.2
parso==0.4.0
passlib==1.7.1
pexpect==4.6.0
pgadmin4==4.5
phonenumbers==8.10.10
pickleshare==0.7.5
pika==1.0.1
Pillow==5.4.1
pluggy==0.9.0
prompt-toolkit==2.0.9
psutil==5.5.1
psycopg2-binary==2.8.2
py==1.8.0
pyasn1==0.4.2
pycairo==1.16.2
pyCalverter==1.6.1
pycparser==2.19
pycrypto==2.6.1
pycups==1.9.73
pyecharts-jupyter-installer==0.0.3
pyexcel==0.5.13
pyexcel-handsontable==0.0.2
pyexcel-io==0.5.11
pyexcel-ods==0.5.6
pyexcel-webio==0.1.4
pyexcel-xls==0.5.8
pyexcel-xlsx==0.5.7
Pygments==2.3.1
PyGObject==3.32.0
PyJWT==1.7.1
pymacaroons==0.13.0
PyNaCl==1.3.0
PyQt5==5.12.1
pyquery==1.4.0
pyrabbit==1.1.0
PySocks==1.6.8
pytest==4.4.1
python-apt==1.8.4
python-consul==1.1.0
python-dateutil==2.8.0
python-debian==0.1.34
python-editor==1.0.4
pytz==2018.9
PyYAML==3.13
reportlab==3.5.18
requests==2.21.0
requests-unixsocket==0.1.5
rsa==3.4.2
s3transfer==0.2.0
SecretStorage==2.3.1
selenium==3.141.0
simplejson==3.16.0
six==1.12.0
soupsieve==1.8
speaklater==1.3
SQLAlchemy==1.3.3
sqlparse==0.2.4
ssh-import-id==5.7
sshtunnel==0.1.4
systemd-python==234
tablib==0.13.0
tblib==1.3.2
terminaltables==3.1.0
texttable==1.6.1
traitlets==4.3.2
twilio==6.26.2
ubuntu-advantage-tools==19.2
ubuntu-drivers-common==0.0.0
ufw==0.36
uk-postcode-utils==1.0
unattended-upgrades==0.1
urllib3==1.24.1
vine==1.3.0
waitress==1.3.0
wcwidth==0.1.7
webencodings==0.5.1
WebOb==1.8.5
WebTest==2.0.33
Werkzeug==0.15.2
WTForms==2.2.1
xdg==4.0.0
xkit==0.0.0
xlrd==1.2.0
xlwt==1.3.0
xmltodict==0.12.0

Other Dependencies

Consul is version v1.4.2

Minimally Reproducible Test Case

The problem is not easy to reproduce. However:

  • I have just a handful of Celery tasks, one which simply does an async_apply, and one which does an async_apply with a countdown of a few seconds.
  • I expect each task completion to store its results in Consul.
  • All tasks are wrapped in a top-level catch-all exception handler. This ensures that all tasks return Celery SUCCESS, and the result data contains the "actual" result as a JSON-friendly dict.

Expected Behavior

I expect all results to be posted to Consul, from where I will later retrieve them (subject to being timed out in the usual way).

Actual Behavior

Some results are not stored in Consul due to a KeyError exception. The basic failure mode is that storing the result throws a KeyError exception when it tries to extract the 'ID':

image_2019_06_15T22_36_15_455Z

This line of code is expecting "data" not to be a boolean, but to have this form:

{'ID': '3e1d1cd2-4de1-be07-02d4-8b3d279acedb'}

(For clarity, the stack trace is from a colleague's machine, on mine, the line numbers are the same).

I believe the root cause of the issue is likely to be in python-consul which assumes that the "requests" library is threadsafe when it is not. My underlying assumption in thinking this is that the confusion between the dict (expected as a response to saving a result) and the bool (expected as a response when expiring a result) is because the Consul backend's self.session is not threadsafe if those calls are interleaved.

@auvipy auvipy added this to the 4.6 milestone Jun 23, 2019
@ShaheedHaque
Copy link
Contributor Author

ShaheedHaque commented Jul 11, 2019

I now believe this is an issue in Consul itself. I'll leave this open to report on progress.

@ShaheedHaque
Copy link
Contributor Author

OK, I believe there is a bug in Celery here too. When these errors occur (which as I say I currently believe to be caused upstream), the celery worker process gets "stuck" with the job in hand:

$ celery inspect active
-> celery@freenas: OK
    * {'id': '2996120d-87d2-4c9b-8774-b7a3b64e1332', 'name': 'paiyroll.tasks.function_run', 'args': "['paiyroll.report_run', 'report_preflight_validate_async', 1921]", 'kwargs': '{}', 'type': 'paiyroll.tasks.function_run', 'hostname': 'celery@freenas', 'time_start': 1562921376.9528346, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': False}, 'worker_pid': 7111}
.
.
.

until all workers are stuck like this. Here is an example of what Celery reports:

[2019-07-12 08:48:51,191: WARNING/ForkPoolWorker-80] /usr/local/lib/python3.7/dist-packages/celery/app/trace.py:568: RuntimeWarning: Exception raised outside body: TypeError("'bool' object is not subscriptable"):
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 449, in trace_task
    uuid, retval, task_request, publish_result,
  File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 149, in mark_as_done
    self.store_result(task_id, result, state, request=request)
  File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 342, in store_result
    request=request, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 714, in _store_result
    self.set(self.get_key_for_task(task_id), self.encode(meta))
  File "/usr/local/lib/python3.7/dist-packages/celery/backends/consul.py", line 92, in set
    ttl=self.expires)
  File "/usr/local/lib/python3.7/dist-packages/consul/base.py", line 1781, in create
    data=data)
  File "/usr/local/lib/python3.7/dist-packages/consul/std.py", line 33, in put
    self.session.request('PUT', uri, body=data, headers=JSON_HEADER)))
  File "/usr/local/lib/python3.7/dist-packages/consul/base.py", line 234, in cb
    data = data['ID']
TypeError: 'bool' object is not subscriptable

  exc, exc_info.traceback)))

My guess is that the Celery worker, having caught this error, does not go on to clean up as expected.

@ShaheedHaque
Copy link
Contributor Author

Filed #5642.

@auvipy auvipy removed this from the 4.6 milestone Aug 13, 2019
@ShaheedHaque ShaheedHaque reopened this Jun 24, 2020
@ShaheedHaque
Copy link
Contributor Author

I finally have what I think is a smoking gun pointing to an issue in Celery. My setup is much as before though with versions as follows:

$ celery report

software -> celery:4.4.5 (cliffs) kombu:4.6.10 py:3.8.2
            billiard:3.6.3.0 py-amqp:2.6.0
platform -> system:Linux arch:64bit, ELF
            kernel version:5.4.0-37-generic imp:CPython
loader   -> celery.loaders.default.Loader
settings -> transport:amqp results:disabled

and using, as before, the Consul result backend. I modified the Celery code slightly to emit these logs:

2020-06-24 09:16:03,336 [WARNING] ForkPoolWorker-27: Trying to fetch key celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05 from Consul
2020-06-24 09:16:03,336 [WARNING] ForkPoolWorker-27: request get="celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05"
2020-06-24 09:16:09,033 [WARNING] ForkPoolWorker-27: response data="{'ID': '7abf2295-6dca-6f7d-ed4c-4bff4b60955f'}", decode=Value is_id=False
2020-06-24 09:16:09,033 [ERROR] ForkPoolWorker-27: cannot get celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05: 'str' object has no attribute 'get' (None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 64, in get
    _, data = self.client.kv.get(key)
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 553, in get
    return self.agent.http.get(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 21, in get
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 232, in cb
    if item.get(decode) is not None:
AttributeError: 'str' object has no attribute 'get'
2020-06-24 09:16:09,035 [WARNING] ForkPoolWorker-27: /usr/local/lib/python3.8/dist-packages/celery/app/trace.py:622: RuntimeWarning: Exception raised outside body: AttributeError("'str' object has no attribute 'get'"):
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 479, in trace_task
    mark_as_done(
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 158, in mark_as_done
    self.store_result(task_id, result, state, request=request)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 442, in store_result
    self._store_result(task_id, result, state, traceback,
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 853, in _store_result
    current_meta = self._get_task_meta_for(task_id)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 871, in _get_task_meta_for
    meta = self.get(self.get_key_for_task(task_id))
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 64, in get
    _, data = self.client.kv.get(key)
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 553, in get
    return self.agent.http.get(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 21, in get
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 232, in cb
    if item.get(decode) is not None:
AttributeError: 'str' object has no attribute 'get'

  warn(RuntimeWarning(
2020-06-24 09:16:09,036 [WARNING] ForkPoolWorker-27: Trying to fetch key celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05 from Consul
2020-06-24 09:16:09,036 [WARNING] ForkPoolWorker-27: request get="celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05"
2020-06-24 09:16:09,053 [WARNING] ForkPoolWorker-27: Trying to create Consul session celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05 with TTL 86400
2020-06-24 09:16:09,199 [WARNING] ForkPoolWorker-27: response data="True", decode=False is_id=True
2020-06-24 09:16:09,199 [ERROR] ForkPoolWorker-27: cannot save result for b'celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05'={"status": "FAILURE", "result": {"exc_type": "AttributeError", "exc_message": ["'str' object has no attribute 'get'"], "exc_module": "builtins"}, "traceback": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.8/dist-packages/celery/app/trace.py\", line 479, in trace_task\n    mark_as_done(\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/base.py\", line 158, in mark_as_done\n    self.store_result(task_id, result, state, request=request)\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/base.py\", line 442, in store_result\n    self._store_result(task_id, result, state, traceback,\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/base.py\", line 853, in _store_result\n    current_meta = self._get_task_meta_for(task_id)\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/base.py\", line 871, in _get_task_meta_for\n    meta = self.get(self.get_key_for_task(task_id))\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py\", line 64, in get\n    _, data = self.client.kv.get(key)\n  File \"/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py\", line 553, in get\n    return self.agent.http.get(\n  File \"/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py\", line 21, in get\n    return callback(self.response(\n  File \"/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py\", line 232, in cb\n    if item.get(decode) is not None:\nAttributeError: 'str' object has no attribute 'get'\n", "children": [[["b3cc0538-e39d-4ff5-accd-d1d0c0e3e30e", null], null], [["14b3b7a0-1374-4ade-adc6-827f55cbe7ae", null], null], [["8a156251-4545-4bf5-9351-200d79b2a757", null], null], [["2563d59b-7885-41e2-b0f2-e01a0fe173c9", null], null], [["840e501b-744b-4d87-9400-7416391dd05a", null], null], [["b08a15a5-5abb-4103-b4ed-8ed16a760521", null], null], [["eb2de10e-f49b-47b8-a2e1-71cbc983774c", null], null], [["4546e0f5-dd1b-45c6-bfba-6fd3de9cc5e4", null], null], [["cd5a9a00-fdae-4bc4-94a9-eb784c1d5be7", null], null], [["e1797005-0f2f-47c9-a1df-fdf26edc07cd", null], null], [["49b95dec-be25-4f86-a295-4fc4650c0665", null], null], [["96f37a25-9b76-4de5-98c3-4e504816933c", null], null], [["6ef42ea5-a45c-4f43-83e3-29c4964bf134", null], null], [["08efabc6-81e1-4ff7-9970-435d4efcbb2c", null], null], [["4ade8f72-13fb-433c-aec4-844f019607ff", null], null], [["11b1456c-f558-461b-a872-0044aa29c954", null], null], [["d8fb6ad9-9669-4c8a-8e73-aef0ffff452a", null], null], [["890a6831-e033-4b63-9199-2f4858b6dd40", null], null], [["f2d7b230-ac3c-489f-b3a3-b5956b379a2d", null], null], [["34d1a8a8-9f8f-44b5-88e3-1ee8713d7a8a", null], null], [["2522e3dc-7bf2-4cb8-90cb-99a9011844e0", null], null]], "date_done": "2020-06-24T09:16:09.036404", "parent_id": "a91ffd09-bff0-4d44-a954-e596c08de35c", "task_id": "601e8b28-9dcf-4e68-847b-5e1875a8eb05"}: 'bool' object is not subscriptable (FAILURE)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 479, in trace_task
    mark_as_done(
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 158, in mark_as_done
    self.store_result(task_id, result, state, request=request)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 442, in store_result
    self._store_result(task_id, result, state, traceback,
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 853, in _store_result
    current_meta = self._get_task_meta_for(task_id)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 871, in _get_task_meta_for
    meta = self.get(self.get_key_for_task(task_id))
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 64, in get
    _, data = self.client.kv.get(key)
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 553, in get
    return self.agent.http.get(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 21, in get
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 232, in cb
    if item.get(decode) is not None:
AttributeError: 'str' object has no attribute 'get'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 78, in set
    return self.set_real(key, value, state)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 99, in set_real
    session_id = self.client.session.create(name=session_name,
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 1763, in create
    return self.agent.http.put(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 26, in put
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 235, in cb
    data = data['ID']
TypeError: 'bool' object is not subscriptable
2020-06-24 09:16:09,200 [ERROR] ForkPoolWorker-27: Task paiyroll.tasks.viewflow_dequeue[601e8b28-9dcf-4e68-847b-5e1875a8eb05] raised unexpected: AttributeError("'str' object has no attribute 'get'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 479, in trace_task
    mark_as_done(
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 158, in mark_as_done
    self.store_result(task_id, result, state, request=request)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 442, in store_result
    self._store_result(task_id, result, state, traceback,
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 853, in _store_result
    current_meta = self._get_task_meta_for(task_id)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 871, in _get_task_meta_for
    meta = self.get(self.get_key_for_task(task_id))
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 64, in get
    _, data = self.client.kv.get(key)
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 553, in get
    return self.agent.http.get(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 21, in get
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 232, in cb
    if item.get(decode) is not None:
AttributeError: 'str' object has no attribute 'get'

However, it is probably easier to see what is going on by looking at the attached wireshark packet capture.
wireshark.pcapng.gz:

  1. Use gunzip to restore the file.
  2. Load into wireshark.
  3. Filter the display by "http".
  4. Order by time column.
  5. Look at the numbered entries (i.e. the first column) 8953 and 8963.
  6. Note that the log timestamps and wireshark timestamps are 1 hour apart, so 09:00 is the log is 10:00 in wireshark.

Entries 8953 and 8963 seem to be a request response pair between Consul on port 8500 and the instance of Celery running on port 50268 (which I take to be ForkPoolWorker-27). Note the request is:

GET http://localhost:8500/v1/kv/celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05?consistent=1

and the response is encoded as application/json with content:

true

Note that true is not a valid response to the GET. It is actually the (valid) response to the earlier request in entry 8940:

PUT /v1/kv/celery-task-meta-4546e0f5-dd1b-45c6-bfba-6fd3de9cc5e4?acquire=128ced02-8f69-5327-69f6-56cbec3b3847 HTTP/1.1\r\n

which successfully stored the result of an earlier task.

The obvious implication is that after initiating the storing of the result for celery-task-meta-4546e0f5-dd1b-45c6-bfba-6fd3de9cc5e4, Celery did not wait for the acknowledgement true. Then when celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05 came along, it ended up using HTTP 1.1 pipelining, and mistaking the earlier response for its own.

Does that make sense?

@ShaheedHaque
Copy link
Contributor Author

Just to refine the previous hypothesis based on looking into how Celery uses python-consul:

  1. python-consul has several variations, of which Celery ends up using consul.std.Consul.
  2. Celery also just instantiates consul.std.Consul() once (i.e. per worker).
  3. consul.std.Consul() creates just a single
    self.session = requests.session()
    
  4. So, Celery just ends up using callbacks fired by the requests library which, AFAICS, does no
    request/response queueing or multiplexing or whatever. (Though I confess to not having had the gumption
    to look at exactly how requests works internally, it is certainly known not to be threadsafe).

Thus, once the Celery worker thinks it has finished sending the task result to the backend, even though the associated callback has not fired, it can pick up the next task. Since the next task uses the same consul.std.Consul() and thereby the same requests.session(), the stage is set for the scenario I see. As a quick hack mostly intended to minimise the source changes, I am trying this edit:

$ diff -U 5 /usr/local/lib/python3.8/dist-packages/celery/backends/consul.py.saved /usr/local/lib/python3.8/dist-packages/celery/backends/consul.py
--- /usr/local/lib/python3.8/dist-packages/celery/backends/consul.py.saved      2020-06-23 14:58:20.042460818 +0100
+++ /usr/local/lib/python3.8/dist-packages/celery/backends/consul.py    2020-06-24 15:25:09.188833535 +0100
@@ -48,12 +48,16 @@
 
     def _init_from_params(self, hostname, port, virtual_host, **params):
         logger.debug('Setting on Consul client to connect to %s:%d',
                      hostname, port)
         self.path = virtual_host
-        self.client = consul.Consul(host=hostname, port=port,
-                                    consistency=self.consistency)
+        self.hostname = hostname
+        self.port = port
+
+    @property
+    def client(self):
+        return consul.Consul(host=self.hostname, port=self.port, consistency=self.consistency)
 
     def _key_to_consul_key(self, key):
         key = bytes_to_str(key)
         return key if self.path is None else '{0}/{1}'.format(self.path, key)

where each reference to self.client ends up with a unique consul.std.Consul(). This is clearly not how one would want to deal with the issue in real life, but should work for a quick test.

An initial run looks promising, I will report back after further runs.

@ShaheedHaque
Copy link
Contributor Author

Multiple test runs on Linux and Mac OSX have not shown the problem after the hack in my previous comment was applied.

@thedrow
Copy link
Member

thedrow commented Jul 13, 2020

You could create a client per thread using local thread storage.

@ShaheedHaque
Copy link
Contributor Author

@thedrow As far as I know, that won't help because the problem is visible when Celery is not using threads. I think the correct approach is to create one client per request (which the above hack achieves in a fragile fashion).

@thedrow
Copy link
Member

thedrow commented Jul 13, 2020

But that means we have to connect each time we make a request.
It's very inefficient.

@maybe-sybr
Copy link
Contributor

We discussed this briefly at the core contributors meeting today and I wanted to chime in with some thoughts and questions for @ShaheedHaque.

  1. This definitely looks like the second PUT to consul is not being waited on before the next GET for some other ID is being made, but I can't see how this could be happening based on the code I see in consul 1.1.0 (the version you included in the bug report template last year).
  2. Our use in the consul backend seems entirely reasonable also and does not appear to do anything which could be construed as async
  3. The only change I could see us being able to reasonably make it some validation on the return value of backend.set() within _store_result(). We simply ignore the result in 4.4.5 and on master:

@ShaheedHaque - do you think you might be able to get some information about the return value from that set() call? The only thing I can think of is that a rogue EINTR might be causing the unanswered PUT request to stop awaiting a response but then the exception I'd expect to be raised in that case would have to be being silenced somewhere. I'm not sure if that information will help us fix the issue but it might inform whether there is anything we can do in celery to avoid it. If it turns out that set() response is something we can handle/retry, I think the fix would be a simply retry loop there, depending on idempotency in consul (I'm not familiar enough with it to say for sure). AFAICT everything else seems to be reasonable code that shouldn't fail in this annoying way :( and since we don't have any real control over the requests.session object used by consul, we'd be breaking in to do any fixes beyond the hack you've already posted above.

One other option would be to try using one of the other consul implementations rather than std.Consul but that would mean we'd need to install tornado or twisted and I'm not super on board with a change like that.

@ShaheedHaque
Copy link
Contributor Author

@maybe-sybr Thanks for your analysis.

The bad news is that I have seen no evidence of anything like an interrupted syscall for either the set() or the put(). The callbacks always get a well-formed JSON response (either the right one or the wrong one).

However, I think I may have found an explanation for the race...according to the documentation a requests.Session:

Provides cookie persistence, connection-pooling, and configuration.

I certainly was not expecting any pooling...presumably that is what is giving rise to the race. Assuming this is the root cause, a bit of poking around the API documentation suggests that there might be a way to nobble the pool using something called an "HTTPAdapter" (it looks a little unpleasant...but better than the present hack).

@ShaheedHaque
Copy link
Contributor Author

Well, either the pooling is not the problem, or my attempt to disable it was wrongly conceived:

from requests.adapters import HTTPAdapter
...
    def _init_from_params(self, hostname, port, virtual_host, **params):
        logger.debug('Setting on Consul client to connect to %s:%d',
                     hostname, port)
        self.path = virtual_host
        self.client = consul.Consul(host=hostname, port=port,
                                    consistency=self.consistency)
        #
        # Disable the default connection pooling.
        #
        adapter = HTTPAdapter(pool_connections=1, pool_maxsize=1, pool_block=True)
        self.client.http.session.mount(self.client.http.base_uri, adapter)

I'm stumped.

@thedrow
Copy link
Member

thedrow commented Dec 13, 2020

It seems like python-consul is unmaintained.
This fork is active.

@ShaheedHaque Can you please uninstall python-consul and install python-consul2.
If it doesn't reproduce we should simply switch.

@thedrow
Copy link
Member

thedrow commented Dec 13, 2020

The unit tests pass with python-consul2.
We only get a handful of deprecation warnings when using it instead of python-consul.

@ShaheedHaque Please test it and see if you can reproduce the problem.

thedrow added a commit that referenced this issue Dec 14, 2020
python-consul has not been maintained in a long while now.
python-consul2 is a maintained fork of the same package.

Ref #5605.
@ShaheedHaque
Copy link
Contributor Author

I confirm that switching to python-consul2 shows exactly the same behaviour:

sudo -H pip3 list | egrep 'celery|consul'
celery                        5.0.5
python-consul2                0.1.5

Nevertheless, given the support situation, switching to it is clearly a good step to take.

matusvalo pushed a commit that referenced this issue Dec 27, 2020
python-consul has not been maintained in a long while now.
python-consul2 is a maintained fork of the same package.

Ref #5605.
@auvipy auvipy modified the milestones: 4.5, 5.1.0 Dec 30, 2020
@ShaheedHaque
Copy link
Contributor Author

ShaheedHaque commented Feb 19, 2021

As per the discussion in #5642, if there is interest is incorporating the above workaround (on the basis that correctness trumps performance), I'd be happy to provide a PR.

It is perhaps worth saying that we have run the workaround extensively in test and in production, and not seen a single instance of this error with the workaround in place. We do see very occasional errors connecting to Consul (possibly as a result of more frequent connection attempts), but at least those result in a very clear connection error rather than this very murky issue. I plan to submit a PR to python-consul2 to add support for connection retries (which it currently disables) which further reduces this uncommon knock-on effect.

@ShaheedHaque
Copy link
Contributor Author

I've submitted the PR to python-consul2 to add support for connection retries (which it currently disables) which compensates for the knock-on effect of the workaround, i.e. more connection attempts at poppyred/python-consul2#31.

@thedrow thedrow added this to Review in progress in Celery 5.1.0 Feb 24, 2021
@thedrow thedrow moved this from Review in progress to In progress in Celery 5.1.0 Feb 24, 2021
@auvipy auvipy moved this from In progress to Postponed in Celery 5.1.0 May 4, 2021
@auvipy auvipy modified the milestones: 5.1.0, 5.2 May 4, 2021
@ShaheedHaque
Copy link
Contributor Author

Since there has been no progress on a refined solution, I'd like to propose a PR based on my fix at #5605 (comment). This has been extensively tested by now, and works perfectly. As per the discussion following the above comment, there is a connection per request (which nobody is keen on), but at least the code works reliably.

I'll put one together later for your consideration.

ShaheedHaque added a commit to ShaheedHaque/celery that referenced this issue Jun 27, 2021
…ponses

from Consul with the outbound Celery request that caused it. This leaves
it prone to mistaking the (final) response from an operation N as the
response to an (early) part of operation N + 1.

This changes fix that by using a separate connection for each request.
That of course has the downside of (a) being relatively expensive and (b)
increasing the rate of connection requests into Consul:

- The former is annoying, but at least the backend works reliably.

- The latter can cause Consul to reject excessive connection attempt, but
  if it does, at least it returns a clear indication of this (IIRC, it
  responds with an HTTP 429"too many connections" indication).

  Additionally, this issue can be ameliorated by enabling retries in the
  python-consul2 (which I believe should be turned on regards less to handle
  transient network issues). This is fixed by the PR in
  https:/github.com/poppyred/python-consul2/pull/31.

Note that we have never seen (b) outside a test specifically trying to hammer
the system, but we see (a) all the time in our normal system tests.
ShaheedHaque added a commit to ShaheedHaque/celery that referenced this issue Jun 28, 2021
…ponses

from Consul with the outbound Celery request that caused it. This leaves
it prone to mistaking the (final) response from an operation N as the
response to an (early) part of operation N + 1.

This changes fix that by using a separate connection for each request.
That of course has the downside of (a) being relatively expensive and (b)
increasing the rate of connection requests into Consul:

- The former is annoying, but at least the backend works reliably.

- The latter can cause Consul to reject excessive connection attempt, but
  if it does, at least it returns a clear indication of this (IIRC, it
  responds with an HTTP 429"too many connections" indication).

  Additionally, this issue can be ameliorated by enabling retries in the
  python-consul2 (which I believe should be turned on regards less to handle
  transient network issues). This is fixed by the PR in
  https:/github.com/poppyred/python-consul2/pull/31.

Note that we have never seen (b) outside a test specifically trying to hammer
the system, but we see (a) all the time in our normal system tests.

To opt-out from the new behaviour add a parameter "one_client=1" to the
connection URL.
ShaheedHaque added a commit to ShaheedHaque/celery that referenced this issue Aug 10, 2021
…ponses

from Consul with the outbound Celery request that caused it. This leaves
it prone to mistaking the (final) response from an operation N as the
response to an (early) part of operation N + 1.

This changes fix that by using a separate connection for each request.
That of course has the downside of (a) being relatively expensive and (b)
increasing the rate of connection requests into Consul:

- The former is annoying, but at least the backend works reliably.

- The latter can cause Consul to reject excessive connection attempt, but
  if it does, at least it returns a clear indication of this (IIRC, it
  responds with an HTTP 429"too many connections" indication).

  Additionally, this issue can be ameliorated by enabling retries in the
  python-consul2 (which I believe should be turned on regards less to handle
  transient network issues). This is fixed by the PR in
  https:/github.com/poppyred/python-consul2/pull/31.

Note that we have never seen (b) outside a test specifically trying to hammer
the system, but we see (a) all the time in our normal system tests.

To opt-out from the new behaviour add a parameter "one_client=1" to the
connection URL.
auvipy pushed a commit that referenced this issue Aug 11, 2021
…6823)

* As per #5605, the Consul backend does not cleanly associate responses
from Consul with the outbound Celery request that caused it. This leaves
it prone to mistaking the (final) response from an operation N as the
response to an (early) part of operation N + 1.

This changes fix that by using a separate connection for each request.
That of course has the downside of (a) being relatively expensive and (b)
increasing the rate of connection requests into Consul:

- The former is annoying, but at least the backend works reliably.

- The latter can cause Consul to reject excessive connection attempt, but
  if it does, at least it returns a clear indication of this (IIRC, it
  responds with an HTTP 429"too many connections" indication).

  Additionally, this issue can be ameliorated by enabling retries in the
  python-consul2 (which I believe should be turned on regards less to handle
  transient network issues). This is fixed by the PR in
  https:/github.com/poppyred/python-consul2/pull/31.

Note that we have never seen (b) outside a test specifically trying to hammer
the system, but we see (a) all the time in our normal system tests.

To opt-out from the new behaviour add a parameter "one_client=1" to the
connection URL.

* Increase code coverage.

* Rewrite Consul backend documentation, and describe the options now
available.
@maybe-sybr
Copy link
Contributor

Should we actually close this with the merge of #6823? We have a workaround but not a fix that we really understand.

That said, I'm not confident we'll ever really spend much more time on working out what's going on here. It seems like we can only really chalk it up to being weirdness in an upstream project, and probably not a great use of our time. As long as we can make the workarounds (the per-request clients and oneclient trapdoor) discoverable for others who might hit this issue, perhaps leaving this closed is fine. I think @ShaheedHaque has done a good job on the documentation in #6823, so I'd be happy to leave this closed rather than have it linger on the backlog forever.

jeyrce pushed a commit to jeyrce/celery that referenced this issue Aug 25, 2021
python-consul has not been maintained in a long while now.
python-consul2 is a maintained fork of the same package.

Ref celery#5605.
jeyrce pushed a commit to jeyrce/celery that referenced this issue Aug 25, 2021
…elery#6823)

* As per celery#5605, the Consul backend does not cleanly associate responses
from Consul with the outbound Celery request that caused it. This leaves
it prone to mistaking the (final) response from an operation N as the
response to an (early) part of operation N + 1.

This changes fix that by using a separate connection for each request.
That of course has the downside of (a) being relatively expensive and (b)
increasing the rate of connection requests into Consul:

- The former is annoying, but at least the backend works reliably.

- The latter can cause Consul to reject excessive connection attempt, but
  if it does, at least it returns a clear indication of this (IIRC, it
  responds with an HTTP 429"too many connections" indication).

  Additionally, this issue can be ameliorated by enabling retries in the
  python-consul2 (which I believe should be turned on regards less to handle
  transient network issues). This is fixed by the PR in
  https:/github.com/poppyred/python-consul2/pull/31.

Note that we have never seen (b) outside a test specifically trying to hammer
the system, but we see (a) all the time in our normal system tests.

To opt-out from the new behaviour add a parameter "one_client=1" to the
connection URL.

* Increase code coverage.

* Rewrite Consul backend documentation, and describe the options now
available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Celery 5.1.0
  
Postponed
Development

Successfully merging a pull request may close this issue.

6 participants