Race condition in Consul result backend when saving a result #5605

ShaheedHaque · 2019-06-16T13:20:58Z

I'm encountering an issue using the Consul result backend which I suspect is NOT a Celery issue as such but rather an issue in the lack of thread safety in the library the backend uses.

The symptoms are consistent with confusion between the sorts of responses that Consul gives to the library when creating saved results, and expiring those saved results. So, if Celery both writes and expires saved results using threads and, the lack of thread-safety in the python-consul library could produce the results described.

Can somebody confirm if Celery access to a results backend involves multiple threads?

matteius · 2019-06-17T05:42:08Z

@ShaheedHaque Basically if you are using the prefork pool which is the default for celery, it will use forking the process instead of threads to get past limitations of the Python GIL. This means that if you are running a multi-core CPU (you probably are) that having more than one worker process would mean the possibility of writing in parallel while consuming tasks and sending the results along. I would have to know more about the setup, but if your python-consul library is only listening in one spot for those response and thinks that its going to enforce a linear order of execution, that could be the issue.

ShaheedHaque · 2019-06-17T08:43:15Z

Hi @matteius,

I am indeed using the default prefork pool. I have taken a cursory look at the Celery code, but could not easily see where the result backend code is called from, so it is unclear to me if there are multiple ports involved (i.e. one per forked worker) or one (i.e. in the master process).

However, my suspicion is that in either event, the result backend is doing both "set" operations to store results and "delete" operations to expire them. The suspicion is based on instrumenting the Consul backend code, and seeing that the "set" code, which expects to have created a Consul session, and expects a dictionary response (which describes the newly created session) something like this:

{'ID': '3e1d1cd2-4de1-be07-02d4-8b3d279acedb'}

is actually sometimes getting a simple boolean response. What can return such a result? The Consul HTTP API that returns a boolean response is the "delete". The "set" code and "delete" code both appear to be called, and the evidence is clear that the various responses are getting mixed up. For example, I see instances where the code is attempting to get the ['ID'] of a boolean!

From the way the backend code is written with inline callbacks, the only way I can see this happening is if a given listener is threaded and the session context is getting confused, but before I plunge into the deep end and propose a fix for the Consul library handling the sessions, I am looking for confirmation that my suspicions match Celery usage of its backend.

matteius · 2019-06-17T18:15:40Z

Which backend are you using for the broker, ie: RabbitMQ, Redis, SQS, something else?

ShaheedHaque · 2019-06-17T19:43:35Z

RabbitMQ.

matteius · 2019-06-18T21:15:04Z

In that case every task that gets a result must define a unique response queue which celery handles internally -- is consul reading from the queue for both expired results or just using the result.get() method? I read somewhere here recently you have to call .get() or .ignore() on every result promise object when using the results backend otherwise it can lead to something possibly like you are describing. It can happen if in your code you only care about the results from some tasks.

ShaheedHaque · 2019-06-19T00:03:54Z

For clarity, I'm just a user. I didn't write either the Consul result backend or the Python-consul package.

That said my code does wait for and check every result.

However, I don't understand what any of that has to do with the problem I described, or the clarification I am seeking.

All I need to check is whether, when Celery invokes the backend to save results, it uses multiple threads? The reason for asking is that I am seeing errors which can be explained if threads are in use because the Python-consul package is not threadsafe

thedrow · 2019-06-19T11:05:10Z

In the future, please use the template for bug reports.

Do you have a test case?

matteius · 2019-06-19T11:56:47Z

@ShaheedHaque As I already stated above, if you are using the default prefork pool, which you have said you are, then Celery is invoking multiple processes and not multiple threads.

ShaheedHaque · 2019-06-19T12:54:22Z

@thedrow Apologies, I did not consider this to fit any of the templates, and specifically I did not consider this a bug report against Celery. Here is the template anyway...

I have read the relevant section in the
contribution guide
on reporting bugs.
I have checked the issues list
for similar or identical bug reports.
I have checked the pull requests list
for existing proposed fixes.
I have checked the commit log
to find out if the bug was already fixed in the master branch.
I have included all related issues and possible duplicate issues
in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

I have included the output of celery -A proj report in the issue.
I have verified that the issue exists against the master branch of Celery.
I have included the contents of pip freeze in the issue.
I have included all the versions of all the external dependencies required
to reproduce this bug.

Optional Debugging Information

I have tried reproducing the issue on more than one Python version
and/or implementation.
I have tried reproducing the issue on more than one message broker and/or
result backend.
I have tried reproducing the issue on more than one version of the message
broker and/or result backend.
I have tried reproducing the issue on more than one operating system.
I have tried reproducing the issue on more than one workers pool.
I have tried reproducing the issue with autoscaling, retries,
ETA/Countdown & rate limits disabled.
I have tried reproducing the issue after downgrading
and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

None. (AFAIK, Application is not thread-safe #4480 is not specifically related).

Possible Duplicates

None

Environment & Settings

Celery version:

celery report Output:

$ celery -A paiyroll report
software -> celery:4.3.0 (rhubarb) kombu:4.5.0 py:3.7.3
            billiard:3.6.0.0 py-amqp:2.4.2
platform -> system:Linux arch:64bit, ELF
            kernel version:5.0.0-16-generic imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp results:consul://localhost:8500/ 

ABSOLUTE_URL_OVERRIDES: {
 }
ADMINS: []
ALLOWED_HOSTS: ['*']
APPEND_SLASH: True
AUTHENTICATION_BACKENDS: ['django.contrib.auth.backends.ModelBackend']
AUTH_PASSWORD_VALIDATORS: '********'
AUTH_USER_MODEL: 'paiyroll.User'
BASE_DIR: '/main/srhaque/Innovatieltd/source'
CACHES: {
 'default': {'BACKEND': 'django.core.cache.backends.locmem.LocMemCache'}}
CACHE_MIDDLEWARE_ALIAS: 'default'
CACHE_MIDDLEWARE_KEY_PREFIX: '********'
CACHE_MIDDLEWARE_SECONDS: 600
CELERY_BROKER_HEARTBEAT: None
CELERY_BROKER_URL: 'amqp://guest:********@localhost:5672//'
CELERY_RESULT_BACKEND: 'consul://localhost:8500/'
CSRF_COOKIE_AGE: 31449600
CSRF_COOKIE_DOMAIN: None
CSRF_COOKIE_HTTPONLY: False
CSRF_COOKIE_NAME: 'csrftoken'
CSRF_COOKIE_PATH: '/'
CSRF_COOKIE_SAMESITE: 'Lax'
CSRF_COOKIE_SECURE: False
CSRF_FAILURE_VIEW: 'django.views.csrf.csrf_failure'
CSRF_HEADER_NAME: 'HTTP_X_CSRFTOKEN'
CSRF_TRUSTED_ORIGINS: []
CSRF_USE_SESSIONS: False
DATABASES: {
    'default': {   'ATOMIC_REQUESTS': False,
                   'AUTOCOMMIT': True,
                   'CONN_MAX_AGE': 0,
                   'ENGINE': 'django.db.backends.postgresql',
                   'HOST': 'localhost',
                   'NAME': 'foo',
                   'OPTIONS': {},
                   'PASSWORD': '********',
                   'PORT': '5432',
                   'TEST': {   'CHARSET': None,
                               'COLLATION': None,
                               'MIRROR': None,
                               'NAME': None},
                   'TIME_ZONE': None,
                   'USER': 'dbcoreuser'},
    'fdw': {   'ATOMIC_REQUESTS': False,
               'AUTOCOMMIT': True,
               'CONN_MAX_AGE': 0,
               'ENGINE': 'django.db.backends.postgresql',
               'HOST': 'localhost',
               'NAME': 'foo',
               'OPTIONS': {},
               'PASSWORD': '********',
               'PORT': '5432',
               'TEST': {   'CHARSET': None,
                           'COLLATION': None,
                           'MIRROR': None,
                           'NAME': None},
               'TIME_ZONE': None,
               'USER': 'dbcoreuser'}}
DATABASE_ROUTERS: '********'
DATA_UPLOAD_MAX_MEMORY_SIZE: 2621440
DATA_UPLOAD_MAX_NUMBER_FIELDS: 1000
DATETIME_FORMAT: 'N j, Y, P'
DATETIME_INPUT_FORMATS: ['%Y-%m-%d %H:%M:%S',
 '%Y-%m-%d %H:%M:%S.%f',
 '%Y-%m-%d %H:%M',
 '%Y-%m-%d',
 '%m/%d/%Y %H:%M:%S',
 '%m/%d/%Y %H:%M:%S.%f',
 '%m/%d/%Y %H:%M',
 '%m/%d/%Y',
 '%m/%d/%y %H:%M:%S',
 '%m/%d/%y %H:%M:%S.%f',
 '%m/%d/%y %H:%M',
 '%m/%d/%y']
DATE_FORMAT: 'N j, Y'
DATE_INPUT_FORMATS: ['%Y-%m-%d',
 '%m/%d/%Y',
 '%m/%d/%y',
 '%b %d %Y',
 '%b %d, %Y',
 '%d %b %Y',
 '%d %b, %Y',
 '%B %d %Y',
 '%B %d, %Y',
 '%d %B %Y',
 '%d %B, %Y']
DEBUG: True
DEBUG_PROPAGATE_EXCEPTIONS: False
DECIMAL_SEPARATOR: '.'
DEFAULT_CHARSET: 'utf-8'
DEFAULT_CONTENT_TYPE: 'text/html'
DEFAULT_DB: 'default'
DEFAULT_EXCEPTION_REPORTER_FILTER: 'django.views.debug.SafeExceptionReporterFilter'
DEFAULT_FILE_STORAGE: 'django.core.files.storage.FileSystemStorage'
DEFAULT_FROM_EMAIL: 'webmaster@localhost'
DEFAULT_INDEX_TABLESPACE: ''
DEFAULT_TABLESPACE: ''
DISALLOWED_USER_AGENTS: []
EMAIL_BACKEND: 'django.core.mail.backends.filebased.EmailBackend'
EMAIL_FILE_PATH: '/tmp/email_messages'
EMAIL_HOST: 'smtp.gmail.com'
EMAIL_HOST_PASSWORD: '********'
EMAIL_HOST_USER: 'paiyroll.com@gmail.com'
EMAIL_PORT: 587
EMAIL_SSL_CERTFILE: ''
EMAIL_SSL_KEYFILE: '********'
EMAIL_SUBJECT_PREFIX: '[Django] '
EMAIL_TIMEOUT: None
EMAIL_USE_LOCALTIME: False
EMAIL_USE_SSL: False
EMAIL_USE_TLS: True
FDW_DB: 'fdw'
FILE_CHARSET: 'utf-8'
FILE_UPLOAD_DIRECTORY_PERMISSIONS: None
FILE_UPLOAD_HANDLERS: 
    ('django_excel.ExcelMemoryFileUploadHandler',
 'django_excel.TemporaryExcelFileUploadHandler')
FILE_UPLOAD_MAX_MEMORY_SIZE: 2621440
FILE_UPLOAD_PERMISSIONS: None
FILE_UPLOAD_TEMP_DIR: None
FIRST_DAY_OF_WEEK: 0
FIXTURE_DIRS: []
FORCE_SCRIPT_NAME: None
FORMAT_MODULE_PATH: None
FORM_RENDERER: 'django.forms.renderers.DjangoTemplates'
IGNORABLE_404_URLS: []
INSTALLED_APPS: ['paiyroll.apps.PaiyrollConfig',
 'django.contrib.admin',
 'django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.messages',
 'django.contrib.staticfiles',
 'django.contrib.postgres',
 'django_jinja',
 'bootstrapform_jinja',
 'import_export',
 'phonenumber_field',
 'polymorphic',
 'viewflow']
INTERNAL_IPS: []
LANGUAGES: [('af', 'Afrikaans'),
 ('ar', 'Arabic'),
 ('ast', 'Asturian'),
 ('az', 'Azerbaijani'),
 ('bg', 'Bulgarian'),
 ('be', 'Belarusian'),
 ('bn', 'Bengali'),
 ('br', 'Breton'),
 ('bs', 'Bosnian'),
 ('ca', 'Catalan'),
 ('cs', 'Czech'),
 ('cy', 'Welsh'),
 ('da', 'Danish'),
 ('de', 'German'),
 ('dsb', 'Lower Sorbian'),
 ('el', 'Greek'),
 ('en', 'English'),
 ('en-au', 'Australian English'),
 ('en-gb', 'British English'),
 ('eo', 'Esperanto'),
 ('es', 'Spanish'),
 ('es-ar', 'Argentinian Spanish'),
 ('es-co', 'Colombian Spanish'),
 ('es-mx', 'Mexican Spanish'),
 ('es-ni', 'Nicaraguan Spanish'),
 ('es-ve', 'Venezuelan Spanish'),
 ('et', 'Estonian'),
 ('eu', 'Basque'),
 ('fa', 'Persian'),
 ('fi', 'Finnish'),
 ('fr', 'French'),
 ('fy', 'Frisian'),
 ('ga', 'Irish'),
 ('gd', 'Scottish Gaelic'),
 ('gl', 'Galician'),
 ('he', 'Hebrew'),
 ('hi', 'Hindi'),
 ('hr', 'Croatian'),
 ('hsb', 'Upper Sorbian'),
 ('hu', 'Hungarian'),
 ('hy', 'Armenian'),
 ('ia', 'Interlingua'),
 ('id', 'Indonesian'),
 ('io', 'Ido'),
 ('is', 'Icelandic'),
 ('it', 'Italian'),
 ('ja', 'Japanese'),
 ('ka', 'Georgian'),
 ('kab', 'Kabyle'),
 ('kk', 'Kazakh'),
 ('km', 'Khmer'),
 ('kn', 'Kannada'),
 ('ko', 'Korean'),
 ('lb', 'Luxembourgish'),
 ('lt', 'Lithuanian'),
 ('lv', 'Latvian'),
 ('mk', 'Macedonian'),
 ('ml', 'Malayalam'),
 ('mn', 'Mongolian'),
 ('mr', 'Marathi'),
 ('my', 'Burmese'),
 ('nb', 'Norwegian Bokmål'),
 ('ne', 'Nepali'),
 ('nl', 'Dutch'),
 ('nn', 'Norwegian Nynorsk'),
 ('os', 'Ossetic'),
 ('pa', 'Punjabi'),
 ('pl', 'Polish'),
 ('pt', 'Portuguese'),
 ('pt-br', 'Brazilian Portuguese'),
 ('ro', 'Romanian'),
 ('ru', 'Russian'),
 ('sk', 'Slovak'),
 ('sl', 'Slovenian'),
 ('sq', 'Albanian'),
 ('sr', 'Serbian'),
 ('sr-latn', 'Serbian Latin'),
 ('sv', 'Swedish'),
 ('sw', 'Swahili'),
 ('ta', 'Tamil'),
 ('te', 'Telugu'),
 ('th', 'Thai'),
 ('tr', 'Turkish'),
 ('tt', 'Tatar'),
 ('udm', 'Udmurt'),
 ('uk', 'Ukrainian'),
 ('ur', 'Urdu'),
 ('vi', 'Vietnamese'),
 ('zh-hans', 'Simplified Chinese'),
 ('zh-hant', 'Traditional Chinese')]
LANGUAGES_BIDI: ['he', 'ar', 'fa', 'ur']
LANGUAGE_CODE: 'en-us'
LANGUAGE_COOKIE_AGE: None
LANGUAGE_COOKIE_DOMAIN: None
LANGUAGE_COOKIE_NAME: 'django_language'
LANGUAGE_COOKIE_PATH: '/'
LOCALE_PATHS: []
LOGGING: {
    'disable_existing_loggers': False,
    'formatters': {   'standard': {   'format': '%(asctime)s [%(levelname)s] '
                                                '%(name)s: %(message)s'}},
    'handlers': {   'file': {   'backupCount': 10,
                                'class': 'logging.handlers.RotatingFileHandler',
                                'filename': '/main/srhaque/.local/share/paiyroll/django.log',
                                'formatter': 'standard',
                                'level': 'DEBUG',
                                'maxBytes': 15728640}},
    'loggers': {'': {'handlers': ['file'], 'level': 'INFO', 'propagate': True}},
    'version': 1}
LOGGING_CONFIG: 'logging.config.dictConfig'
LOGGING_DIR: '/main/srhaque/.local/share/paiyroll'
LOGIN_REDIRECT_URL: '/accounts/profile/'
LOGIN_URL: '/accounts/login/'
LOGOUT_REDIRECT_URL: None
MANAGERS: []
MEDIA_ROOT: ''
MEDIA_URL: ''
MESSAGE_STORAGE: 'django.contrib.messages.storage.fallback.FallbackStorage'
MIDDLEWARE: ['django.middleware.security.SecurityMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'project.middleware.SessionExpiryMiddleware',
 'django.middleware.common.CommonMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware',
 'django.middleware.clickjacking.XFrameOptionsMiddleware']
MIGRATION_MODULES: {
 }
MONTH_DAY_FORMAT: 'F j'
NUMBER_GROUPING: 0
PAIYROLL_CLIENT_OUTPUT_DIR: '/tmp'
PAIYROLL_DEBUG_ADMIN_ACCESS_ALL_WORKFLOWS: True
PAIYROLL_DEBUG_NOTIFICATIONS: True
PAIYROLL_DEBUG_PAY_WALK: 0
PAIYROLL_DEBUG_PROMPTS: 259
PAIYROLL_DEBUG_SCHEDULING_DATE: '2018-01-01'
PAIYROLL_DEBUG_SCHEDULING_FAST: 10
PAIYROLL_GB_RTI_ALL_TRANSACT: None
PAIYROLL_GB_RTI_OTHERS_TRANSACT_FROM: '2019-04-06'
PAIYROLL_GB_RTI_TEST_COMPANIES: 
    ('HMRC', 'HMRC RTI Recognition')
PAIYROLL_GB_RTI_TEST_COMPANIES_ACTUAL_T: '2019-03-31'
PAIYROLL_INTERNAL_NETWORK: IPv4Network('192.168.1.0/24')
PAIYROLL_NOTIFICATIONS: {
    'email': {   'backend': 'django.core.mail.backends.filebased.EmailBackend',
                 'file_path': '/tmp/email_messages',
                 'host': 'smtp.gmail.com',
                 'password': '********',
                 'port': 587,
                 'ssl_certfile': '',
                 'ssl_keyfile': '********',
                 'timeout': None,
                 'use_ssl': False,
                 'use_tls': True,
                 'username': 'paiyroll.com@gmail.com'},
    'sms': {   'account_sid': 'ACbabf740b7a3d23010cbe381a8a184fe0',
               'from': '+15005550006',
               'password': '********'}}
PAIYROLL_REPORT_SPEC_DIR: '/main/srhaque/Innovatieltd/source/paiyroll/report/JasperReports'
PAIYROLL_REPORT_SPEC_SPREADSHEET: '/main/srhaque/Innovatieltd/source/paiyroll/report/Spreadsheets'
PAIYROLL_SITE_NAME: 'login.paiyroll.com'
PASSWORD_HASHERS: '********'
PASSWORD_RESET_TIMEOUT_DAYS: '********'
PREPEND_WWW: False
ROOT_URLCONF: 'project.urls'
SECRET_KEY: '********'
SECURE_BROWSER_XSS_FILTER: False
SECURE_CONTENT_TYPE_NOSNIFF: False
SECURE_HSTS_INCLUDE_SUBDOMAINS: False
SECURE_HSTS_PRELOAD: False
SECURE_HSTS_SECONDS: 0
SECURE_PROXY_SSL_HEADER: None
SECURE_REDIRECT_EXEMPT: []
SECURE_SSL_HOST: None
SECURE_SSL_REDIRECT: False
SERVER_EMAIL: 'root@localhost'
SESSION_CACHE_ALIAS: 'default'
SESSION_COOKIE_AGE: 5400
SESSION_COOKIE_DOMAIN: None
SESSION_COOKIE_HTTPONLY: True
SESSION_COOKIE_NAME: 'sessionid'
SESSION_COOKIE_PATH: '/'
SESSION_COOKIE_SAMESITE: 'Lax'
SESSION_COOKIE_SECURE: False
SESSION_ENGINE: 'django.contrib.sessions.backends.db'
SESSION_EXPIRE_AT_BROWSER_CLOSE: False
SESSION_FILE_PATH: None
SESSION_SAVE_EVERY_REQUEST: False
SESSION_SERIALIZER: 'django.contrib.sessions.serializers.JSONSerializer'
SETTINGS_MODULE: 'project.settings'
SHORT_DATETIME_FORMAT: 'm/d/Y P'
SHORT_DATE_FORMAT: 'm/d/Y'
SIGNING_BACKEND: 'django.core.signing.TimestampSigner'
SILENCED_SYSTEM_CHECKS: []
STATICFILES_DIRS: []
STATICFILES_FINDERS: ['django.contrib.staticfiles.finders.FileSystemFinder',
 'django.contrib.staticfiles.finders.AppDirectoriesFinder']
STATICFILES_STORAGE: 'django.contrib.staticfiles.storage.StaticFilesStorage'
STATIC_ROOT: '../staticroot'
STATIC_URL: '/static/'
TEMPLATES: [{'APP_DIRS': True,
  'BACKEND': 'django_jinja.backend.Jinja2',
  'DIRS': ['/main/srhaque/Innovatieltd/source/templates'],
  'OPTIONS': {'app_dirname': 'templates',
              'auto_reload': True,
              'autoescape': True,
              'bytecode_cache': {'backend': 'django_jinja.cache.BytecodeCache',
                                 'enabled': False,
                                 'name': 'default'},
              'constants': {},
              'context_processors': ['django.contrib.auth.context_processors.auth',
                                     'django.contrib.messages.context_processors.messages'],
              'extensions': ['jinja2.ext.do',
                             'jinja2.ext.loopcontrols',
                             'jinja2.ext.with_',
                             'jinja2.ext.i18n',
                             'jinja2.ext.autoescape',
                             'django_jinja.builtins.extensions.CsrfExtension',
                             'django_jinja.builtins.extensions.CacheExtension',
                             'django_jinja.builtins.extensions.TimezoneExtension',
                             'django_jinja.builtins.extensions.UrlsExtension',
                             'django_jinja.builtins.extensions.StaticFilesExtension',
                             'django_jinja.builtins.extensions.DjangoFiltersExtension',
                             'jinja_extensions.DebugExtension'],
              'filters': {},
              'globals': {},
              'match_extension': ('.html',
                                  '.jinja'),
              'match_regex': '^(?!admin/).*',
              'newstyle_gettext': True,
              'tests': {},
              'translation_engine': 'django.utils.translation',
              'undefined': None}},
 {'APP_DIRS': True,
  'BACKEND': 'django.template.backends.django.DjangoTemplates',
  'DIRS': ['/main/srhaque/Innovatieltd/source/templates'],
  'OPTIONS': {'context_processors': ['django.template.context_processors.debug',
                                     'django.template.context_processors.request',
                                     'django.contrib.auth.context_processors.auth',
                                     'django.contrib.messages.context_processors.messages']}}]
TEST_NON_SERIALIZED_APPS: []
TEST_RUNNER: 'django.test.runner.DiscoverRunner'
THOUSAND_SEPARATOR: ','
TIME_FORMAT: 'P'
TIME_INPUT_FORMATS: ['%H:%M:%S', '%H:%M:%S.%f', '%H:%M']
TIME_ZONE: 'UTC'
USE_I18N: True
USE_L10N: True
USE_THOUSAND_SEPARATOR: False
USE_TZ: True
USE_X_FORWARDED_HOST: False
USE_X_FORWARDED_PORT: False
WSGI_APPLICATION: 'project.wsgi.application'
X_FRAME_OPTIONS: 'SAMEORIGIN'
YEAR_MONTH_FORMAT: 'F Y'
is_overridden: <bound method Settings.is_overridden of <Settings "project.settings">>

Steps to Reproduce

Required Dependencies

Minimal Python Version: N/A or Unknown
Minimal Celery Version: N/A or Unknown
Minimal Kombu Version: N/A or Unknown
Minimal Broker Version: N/A or Unknown
Minimal Result Backend Version: N/A or Unknown
Minimal OS and/or Kernel Version: N/A or Unknown
Minimal Broker Client Version: N/A or Unknown
Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:

$ pip3 freeze
alembic==1.0.10
amqp==2.4.2
apt-xapian-index==0.47
asn1crypto==0.24.0
atomicwrites==1.3.0
attrs==19.1.0
aws-shell==0.2.1
awscli==1.16.147
Babel==2.6.0
backcall==0.1.0
backports.csv==1.0.7
bcrypt==3.1.6
beautifulsoup4==4.7.1
billiard==3.6.0.0
blinker==1.4
boto3==1.9.137
botocore==1.12.137
calendra==3.0
celery==4.3.0
certifi==2018.8.24
cffi==1.12.3
chardet==3.0.4
chromedriver==2.24.1
Click==7.0
colorama==0.3.9
command-not-found==0.3
configobj==5.0.6
coverage==4.5.3
cryptography==2.3
cssselect==1.0.3
cupshelpers==1.0
datadiff==2.0.0
decorator==4.4.0
deepdiff==4.0.6
defer==1.0.6
defusedxml==0.6.0
diff-match-patch==20181111
distro-info===0.21ubuntu2
Django==2.2
django-excel==0.0.10
django-extra-views==0.12.0
django-filter==2.1.0
django-import-export==1.2.0
django-jinja==2.4.1
django-jinja-bootstrap-form==4.2.3
django-model-utils==3.1.2
django-phonenumber-field==2.3.1
django-polymorphic==2.0.3
django-viewflow==1.5.3
django-webtest==1.9.4
docutils==0.14
entrypoints==0.3
ephem==3.7.6.0
et-xmlfile==1.0.1
Flask==1.0.2
Flask-BabelEx==0.9.3
Flask-Gravatar==0.5.0
Flask-HTMLmin==1.5.0
Flask-Login==0.4.1
Flask-Mail==0.9.1
Flask-Migrate==2.4.0
Flask-Paranoid==0.2.0
Flask-Principal==0.4.0
Flask-Security==3.0.0
Flask-SQLAlchemy==2.3.2
Flask-WTF==0.14.2
gunicorn==19.9.0
html5lib==1.0.1
htmlmin==0.1.12
httplib2==0.11.3
idna==2.6
ipython==7.5.0
ipython-genutils==0.2.0
itsdangerous==1.1.0
jdcal==1.4.1
jedi==0.13.3
Jinja2==2.10.1
jmespath==0.9.4
jpy==0.10.0.dev1
jsonpickle==1.1
keyring==17.1.1
keyrings.alt==3.1.1
kombu==4.5.0
language-selector==0.1
ldap3==2.4.1
lml==0.0.9
lunardate==0.2.0
lxml==4.3.3
Mako==1.0.9
MarkupSafe==1.1.1
more-itertools==7.0.0
multicorn===-VERSION-
netifaces==0.10.4
networkx==2.3
odfpy==1.4.0
olefile==0.46
openpyxl==2.5.14
ordered-set==3.1.1
paiyroll==0.1
paiyroll-fdw==0.1
paramiko==2.4.2
parso==0.4.0
passlib==1.7.1
pexpect==4.6.0
pgadmin4==4.5
phonenumbers==8.10.10
pickleshare==0.7.5
pika==1.0.1
Pillow==5.4.1
pluggy==0.9.0
prompt-toolkit==2.0.9
psutil==5.5.1
psycopg2-binary==2.8.2
py==1.8.0
pyasn1==0.4.2
pycairo==1.16.2
pyCalverter==1.6.1
pycparser==2.19
pycrypto==2.6.1
pycups==1.9.73
pyecharts-jupyter-installer==0.0.3
pyexcel==0.5.13
pyexcel-handsontable==0.0.2
pyexcel-io==0.5.11
pyexcel-ods==0.5.6
pyexcel-webio==0.1.4
pyexcel-xls==0.5.8
pyexcel-xlsx==0.5.7
Pygments==2.3.1
PyGObject==3.32.0
PyJWT==1.7.1
pymacaroons==0.13.0
PyNaCl==1.3.0
PyQt5==5.12.1
pyquery==1.4.0
pyrabbit==1.1.0
PySocks==1.6.8
pytest==4.4.1
python-apt==1.8.4
python-consul==1.1.0
python-dateutil==2.8.0
python-debian==0.1.34
python-editor==1.0.4
pytz==2018.9
PyYAML==3.13
reportlab==3.5.18
requests==2.21.0
requests-unixsocket==0.1.5
rsa==3.4.2
s3transfer==0.2.0
SecretStorage==2.3.1
selenium==3.141.0
simplejson==3.16.0
six==1.12.0
soupsieve==1.8
speaklater==1.3
SQLAlchemy==1.3.3
sqlparse==0.2.4
ssh-import-id==5.7
sshtunnel==0.1.4
systemd-python==234
tablib==0.13.0
tblib==1.3.2
terminaltables==3.1.0
texttable==1.6.1
traitlets==4.3.2
twilio==6.26.2
ubuntu-advantage-tools==19.2
ubuntu-drivers-common==0.0.0
ufw==0.36
uk-postcode-utils==1.0
unattended-upgrades==0.1
urllib3==1.24.1
vine==1.3.0
waitress==1.3.0
wcwidth==0.1.7
webencodings==0.5.1
WebOb==1.8.5
WebTest==2.0.33
Werkzeug==0.15.2
WTForms==2.2.1
xdg==4.0.0
xkit==0.0.0
xlrd==1.2.0
xlwt==1.3.0
xmltodict==0.12.0

Other Dependencies

Consul is version v1.4.2

Minimally Reproducible Test Case

The problem is not easy to reproduce. However:

I have just a handful of Celery tasks, one which simply does an async_apply, and one which does an async_apply with a countdown of a few seconds.
I expect each task completion to store its results in Consul.
All tasks are wrapped in a top-level catch-all exception handler. This ensures that all tasks return Celery SUCCESS, and the result data contains the "actual" result as a JSON-friendly dict.

Expected Behavior

I expect all results to be posted to Consul, from where I will later retrieve them (subject to being timed out in the usual way).

Actual Behavior

Some results are not stored in Consul due to a KeyError exception. The basic failure mode is that storing the result throws a KeyError exception when it tries to extract the 'ID':

This line of code is expecting "data" not to be a boolean, but to have this form:

{'ID': '3e1d1cd2-4de1-be07-02d4-8b3d279acedb'}

(For clarity, the stack trace is from a colleague's machine, on mine, the line numbers are the same).

I believe the root cause of the issue is likely to be in python-consul which assumes that the "requests" library is threadsafe when it is not. My underlying assumption in thinking this is that the confusion between the dict (expected as a response to saving a result) and the bool (expected as a response when expiring a result) is because the Consul backend's self.session is not threadsafe if those calls are interleaved.

ShaheedHaque · 2019-07-11T22:03:21Z

I now believe this is an issue in Consul itself. I'll leave this open to report on progress.

ShaheedHaque · 2019-07-12T09:28:35Z

OK, I believe there is a bug in Celery here too. When these errors occur (which as I say I currently believe to be caused upstream), the celery worker process gets "stuck" with the job in hand:

$ celery inspect active
-> celery@freenas: OK
    * {'id': '2996120d-87d2-4c9b-8774-b7a3b64e1332', 'name': 'paiyroll.tasks.function_run', 'args': "['paiyroll.report_run', 'report_preflight_validate_async', 1921]", 'kwargs': '{}', 'type': 'paiyroll.tasks.function_run', 'hostname': 'celery@freenas', 'time_start': 1562921376.9528346, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'celery', 'priority': 0, 'redelivered': False}, 'worker_pid': 7111}
.
.
.

until all workers are stuck like this. Here is an example of what Celery reports:

[2019-07-12 08:48:51,191: WARNING/ForkPoolWorker-80] /usr/local/lib/python3.7/dist-packages/celery/app/trace.py:568: RuntimeWarning: Exception raised outside body: TypeError("'bool' object is not subscriptable"):
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 449, in trace_task
    uuid, retval, task_request, publish_result,
  File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 149, in mark_as_done
    self.store_result(task_id, result, state, request=request)
  File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 342, in store_result
    request=request, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/celery/backends/base.py", line 714, in _store_result
    self.set(self.get_key_for_task(task_id), self.encode(meta))
  File "/usr/local/lib/python3.7/dist-packages/celery/backends/consul.py", line 92, in set
    ttl=self.expires)
  File "/usr/local/lib/python3.7/dist-packages/consul/base.py", line 1781, in create
    data=data)
  File "/usr/local/lib/python3.7/dist-packages/consul/std.py", line 33, in put
    self.session.request('PUT', uri, body=data, headers=JSON_HEADER)))
  File "/usr/local/lib/python3.7/dist-packages/consul/base.py", line 234, in cb
    data = data['ID']
TypeError: 'bool' object is not subscriptable

  exc, exc_info.traceback)))

My guess is that the Celery worker, having caught this error, does not go on to clean up as expected.

ShaheedHaque · 2019-07-15T10:30:43Z

Filed #5642.

ShaheedHaque · 2020-06-24T11:02:10Z

I finally have what I think is a smoking gun pointing to an issue in Celery. My setup is much as before though with versions as follows:

$ celery report

software -> celery:4.4.5 (cliffs) kombu:4.6.10 py:3.8.2
            billiard:3.6.3.0 py-amqp:2.6.0
platform -> system:Linux arch:64bit, ELF
            kernel version:5.4.0-37-generic imp:CPython
loader   -> celery.loaders.default.Loader
settings -> transport:amqp results:disabled

and using, as before, the Consul result backend. I modified the Celery code slightly to emit these logs:

2020-06-24 09:16:03,336 [WARNING] ForkPoolWorker-27: Trying to fetch key celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05 from Consul
2020-06-24 09:16:03,336 [WARNING] ForkPoolWorker-27: request get="celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05"
2020-06-24 09:16:09,033 [WARNING] ForkPoolWorker-27: response data="{'ID': '7abf2295-6dca-6f7d-ed4c-4bff4b60955f'}", decode=Value is_id=False
2020-06-24 09:16:09,033 [ERROR] ForkPoolWorker-27: cannot get celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05: 'str' object has no attribute 'get' (None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 64, in get
    _, data = self.client.kv.get(key)
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 553, in get
    return self.agent.http.get(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 21, in get
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 232, in cb
    if item.get(decode) is not None:
AttributeError: 'str' object has no attribute 'get'
2020-06-24 09:16:09,035 [WARNING] ForkPoolWorker-27: /usr/local/lib/python3.8/dist-packages/celery/app/trace.py:622: RuntimeWarning: Exception raised outside body: AttributeError("'str' object has no attribute 'get'"):
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 479, in trace_task
    mark_as_done(
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 158, in mark_as_done
    self.store_result(task_id, result, state, request=request)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 442, in store_result
    self._store_result(task_id, result, state, traceback,
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 853, in _store_result
    current_meta = self._get_task_meta_for(task_id)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 871, in _get_task_meta_for
    meta = self.get(self.get_key_for_task(task_id))
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 64, in get
    _, data = self.client.kv.get(key)
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 553, in get
    return self.agent.http.get(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 21, in get
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 232, in cb
    if item.get(decode) is not None:
AttributeError: 'str' object has no attribute 'get'

  warn(RuntimeWarning(
2020-06-24 09:16:09,036 [WARNING] ForkPoolWorker-27: Trying to fetch key celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05 from Consul
2020-06-24 09:16:09,036 [WARNING] ForkPoolWorker-27: request get="celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05"
2020-06-24 09:16:09,053 [WARNING] ForkPoolWorker-27: Trying to create Consul session celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05 with TTL 86400
2020-06-24 09:16:09,199 [WARNING] ForkPoolWorker-27: response data="True", decode=False is_id=True
2020-06-24 09:16:09,199 [ERROR] ForkPoolWorker-27: cannot save result for b'celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05'={"status": "FAILURE", "result": {"exc_type": "AttributeError", "exc_message": ["'str' object has no attribute 'get'"], "exc_module": "builtins"}, "traceback": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.8/dist-packages/celery/app/trace.py\", line 479, in trace_task\n    mark_as_done(\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/base.py\", line 158, in mark_as_done\n    self.store_result(task_id, result, state, request=request)\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/base.py\", line 442, in store_result\n    self._store_result(task_id, result, state, traceback,\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/base.py\", line 853, in _store_result\n    current_meta = self._get_task_meta_for(task_id)\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/base.py\", line 871, in _get_task_meta_for\n    meta = self.get(self.get_key_for_task(task_id))\n  File \"/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py\", line 64, in get\n    _, data = self.client.kv.get(key)\n  File \"/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py\", line 553, in get\n    return self.agent.http.get(\n  File \"/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py\", line 21, in get\n    return callback(self.response(\n  File \"/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py\", line 232, in cb\n    if item.get(decode) is not None:\nAttributeError: 'str' object has no attribute 'get'\n", "children": [[["b3cc0538-e39d-4ff5-accd-d1d0c0e3e30e", null], null], [["14b3b7a0-1374-4ade-adc6-827f55cbe7ae", null], null], [["8a156251-4545-4bf5-9351-200d79b2a757", null], null], [["2563d59b-7885-41e2-b0f2-e01a0fe173c9", null], null], [["840e501b-744b-4d87-9400-7416391dd05a", null], null], [["b08a15a5-5abb-4103-b4ed-8ed16a760521", null], null], [["eb2de10e-f49b-47b8-a2e1-71cbc983774c", null], null], [["4546e0f5-dd1b-45c6-bfba-6fd3de9cc5e4", null], null], [["cd5a9a00-fdae-4bc4-94a9-eb784c1d5be7", null], null], [["e1797005-0f2f-47c9-a1df-fdf26edc07cd", null], null], [["49b95dec-be25-4f86-a295-4fc4650c0665", null], null], [["96f37a25-9b76-4de5-98c3-4e504816933c", null], null], [["6ef42ea5-a45c-4f43-83e3-29c4964bf134", null], null], [["08efabc6-81e1-4ff7-9970-435d4efcbb2c", null], null], [["4ade8f72-13fb-433c-aec4-844f019607ff", null], null], [["11b1456c-f558-461b-a872-0044aa29c954", null], null], [["d8fb6ad9-9669-4c8a-8e73-aef0ffff452a", null], null], [["890a6831-e033-4b63-9199-2f4858b6dd40", null], null], [["f2d7b230-ac3c-489f-b3a3-b5956b379a2d", null], null], [["34d1a8a8-9f8f-44b5-88e3-1ee8713d7a8a", null], null], [["2522e3dc-7bf2-4cb8-90cb-99a9011844e0", null], null]], "date_done": "2020-06-24T09:16:09.036404", "parent_id": "a91ffd09-bff0-4d44-a954-e596c08de35c", "task_id": "601e8b28-9dcf-4e68-847b-5e1875a8eb05"}: 'bool' object is not subscriptable (FAILURE)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 479, in trace_task
    mark_as_done(
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 158, in mark_as_done
    self.store_result(task_id, result, state, request=request)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 442, in store_result
    self._store_result(task_id, result, state, traceback,
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 853, in _store_result
    current_meta = self._get_task_meta_for(task_id)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 871, in _get_task_meta_for
    meta = self.get(self.get_key_for_task(task_id))
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 64, in get
    _, data = self.client.kv.get(key)
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 553, in get
    return self.agent.http.get(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 21, in get
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 232, in cb
    if item.get(decode) is not None:
AttributeError: 'str' object has no attribute 'get'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 78, in set
    return self.set_real(key, value, state)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 99, in set_real
    session_id = self.client.session.create(name=session_name,
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 1763, in create
    return self.agent.http.put(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 26, in put
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 235, in cb
    data = data['ID']
TypeError: 'bool' object is not subscriptable
2020-06-24 09:16:09,200 [ERROR] ForkPoolWorker-27: Task paiyroll.tasks.viewflow_dequeue[601e8b28-9dcf-4e68-847b-5e1875a8eb05] raised unexpected: AttributeError("'str' object has no attribute 'get'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 479, in trace_task
    mark_as_done(
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 158, in mark_as_done
    self.store_result(task_id, result, state, request=request)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 442, in store_result
    self._store_result(task_id, result, state, traceback,
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 853, in _store_result
    current_meta = self._get_task_meta_for(task_id)
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/base.py", line 871, in _get_task_meta_for
    meta = self.get(self.get_key_for_task(task_id))
  File "/usr/local/lib/python3.8/dist-packages/celery/backends/consul.py", line 64, in get
    _, data = self.client.kv.get(key)
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 553, in get
    return self.agent.http.get(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/std.py", line 21, in get
    return callback(self.response(
  File "/usr/local/lib/python3.8/dist-packages/python_consul-1.1.0-py3.8.egg/consul/base.py", line 232, in cb
    if item.get(decode) is not None:
AttributeError: 'str' object has no attribute 'get'

However, it is probably easier to see what is going on by looking at the attached wireshark packet capture.
wireshark.pcapng.gz:

Use gunzip to restore the file.
Load into wireshark.
Filter the display by "http".
Order by time column.
Look at the numbered entries (i.e. the first column) 8953 and 8963.
Note that the log timestamps and wireshark timestamps are 1 hour apart, so 09:00 is the log is 10:00 in wireshark.

Entries 8953 and 8963 seem to be a request response pair between Consul on port 8500 and the instance of Celery running on port 50268 (which I take to be ForkPoolWorker-27). Note the request is:

GET http://localhost:8500/v1/kv/celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05?consistent=1

and the response is encoded as application/json with content:

true

Note that true is not a valid response to the GET. It is actually the (valid) response to the earlier request in entry 8940:

PUT /v1/kv/celery-task-meta-4546e0f5-dd1b-45c6-bfba-6fd3de9cc5e4?acquire=128ced02-8f69-5327-69f6-56cbec3b3847 HTTP/1.1\r\n

which successfully stored the result of an earlier task.

The obvious implication is that after initiating the storing of the result for celery-task-meta-4546e0f5-dd1b-45c6-bfba-6fd3de9cc5e4, Celery did not wait for the acknowledgement true. Then when celery-task-meta-601e8b28-9dcf-4e68-847b-5e1875a8eb05 came along, it ended up using HTTP 1.1 pipelining, and mistaking the earlier response for its own.

Does that make sense?

ShaheedHaque · 2020-06-24T14:52:26Z

Just to refine the previous hypothesis based on looking into how Celery uses python-consul:

python-consul has several variations, of which Celery ends up using consul.std.Consul.
Celery also just instantiates consul.std.Consul() once (i.e. per worker).
consul.std.Consul() creates just a single
```
self.session = requests.session()
```
So, Celery just ends up using callbacks fired by the requests library which, AFAICS, does no
request/response queueing or multiplexing or whatever. (Though I confess to not having had the gumption
to look at exactly how requests works internally, it is certainly known not to be threadsafe).

Thus, once the Celery worker thinks it has finished sending the task result to the backend, even though the associated callback has not fired, it can pick up the next task. Since the next task uses the same consul.std.Consul() and thereby the same requests.session(), the stage is set for the scenario I see. As a quick hack mostly intended to minimise the source changes, I am trying this edit:

$ diff -U 5 /usr/local/lib/python3.8/dist-packages/celery/backends/consul.py.saved /usr/local/lib/python3.8/dist-packages/celery/backends/consul.py
--- /usr/local/lib/python3.8/dist-packages/celery/backends/consul.py.saved      2020-06-23 14:58:20.042460818 +0100
+++ /usr/local/lib/python3.8/dist-packages/celery/backends/consul.py    2020-06-24 15:25:09.188833535 +0100
@@ -48,12 +48,16 @@
 
     def _init_from_params(self, hostname, port, virtual_host, **params):
         logger.debug('Setting on Consul client to connect to %s:%d',
                      hostname, port)
         self.path = virtual_host
-        self.client = consul.Consul(host=hostname, port=port,
-                                    consistency=self.consistency)
+        self.hostname = hostname
+        self.port = port
+
+    @property
+    def client(self):
+        return consul.Consul(host=self.hostname, port=self.port, consistency=self.consistency)
 
     def _key_to_consul_key(self, key):
         key = bytes_to_str(key)
         return key if self.path is None else '{0}/{1}'.format(self.path, key)

where each reference to self.client ends up with a unique consul.std.Consul(). This is clearly not how one would want to deal with the issue in real life, but should work for a quick test.

An initial run looks promising, I will report back after further runs.

ShaheedHaque · 2020-06-25T08:38:33Z

Multiple test runs on Linux and Mac OSX have not shown the problem after the hack in my previous comment was applied.

thedrow · 2020-07-13T10:02:53Z

You could create a client per thread using local thread storage.

ShaheedHaque · 2020-07-13T12:35:38Z

@thedrow As far as I know, that won't help because the problem is visible when Celery is not using threads. I think the correct approach is to create one client per request (which the above hack achieves in a fragile fashion).

thedrow · 2020-07-13T12:39:04Z

But that means we have to connect each time we make a request.
It's very inefficient.

maybe-sybr · 2020-12-11T22:10:25Z

We discussed this briefly at the core contributors meeting today and I wanted to chime in with some thoughts and questions for @ShaheedHaque.

This definitely looks like the second PUT to consul is not being waited on before the next GET for some other ID is being made, but I can't see how this could be happening based on the code I see in consul 1.1.0 (the version you included in the bug report template last year).
- I'm basing that on timing for the PUT in packet 8940 and the GET in 8953 as well as the obvious responses in 8963 and 8966
- nothing involved in the call chain appears to be a sneaky generator which could trip us up
- https://github.com/cablehead/python-consul/blob/v1.1.0/consul/base.py#L620-L621
- https://github.com/cablehead/python-consul/blob/v1.1.0/consul/base.py#L1760-L1764
- https://github.com/cablehead/python-consul/blob/v1.1.0/consul/std.py#L24-L28 (calling the reponse() method directly above which makes use of the base.Response namedtuple type)
Our use in the consul backend seems entirely reasonable also and does not appear to do anything which could be construed as async
- https://github.com/celery/celery/blob/v4.4.5/celery/backends/consul.py#L90-L98
- we arrive at the above from backend.store_result() -> _store_result() -> set()
The only change I could see us being able to reasonably make it some validation on the return value of backend.set() within _store_result(). We simply ignore the result in 4.4.5 and on master:
- https://github.com/celery/celery/blob/v4.4.5/celery/backends/base.py#L858
- https://github.com/celery/celery/blob/master/celery/backends/base.py#L856

@ShaheedHaque - do you think you might be able to get some information about the return value from that set() call? The only thing I can think of is that a rogue EINTR might be causing the unanswered PUT request to stop awaiting a response but then the exception I'd expect to be raised in that case would have to be being silenced somewhere. I'm not sure if that information will help us fix the issue but it might inform whether there is anything we can do in celery to avoid it. If it turns out that set() response is something we can handle/retry, I think the fix would be a simply retry loop there, depending on idempotency in consul (I'm not familiar enough with it to say for sure). AFAICT everything else seems to be reasonable code that shouldn't fail in this annoying way :( and since we don't have any real control over the requests.session object used by consul, we'd be breaking in to do any fixes beyond the hack you've already posted above.

One other option would be to try using one of the other consul implementations rather than std.Consul but that would mean we'd need to install tornado or twisted and I'm not super on board with a change like that.

ShaheedHaque · 2020-12-11T23:52:01Z

@maybe-sybr Thanks for your analysis.

The bad news is that I have seen no evidence of anything like an interrupted syscall for either the set() or the put(). The callbacks always get a well-formed JSON response (either the right one or the wrong one).

However, I think I may have found an explanation for the race...according to the documentation a requests.Session:

Provides cookie persistence, connection-pooling, and configuration.

I certainly was not expecting any pooling...presumably that is what is giving rise to the race. Assuming this is the root cause, a bit of poking around the API documentation suggests that there might be a way to nobble the pool using something called an "HTTPAdapter" (it looks a little unpleasant...but better than the present hack).

ShaheedHaque · 2020-12-12T11:03:14Z

Well, either the pooling is not the problem, or my attempt to disable it was wrongly conceived:

from requests.adapters import HTTPAdapter
...
    def _init_from_params(self, hostname, port, virtual_host, **params):
        logger.debug('Setting on Consul client to connect to %s:%d',
                     hostname, port)
        self.path = virtual_host
        self.client = consul.Consul(host=hostname, port=port,
                                    consistency=self.consistency)
        #
        # Disable the default connection pooling.
        #
        adapter = HTTPAdapter(pool_connections=1, pool_maxsize=1, pool_block=True)
        self.client.http.session.mount(self.client.http.base_uri, adapter)

I'm stumped.

thedrow · 2020-12-13T14:22:58Z

It seems like python-consul is unmaintained.
This fork is active.

@ShaheedHaque Can you please uninstall python-consul and install python-consul2.
If it doesn't reproduce we should simply switch.

thedrow · 2020-12-13T15:47:24Z

The unit tests pass with python-consul2.
We only get a handful of deprecation warnings when using it instead of python-consul.

@ShaheedHaque Please test it and see if you can reproduce the problem.

python-consul has not been maintained in a long while now. python-consul2 is a maintained fork of the same package. Ref #5605.

ShaheedHaque · 2020-12-20T21:27:07Z

I confirm that switching to python-consul2 shows exactly the same behaviour:

sudo -H pip3 list | egrep 'celery|consul'
celery                        5.0.5
python-consul2                0.1.5

Nevertheless, given the support situation, switching to it is clearly a good step to take.

python-consul has not been maintained in a long while now. python-consul2 is a maintained fork of the same package. Ref #5605.

ShaheedHaque · 2021-02-19T11:27:28Z

As per the discussion in #5642, if there is interest is incorporating the above workaround (on the basis that correctness trumps performance), I'd be happy to provide a PR.

It is perhaps worth saying that we have run the workaround extensively in test and in production, and not seen a single instance of this error with the workaround in place. We do see very occasional errors connecting to Consul (possibly as a result of more frequent connection attempts), but at least those result in a very clear connection error rather than this very murky issue. I plan to submit a PR to python-consul2 to add support for connection retries (which it currently disables) which further reduces this uncommon knock-on effect.

ShaheedHaque · 2021-02-22T18:01:02Z

I've submitted the PR to python-consul2 to add support for connection retries (which it currently disables) which compensates for the knock-on effect of the workaround, i.e. more connection attempts at poppyred/python-consul2#31.

ShaheedHaque · 2021-06-27T08:25:32Z

Since there has been no progress on a refined solution, I'd like to propose a PR based on my fix at #5605 (comment). This has been extensively tested by now, and works perfectly. As per the discussion following the above comment, there is a connection per request (which nobody is keen on), but at least the code works reliably.

I'll put one together later for your consideration.

…ponses from Consul with the outbound Celery request that caused it. This leaves it prone to mistaking the (final) response from an operation N as the response to an (early) part of operation N + 1. This changes fix that by using a separate connection for each request. That of course has the downside of (a) being relatively expensive and (b) increasing the rate of connection requests into Consul: - The former is annoying, but at least the backend works reliably. - The latter can cause Consul to reject excessive connection attempt, but if it does, at least it returns a clear indication of this (IIRC, it responds with an HTTP 429"too many connections" indication). Additionally, this issue can be ameliorated by enabling retries in the python-consul2 (which I believe should be turned on regards less to handle transient network issues). This is fixed by the PR in https:/github.com/poppyred/python-consul2/pull/31. Note that we have never seen (b) outside a test specifically trying to hammer the system, but we see (a) all the time in our normal system tests.

…ponses from Consul with the outbound Celery request that caused it. This leaves it prone to mistaking the (final) response from an operation N as the response to an (early) part of operation N + 1. This changes fix that by using a separate connection for each request. That of course has the downside of (a) being relatively expensive and (b) increasing the rate of connection requests into Consul: - The former is annoying, but at least the backend works reliably. - The latter can cause Consul to reject excessive connection attempt, but if it does, at least it returns a clear indication of this (IIRC, it responds with an HTTP 429"too many connections" indication). Additionally, this issue can be ameliorated by enabling retries in the python-consul2 (which I believe should be turned on regards less to handle transient network issues). This is fixed by the PR in https:/github.com/poppyred/python-consul2/pull/31. Note that we have never seen (b) outside a test specifically trying to hammer the system, but we see (a) all the time in our normal system tests. To opt-out from the new behaviour add a parameter "one_client=1" to the connection URL.

…6823) * As per #5605, the Consul backend does not cleanly associate responses from Consul with the outbound Celery request that caused it. This leaves it prone to mistaking the (final) response from an operation N as the response to an (early) part of operation N + 1. This changes fix that by using a separate connection for each request. That of course has the downside of (a) being relatively expensive and (b) increasing the rate of connection requests into Consul: - The former is annoying, but at least the backend works reliably. - The latter can cause Consul to reject excessive connection attempt, but if it does, at least it returns a clear indication of this (IIRC, it responds with an HTTP 429"too many connections" indication). Additionally, this issue can be ameliorated by enabling retries in the python-consul2 (which I believe should be turned on regards less to handle transient network issues). This is fixed by the PR in https:/github.com/poppyred/python-consul2/pull/31. Note that we have never seen (b) outside a test specifically trying to hammer the system, but we see (a) all the time in our normal system tests. To opt-out from the new behaviour add a parameter "one_client=1" to the connection URL. * Increase code coverage. * Rewrite Consul backend documentation, and describe the options now available.

maybe-sybr · 2021-08-12T00:21:30Z

Should we actually close this with the merge of #6823? We have a workaround but not a fix that we really understand.

That said, I'm not confident we'll ever really spend much more time on working out what's going on here. It seems like we can only really chalk it up to being weirdness in an upstream project, and probably not a great use of our time. As long as we can make the workarounds (the per-request clients and oneclient trapdoor) discoverable for others who might hit this issue, perhaps leaving this closed is fine. I think @ShaheedHaque has done a good job on the documentation in #6823, so I'd be happy to leave this closed rather than have it linger on the backlog forever.

python-consul has not been maintained in a long while now. python-consul2 is a maintained fork of the same package. Ref celery#5605.

…elery#6823) * As per celery#5605, the Consul backend does not cleanly associate responses from Consul with the outbound Celery request that caused it. This leaves it prone to mistaking the (final) response from an operation N as the response to an (early) part of operation N + 1. This changes fix that by using a separate connection for each request. That of course has the downside of (a) being relatively expensive and (b) increasing the rate of connection requests into Consul: - The former is annoying, but at least the backend works reliably. - The latter can cause Consul to reject excessive connection attempt, but if it does, at least it returns a clear indication of this (IIRC, it responds with an HTTP 429"too many connections" indication). Additionally, this issue can be ameliorated by enabling retries in the python-consul2 (which I believe should be turned on regards less to handle transient network issues). This is fixed by the PR in https:/github.com/poppyred/python-consul2/pull/31. Note that we have never seen (b) outside a test specifically trying to hammer the system, but we see (a) all the time in our normal system tests. To opt-out from the new behaviour add a parameter "one_client=1" to the connection URL. * Increase code coverage. * Rewrite Consul backend documentation, and describe the options now available.

ShaheedHaque mentioned this issue Jun 18, 2019

Application is not thread-safe #4480

Open

2 tasks

thedrow added Component: Consul Results Backend Issue Type: Bug Report labels Jun 19, 2019

auvipy added this to the 4.6 milestone Jun 23, 2019

auvipy added the upstream bug label Jul 12, 2019

ShaheedHaque mentioned this issue Jul 14, 2019

Catastrophic failure by worker starvation on error in result backend #5642

Closed

16 tasks

ShaheedHaque closed this as completed Jul 15, 2019

auvipy removed this from the 4.6 milestone Aug 13, 2019

ShaheedHaque reopened this Jun 24, 2020

ShaheedHaque mentioned this issue Jun 24, 2020

KV API misordering HTTP responses hashicorp/consul#8180

Closed

auvipy added this to the 4.5 milestone Jun 25, 2020

ShaheedHaque mentioned this issue Jul 7, 2020

Possible Kombu bug matching corresponding Celery bug in Consul support. celery/kombu#1222

Open

thedrow added a commit that referenced this issue Dec 14, 2020

Depend on the maintained python-consul2 library.

2b66d9b

python-consul has not been maintained in a long while now. python-consul2 is a maintained fork of the same package. Ref #5605.

thedrow mentioned this issue Dec 14, 2020

Depend on the maintained python-consul2 library #6544

Merged

matusvalo pushed a commit that referenced this issue Dec 27, 2020

Depend on the maintained python-consul2 library. (#6544)

ae46302

python-consul has not been maintained in a long while now. python-consul2 is a maintained fork of the same package. Ref #5605.

auvipy modified the milestones: 4.5, 5.1.0 Dec 30, 2020

thedrow added this to Review in progress in Celery 5.1.0 Feb 24, 2021

thedrow moved this from Review in progress to In progress in Celery 5.1.0 Feb 24, 2021

auvipy moved this from In progress to Postponed in Celery 5.1.0 May 4, 2021

auvipy modified the milestones: 5.1.0, 5.2 May 4, 2021

ShaheedHaque mentioned this issue Jun 27, 2021

Enable retries in the std.py backend. poppyred/python-consul2#31

Open

ShaheedHaque mentioned this issue Jun 27, 2021

The Consul backend must correctly associate requests and responses #6823

Merged

thedrow linked a pull request Jun 27, 2021 that will close this issue

The Consul backend must correctly associate requests and responses #6823

Merged

auvipy closed this as completed in #6823 Aug 11, 2021

jeyrce pushed a commit to jeyrce/celery that referenced this issue Aug 25, 2021

Depend on the maintained python-consul2 library. (celery#6544)

d323a5b

python-consul has not been maintained in a long while now. python-consul2 is a maintained fork of the same package. Ref celery#5605.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in Consul result backend when saving a result #5605

Race condition in Consul result backend when saving a result #5605

ShaheedHaque commented Jun 16, 2019 •

edited

matteius commented Jun 17, 2019

ShaheedHaque commented Jun 17, 2019 •

edited

matteius commented Jun 17, 2019

ShaheedHaque commented Jun 17, 2019

matteius commented Jun 18, 2019

ShaheedHaque commented Jun 19, 2019 •

edited

thedrow commented Jun 19, 2019

matteius commented Jun 19, 2019

ShaheedHaque commented Jun 19, 2019 •

edited

ShaheedHaque commented Jul 11, 2019 •

edited

ShaheedHaque commented Jul 12, 2019

ShaheedHaque commented Jul 15, 2019

ShaheedHaque commented Jun 24, 2020

ShaheedHaque commented Jun 24, 2020

ShaheedHaque commented Jun 25, 2020

thedrow commented Jul 13, 2020

ShaheedHaque commented Jul 13, 2020

thedrow commented Jul 13, 2020

maybe-sybr commented Dec 11, 2020

ShaheedHaque commented Dec 11, 2020

ShaheedHaque commented Dec 12, 2020

thedrow commented Dec 13, 2020

thedrow commented Dec 13, 2020

ShaheedHaque commented Dec 20, 2020

ShaheedHaque commented Feb 19, 2021 •

edited

ShaheedHaque commented Feb 22, 2021

ShaheedHaque commented Jun 27, 2021

maybe-sybr commented Aug 12, 2021

Race condition in Consul result backend when saving a result #5605

Race condition in Consul result backend when saving a result #5605

Comments

ShaheedHaque commented Jun 16, 2019 • edited

matteius commented Jun 17, 2019

ShaheedHaque commented Jun 17, 2019 • edited

matteius commented Jun 17, 2019

ShaheedHaque commented Jun 17, 2019

matteius commented Jun 18, 2019

ShaheedHaque commented Jun 19, 2019 • edited

thedrow commented Jun 19, 2019

matteius commented Jun 19, 2019

ShaheedHaque commented Jun 19, 2019 • edited

Mandatory Debugging Information

Optional Debugging Information

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

Environment & Settings

Steps to Reproduce

Required Dependencies

Python Packages

Other Dependencies

Minimally Reproducible Test Case

Expected Behavior

Actual Behavior

ShaheedHaque commented Jul 11, 2019 • edited

ShaheedHaque commented Jul 12, 2019

ShaheedHaque commented Jul 15, 2019

ShaheedHaque commented Jun 24, 2020

ShaheedHaque commented Jun 24, 2020

ShaheedHaque commented Jun 25, 2020

thedrow commented Jul 13, 2020

ShaheedHaque commented Jul 13, 2020

thedrow commented Jul 13, 2020

maybe-sybr commented Dec 11, 2020

ShaheedHaque commented Dec 11, 2020

ShaheedHaque commented Dec 12, 2020

thedrow commented Dec 13, 2020

thedrow commented Dec 13, 2020

ShaheedHaque commented Dec 20, 2020

ShaheedHaque commented Feb 19, 2021 • edited

ShaheedHaque commented Feb 22, 2021

ShaheedHaque commented Jun 27, 2021

maybe-sybr commented Aug 12, 2021

ShaheedHaque commented Jun 16, 2019 •

edited

ShaheedHaque commented Jun 17, 2019 •

edited

ShaheedHaque commented Jun 19, 2019 •

edited

ShaheedHaque commented Jun 19, 2019 •

edited

ShaheedHaque commented Jul 11, 2019 •

edited

ShaheedHaque commented Feb 19, 2021 •

edited