Connection starvation in gunicorn worker #3795

idzikovsky · 2024-07-19T17:40:25Z

Is there an existing issue for this?

I have searched the existing issues

Description

I think we've faced some strange behavior somewhere in Hue web server internals:
After couple of users logs in (the number is somewhere near 10), Hue server stops to respond on any request, until gunicorn master thread restart worker threads by timeout (configured by gunicorn_worker_timeout option).

I spend some time debugging this issue, and I wasn't be able to find any root cause here. What I've found is that it seems like something cause gunicorn connections to hang, but the problem here is that I don't see any exception that might cause this.

I observed this on Hue 4.11 with a default configuration. The only thing that I've changed is configured it to use PAM authentication, and got the reproduce with both MySQL and PostgreSQL databases.
Also, I've just reproduced the same problem on Hue build from the master branch from the bdeccbd commit.

And I changed gunicorn_worker_timeout value to 120 seconds to be able to reproduce this problem more quickly.

I will continue to investigate this, but it would be helpful to have some clues or directions if possible.

One small note to add: on Hue 4.11 this problem appears only on Python 3 build. On Python 2 build everything is fine.

Steps To Reproduce

Build Hue from master.
Configure Hue to use PAM auth and to use MySQL/PostgreSQL database to store its data (I'm not sure if those steps are required, but this is what I have on my setup).
Set gunicorn_worker_timeout to smaller value, like 120 seconds, to be able to reproduce this faster.
Open Hue in a browser anonymous window and log in, than repeat this step ~10 times.
I've created the following selenium script to automate this process:

#!/usr/bin/env python3

from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import sys
from threading import Thread
import time
from webdriver_manager.chrome import ChromeDriverManager

parallel = 10
if len(sys.argv) > 1:
    parallel = int(sys.argv[1])


def main():
    def wait_for_page_loaded(driver):
        waiter = WebDriverWait(driver, 120)
        waiter.until(lambda driver: driver.execute_script('return jQuery(":animated").length == 0;'))


    driver_manager = ChromeDriverManager() 
    driver_executable_path = driver_manager.install()

    driver_options = Options()
    driver_options.add_argument("--disable-features=SidePanelPinning")
    driver_options.add_argument("--auto-open-devtools-for-tabs")
    driver_options.set_capability('unhandledPromptBehavior', 'accept')
    driver_options.add_experimental_option("prefs", {
            "devtools.preferences.panel-selected-tab": '"network"',
            "devtools.preferences.network-log-show-overview": False,
        })

    driver = webdriver.Chrome(service=Service(driver_executable_path), options=driver_options)

    waiter = WebDriverWait(driver, 120)
    driver.get("http://localhost:8888/")

    wait_for_page_loaded(driver)

    waiter.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'input#id_username')))
    el = driver.find_elements(By.CSS_SELECTOR, 'input#id_username')
    el[0].send_keys("user1")

    el = driver.find_elements(By.CSS_SELECTOR, 'input#id_password')
    el[0].send_keys("password1")

    el = driver.find_elements(By.CSS_SELECTOR, 'input.btn-primary')
    el[0].click()

    while True:
        try:
            _ = driver.window_handles
        except WebDriverException:
            break
        time.sleep(1)


if __name__ == '__main__':
    threads = []
    for _ in range(parallel):
        t = Thread(target=main, daemon=True)
        t.start()
        threads.append(t)
    
    for thread in threads:
        thread.join()

Than I wait for those pages to load, maybe wait one more minute or try to create 2 new sessions, and after that I'm not able to open login page or anything at all until I see the following message in an error.log:

[16/Jul/2024 10:35:48 -0700] glogging     CRITICAL WORKER TIMEOUT (pid:1672386)

Than Hue becomes ok, until 10 more users logs in.

Logs

No response

Hue version

4.11, master, bdeccbd

The text was updated successfully, but these errors were encountered:

Harshg999 · 2024-07-23T12:21:58Z

Thanks @idzikovsky for reporting this interesting issue, @amitsrivastava @ranade1 - can you guys take a look?

github-actions · 2024-08-23T01:54:57Z

This issue is stale because it has been open 30 days with no activity and is not labeled "Prevent stale". Remove "stale" label or comment or this will be closed in 10 days.

idzikovsky · 2024-09-03T10:45:42Z

It's strange that no one except me has faced this issue, as I got it on different Hue builds on different operating systems and different Hue configurations.

idzikovsky added the BUG Issue type for reporting failure due to bug in functionality label Jul 19, 2024

github-actions bot added the Stale label Aug 23, 2024

github-actions bot closed this as completed Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection starvation in gunicorn worker #3795

Connection starvation in gunicorn worker #3795

idzikovsky commented Jul 19, 2024 •

edited

Loading

Harshg999 commented Jul 23, 2024

github-actions bot commented Aug 23, 2024

idzikovsky commented Sep 3, 2024

Connection starvation in gunicorn worker #3795

Connection starvation in gunicorn worker #3795

Comments

idzikovsky commented Jul 19, 2024 • edited Loading

Is there an existing issue for this?

Description

Steps To Reproduce

Logs

Hue version

Harshg999 commented Jul 23, 2024

github-actions bot commented Aug 23, 2024

idzikovsky commented Sep 3, 2024

idzikovsky commented Jul 19, 2024 •

edited

Loading