Added storage_max_events option, updated pg_storage event loading #2

johnarnold · 2018-02-02T21:41:48Z

This option allows limiting the max number of events loaded from the storage provider (i.e. postgresql). Thus, loading performance will be deterministic regardless of how big the table gets.

bosonogi · 2018-02-03T10:29:12Z

flower/utils/pg_storage.py

@@ -27,7 +27,14 @@

 _add_event = """INSERT INTO events (time, data) VALUES (%s, %s) ON CONFLICT DO NOTHING"""

-_all_events = """SELECT data FROM events ORDER BY time ASC"""
+_get_events = """SELECT data FROM (


@johnarnold My SQL skills are a bit rusty... Does this query work as expected (return all events) when max_events is None?

@johnarnold Limiting by a number events can result in an incomplete task load at the point where loading is cut off. I've just tested this query for loading the events for the last x tasks:

SELECT data FROM events WHERE data->'uuid' IN ( SELECT data->'uuid' FROM events WHERE data @> '{"type": "task-received"}' ORDER BY time DESC LIMIT 4 ) ORDER BY time ASC

max_events should never be none, as it's defaulted to 30000 (i.e. 3x the default max_tasks)

Also, it's loading the newest events, so while you could get a partial task load, it will be the oldest task(s) that are incomplete. Does that cause any problems? I don't see a way to avoid it, other than keeping 'all events forever' and that doesn't seem doable or useful either.

My thought was that if you load enough or slightly more events than are needed to populate the task cache up to it's max_tasks size, any partial tasks will get dropped out of the cache

I have this running in production, btw. My ratios might be slightly off --
storage_max_events = 300k yielded ~ 75k tasks, whereas my max_tasks = 100k. So the storage_max_events should probably be around 4x the max_tasks.

but it does work.

I'm not sure a contains-type query against the GIN index will get us the performance we want. I'm not a pSQL expert so maybe there's a query plan optimization we're missing. My head goes towards making uuid it's own column and indexing it, so we can do a simpler distinct + order + limit query to build a list of n tasks, and fetch all their events.

@johnarnold Another approach might be to keep the query simple as it was, and move the limit logic to Python code. As we read all events from the database one at a time in descending order, we could detect when all events for max_tasks number of tasks have been loaded and stop reading. Then these events could be replayed in the correct (ascending) order.

This would involve loading the events to memory before using them, but by design that amount of data must be small enough to fit temporarily in memory without issues.

@bosonogi That actually could be pretty straightforward, as a simple Set of uuids with a len() check against max_tasks would tell us when to stop. I'm a bit worried though that the 'get all events' query will be slow to return a cursor as the table grows. I'll do some testing. I think exposing uuid as a column may end up giving the desired perf and keep the query on the simple side.

As an aside, I'm seeing utilization on my postgresql server growing at a rate of about 1% per-hour until it flatlines (70% currently) -- I suspect the GIN index becomes troublesome to update with a very large table size. So we should either think about occasionally truncating the table, or ditching the GIN index. Give it some thought, and let me know if you see similar behavior?

Interesting reading: ORDER BY kills gin index perf, forces b-tree index:
https://dba.stackexchange.com/questions/113132/gin-index-not-used-when-adding-order-clause

Interesting reading indeed. We do order by on the "time" column, so we might get away with it.

I'm not against adding one more column and index for UUID if it proves that would produce significant boost in performance. Note however that the UUID field is unfortunately named as such as it is possible to define any string as the task ID (and we do that in our system so that the task IDs would be more meaningful). This means you can't limit the column type with assumptions related to UUIDs.

johnarnold · 2018-02-13T23:26:44Z

@bosonogi Take a look at the latest here, I implemented the uuid column, index, and max_tasks based query. So far so good in my testing.

Note: to test, you may want to drop your events table in postgresql and let it be recreated.

johnarnold · 2018-02-13T23:52:17Z

flower/utils/pg_storage.py

@@ -18,16 +18,40 @@
    (
        id SERIAL PRIMARY KEY,
        time TIMESTAMP NOT NULL,
+        uuid VARCHAR(36) NOT NULL,
        data JSONB NOT NULL,


could store as TEXT instead of JSONB, since we're not using a GIN index... maybe get some more perf by not validating the json?

Dropping the GIN index and the JSONB is not an option for me, as I already depend on those for some data mining that is separate from Flower -- and I expect there will more of that in the future.

johnarnold · 2018-02-14T02:07:28Z

@bosonogi The latest changes fix up the loading performance. It's suitably fast now. However, I ran into another performance bottleneck -- write performance on the flower side.

I set up a simple test to spray tasks:

@app.task()
def emit_true():
    return True

class TestPerformance:

    def test_emit_true(self):
        for i in (range(1,100000)):
            net_task.emit_true.delay()

The result is that flower basically locks up when i run the test. I haven't dug into the event callback and Tornado enough to understand how it's working yet, but it seems like it's just pegging one thread. Is there a way around this, are we hitting sing-e-thread perf due to the global connection object?

More digging needed, just thought I'd throw it out there before I call it a night.

bosonogi · 2018-02-14T06:40:45Z

Database inserts could be queued up from the main thread in a queue.Queue and executed in a separate worker thread. This would avoid blocking up the main thread while still eventually filling up the database in the correct order.

bosonogi · 2018-02-14T07:15:31Z

flower/utils/pg_storage.py

@@ -18,16 +22,40 @@
    (
        id SERIAL PRIMARY KEY,
        time TIMESTAMP NOT NULL,
-        data JSONB NOT NULL,
+        uuid VARCHAR(36) NOT NULL,
+        data TEXT NOT NULL,


@johnarnold Dropping the GIN index and the JSONB is not an option for me, as I already depend on those for some data mining that is separate from Flower -- and I expect there will more of that in the future.

@bosonogi Would you need the index if the table schema was expanded? Are there specific columns you need?

JSONB is probably fine (it's about a 20-30% perf hit and a 200% increase in table size) but the GIN index calculating is pretty brutal as you get to large event table sizes. I can profile it some more for specifics after I figure out the write queueing.

@johnarnold Well, the idea is to be able to access any fields from events in queries. While we don't need access to all of them right now, we do not know which of them we might need in the future.

Inserting in a separate thread should help with the increased computation cost. I've made a pull request to your branch with some quick changes (johnarnold#4) to possibly help things move along -- and to be a bit more involved ;) I hope I will have time to do more than just suggest untested code in the future.

bosonogi · 2018-02-14T08:04:19Z

flower/utils/pg_storage.py

@@ -18,16 +22,40 @@
    (
        id SERIAL PRIMARY KEY,
        time TIMESTAMP NOT NULL,
-        data JSONB NOT NULL,
+        uuid VARCHAR(36) NOT NULL,


@johnarnold 36 characters is too few... I've said in a previous comment that these are not always uuids but can be arbitrary strings -- which can easily be longer than 36 characters. As an example, the current maximum length of a task id in our database is 129 characters.

@bosonogi Just letting you know I've been slammed with other work lately so I haven't gotten back to this but I will try. For our environment it seems like our volume is stressful to PostgreSQL, so I may write up a very simple redis provider and contribute into your PR.

@johnarnold cool; work situation is similar at my end.

An idea for an alternative approach has been brewing that would sidestep Flower (and its in-memory architecture). As I only use a small subset of what Flower does (start tasks and look up task details using its HTTP API), it seems to me it should be easy to make a Django Rest Framework app that does this -- utilizing the event data we already use/have.

Added storage_max_events option, updated pg_storage event loading

e6b180f

bosonogi reviewed Feb 4, 2018

View reviewed changes

refactor pg_storage to use uuid index, faster queries, max_tasks

3a133f6

johnarnold commented Feb 13, 2018

View reviewed changes

johnarnold added 3 commits February 14, 2018 00:51

Integrate retries, move event data from jsonb to text

52793fc

deserialize events from text

3122724

remove unused var retries_remaining

4a12d1d

bosonogi reviewed Feb 14, 2018

View reviewed changes

bosonogi mentioned this pull request Aug 29, 2019

Option to store events inside a Postgres database mher/flower#724

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added storage_max_events option, updated pg_storage event loading #2

Added storage_max_events option, updated pg_storage event loading #2

johnarnold commented Feb 2, 2018

bosonogi Feb 3, 2018

bosonogi Feb 4, 2018 •

edited

Loading

johnarnold Feb 5, 2018 •

edited

Loading

johnarnold Feb 5, 2018

johnarnold Feb 5, 2018 •

edited

Loading

johnarnold Feb 7, 2018

bosonogi Feb 7, 2018

johnarnold Feb 7, 2018

johnarnold Feb 7, 2018

bosonogi Feb 8, 2018

johnarnold commented Feb 13, 2018

johnarnold Feb 13, 2018

bosonogi Feb 14, 2018

johnarnold commented Feb 14, 2018

bosonogi commented Feb 14, 2018

bosonogi Feb 14, 2018

johnarnold Feb 14, 2018 •

edited

Loading

bosonogi Feb 14, 2018 •

edited

Loading

bosonogi Feb 14, 2018

johnarnold Mar 26, 2018

bosonogi Mar 27, 2018

Added storage_max_events option, updated pg_storage event loading #2

Are you sure you want to change the base?

Added storage_max_events option, updated pg_storage event loading #2

Conversation

johnarnold commented Feb 2, 2018

Choose a reason for hiding this comment

bosonogi Feb 4, 2018 • edited Loading

Choose a reason for hiding this comment

johnarnold Feb 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnarnold Feb 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnarnold commented Feb 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnarnold commented Feb 14, 2018

bosonogi commented Feb 14, 2018

Choose a reason for hiding this comment

johnarnold Feb 14, 2018 • edited Loading

Choose a reason for hiding this comment

bosonogi Feb 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bosonogi Feb 4, 2018 •

edited

Loading

johnarnold Feb 5, 2018 •

edited

Loading

johnarnold Feb 5, 2018 •

edited

Loading

johnarnold Feb 14, 2018 •

edited

Loading

bosonogi Feb 14, 2018 •

edited

Loading