Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support concurrent updates for Zep users #326

Closed
nicoeiris11 opened this issue Mar 21, 2024 · 12 comments
Closed

[FEAT] Support concurrent updates for Zep users #326

nicoeiris11 opened this issue Mar 21, 2024 · 12 comments

Comments

@nicoeiris11
Copy link

Is your feature request related to a problem? Please describe.
I'm having a lot of problems IN PROD updating Zep users in async tasks. I run Celery tasks to process data and save it to Users metadata. Different async tasks could potentially update the same user metadata at the same time. I'm aware that Zep implements DB advisory locks in the update as follows:

image

However, not having control in my app over this behavior makes my data inconsistent and generates many tasks to fail on my side (Zep ApiError).

Describe the solution you'd like
I'd like better handling of locks on Zep side, implementing:

  • A method to check if a user is locked
  • Make the update and aupdate methods to wait until a lock is released to perform the operation.
  • A method to force the release of a user lock.

Describe alternatives you've considered
Either having the Zep user update operation to wait until the lock is released to update the user, or , another strategy could be providing a method to check if a user is locked, so the client can wait until the user is locked to perform the next update op.
Also, having a method to force the release of a lock if something is wrong on Zep side could also be useful.

Additional context
Currently, the solution I implemented which is not working, is each time I want to perform an update, I do the following:

  1. Check if user_id in Redis cache
  2. IF present wait 20 seconds
  3. IF present again raise exception
  4. IF not present save user_id in cache
  5. run zep.update
  6. remove user_id from cache

This is the way I found to handle concurrent updates in my Celery async tasks.
But still having issues for many cases. Any thoughts? Any better possible solution?

Thank you so much and great work guys!!

@danielchalef
Copy link
Member

Thanks for raising this. We candidly did not design user metadata handling with high concurrency requirements in mind. It would be helpful to understand how you're using user metadata. What data are you storing in the user object?

@nicoeiris11
Copy link
Author

Hi @danielchalef , thanks for your quick response.

My data flow is the following:

  • Client requests a new user attaching a text file associated to it (this file contains 2 sections to be summarized at the same time)
  • My FastAPI service creates the user in zep with empty metadata and triggers 2 Celery task associated to the user (each async task will do heavy the processing of summarizing each section).
  • At this point, I queue both Celery tasks in the "main" thread and update user metadata saving both Celery tasks ID.
  • Each Celery async task process summary of the corresponding section and when it's done updates the user metadata in Zep with the resulting summary.

So in the user metadata I save:

  1. User info
  2. Celery task 1 ID
  3. Celery task 2 ID
  4. Celery task 1 summary result (section 1 of the file)
  5. Celery task 2 summary result (section 2 of the file)

What happens is that sometimes both async tasks or even "main" thread updates, and one of the async tasks attempt to update at the same time and the app breaks because of APIError (produced by the advisory lock in pg db).

Ideally, one of the concurrent updates should keep waiting for the release to happen. Or, at least, the API should provide a method like zep.user(user_id).is_locked() to avoid the APIError and wait.

@danielchalef
Copy link
Member

Candidate fix in #329

@danielchalef
Copy link
Member

danielchalef commented Mar 24, 2024

@nicoeiris11 Zep v0.24.0 includes an experimental approach to locking user metadata that should cope better with high-concurrency updates to the same user record. Please try it out and let me know if this fixes your issue. We'll apply this fix more widely if so.

@nicoeiris11
Copy link
Author

@danielchalef, thank you so much for the quick fix.

I tested v.0.24.0 and the number of failed updates decreased significantly, but still had some cases in which Zep failed after the 3 retries. Is it possible to experiment with a retry policy with exponential backoff and more time between attempts?
For an environment with a heavy load and multiple users updated concurrently, 200ms doesn't seem to be enough time for locks to be released. What about something like starting with 5 seconds with exponential backoff?

I appreciate your time and dedication to improving the user experience/development of the tool.

@danielchalef
Copy link
Member

Good point. An exponential backoff may work better here. I'm loath to increase the max backoff time beyond ~10 seconds, preferring the client time out and retry. See #330

@danielchalef
Copy link
Member

v0.25.0 - Use Exponential Backoff for Metadata Lock Fails is building. Please let me know your thoughts once you've had a chance to check it out.

@nicoeiris11
Copy link
Author

@danielchalef thank you so much for releasing a new version with the changes so fast.

I'll let you know as soon as I have updates from the new tag testing.

@nicoeiris11
Copy link
Author

@danielchalef I want to confirm that with v0.25.0 my load test passed 100% ok.

There will always be a threshold of requests load that will cause Zep to return time out due to the number of concurrent DB operations. But from my last experiments, I didn't notice ApiError due to locks anymore (only time outs when I manually force a very high load scenario).

I want to thank you for your hard work and dedication to maintaining the repo and addressing developers' issues.
Best regards!

@danielchalef
Copy link
Member

@nicoeiris11 Great to hear! If you're using Zep's default docker-compose setup, the Postgres instance is not tuned for production use. There is however plenty of literature online around sizing and tuning Postgres implementations.

@nicoeiris11
Copy link
Author

@danielchalef Yes, I'm using docker-compose in production pointing to stable tag release.
Any specific recommendation/resource about pg tuning to better leverage this docker service?

@danielchalef
Copy link
Member

The Postgres website has good guidance on tuning. I've also found this tool useful: https://pgtune.leopard.in.ua/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants