Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

share data between uwsgi workers #1813

Open
dalf opened this issue Jan 28, 2020 · 4 comments
Open

share data between uwsgi workers #1813

dalf opened this issue Jan 28, 2020 · 4 comments
Labels

Comments

@dalf
Copy link
Collaborator

@dalf dalf commented Jan 28, 2020

Why

Currently data between uwsgi workers are not shared.
It's include:

  • engine statistics: the /stats page is not accurate: see #162 and #199 :
    engine.stats = {
    'result_count': 0,
    'search_count': 0,
    'engine_time': 0,
    'engine_time_count': 0,
    'score_count': 0,
    'errors': 0
    }
  • engine error / suspend time: so an engine can be suspended in one worker, but the others :
    'suspend_end_time': 0,
    'continuous_errors': 0,
  • wolframalpha_noapi token: this token is asked by each worker:
    token = {'value': '',
    'last_updated': None}
  • #1812 requires this feature.
  • Embeddeding searx-checker (#1559) requires an way to make sure it runs only once per (day|hour|whatever).

How ?

I don't have an answer to that question. Most probably this issue is already solved somewhere in a way I don't think about.

uwsgi: SharedArea

#415 suggests to use SharedArea.

It's an array of bytes, so an abstraction must be built on top of it.

An very API simple can be built:

  • allocate an int, bind this value to this key.
  • allocate an str of xx bytes, bind this value to this key.
  • allocation are possible when the workers starts, forbidden later.
  • read value bound to a key.
  • write value bound to a key.

If the structure is the same for all workers, it will work as expected. An checksum of the allocation can be added at the begin of the structure, so all workers can make sure they don't corrupt the data. Compare to mmappickle (see below), it would be way faster.

uwsgi: Caching Framework

I've tried the Caching Framework: it works as intended, and allows to share data between the workers.

Without any abstraction it includes a hard dependency on uwsgi.

Note if we look in the future (?): I haven't seen similar feature with asgi servers

uwsgi: Signal

I've tried uwsgi Signals: it doesn't seems possible to send an Signal to all the workers despite what says the documentation.

Signal could be use to store the data somewhere than asks all other workers to read the update.

multiprocessing.shared_memory

The python package multiprocessing.shared_memory allows to share data between processes.
Unfortunately it's only works from Python 3.8.

The github project SleepProgger/py_shared_memory extracts this feature, but it doesn't compile on Python 3.6, but only from the version 3.7: _PyArg_ParseStackAndKeywords has been renamed, but that's not the only thing to make it work. Anyway if we can avoid to compile C code it would be nice.

mmap file

https://docs.python.org/3.0/library/mmap.html

Same comment as the SharedArea.

The "simple" API (as described in the ShareArea section but applied to mmap) could be a way to solve this issue without an additional server .... or perhaps the performance wouldn't be good.

mmappickle

The github project UniNE-CHYN/mmappickle provides a Python dict over a mmap.

import timeit
print(timeit.timeit("for u in d: z[u] = u", setup="d = [ 'A' + str(i) for i in range(10) ]; z = {}", number=100))
print(timeit.timeit("for u in d: z[u] = u", setup="from mmappickle.dict import mmapdict; d = [ 'A' + str(i) for i in range(10) ]; z = mmapdict('/tmp/bench_mmap.pickle')", number=100))
print(timeit.timeit("for u in d: z[u]", setup="from mmappickle.dict import mmapdict; d = [ 'A' + str(i) for i in range(10) ]; z = mmapdict('/tmp/bench_mmap.pickle')", number=100))
0.00021737698989454657
21.098201046013855
0.2509144829964498

Damn slow.

multiprocessing.Pipe

If searx creates the processes by itself, multiprocessing.Pipe can be used.

lmdb

https://lmdb.readthedocs.io/en/release/

LMDB is a tiny database with some excellent properties:

  • Reader/writer transactions: readers don’t block writers, writers don’t block readers. Each environment supports one concurrent write transaction.
  • Read transactions are extremely cheap.
  • Environments may be opened by multiple processes on the same host, making it ideal for working around Python’s GIL.
  • Memory mapped, allowing for zero copy lookup and iteration. This is optionally exposed to Python using the buffer() interface.
  • Maintenance requires no external process or background threads.
  • No application-level caching is required: LMDB fully exploits the operating system’s buffer cache.

From #967

posix_ipc

https://github.com/osvenskan/posix_ipc

Provide lock and shared memory using the POSIX API (so some C code has to be compiled).

external server: Redis

This could be optional: the default configuration wouldn't share anything, but once redis is configured data are read & write from / to it.

I think that's the usual way to solve this issue.

@return42

This comment has been minimized.

Copy link
Collaborator

@return42 return42 commented Jan 28, 2020

I have a doubt that we really need shared states for the main use of searx / beside stats and "future features" each searx request is stateless and complete isolated. We should not give up this simplicity of searx. Requirements such as these disproportionately inflate searx.

Sorry that I nitpicking about suggestions like this, but I want first see simple solution to our most pressing problems. It is a bit OT and not fair from my side, but I would remind about #1785 and dalf/searx-stats2#7 ;)

@dalf

This comment has been minimized.

Copy link
Collaborator Author

@dalf dalf commented Jan 31, 2020

The purpose here is not to fall into the second-system effect or add complexity / too dependencies, but if possible find the right tool to share data between the workers. I've included redis because it's the common solution, but it doesn't fit with the searx spirit.

Let's review the items:

  • engine statistics: that's okay if the stats are more or less accurate, by the end of the day/week/whatever, all uwsgi workers will have the same amount of requests.
  • engine error / suspend time: there is no opened issue about that (?).
  • wolframalpha_noapi: wolframalpha doesn't complain.
  • anti-captcha.com: a file, check the timestamp, done (not efficient since it would use the file system on each requests, but it works).
  • embeddeding searx-checker: same, a file, check the timestamp, done.

The second purpose of this issue is at least to list what would require a global state, and maybe we can decide to do something about it at some point.

For each of the items of the above list, it would been easier to have a global state: actually, one of the reason this issue exists is the Python GIL. To make it more clear: if searx would have been written in golang this issue wouldn't exist.

async code and asyncio can help one that, see #1724 (even if it recommended to use multiple workers, so...).


Out of topic: What is searx from the networt point of view ?
A server that proxies requests to other servers.

IHMO, searx is weird from this aspect: there are multiple HTTP(S) pool (one per process), so there is no global outgoing request rate throttling.
Same about multiple IP usage: the rolling happens on each requests whatever the engine.

Is it far-fetched to say that this project https://github.com/unixfox/proxy-ripv6 is related to this issue ?

Another example: the a.searx.space server is too slow (Kimsufi.fr KS-1). So if I search for something I will get a timeout from the bing/whatever engine. On the second request, since the connection is already done it will work. But since there are multiple connection pools, I have to click "search" two or three times (until I use an "initialized" HTTP connection pool). Here, the problem is the server (the CPU performance).

A browser has multiple windows, tabs but one connection pool when using HTTP/2

@return42

This comment has been minimized.

Copy link
Collaborator

@return42 return42 commented Feb 1, 2020

My practical experience with building up networked services is limited. My experiences are rather atomic and I try to make an assessment with that.

What I ask my self is: what are we doing when we share states .. here are my 2cent ..

For each of the items of the above list, it would been easier to have a global state: actually, one of the reason this issue exists is the Python GIL.

Is it so? .. as far I can say: it depends on the infrastructure where searx (workers) are placed, e.g. for load balancing and that brings us to the subject at hand.

If we really need global state (you named reasons having a practical value), something like a global state service might be the answer.

Such a global state service would have to be developed with respect for privacy, which is not an easy task. There will always be requirements that are not feasible in terms of privacy. For example, is it okay to store and share a session context globally for a common group of outgoing request? .. I won't go into details, but I guess the answer is: "it depends"

If we do so and searx workers are sharing outgoing sessions, the next requirement come up: outgoing IP management is needed.

If I sum up my thoughts, I have to say that we are talking about searx networks or better name it searx cluster.

Is my conclusion correct / .. thoughts?

@dalf

This comment has been minimized.

Copy link
Collaborator Author

@dalf dalf commented Feb 3, 2020

If we really need global state (you named reasons having a practical value), something like a global state service might be the answer.

How I "feel" the implementation:

  • a service offer a register function to declare variables / structures.
  • engines and the core registers the variables they need.

It allows to have an review of what is shared.

But if searx switches to asyncio code, the global state would be some Python variables, if and only if one worker is used. So we need to do some measures / benchmarks of #1724 (asyncio code) or updated version ( #1703 (comment) ). If we need more than one worker, then this issue is still opened even with asyncio.

Also note of PR #1724 :
if the image proxy is fast enough, morty could be used only when the admin wants to provide a html proxy. Not sure if httpx (or aiohttp) could be as fast as valyala/fasthttp
It may weight on the balance.

For example, is it okay to store and share a session context globally for a common group of outgoing request? .. I won't go into details, but I guess the answer is: "it depends"

About the HTTP connections (not session), it is already the intent: #192 even if uwsgi breaks this.

If I sum up my thoughts, I have to say that we are talking about searx networks or better name it searx cluster.

Except the embeddeding searx-checker issue (?), yes !
It wasn't clear in my mind when I opened the issue.

Is my conclusion correct / .. thoughts?

Yes.
And actually it is related to the question: will searx will use async code ?

@dalf dalf added the core label Feb 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.