-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scope database sessions to requests instead of threads #214
Conversation
Hello @dannygoldstein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-03-24 17:51:59 UTC |
I will read in detail tomorrow, but relevant: we do not have threads that run async at the moment, afaik. |
@stefanv Perhaps thread-safe is not the most accurate term for this PR - thread-safe refers to collisions between multiple threads trying to access the same address space. The problem that this PR addresses is somewhat different-- when one single thread is concurrently handling multiple requests, the different requests both share a session, causing the permissions checking infrastructure (which assumes sessions are scoped to requests) to fail. This pr simply enforces that database sessions are scoped to requests instead of threads |
I believe Dima has implemented some async handler methods in F extensions
…On Tue, Mar 23, 2021, 5:39 PM Danny Goldstein ***@***.***> wrote:
@stefanv <https://github.com/stefanv> Perhaps thread-safe is not the most
accurate term for this PR - thread-safe refers to collisions between
multiple threads trying to access the same address space. The problem that
this PR addresses is somewhat different-- when one single thread is
concurrently handling multiple requests, there is currently a bleed through
between the different requests' database sessions.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#214 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXFGTIXAQG4N35UEJIFVMDTFEYCVANCNFSM4ZWBBKZQ>
.
|
The offsets handler is async, as is the finding chart generator |
@dannygoldstein This is perfect. The only thing I don't follow is where you set the HTTP header. Do you mind point that out? |
Second question: why do we need a context variable, instead of an instance UUID? |
@dannygoldstein @stefanv It looks like it's being set by nginx here https://github.com/cesium-ml/baselayer/pull/214/files#diff-374ca41bad5bb33fc44ec617aef88cd102525dfe18eb1c89a75cea68e8f83c85R56 , but not all clients provide that header, right? How do we handle those cases? |
That's just the proxy forwarding the header, isn't it? |
Right, so it looks like this depends on whether the client provides that header in the request? |
nginx can set a header to UUID, if desired, though |
$request_id is an embedded variable that is generated by NGINX for each
incoming request -
http://nginx.org/en/docs/http/ngx_http_core_module.html - grep for
$request_id (no permalink).
…On Wed, Mar 24, 2021 at 10:25 AM Ari Crellin-Quick ***@***.***> wrote:
Right, so it looks like this depends on whether the client provides that
header in the request?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#214 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAVEFYFFZMCQHGVIAJJPZUTTFIN7XANCNFSM4ZWBBKZQ>
.
--
Danny Goldstein
http://astro.caltech.edu/~danny
|
ah, perfect |
@dannygoldstein but where does the |
@acrellin I've now sidestepped the issue altogether and simply assigned the request_id inside of Tornado instead of forwarding the value from nginx. |
@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance. |
Can you show an example? |
Each request instance handles a specific HTTP connection, if I'm not mistaken. And you cannot have multiple threads that access the database within one request? |
@stefanv The database session is initialized at the baselayer/app/models.py module level. What I think you are suggesting is to do something like
Problem with this is we would have to go and refactor hundreds of lines of handler code to use self.session instead of DBSession. We'd also have to create a separate sessionfactory for sessions that are used outside of request handlers (e.g., in the IPython terminal). It would be a huge change. |
We'd also have to refactor the event handlers, test suite, fixtures, etc. It would end being thousands of lines of code. |
This way we have a single session factory, consistent scoping both in and out of request handlers, and no need to refactor any other code in the entire code base. |
The reason is that DBSession is initialized at the top level of app/models.py, and can't access any variables local to handler methods in its |
@profjsb "but it's only 6 lines of code"! 😂 Turns out these 6 (now 5) lines of code are... tricky. I spent a lot of time trying to figure out exactly how Tornado navigates contexts. Take a look at this Tornado issue, e.g.: (This patch was introduced after the 6.0.4 that we depend on.) Long and short of it, I think the reason any of this works is because a) Tornado 5 and up builds on asyncio and b) Tornado now has some context awareness itself. What is not clear to me yet is exactly when contexts switch out. Ideally, we'd want a context switch per request handler, but we may get now get contexts depending on... well, I don't know and I need help here. I cannot find any Tornado documentation that discusses contexts. The asyncio documentation states:
The Task in this case would be a method of our handler, I suspect. What I would like to test next is the following:
|
@stefanv https://groups.google.com/g/python-tornado/c/8mzGNnhUxtU?pli=1
from Ben Darnell:
"It will work automatically for native (`async def`) coroutines"
|
@dannygoldstein Right, but what does that mean? import contextvars
import uuid
ctx = contextvars.ContextVar('request_id', default=None)
class MyHandler:
def __init__(self, name):
self._name = name
ctx.set(uuid.uuid4().hex)
def get(self):
print(self._name, ' -> ', ctx.get())
print('Default context:', ctx.get())
print('Initializing handler A')
a = MyHandler('a')
print('Global context value:', ctx.get())
print('Initializing handler B')
b = MyHandler('b')
print('Global context value:', ctx.get())
print('Executing a.get')
a.get()
print('Executing b.get')
b.get()
|
@stefanv I think the issue you linked do above is only relevant to coroutines that are created using the |
Yes, not sure exactly how those differ other than https://www.tornadoweb.org/en/stable/guide/coroutines.html#native-vs-decorated-coroutines |
@stefanv I think there is only one "context" in your example (the global context), whereas in our situations there would be 2 separate contexts, one for each request -- I think a closer analog to what would happen in tornado would be: import contextvars
import asyncio
import uuid
ctx = contextvars.ContextVar('request_id', default=None)
class MyHandler:
def __init__(self, name):
self._name = name
ctx.set(uuid.uuid4().hex)
def get(self):
print(self._name, ' -> ', ctx.get())
async def create_handler_and_set_context(name):
handler = MyHandler(name)
print(f'Initializing handler {name}')
print('Global context value:', ctx.get())
print(f'Executing {name}.get')
handler.get()
async def main():
await create_handler_and_set_context('a')
await create_handler_and_set_context('b')
asyncio.run(main())
print('Global context value:', ctx.get())
|
@dannygoldstein I was wondering more about what would happen to all the "standard" calls (i.e., non-async). I presume we'd prefer for them to also get their own context? I think we can run an experiment with real Tornado to see whether what you're saying above holds (we can use get handlers that sleep for a while). |
@stefanv Want to give that a shot? I tried implementing with import contextvars
import asyncio
import uuid
import time
ctx = contextvars.ContextVar('request_id', default=None)
class MyHandler:
def __init__(self, name):
self._name = name
ctx.set(uuid.uuid4().hex)
def get(self):
print(self._name, ' -> ', ctx.get())
def _body(name):
handler = MyHandler(name)
print(f'Initializing handler {name}')
print('Global context value:', ctx.get())
print(f'Executing {name}.get')
handler.get()
async def create_handler_and_set_context(name):
_body(name)
time.sleep(3)
def blocking_create_handler_and_set_context(name):
_body(name)
time.sleep(1.5)
async def main():
asyncio.create_task(create_handler_and_set_context('async a'))
asyncio.create_task(create_handler_and_set_context('async b'))
blocking_create_handler_and_set_context('block c')
blocking_create_handler_and_set_context('block d')
asyncio.create_task(create_handler_and_set_context('async e'))
asyncio.run(main())
print('Global context value:', ctx.get())
|
From tornado docs: https://www.tornadoweb.org/en/stable/index.html#asyncio-integration
|
@stefanv here is a tornado version, from tornado import web, ioloop
import time
import numpy as np
import contextvars
import asyncio
import uuid
request_id = contextvars.ContextVar('request_id', default=None)
class BaseHandler(web.RequestHandler):
def prepare(self):
request_id.set(uuid.uuid4().hex)
class BlockingHandler(BaseHandler):
def get(self):
time.sleep(np.random.rand())
self.write(request_id.get())
class AsyncHandler(BaseHandler):
async def get(self):
await asyncio.sleep(np.random.rand())
self.write(request_id.get())
def make_app():
return web.Application([
(r"/blocking", BlockingHandler),
(r'/async', AsyncHandler)
])
if __name__ == "__main__":
app = make_app()
app.listen(8888)
ioloop.IOLoop.current().start() I sent 50 requests to this at once using aiohttp (49 to the async handler and 1 to the blocking handler), the whole thing finished in ~1s (indicating that the async stuff on the server worked). I printed the request_ids returned by the server on the client side and verified that all 50 were unique, so I think this works fine in tornado |
Here is my client code # Send 50 requests to the server at once, 49 async, 1 blocking, then
# verify that they all had separate request contexts on the server
import asyncio
import aiohttp
async def run(url_list):
tasks = []
async with aiohttp.ClientSession() as session:
for url in url_list:
task = asyncio.ensure_future(session.get(url))
tasks.append(task)
responses = asyncio.gather(*tasks)
await responses
return responses
async def main():
url_list = ['http://localhost:8888/async'] * 50
# send one blocking request
url_list[25] = url_list[25].replace('async', 'blocking')
task = run(url_list)
future = await task
context_ids = set()
for r in future.result():
context_ids.add(await r.text())
assert len(context_ids) == 50
print(context_ids)
asyncio.run(main())
|
We have two requirements:
Danny confirmed (1) above, and I've now also confirmed (2), so this is good to go. |
Thanks, @dannygoldstein! |
A ContextVar is now used to determine when a new database session should be created. The ContextVar `session_context_id` is set upon request instantiation, and remains constant during its execution (i.e., while running a method like `async get`, `post`, etc.).
A ContextVar is now used to determine when a new database session should be created. The ContextVar `session_context_id` is set upon request instantiation, and remains constant during its execution (i.e., while running a method like `async get`, `post`, etc.).
Prerequisite reading: https://docs.sqlalchemy.org/en/14/orm/session_basics.html#session-faq-whentocreate
At the moment, we maintain a single sqlalchemy database session per thread and use this session for all requests handled by the thread. Said another way, the scope of our sessions currently is the entire thread, rather than the request. This model is unsafe. Consider the following scenario:
This breaks the permission checking framework, which assumes that all of the objects in the current session are in the scope of the current request (rather than the current thread).
This is a known issue with SQLalchemy and tornado. From the Sqlalchemy docs:
https://docs.sqlalchemy.org/en/14/orm/contextual.html#using-thread-local-scope-with-web-applications
Since this is not the case for Tornado, we need to explicitly specify the scope of our sessions:
https://docs.sqlalchemy.org/en/14/orm/contextual.html#using-custom-created-scopes
This PR implements a get_request_id function as a
contextvar
:https://docs.python.org/3/library/contextvars.html
The context var keeps track of the current request ID, which I pass to the handlers from NGINX, following this guide:
https://quanttype.net/posts/2020-02-05-request-id-logging.html
This ensures that each request context maintains its own database session, scoped to the request.
Related to #213.