Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scope database sessions to requests instead of threads #214

Merged
merged 14 commits into from
Mar 25, 2021

Conversation

dannygoldstein
Copy link
Contributor

@dannygoldstein dannygoldstein commented Mar 23, 2021

Prerequisite reading: https://docs.sqlalchemy.org/en/14/orm/session_basics.html#session-faq-whentocreate

At the moment, we maintain a single sqlalchemy database session per thread and use this session for all requests handled by the thread. Said another way, the scope of our sessions currently is the entire thread, rather than the request. This model is unsafe. Consider the following scenario:

  1. Web worker 1 receives a request that routes to an asynchronous handler. The handler launches into a non-blocking coroutine.
  2. Web worker 1, free to handle new requests, receives another request that is routed to a blocking handler.
  3. Because both of these requests are being processed simultaneously in the same thread, there is a critical information leak: all of the objects brought into the thread's database session by request 1 are mixed with the objects from request 2

This breaks the permission checking framework, which assumes that all of the objects in the current session are in the scope of the current request (rather than the current thread).

This is a known issue with SQLalchemy and tornado. From the Sqlalchemy docs:

https://docs.sqlalchemy.org/en/14/orm/contextual.html#using-thread-local-scope-with-web-applications

As it turns out, most Python web frameworks, with notable exceptions such as the asynchronous frameworks Twisted and Tornado, use threads in a simple way, such that a particular web request is received, processed, and completed within the scope of a single worker thread.

Since this is not the case for Tornado, we need to explicitly specify the scope of our sessions:

https://docs.sqlalchemy.org/en/14/orm/contextual.html#using-custom-created-scopes

This PR implements a get_request_id function as a contextvar:

https://docs.python.org/3/library/contextvars.html

This module provides APIs to manage, store, and access context-local state. The ContextVar class is used to declare and work with Context Variables. The copy_context() function and the Context class should be used to manage the current context in asynchronous frameworks.

Context managers that have state should use Context Variables instead of threading.local() to prevent their state from bleeding to other code unexpectedly, when used in concurrent code.

The context var keeps track of the current request ID, which I pass to the handlers from NGINX, following this guide:

https://quanttype.net/posts/2020-02-05-request-id-logging.html

This ensures that each request context maintains its own database session, scoped to the request.

Related to #213.

@pep8speaks
Copy link

pep8speaks commented Mar 23, 2021

Hello @dannygoldstein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-24 17:51:59 UTC

@stefanv
Copy link
Contributor

stefanv commented Mar 24, 2021

I will read in detail tomorrow, but relevant: we do not have threads that run async at the moment, afaik.

@dannygoldstein
Copy link
Contributor Author

dannygoldstein commented Mar 24, 2021

@stefanv Perhaps thread-safe is not the most accurate term for this PR - thread-safe refers to collisions between multiple threads trying to access the same address space. The problem that this PR addresses is somewhat different-- when one single thread is concurrently handling multiple requests, the different requests both share a session, causing the permissions checking infrastructure (which assumes sessions are scoped to requests) to fail. This pr simply enforces that database sessions are scoped to requests instead of threads

@dannygoldstein dannygoldstein changed the title make database sessions thread-safe scope database sessions to requests instead of threads Mar 24, 2021
@acrellin
Copy link
Member

acrellin commented Mar 24, 2021 via email

@dannygoldstein
Copy link
Contributor Author

The offsets handler is async, as is the finding chart generator

@stefanv
Copy link
Contributor

stefanv commented Mar 24, 2021

@dannygoldstein This is perfect. The only thing I don't follow is where you set the HTTP header. Do you mind point that out?

@stefanv
Copy link
Contributor

stefanv commented Mar 24, 2021

Second question: why do we need a context variable, instead of an instance UUID?

@acrellin
Copy link
Member

acrellin commented Mar 24, 2021

@dannygoldstein @stefanv It looks like it's being set by nginx here https://github.com/cesium-ml/baselayer/pull/214/files#diff-374ca41bad5bb33fc44ec617aef88cd102525dfe18eb1c89a75cea68e8f83c85R56 , but not all clients provide that header, right? How do we handle those cases?

@stefanv
Copy link
Contributor

stefanv commented Mar 24, 2021

That's just the proxy forwarding the header, isn't it?

@acrellin
Copy link
Member

Right, so it looks like this depends on whether the client provides that header in the request?

@acrellin
Copy link
Member

nginx can set a header to UUID, if desired, though

@dannygoldstein
Copy link
Contributor Author

dannygoldstein commented Mar 24, 2021 via email

@acrellin
Copy link
Member

ah, perfect

@acrellin
Copy link
Member

@dannygoldstein but where does the "request-id" header get set? Does the X-Request-ID header get set with that key?

@dannygoldstein
Copy link
Contributor Author

@acrellin I've now sidestepped the issue altogether and simply assigned the request_id inside of Tornado instead of forwarding the value from nginx.

@stefanv
Copy link
Contributor

stefanv commented Mar 24, 2021

@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance.

@dannygoldstein
Copy link
Contributor Author

@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance.

Can you show an example?

@stefanv
Copy link
Contributor

stefanv commented Mar 24, 2021

Each request instance handles a specific HTTP connection, if I'm not mistaken. And you cannot have multiple threads that access the database within one request?

@dannygoldstein
Copy link
Contributor Author

@stefanv The database session is initialized at the baselayer/app/models.py module level. What I think you are suggesting is to do something like

class BaseHandler(PSAHandler):

    def prepare(self):
        self.session = scoped_session(sessionfactory())

    def get(self):
        self.session.query(...)

    def on_finish(self):
        self.session.remove()

Problem with this is we would have to go and refactor hundreds of lines of handler code to use self.session instead of DBSession. We'd also have to create a separate sessionfactory for sessions that are used outside of request handlers (e.g., in the IPython terminal). It would be a huge change.

@dannygoldstein
Copy link
Contributor Author

We'd also have to refactor the event handlers, test suite, fixtures, etc. It would end being thousands of lines of code.

@dannygoldstein
Copy link
Contributor Author

This way we have a single session factory, consistent scoping both in and out of request handlers, and no need to refactor any other code in the entire code base.

@dannygoldstein
Copy link
Contributor Author

@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance.

The reason is that DBSession is initialized at the top level of app/models.py, and can't access any variables local to handler methods in its scopefunc. That's what the contextvar allows us to get around.

@stefanv
Copy link
Contributor

stefanv commented Mar 24, 2021

@profjsb "but it's only 6 lines of code"! 😂

Turns out these 6 (now 5) lines of code are... tricky. I spent a lot of time trying to figure out exactly how Tornado navigates contexts. Take a look at this Tornado issue, e.g.:

tornadoweb/tornado#2938

(This patch was introduced after the 6.0.4 that we depend on.)

Long and short of it, I think the reason any of this works is because a) Tornado 5 and up builds on asyncio and b) Tornado now has some context awareness itself.

What is not clear to me yet is exactly when contexts switch out. Ideally, we'd want a context switch per request handler, but we may get now get contexts depending on... well, I don't know and I need help here. I cannot find any Tornado documentation that discusses contexts.

The asyncio documentation states:

When a Task is created it copies the current context and later runs its coroutine in the copied context.

The Task in this case would be a method of our handler, I suspect.

What I would like to test next is the following:

  1. Instantiate two handlers, a and b
  2. Run a.prepare, then run b.prepare
  3. Execute a method of a then b. Verify whether those methods have access to different contexts.

@dannygoldstein
Copy link
Contributor Author

dannygoldstein commented Mar 24, 2021 via email

@stefanv
Copy link
Contributor

stefanv commented Mar 24, 2021

@dannygoldstein Right, but what does that mean?

import contextvars
import uuid


ctx = contextvars.ContextVar('request_id', default=None)

class MyHandler:
    def __init__(self, name):
        self._name = name
        ctx.set(uuid.uuid4().hex)

    def get(self):
        print(self._name, ' -> ', ctx.get())


print('Default context:', ctx.get())
print('Initializing handler A')
a = MyHandler('a')
print('Global context value:', ctx.get())
print('Initializing handler B')
b = MyHandler('b')
print('Global context value:', ctx.get())
print('Executing a.get')
a.get()
print('Executing b.get')
b.get()
Default context: None
Initializing handler A
Global context value: d9890d36505049c0b44160aa7d0f40e6
Initializing handler B
Global context value: 3071db7f53b64203a51e2795ee0f2bc0
Executing a.get
a  ->  3071db7f53b64203a51e2795ee0f2bc0
Executing b.get
b  ->  3071db7f53b64203a51e2795ee0f2bc0

@dannygoldstein
Copy link
Contributor Author

@stefanv I think the issue you linked do above is only relevant to coroutines that are created using the tornado.gen.coroutine module and not the native python ones that are created using async def. We have only been using the latter so far in SP and F

@stefanv
Copy link
Contributor

stefanv commented Mar 24, 2021

@stefanv I think the issue you linked do above is only relevant to coroutines that are created using the tornado.gen.coroutine module and not the native python ones that are created using async def. We have only been using the latter so far in SP and F

Yes, not sure exactly how those differ other than https://www.tornadoweb.org/en/stable/guide/coroutines.html#native-vs-decorated-coroutines

@dannygoldstein
Copy link
Contributor Author

dannygoldstein commented Mar 25, 2021

@stefanv I think there is only one "context" in your example (the global context), whereas in our situations there would be 2 separate contexts, one for each request -- I think a closer analog to what would happen in tornado would be:

import contextvars
import asyncio
import uuid


ctx = contextvars.ContextVar('request_id', default=None)

class MyHandler:
    def __init__(self, name):
        self._name = name
        ctx.set(uuid.uuid4().hex)

    def get(self):
        print(self._name, ' -> ', ctx.get())


async def create_handler_and_set_context(name):
    handler = MyHandler(name)
    print(f'Initializing handler {name}')
    print('Global context value:', ctx.get())
    print(f'Executing {name}.get')
    handler.get()

async def main():
    await create_handler_and_set_context('a')
    await create_handler_and_set_context('b')

asyncio.run(main())
print('Global context value:', ctx.get())
(skyportal2) Dannys-MBP:baselayer danny$ python resp.py
Initializing handler a
Global context value: 5e451d05d07548b68c45196197e7b721
Executing a.get
a  ->  5e451d05d07548b68c45196197e7b721
Initializing handler b
Global context value: f322233198eb426da64ddfb9fbc6b99e
Executing b.get
b  ->  f322233198eb426da64ddfb9fbc6b99e
Global context value: None

@stefanv
Copy link
Contributor

stefanv commented Mar 25, 2021

@dannygoldstein I was wondering more about what would happen to all the "standard" calls (i.e., non-async). I presume we'd prefer for them to also get their own context?

I think we can run an experiment with real Tornado to see whether what you're saying above holds (we can use get handlers that sleep for a while).

@dannygoldstein
Copy link
Contributor Author

@stefanv Want to give that a shot? I tried implementing with asyncio and I think it would look something like this:

import contextvars
import asyncio
import uuid
import time


ctx = contextvars.ContextVar('request_id', default=None)

class MyHandler:
    def __init__(self, name):
        self._name = name
        ctx.set(uuid.uuid4().hex)

    def get(self):
        print(self._name, ' -> ', ctx.get())


def _body(name):
    handler = MyHandler(name)
    print(f'Initializing handler {name}')
    print('Global context value:', ctx.get())
    print(f'Executing {name}.get')
    handler.get()

async def create_handler_and_set_context(name):
    _body(name)
    time.sleep(3)    

def blocking_create_handler_and_set_context(name):
    _body(name)
    time.sleep(1.5)
    
async def main():
    asyncio.create_task(create_handler_and_set_context('async a'))
    asyncio.create_task(create_handler_and_set_context('async b'))
    blocking_create_handler_and_set_context('block c')
    blocking_create_handler_and_set_context('block d')
    asyncio.create_task(create_handler_and_set_context('async e'))
    

asyncio.run(main())
print('Global context value:', ctx.get())
(skyportal2) Dannys-MBP:baselayer danny$ python resp.py
Initializing handler block c
Global context value: c1ac0cbc2db24259809a9c45411e72ea
Executing block c.get
block c  ->  c1ac0cbc2db24259809a9c45411e72ea
Initializing handler block d
Global context value: 40231eea993d4060995e7131a8971e7e
Executing block d.get
block d  ->  40231eea993d4060995e7131a8971e7e
Initializing handler async a
Global context value: 254b3adfa6a74c04b9fd4c1ca2873007
Executing async a.get
async a  ->  254b3adfa6a74c04b9fd4c1ca2873007
Initializing handler async b
Global context value: 7077808e0fd84fddae4781171489c712
Executing async b.get
async b  ->  7077808e0fd84fddae4781171489c712
Initializing handler async e
Global context value: 9ec747e303324365b22a6692828a4dc7
Executing async e.get
async e  ->  9ec747e303324365b22a6692828a4dc7
Global context value: None

@dannygoldstein
Copy link
Contributor Author

dannygoldstein commented Mar 25, 2021

From tornado docs:

https://www.tornadoweb.org/en/stable/index.html#asyncio-integration

asyncio Integration
Tornado is integrated with the standard library asyncio module and shares the same event loop (by default since Tornado 5.0). In general, libraries designed for use with asyncio can be mixed freely with Tornado.

@dannygoldstein
Copy link
Contributor Author

@stefanv here is a tornado version,

from tornado import web, ioloop
import time
import numpy as np
import contextvars
import asyncio
import uuid

request_id = contextvars.ContextVar('request_id', default=None)

class BaseHandler(web.RequestHandler):

    def prepare(self):
        request_id.set(uuid.uuid4().hex)

class BlockingHandler(BaseHandler):

    def get(self):
        time.sleep(np.random.rand())
        self.write(request_id.get())


class AsyncHandler(BaseHandler):

    async def get(self):
        await asyncio.sleep(np.random.rand())
        self.write(request_id.get())


def make_app():
    return web.Application([
        (r"/blocking", BlockingHandler),
        (r'/async', AsyncHandler)
    ])

if __name__ == "__main__":
    app = make_app()
    app.listen(8888)
    ioloop.IOLoop.current().start()

I sent 50 requests to this at once using aiohttp (49 to the async handler and 1 to the blocking handler), the whole thing finished in ~1s (indicating that the async stuff on the server worked). I printed the request_ids returned by the server on the client side and verified that all 50 were unique, so I think this works fine in tornado

@dannygoldstein
Copy link
Contributor Author

Here is my client code

# Send 50 requests to the server at once, 49 async, 1 blocking, then 
# verify that they all had separate request contexts on the server

import asyncio
import aiohttp

async def run(url_list):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in url_list:
            task = asyncio.ensure_future(session.get(url))
            tasks.append(task)
        responses = asyncio.gather(*tasks)
        await responses
    return responses

async def main():
    url_list = ['http://localhost:8888/async'] * 50

    # send one blocking request
    url_list[25] = url_list[25].replace('async', 'blocking')

    task = run(url_list)
    future = await task
    context_ids = set()
    for r in future.result():
        context_ids.add(await r.text())
    assert len(context_ids) == 50
    print(context_ids)


asyncio.run(main())
(skyportal2) Dannys-MBP:baselayer danny$ time python client.py
{'27fc9539924043adac41c85afb533a2b', 'a867b251fdc54fd3a2de279bb3d66443', 'e22a3d6eb0124db89c5fbbe8f2612615', '5573d8ecd18a48c7bd448f6411291c9f', '64e0fd3daeb14a8a9a14a0e2ea10ce4a', 'e39841b3423d4d96bc6438739ae06e0f', '2262acdc09934ba3b270dd5abbfcb255', 'fbc6a204c3864c98badf08bfd0e59096', '130337ac1c214ba6a958d175f13d5d75', '48b0d98dbd664b35a629424dd2f15a7a', 'aa19e5cf270e4d7982d47657f2993173', '02dfc23ac8664f88b3a2a416637a63a9', '33964eb38b464d159257af4038fbec9d', 'b1bfc42e807f43a0be7c1f528f625818', 'e612464b672241509c8dd9fec34508cc', '2e3b3fb708b14b8ca808e25237a01c45', 'cc8b2750def14b13bda52159617d5faf', 'db3ed5124de745ccaf1cc54ee486531f', '4b5c370d19ad44b7b7bf76d223ef114b', 'a188bec587ed4b5ead8eb42157cc6170', '4a9b479144f14bef943d4940885196a8', '7c33d0fcbcb743138defe5b3b0ad9edc', 'dc939d777581426797cd683d7e8cf0b1', 'e97a2646c08342f98c3e7ad64e7e6777', 'fa6e0d303bbc4482b8fd8831a8f5e5ea', 'a915abefe4ae41e4ad81415cda0a893e', 'c7badc7ed9914767b40d7968de37ec3f', '5d8295d2a7fc402f8720cfbe3f8d20b7', 'b8d1733b1c9547e9ae56f3cca2213d30', '9678c52906984df19760de07c7a15732', 'e0e9d114959f469299747c0d011f05e3', '1defaf9261094c2ba1a31439177ad23c', '94177c00f027487eae2f45179e790f9e', '450d3a461a544ffb9d2fbb3a652aec9e', '906bf7ceaa6f43b88519879d9f0633b5', '3eb41aa7d9c74223934db7a763ff2023', '38a72829dd9145439aa3f43b37bc013d', 'b3f8d359601147b78e81e3d0393c5c20', '9beecbe7b8f34ae79349f61a4fc0f0ff', '5821fd67fcd9410d95bc8a0a30c1474c', '6fb1b16dd32a4dfd8a56ce49f8898b3d', 'be953628b8f446bcb7d892c1fa42f01e', '734fa054f11942e2937ee42f6ad27d7b', 'f89ea2e8df944933855dc39697e4bc99', 'f214da61a1814b2fb0c0ac7ddac1326e', '5f2f9dcb507b4262b557e1fdf3b6270a', '7b6286c19a444f83add35de132cbbc65', '9b41cfe14516455bb3cd715145308649', 'cd45200564a2426e8b0e4aeafbd6a524', '39cd38fe12ee41459f44bce44474ee63'}

real	0m1.650s
user	0m0.223s
sys	0m0.049s

@stefanv
Copy link
Contributor

stefanv commented Mar 25, 2021

We have two requirements:

  1. Every new request must execute in its own context
  2. Every async def block must execute in the same context

Danny confirmed (1) above, and I've now also confirmed (2), so this is good to go.

@stefanv stefanv merged commit 960d01f into cesium-ml:master Mar 25, 2021
@stefanv
Copy link
Contributor

stefanv commented Mar 25, 2021

Thanks, @dannygoldstein!

acrellin pushed a commit to acrellin/baselayer that referenced this pull request Apr 2, 2021
A ContextVar is now used to determine when a new database session should be created.
The ContextVar `session_context_id` is set upon request instantiation, and remains constant during its execution (i.e., while running a method like `async get`, `post`, etc.).
profjsb pushed a commit to profjsb/baselayer that referenced this pull request Sep 27, 2022
A ContextVar is now used to determine when a new database session should be created.
The ContextVar `session_context_id` is set upon request instantiation, and remains constant during its execution (i.e., while running a method like `async get`, `post`, etc.).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants