scope database sessions to requests instead of threads #214

dannygoldstein · 2021-03-23T21:50:10Z

Prerequisite reading: https://docs.sqlalchemy.org/en/14/orm/session_basics.html#session-faq-whentocreate

At the moment, we maintain a single sqlalchemy database session per thread and use this session for all requests handled by the thread. Said another way, the scope of our sessions currently is the entire thread, rather than the request. This model is unsafe. Consider the following scenario:

Web worker 1 receives a request that routes to an asynchronous handler. The handler launches into a non-blocking coroutine.
Web worker 1, free to handle new requests, receives another request that is routed to a blocking handler.
Because both of these requests are being processed simultaneously in the same thread, there is a critical information leak: all of the objects brought into the thread's database session by request 1 are mixed with the objects from request 2

This breaks the permission checking framework, which assumes that all of the objects in the current session are in the scope of the current request (rather than the current thread).

This is a known issue with SQLalchemy and tornado. From the Sqlalchemy docs:

https://docs.sqlalchemy.org/en/14/orm/contextual.html#using-thread-local-scope-with-web-applications

As it turns out, most Python web frameworks, with notable exceptions such as the asynchronous frameworks Twisted and Tornado, use threads in a simple way, such that a particular web request is received, processed, and completed within the scope of a single worker thread.

Since this is not the case for Tornado, we need to explicitly specify the scope of our sessions:

https://docs.sqlalchemy.org/en/14/orm/contextual.html#using-custom-created-scopes

This PR implements a get_request_id function as a contextvar:

https://docs.python.org/3/library/contextvars.html

This module provides APIs to manage, store, and access context-local state. The ContextVar class is used to declare and work with Context Variables. The copy_context() function and the Context class should be used to manage the current context in asynchronous frameworks.

Context managers that have state should use Context Variables instead of threading.local() to prevent their state from bleeding to other code unexpectedly, when used in concurrent code.

The context var keeps track of the current request ID, which I pass to the handlers from NGINX, following this guide:

https://quanttype.net/posts/2020-02-05-request-id-logging.html

This ensures that each request context maintains its own database session, scoped to the request.

Related to #213.

cesium-ml#212)

pep8speaks · 2021-03-23T21:50:13Z

Hello @dannygoldstein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-24 17:51:59 UTC

stefanv · 2021-03-24T00:30:30Z

I will read in detail tomorrow, but relevant: we do not have threads that run async at the moment, afaik.

dannygoldstein · 2021-03-24T00:38:52Z

@stefanv Perhaps thread-safe is not the most accurate term for this PR - thread-safe refers to collisions between multiple threads trying to access the same address space. The problem that this PR addresses is somewhat different-- when one single thread is concurrently handling multiple requests, the different requests both share a session, causing the permissions checking infrastructure (which assumes sessions are scoped to requests) to fail. This pr simply enforces that database sessions are scoped to requests instead of threads

acrellin · 2021-03-24T00:42:32Z

I believe Dima has implemented some async handler methods in F extensions

…

On Tue, Mar 23, 2021, 5:39 PM Danny Goldstein ***@***.***> wrote: @stefanv <https://github.com/stefanv> Perhaps thread-safe is not the most accurate term for this PR - thread-safe refers to collisions between multiple threads trying to access the same address space. The problem that this PR addresses is somewhat different-- when one single thread is concurrently handling multiple requests, there is currently a bleed through between the different requests' database sessions. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#214 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXFGTIXAQG4N35UEJIFVMDTFEYCVANCNFSM4ZWBBKZQ> .

dannygoldstein · 2021-03-24T00:43:58Z

The offsets handler is async, as is the finding chart generator

stefanv · 2021-03-24T17:00:51Z

@dannygoldstein This is perfect. The only thing I don't follow is where you set the HTTP header. Do you mind point that out?

stefanv · 2021-03-24T17:14:15Z

Second question: why do we need a context variable, instead of an instance UUID?

acrellin · 2021-03-24T17:15:34Z

@dannygoldstein @stefanv It looks like it's being set by nginx here https://github.com/cesium-ml/baselayer/pull/214/files#diff-374ca41bad5bb33fc44ec617aef88cd102525dfe18eb1c89a75cea68e8f83c85R56 , but not all clients provide that header, right? How do we handle those cases?

stefanv · 2021-03-24T17:24:12Z

That's just the proxy forwarding the header, isn't it?

acrellin · 2021-03-24T17:24:55Z

Right, so it looks like this depends on whether the client provides that header in the request?

acrellin · 2021-03-24T17:26:28Z

nginx can set a header to UUID, if desired, though

dannygoldstein · 2021-03-24T17:26:45Z

$request_id is an embedded variable that is generated by NGINX for each incoming request - http://nginx.org/en/docs/http/ngx_http_core_module.html - grep for $request_id (no permalink).

…

On Wed, Mar 24, 2021 at 10:25 AM Ari Crellin-Quick ***@***.***> wrote: Right, so it looks like this depends on whether the client provides that header in the request? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#214 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAVEFYFFZMCQHGVIAJJPZUTTFIN7XANCNFSM4ZWBBKZQ> .

-- Danny Goldstein http://astro.caltech.edu/~danny

acrellin · 2021-03-24T17:29:20Z

ah, perfect

acrellin · 2021-03-24T17:31:30Z

@dannygoldstein but where does the "request-id" header get set? Does the X-Request-ID header get set with that key?

dannygoldstein · 2021-03-24T17:51:09Z

@acrellin I've now sidestepped the issue altogether and simply assigned the request_id inside of Tornado instead of forwarding the value from nginx.

stefanv · 2021-03-24T18:36:08Z

@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance.

dannygoldstein · 2021-03-24T18:48:03Z

@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance.

Can you show an example?

stefanv · 2021-03-24T20:06:34Z

Each request instance handles a specific HTTP connection, if I'm not mistaken. And you cannot have multiple threads that access the database within one request?

dannygoldstein · 2021-03-24T20:42:51Z

@stefanv The database session is initialized at the baselayer/app/models.py module level. What I think you are suggesting is to do something like

class BaseHandler(PSAHandler):

    def prepare(self):
        self.session = scoped_session(sessionfactory())

    def get(self):
        self.session.query(...)

    def on_finish(self):
        self.session.remove()

Problem with this is we would have to go and refactor hundreds of lines of handler code to use self.session instead of DBSession. We'd also have to create a separate sessionfactory for sessions that are used outside of request handlers (e.g., in the IPython terminal). It would be a huge change.

dannygoldstein · 2021-03-24T20:44:31Z

We'd also have to refactor the event handlers, test suite, fixtures, etc. It would end being thousands of lines of code.

dannygoldstein · 2021-03-24T20:45:03Z

This way we have a single session factory, consistent scoping both in and out of request handlers, and no need to refactor any other code in the entire code base.

dannygoldstein · 2021-03-24T20:51:12Z

@dannygoldstein What I'm trying to understand is why we need the contexts, and not simply set an attribute on the request instance.

The reason is that DBSession is initialized at the top level of app/models.py, and can't access any variables local to handler methods in its scopefunc. That's what the contextvar allows us to get around.

stefanv · 2021-03-24T23:35:27Z

@profjsb "but it's only 6 lines of code"! 😂

Turns out these 6 (now 5) lines of code are... tricky. I spent a lot of time trying to figure out exactly how Tornado navigates contexts. Take a look at this Tornado issue, e.g.:

tornadoweb/tornado#2938

(This patch was introduced after the 6.0.4 that we depend on.)

Long and short of it, I think the reason any of this works is because a) Tornado 5 and up builds on asyncio and b) Tornado now has some context awareness itself.

What is not clear to me yet is exactly when contexts switch out. Ideally, we'd want a context switch per request handler, but we may get now get contexts depending on... well, I don't know and I need help here. I cannot find any Tornado documentation that discusses contexts.

The asyncio documentation states:

When a Task is created it copies the current context and later runs its coroutine in the copied context.

The Task in this case would be a method of our handler, I suspect.

What I would like to test next is the following:

Instantiate two handlers, a and b
Run a.prepare, then run b.prepare
Execute a method of a then b. Verify whether those methods have access to different contexts.

dannygoldstein · 2021-03-24T23:50:58Z

@stefanv https://groups.google.com/g/python-tornado/c/8mzGNnhUxtU?pli=1 from Ben Darnell: "It will work automatically for native (`async def`) coroutines"

stefanv · 2021-03-24T23:53:00Z

@dannygoldstein Right, but what does that mean?

import contextvars
import uuid


ctx = contextvars.ContextVar('request_id', default=None)

class MyHandler:
    def __init__(self, name):
        self._name = name
        ctx.set(uuid.uuid4().hex)

    def get(self):
        print(self._name, ' -> ', ctx.get())


print('Default context:', ctx.get())
print('Initializing handler A')
a = MyHandler('a')
print('Global context value:', ctx.get())
print('Initializing handler B')
b = MyHandler('b')
print('Global context value:', ctx.get())
print('Executing a.get')
a.get()
print('Executing b.get')
b.get()

Default context: None
Initializing handler A
Global context value: d9890d36505049c0b44160aa7d0f40e6
Initializing handler B
Global context value: 3071db7f53b64203a51e2795ee0f2bc0
Executing a.get
a  ->  3071db7f53b64203a51e2795ee0f2bc0
Executing b.get
b  ->  3071db7f53b64203a51e2795ee0f2bc0

dannygoldstein · 2021-03-24T23:54:32Z

@stefanv I think the issue you linked do above is only relevant to coroutines that are created using the tornado.gen.coroutine module and not the native python ones that are created using async def. We have only been using the latter so far in SP and F

stefanv · 2021-03-24T23:56:47Z

@stefanv I think the issue you linked do above is only relevant to coroutines that are created using the tornado.gen.coroutine module and not the native python ones that are created using async def. We have only been using the latter so far in SP and F

Yes, not sure exactly how those differ other than https://www.tornadoweb.org/en/stable/guide/coroutines.html#native-vs-decorated-coroutines

dannygoldstein · 2021-03-25T00:20:42Z

@stefanv I think there is only one "context" in your example (the global context), whereas in our situations there would be 2 separate contexts, one for each request -- I think a closer analog to what would happen in tornado would be:

import contextvars
import asyncio
import uuid


ctx = contextvars.ContextVar('request_id', default=None)

class MyHandler:
    def __init__(self, name):
        self._name = name
        ctx.set(uuid.uuid4().hex)

    def get(self):
        print(self._name, ' -> ', ctx.get())


async def create_handler_and_set_context(name):
    handler = MyHandler(name)
    print(f'Initializing handler {name}')
    print('Global context value:', ctx.get())
    print(f'Executing {name}.get')
    handler.get()

async def main():
    await create_handler_and_set_context('a')
    await create_handler_and_set_context('b')

asyncio.run(main())
print('Global context value:', ctx.get())

(skyportal2) Dannys-MBP:baselayer danny$ python resp.py
Initializing handler a
Global context value: 5e451d05d07548b68c45196197e7b721
Executing a.get
a  ->  5e451d05d07548b68c45196197e7b721
Initializing handler b
Global context value: f322233198eb426da64ddfb9fbc6b99e
Executing b.get
b  ->  f322233198eb426da64ddfb9fbc6b99e
Global context value: None

stefanv · 2021-03-25T00:26:59Z

@dannygoldstein I was wondering more about what would happen to all the "standard" calls (i.e., non-async). I presume we'd prefer for them to also get their own context?

I think we can run an experiment with real Tornado to see whether what you're saying above holds (we can use get handlers that sleep for a while).

dannygoldstein · 2021-03-25T00:50:30Z

@stefanv Want to give that a shot? I tried implementing with asyncio and I think it would look something like this:

import contextvars
import asyncio
import uuid
import time


ctx = contextvars.ContextVar('request_id', default=None)

class MyHandler:
    def __init__(self, name):
        self._name = name
        ctx.set(uuid.uuid4().hex)

    def get(self):
        print(self._name, ' -> ', ctx.get())


def _body(name):
    handler = MyHandler(name)
    print(f'Initializing handler {name}')
    print('Global context value:', ctx.get())
    print(f'Executing {name}.get')
    handler.get()

async def create_handler_and_set_context(name):
    _body(name)
    time.sleep(3)    

def blocking_create_handler_and_set_context(name):
    _body(name)
    time.sleep(1.5)
    
async def main():
    asyncio.create_task(create_handler_and_set_context('async a'))
    asyncio.create_task(create_handler_and_set_context('async b'))
    blocking_create_handler_and_set_context('block c')
    blocking_create_handler_and_set_context('block d')
    asyncio.create_task(create_handler_and_set_context('async e'))
    

asyncio.run(main())
print('Global context value:', ctx.get())

(skyportal2) Dannys-MBP:baselayer danny$ python resp.py
Initializing handler block c
Global context value: c1ac0cbc2db24259809a9c45411e72ea
Executing block c.get
block c  ->  c1ac0cbc2db24259809a9c45411e72ea
Initializing handler block d
Global context value: 40231eea993d4060995e7131a8971e7e
Executing block d.get
block d  ->  40231eea993d4060995e7131a8971e7e
Initializing handler async a
Global context value: 254b3adfa6a74c04b9fd4c1ca2873007
Executing async a.get
async a  ->  254b3adfa6a74c04b9fd4c1ca2873007
Initializing handler async b
Global context value: 7077808e0fd84fddae4781171489c712
Executing async b.get
async b  ->  7077808e0fd84fddae4781171489c712
Initializing handler async e
Global context value: 9ec747e303324365b22a6692828a4dc7
Executing async e.get
async e  ->  9ec747e303324365b22a6692828a4dc7
Global context value: None

dannygoldstein · 2021-03-25T01:15:11Z

From tornado docs:

https://www.tornadoweb.org/en/stable/index.html#asyncio-integration

asyncio Integration
Tornado is integrated with the standard library asyncio module and shares the same event loop (by default since Tornado 5.0). In general, libraries designed for use with asyncio can be mixed freely with Tornado.

dannygoldstein · 2021-03-25T02:59:42Z

@stefanv here is a tornado version,

from tornado import web, ioloop
import time
import numpy as np
import contextvars
import asyncio
import uuid

request_id = contextvars.ContextVar('request_id', default=None)

class BaseHandler(web.RequestHandler):

    def prepare(self):
        request_id.set(uuid.uuid4().hex)

class BlockingHandler(BaseHandler):

    def get(self):
        time.sleep(np.random.rand())
        self.write(request_id.get())


class AsyncHandler(BaseHandler):

    async def get(self):
        await asyncio.sleep(np.random.rand())
        self.write(request_id.get())


def make_app():
    return web.Application([
        (r"/blocking", BlockingHandler),
        (r'/async', AsyncHandler)
    ])

if __name__ == "__main__":
    app = make_app()
    app.listen(8888)
    ioloop.IOLoop.current().start()

I sent 50 requests to this at once using aiohttp (49 to the async handler and 1 to the blocking handler), the whole thing finished in ~1s (indicating that the async stuff on the server worked). I printed the request_ids returned by the server on the client side and verified that all 50 were unique, so I think this works fine in tornado

dannygoldstein · 2021-03-25T03:04:51Z

Here is my client code

# Send 50 requests to the server at once, 49 async, 1 blocking, then 
# verify that they all had separate request contexts on the server

import asyncio
import aiohttp

async def run(url_list):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for url in url_list:
            task = asyncio.ensure_future(session.get(url))
            tasks.append(task)
        responses = asyncio.gather(*tasks)
        await responses
    return responses

async def main():
    url_list = ['http://localhost:8888/async'] * 50

    # send one blocking request
    url_list[25] = url_list[25].replace('async', 'blocking')

    task = run(url_list)
    future = await task
    context_ids = set()
    for r in future.result():
        context_ids.add(await r.text())
    assert len(context_ids) == 50
    print(context_ids)


asyncio.run(main())

(skyportal2) Dannys-MBP:baselayer danny$ time python client.py
{'27fc9539924043adac41c85afb533a2b', 'a867b251fdc54fd3a2de279bb3d66443', 'e22a3d6eb0124db89c5fbbe8f2612615', '5573d8ecd18a48c7bd448f6411291c9f', '64e0fd3daeb14a8a9a14a0e2ea10ce4a', 'e39841b3423d4d96bc6438739ae06e0f', '2262acdc09934ba3b270dd5abbfcb255', 'fbc6a204c3864c98badf08bfd0e59096', '130337ac1c214ba6a958d175f13d5d75', '48b0d98dbd664b35a629424dd2f15a7a', 'aa19e5cf270e4d7982d47657f2993173', '02dfc23ac8664f88b3a2a416637a63a9', '33964eb38b464d159257af4038fbec9d', 'b1bfc42e807f43a0be7c1f528f625818', 'e612464b672241509c8dd9fec34508cc', '2e3b3fb708b14b8ca808e25237a01c45', 'cc8b2750def14b13bda52159617d5faf', 'db3ed5124de745ccaf1cc54ee486531f', '4b5c370d19ad44b7b7bf76d223ef114b', 'a188bec587ed4b5ead8eb42157cc6170', '4a9b479144f14bef943d4940885196a8', '7c33d0fcbcb743138defe5b3b0ad9edc', 'dc939d777581426797cd683d7e8cf0b1', 'e97a2646c08342f98c3e7ad64e7e6777', 'fa6e0d303bbc4482b8fd8831a8f5e5ea', 'a915abefe4ae41e4ad81415cda0a893e', 'c7badc7ed9914767b40d7968de37ec3f', '5d8295d2a7fc402f8720cfbe3f8d20b7', 'b8d1733b1c9547e9ae56f3cca2213d30', '9678c52906984df19760de07c7a15732', 'e0e9d114959f469299747c0d011f05e3', '1defaf9261094c2ba1a31439177ad23c', '94177c00f027487eae2f45179e790f9e', '450d3a461a544ffb9d2fbb3a652aec9e', '906bf7ceaa6f43b88519879d9f0633b5', '3eb41aa7d9c74223934db7a763ff2023', '38a72829dd9145439aa3f43b37bc013d', 'b3f8d359601147b78e81e3d0393c5c20', '9beecbe7b8f34ae79349f61a4fc0f0ff', '5821fd67fcd9410d95bc8a0a30c1474c', '6fb1b16dd32a4dfd8a56ce49f8898b3d', 'be953628b8f446bcb7d892c1fa42f01e', '734fa054f11942e2937ee42f6ad27d7b', 'f89ea2e8df944933855dc39697e4bc99', 'f214da61a1814b2fb0c0ac7ddac1326e', '5f2f9dcb507b4262b557e1fdf3b6270a', '7b6286c19a444f83add35de132cbbc65', '9b41cfe14516455bb3cd715145308649', 'cd45200564a2426e8b0e4aeafbd6a524', '39cd38fe12ee41459f44bce44474ee63'}

real	0m1.650s
user	0m0.223s
sys	0m0.049s

stefanv · 2021-03-25T17:58:02Z

We have two requirements:

Every new request must execute in its own context
Every async def block must execute in the same context

Danny confirmed (1) above, and I've now also confirmed (2), so this is good to go.

stefanv · 2021-03-25T18:07:54Z

Thanks, @dannygoldstein!

A ContextVar is now used to determine when a new database session should be created. The ContextVar `session_context_id` is set upon request instantiation, and remains constant during its execution (i.e., while running a method like `async get`, `post`, etc.).

dannygoldstein added 4 commits March 22, 2021 10:01

Fix syntax error in CustomUserAccessControl (cesium-ml#211)

4885d9f

clear the sqlalchemy session at the beginning of each api handler call (

6de3248

cesium-ml#212)

Merge branch 'master' of https://github.com/cesium-ml/baselayer

ac24b8d

make session initialization thread safe and properly scoped

dd851b0

dannygoldstein requested review from stefanv, acrellin and kmshin1397 March 23, 2021 21:50

dannygoldstein added 3 commits March 23, 2021 14:55

pass request id to handlers from nginx

d4552e3

reset session_context_id to none on request finish

d052532

cleanup

664257b

dannygoldstein mentioned this pull request Mar 23, 2021

clear the sqlalchemy session at the beginning of each api handler call #213

Closed

dannygoldstein added 2 commits March 23, 2021 17:23

further cleanup

4964417

final cleanup

b1519da

dannygoldstein changed the title ~~make database sessions thread-safe~~ scope database sessions to requests instead of threads Mar 24, 2021

simplify import

24dce93

dannygoldstein mentioned this pull request Mar 24, 2021

Patch security vulnerabilities and data model violations revealed by test suite after enabling UACL skyportal/skyportal#1801

Merged

dannygoldstein added 3 commits March 24, 2021 10:45

simply use uuid4 for session_context_id

5c20c94

merge upstream

febbcb8

remove nginx config line

1ac8b56

remove space

e3cc391

acrellin approved these changes Mar 24, 2021

View reviewed changes

kmshin1397 approved these changes Mar 24, 2021

View reviewed changes

stefanv merged commit 960d01f into cesium-ml:master Mar 25, 2021

scope database sessions to requests instead of threads #214

scope database sessions to requests instead of threads #214

Conversation

dannygoldstein commented Mar 23, 2021 • edited Loading

pep8speaks commented Mar 23, 2021 • edited Loading

Comment last updated at 2021-03-24 17:51:59 UTC

stefanv commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021 • edited Loading

acrellin commented Mar 24, 2021 via email

dannygoldstein commented Mar 24, 2021

stefanv commented Mar 24, 2021

stefanv commented Mar 24, 2021

acrellin commented Mar 24, 2021 • edited Loading

stefanv commented Mar 24, 2021

acrellin commented Mar 24, 2021

acrellin commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021 via email

acrellin commented Mar 24, 2021

acrellin commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021

stefanv commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021

stefanv commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021

stefanv commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021 via email • edited Loading

stefanv commented Mar 24, 2021

dannygoldstein commented Mar 24, 2021

stefanv commented Mar 24, 2021

dannygoldstein commented Mar 25, 2021 • edited Loading

stefanv commented Mar 25, 2021

dannygoldstein commented Mar 25, 2021

dannygoldstein commented Mar 25, 2021 • edited Loading

dannygoldstein commented Mar 25, 2021

dannygoldstein commented Mar 25, 2021

stefanv commented Mar 25, 2021

stefanv commented Mar 25, 2021

dannygoldstein commented Mar 23, 2021 •

edited

Loading

pep8speaks commented Mar 23, 2021 •

edited

Loading

dannygoldstein commented Mar 24, 2021 •

edited

Loading

acrellin commented Mar 24, 2021 •

edited

Loading

dannygoldstein commented Mar 24, 2021 via email •

edited

Loading

dannygoldstein commented Mar 25, 2021 •

edited

Loading

dannygoldstein commented Mar 25, 2021 •

edited

Loading