Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corner-case performance degradation #121

Open
ppolewicz opened this issue Aug 16, 2023 · 4 comments
Open

Corner-case performance degradation #121

ppolewicz opened this issue Aug 16, 2023 · 4 comments

Comments

@ppolewicz
Copy link

ppolewicz commented Aug 16, 2023

This might be a known limitation and I'm pretty sure it is something one shouldn't be doing when using nogil mode. There is an easy workaround for now to just PYTHONGIL=1 in such a case, but I thought it might be interesting to show the result of the test I've ran:

1 million increments per thread, 3 threads:

docker image PYTHONGIL time slowdown result
python:3.9-slim not set 0m1,014s 1x 1608925
nogil/python not set 0m46,443s 45.8x 1008232
nogil/python 1 0m1,047s 1.032x 3000000

5 million increments per thread, 3 threads:

docker image PYTHONGIL time slowdown result
python:3.9-slim not set 0m2,110s 1x 7823878
nogil/python not set 3m44,655s 106.47x 5024261
nogil/python 1 0m2,127s 1.0080x 15000000

1 million increments per thread, 10 threads:

docker image PYTHONGIL time slowdown result
python:3.9-slim not set 0m1,702s 1x 3051994
nogil/python not set 2m14,190s 78.84x 1002417
nogil/python 1 0m1,761s 1.0346x 10000000

environment: Ubuntu VM running on 5 cores i5-9600KF CPU @ 3.70GHz

It looks like nogil/python with PYTHONGIL=1 provides a more consistent += than the official CPython release. Not sure if it's intended. There was no nogil 3.12 docker image, so I didn't test there.

Code

incrementor.py

import sys
print('nogil', hasattr(sys.flags, 'nogil') and sys.flags.nogil)


import threading

THREAD_COUNT = 3
BY_HOW_MUCH = 1_000_000


class Incrementor:
    def __init__(self):
        self.c = 0

def incr(incrementor, by_how_much):
    for i in range(by_how_much):
        incrementor.c += 1

incrementor = Incrementor()

threads = [
    threading.Thread(target=incr, args=(incrementor, BY_HOW_MUCH))
    for i in range(THREAD_COUNT)
]

for t in threads:
    t.start()

for t in threads:
    t.join()

print(incrementor.c)

Dockerfile

FROM nogil/python
#FROM python:3.9-slim
#ENV PYTHONGIL=1

WORKDIR /usr/src/app
COPY incrementor.py .
CMD ["python3", "incrementor.py"]
some raw test results ```sh $ time docker run -it --rm --name incrementor-running incrementor Sending build context to Docker daemon 4.608kB Step 1/4 : FROM python:3.9-slim ---> 313d2f483acd Step 2/4 : WORKDIR /usr/src/app ---> Using cache ---> 21fe8413ac4b Step 3/4 : COPY incrementor.py . ---> Using cache ---> e12f9c4f202c Step 4/4 : CMD ["python3", "incrementor.py"] ---> Using cache ---> df42d07b5298 Successfully built df42d07b5298 Successfully tagged incrementor:latest nogil False 1608925

real 0m1,014s
user 0m0,016s
sys 0m0,016s


```sh
$ time docker run -it --rm --name incrementor-running incrementor
Sending build context to Docker daemon  4.608kB
Step 1/4 : FROM nogil/python
 ---> fb4c356acc04
Step 2/4 : WORKDIR /usr/src/app
 ---> Using cache
 ---> 7c0cc98d04d0
Step 3/4 : COPY incrementor.py .
 ---> 91b301e2e554
Step 4/4 : CMD ["python3", "incrementor.py"]
 ---> Running in 8bf5cd50d44a
Removing intermediate container 8bf5cd50d44a
 ---> 0604276a25f2
Successfully built 0604276a25f2
Successfully tagged incrementor:latest
nogil True
1008232

real	0m46,443s
user	0m0,021s
sys	0m0,014s
$ time docker run -it --rm --name incrementor-running incrementor
Sending build context to Docker daemon  4.608kB
Step 1/5 : FROM nogil/python
 ---> fb4c356acc04
Step 2/5 : ENV PYTHONGIL=1
 ---> Running in c541a1442b74
Removing intermediate container c541a1442b74
 ---> 242047eabdba
Step 3/5 : WORKDIR /usr/src/app
 ---> Running in 68c1395b279f
Removing intermediate container 68c1395b279f
 ---> 4f2bc8211e90
Step 4/5 : COPY incrementor.py .
 ---> 7bc7aa0dc555
Step 5/5 : CMD ["python3", "incrementor.py"]
 ---> Running in 45fc225db0eb
Removing intermediate container 45fc225db0eb
 ---> 5d79a20dfb2b
Successfully built 5d79a20dfb2b
Successfully tagged incrementor:latest
nogil False
3000000

real	0m1,291s
user	0m0,015s
sys	0m0,019s
@pjurgielewicz
Copy link

Hi @ppolewicz, as far as I understand correctly this is somewhat expected behavior due to biased reference counting - namely non-owning threads can access shared objects' refcount fields using atomics only.

In my tests, I was also trying to violate queue performance, which is relatively easy.

import time
import threading
from queue import SimpleQueue, Queue

def producer(queue):
    buf = bytearray(1500)
    for i in range(10**6):
        queue.put((i, buf))
    queue.put(None)  # Add a sentinel value to indicate the end of data

def consumer(queue):
    while True:
        item = queue.get(timeout=10**0)
        if item is None:
            break
        # Process the item here (e.g., print or do some computation)

# Create a SimpleQueue object
# queue = Queue()
queue = SimpleQueue()

# Create the producer and consumer threads
producer_thread = threading.Thread(target=producer, args=(queue,))
consumer_thread = threading.Thread(target=consumer, args=(queue,))

# Start the timer
start_time = time.time()

# Start the threads
consumer_thread.start()
producer_thread.start()

# Wait for the threads to finish
producer_thread.join()
consumer_thread.join()

# Calculate the elapsed time
end_time = time.time()
elapsed_time = end_time - start_time

# Print the performance metrics
print(f"Data exchange performance: {elapsed_time} seconds.")

In nogil CPython there are 2 (or more) threads concurrently fighting for access to the resource (queue) leading to enormous overhead while GIL prevents that (on my machine there is 2 orders of magnitude difference in execution speed between nogil and regular CPython).

All in all nogil requires a slightly different approach to data management. In the case of queues, for my project the solution with greedy data buffering on the client side provides good results.

@ppolewicz
Copy link
Author

GIL required workarounds and biased reference counting may sometimes require workarounds too. It'll be harder to run into those, I hope.

If my understanding is correct, incr() shouldn't be touching the reference counter of the incrementor object a million times, but it's rather the integers assigned to c that are being DECREFed with atomics. I switched it to ^= True in hope of getting rid of that effect as True and False should be immortal, but it didn't help. Am I missing something? Is nogil/python immortalizing True and False or was that only a 3.12 thing?

@pjurgielewicz
Copy link

@ppolewicz

In your code try switching line:

threads = [
    threading.Thread(target=incr, args=(incrementor, BY_HOW_MUCH))
    for i in range(THREAD_COUNT)
]

to:

threads = [
    threading.Thread(target=incr, args=(Incrementor(), BY_HOW_MUCH))
    for i in range(THREAD_COUNT)
]

And you will see the maximum possible performance on nogil. Of course, the result will be wrong but my point here is that in the original code, 3 child threads are simultaneously trying to access the same instance of the Incrementor object thus modifying its refcnt (non-atomic) then these threads are trying to modify field self.c which also affects the counter - and all these operations are atomic (lock-acquiring and potentially CPU cache-invalidating operations).

About immortalization: yes, immortalization is part of nogil 3.9 (_Py_REF_IMMORTAL_MASK ) but its existence does not kick in the place you mentioned (in my counterexample 3 threads are freely creating and assigning ints to per-thread self.c object)

@colesbury
Copy link
Owner

I think @pjurgielewicz's response has covered most of this, but just in case:

  1. Yes, this is expected. (The nogil with PYTHONGIL=1 behavior is a bit surprising, but I think that's just because there are fewer places for the GIL to be released; don't read too much into it.)
  2. Depending on the interpreter,incr() may touch the reference count of incrementor. it definitely does in CPython because access to local variables requires a LOAD_FAST which copies the local variable to the stack and increments the reference count. To be honest, I can't remember if the nogil-3.9 bytecode works this way, but it doesn't matter because that's not going to get upstreamed.
  3. There's also contention on the attribute c. You'll have this even if you switch to booleans and immortalize incrementor.

The general problem of memory contention is not specific to Python or nogil -- you can see similar behavior when modifying the same variable from many threads in C or C++ -- but it's potentially more of an issue in nogil because there's possible contention on more things:

  1. The reference count of incrementor
  2. The reference count of the numbers themselves
  3. The dict lock that protects incrementor.c

The provided example is probably not worth optimizing for, but I expect there will be more real world examples where memory contention is an issue and we have a few possible strategies for mitigating it, including immortalization and deferred reference counting, but how they're applied will depend on the actual context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants