Starlette and pandas results in unreleased memory piling up when using sync endpoints #1573

Atheuz · 2022-04-05T12:14:18Z

Atheuz
Apr 5, 2022

The problem

The core problem I'm trying to solve here is that when you define sync endpoints in Starlette and there is a problem with an dependency where it won't properly release memory, the idle memory usage will grow unbounded. In my case pandas has a known issue of not releasing memory.

The problem doesn't occur with async endpoints, only with sync endpoints.

Reproducing

Install these requirements:

anyio==3.5.0
asgiref==3.5.0
click==8.0.4
h11==0.13.0
idna==3.3
sniffio==1.2.0
starlette==0.19.0
typing_extensions==4.1.1
uvicorn==0.17.6
pandas==1.4.2
faker==13.3.4

Generate a CSV file with this script (make-fake-data.py) to a file called test-data.csv:

from faker import Faker

def make_rows(n=50000):
    """Make fake data rows."""
    f = Faker()
    header = ('Job', 'Detractor', 'Promoter', 'Trend',
              'Company', 'Car License Plate', 'Other Company',
              'A', 'B', 'C', 'D',
              'E', 'F', 'G', 'H',
              'I', 'J', 'K', 'L',
              'Password', 'Phone', 'UUID', 'SSN',
              'II', 'JJ', 'KK', 'LL',
              'EE', 'FF', 'GG', 'HH',
              'AA', 'BB', 'CC', 'DD',
              'Spouse Name', 'Spouse Address', 'Spouse Favorite Color',
              'M', 'N', 'O', 'P',
              'Q', 'R', 'S', 'T', 'U',
              'QQ', 'RR', 'SS', 'TT', 'UU',
              'Name', 'Address', 'Favorite Color', 'IBAN')
    data_columns = ('{{job}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{company}}', '{{license_plate}}', '{{company}}',
                    '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{password}}', '{{phone_number}}', '{{uuid4}}', '{{ssn}}',
                    '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{name}}', '{{address}}', '{{safe_color_name}}',
                    '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}', '{{pyfloat}}',
                    '{{name}}', '{{address}}', '{{safe_color_name}}', '{{iban}}')
    rows = f.csv(header=header,
             data_columns=data_columns, num_rows=n, include_row_ids=True)
    with open("test-data.csv", "w") as w:
        w.write(rows)


if __name__ == '__main__':
    make_rows()

Save the following Starlette application in a run.py file:

from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route
import uvicorn
import pandas as pd


def get_dataframe(fn):
    df = pd.read_csv(fn)
    return df


def mm(request):
    df = get_dataframe(fn="test-data.csv")
    return JSONResponse({"message": "Hello World"})


async def mm2(request):
    df = get_dataframe(fn="test-data.csv")
    return JSONResponse({"message": "Hello Async World"})


app = Starlette(debug=False, routes=[
    Route('/', mm),
    Route('/async', mm2),
])


uvicorn.run(app, host="0.0.0.0", port=8080)

Run the Starlette application with this command:

MALLOC_TRIM_THRESHOLD_=0 python run.py

Save this HTTP load testing script for k6

import http from "k6/http";

export const options = {
  discardResponseBodies: true,
  scenarios: {
    contacts: {
      executor: 'constant-vus',
      vus: 100,
      duration: '1s',
      gracefulStop: '60s',
    },
  },
};


export default function() {
    let response = http.get("http://localhost:8080/");
};

Run the HTTP load testing script using the following k6 command:

k6 run test.js

Results

The process goes from using around 50MiB to taking up roughly 3.5GiB during request time, and then once the requests are completed it settles around 800MiB, and it never decreases from this, and infact if you keep making the same requests then it increases to 1GiB, and then again to 1.2GiB and so on.

If I instead make requests to http://localhost:8080/async in the same HTTP load testing script, then the process goes from using 60MiB to taking up roughly 160MiB during request time, and then once the requests are completes it settles around 100MiB.

Thoughts

I've read that using the environment variable MALLOC_TRIM_THRESHOLD_=0 is supposed to help with memory behaviour in pandas, but the test above is already using that. If I remove that environment variable, then it goes from 50MiB to 4.3GiB, to settling on 3GiB, i.e. it uses even more memory so there is some truth to that but it doesn't address the underlying core problem.

When you make sufficient requests to the sync endpoints, Starlette spins up worker threads:

If however you use the async endpoints, it doesn't do that.

You can see here that if it's a coroutine, then it's run as-is, but if it's not then it runs everything in a threadpool that then creates all of these worker threads:

starlette/starlette/routing.py

Line 62 in 20d24a8

async def app(scope: Scope, receive: Receive, send: Send) -> None:

=>

starlette/starlette/concurrency.py

Line 35 in 20d24a8

async def run_in_threadpool(

=>
https://github.com/agronholm/anyio/blob/e23b44e171c71d44e57d0c103a7daec1aaa7ad57/src/anyio/to_thread.py#L10

But I don't think this is the underlying cause of the problem because these worker threads are killed on new requests being made after 10 seconds of being idle.

What I'm not sure about is if there is a way to avoid this ever-increasing memory usage, other than switching to entirely using async endpoints. Also if there is a way to make pandas better behaved w.r.t memory.

References:
Pandas memory leak issues:
pandas-dev/pandas#2659
pandas-dev/pandas#21353

Kludex · 2024-01-14T19:21:25Z

Kludex
Jan 14, 2024
Maintainer Sponsor

I'll close this as outdated.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starlette and pandas results in unreleased memory piling up when using sync endpoints #1573

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Starlette and pandas results in unreleased memory piling up when using sync endpoints #1573

Atheuz Apr 5, 2022

The problem

Reproducing

Results

Thoughts

Replies: 1 comment

Kludex Jan 14, 2024 Maintainer Sponsor

Atheuz
Apr 5, 2022

Kludex
Jan 14, 2024
Maintainer Sponsor