OOM failures working with USC data in 2-core codespaces #218

EliahKagan · 2023-08-06T12:15:11Z

On the usc feature branch, some operations with the U.S. Code XML data, specifically count_tokens on very long texts, require a lot of RAM and produce out-of-memory errors (currently manifesting as Jupyer kernels being killed by the operating system) when insufficient RAM is available. This happens on 2-core codespaces, which have 4 GiB of RAM.

4 GiB should be enough to perform these operations, and the underlying problem is that, while counting how many tokens a text is intuitively feels like it should use only constant space and could be implemented in constant space, using tiktoken to count tokens takes linear space because it builds a list of all the token IDs and then checks the list's length. Often that's okay, because often one counts tokens in a text that is with a couple orders of magnitude of being short enough to embed. But we count token in longer texts to produce statistics used for planning.

The reason it takes more than 4 GiB of RAM relates in part to how Python represents data. The flat vec that the Rust code of tiktoken produces is much smaller. But to make it into a Python list, Python int objects have to be created (with associated space overhead) and pointed to by pointers ("references") in the list.

Although it could perhaps be helpful to develop a way to count tokens in huge single documents without using as much RAM (which could then be contributed to tiktoken, if its developers want that), the purpose of this bug report is not to propose or plan that. Instead, it is to note that, as a restriction on using the default-selected codespace, it looks like this problem may go away on its own very soon due to the changes described in Codespaces gives you a free upgrade. As I understand it, the RAM even for 2-core codespaces may double to 8 GiB soon.

This bug report is for the purpose of keeping track of that, to avoid assuming that things will start working until that is known. It can be closed when we verify that both (a) some such change has happened that plausibly eliminates the OOM errors in our use, and (b) testing shows that they are, in fact, eliminated. If the change is not sufficient to eliminate it, then this bug report could either be edited accordingly or closed as not planned.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM failures working with USC data in 2-core codespaces #218

OOM failures working with USC data in 2-core codespaces #218

EliahKagan commented Aug 6, 2023 •

edited

Loading

OOM failures working with USC data in 2-core codespaces #218

OOM failures working with USC data in 2-core codespaces #218

Comments

EliahKagan commented Aug 6, 2023 • edited Loading

EliahKagan commented Aug 6, 2023 •

edited

Loading