Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM failures working with USC data in 2-core codespaces #218

Open
EliahKagan opened this issue Aug 6, 2023 · 0 comments
Open

OOM failures working with USC data in 2-core codespaces #218

EliahKagan opened this issue Aug 6, 2023 · 0 comments

Comments

@EliahKagan
Copy link
Collaborator

EliahKagan commented Aug 6, 2023

On the usc feature branch, some operations with the U.S. Code XML data, specifically count_tokens on very long texts, require a lot of RAM and produce out-of-memory errors (currently manifesting as Jupyer kernels being killed by the operating system) when insufficient RAM is available. This happens on 2-core codespaces, which have 4 GiB of RAM.

4 GiB should be enough to perform these operations, and the underlying problem is that, while counting how many tokens a text is intuitively feels like it should use only constant space and could be implemented in constant space, using tiktoken to count tokens takes linear space because it builds a list of all the token IDs and then checks the list's length. Often that's okay, because often one counts tokens in a text that is with a couple orders of magnitude of being short enough to embed. But we count token in longer texts to produce statistics used for planning.

The reason it takes more than 4 GiB of RAM relates in part to how Python represents data. The flat vec that the Rust code of tiktoken produces is much smaller. But to make it into a Python list, Python int objects have to be created (with associated space overhead) and pointed to by pointers ("references") in the list.

Although it could perhaps be helpful to develop a way to count tokens in huge single documents without using as much RAM (which could then be contributed to tiktoken, if its developers want that), the purpose of this bug report is not to propose or plan that. Instead, it is to note that, as a restriction on using the default-selected codespace, it looks like this problem may go away on its own very soon due to the changes described in Codespaces gives you a free upgrade. As I understand it, the RAM even for 2-core codespaces may double to 8 GiB soon.

This bug report is for the purpose of keeping track of that, to avoid assuming that things will start working until that is known. It can be closed when we verify that both (a) some such change has happened that plausibly eliminates the OOM errors in our use, and (b) testing shows that they are, in fact, eliminated. If the change is not sufficient to eliminate it, then this bug report could either be edited accordingly or closed as not planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant