You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On the usc feature branch, some operations with the U.S. Code XML data, specifically count_tokens on very long texts, require a lot of RAM and produce out-of-memory errors (currently manifesting as Jupyer kernels being killed by the operating system) when insufficient RAM is available. This happens on 2-core codespaces, which have 4 GiB of RAM.
4 GiB should be enough to perform these operations, and the underlying problem is that, while counting how many tokens a text is intuitively feels like it should use only constant space and could be implemented in constant space, using tiktoken to count tokens takes linear space because it builds a list of all the token IDs and then checks the list's length. Often that's okay, because often one counts tokens in a text that is with a couple orders of magnitude of being short enough to embed. But we count token in longer texts to produce statistics used for planning.
The reason it takes more than 4 GiB of RAM relates in part to how Python represents data. The flat vec that the Rust code of tiktoken produces is much smaller. But to make it into a Python list, Python int objects have to be created (with associated space overhead) and pointed to by pointers ("references") in the list.
Although it could perhaps be helpful to develop a way to count tokens in huge single documents without using as much RAM (which could then be contributed to tiktoken, if its developers want that), the purpose of this bug report is not to propose or plan that. Instead, it is to note that, as a restriction on using the default-selected codespace, it looks like this problem may go away on its own very soon due to the changes described in Codespaces gives you a free upgrade. As I understand it, the RAM even for 2-core codespaces may double to 8 GiB soon.
This bug report is for the purpose of keeping track of that, to avoid assuming that things will start working until that is known. It can be closed when we verify that both (a) some such change has happened that plausibly eliminates the OOM errors in our use, and (b) testing shows that they are, in fact, eliminated. If the change is not sufficient to eliminate it, then this bug report could either be edited accordingly or closed as not planned.
The text was updated successfully, but these errors were encountered:
On the usc feature branch, some operations with the U.S. Code XML data, specifically
count_tokens
on very long texts, require a lot of RAM and produce out-of-memory errors (currently manifesting as Jupyer kernels being killed by the operating system) when insufficient RAM is available. This happens on 2-core codespaces, which have 4 GiB of RAM.4 GiB should be enough to perform these operations, and the underlying problem is that, while counting how many tokens a text is intuitively feels like it should use only constant space and could be implemented in constant space, using
tiktoken
to count tokens takes linear space because it builds a list of all the token IDs and then checks the list's length. Often that's okay, because often one counts tokens in a text that is with a couple orders of magnitude of being short enough to embed. But we count token in longer texts to produce statistics used for planning.The reason it takes more than 4 GiB of RAM relates in part to how Python represents data. The flat
vec
that the Rust code oftiktoken
produces is much smaller. But to make it into a Python list, Pythonint
objects have to be created (with associated space overhead) and pointed to by pointers ("references") in the list.Although it could perhaps be helpful to develop a way to count tokens in huge single documents without using as much RAM (which could then be contributed to
tiktoken
, if its developers want that), the purpose of this bug report is not to propose or plan that. Instead, it is to note that, as a restriction on using the default-selected codespace, it looks like this problem may go away on its own very soon due to the changes described in Codespaces gives you a free upgrade. As I understand it, the RAM even for 2-core codespaces may double to 8 GiB soon.This bug report is for the purpose of keeping track of that, to avoid assuming that things will start working until that is known. It can be closed when we verify that both (a) some such change has happened that plausibly eliminates the OOM errors in our use, and (b) testing shows that they are, in fact, eliminated. If the change is not sufficient to eliminate it, then this bug report could either be edited accordingly or closed as not planned.
The text was updated successfully, but these errors were encountered: