Skip to content

Memory leak issue #10015

Jan 10, 2022 · 6 comments · 29 replies
Discussion options

You must be logged in to vote

The memory usage increases slightly during processing because the pipeline vocab in nlp.vocab is not static. The lexeme cache (nlp.vocab) and string store (nlp.vocab.strings) grow as texts are processed and new tokens are seen. The lexeme cache is just for speed, but the string store is needed to map the string hashes back to strings.

If you're saving Doc objects directly for future processing, you'd need the string store cache to know which token strings were in the docs, since the Doc object just includes the hashes (ints) for the tokens and not the strings.

If you're saving the resulting annotation/analysis in some other form, then you probably don't need to keep track of the full stri…

Replies: 6 comments 29 replies

Comment options

You must be logged in to vote
16 replies
@rkoystart
Comment options

@svlandeg
Comment options

@chrishmorris
Comment options

@jey07
Comment options

@thomashacker
Comment options

Answer selected by adrianeboyd
Comment options

You must be logged in to vote
8 replies
@adrianeboyd
Comment options

@rkoystart
Comment options

@marzooq-unbxd
Comment options

@marzooq-unbxd
Comment options

@polm
Comment options

Comment options

You must be logged in to vote
2 replies
@polm
Comment options

@kaliaanup
Comment options

Comment options

You must be logged in to vote
1 reply
@svlandeg
Comment options

Comment options

You must be logged in to vote
1 reply
@adrianeboyd
Comment options

Comment options

You must be logged in to vote
1 reply
@adrianeboyd
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / vectors Feature: Word vectors and similarity perf / memory Performance: memory use
Converted from issue

This discussion was converted from issue #10012 on January 10, 2022 11:30.