-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak #56
Comments
I found it, it wasn't lxml as such, rather a combination with a function cache. |
Using master, the increase is acceptable now, good job. Can't wait for the new release. |
It's out ✔️ |
Have met this bug as well tonight through trafilatura. This bug was making the Python process very difficult to scale in number of pages crawled. Are you sure about the remaining lru_cache calls, i.e., that their function arguments do not involve a lot of memory (eg non-small HTML strings or lxml trees as argument)? Thx |
@dmoklaf Yes the main problem has been addressed, there are no lxml trees as argument anymore. Besides, a few other You may have to update |
Thanks - I have done that last night but still have regular memory increases (with a few thousand docs I reach 1gb quickly, and from there it keeps growing). The use of global variables to hold data (in this case, the dictionaries hidden behind @lru_cache in both htmldata and trafilatura) is causing this and preventing me from saying to the process "get rid of all your caches" once the crawling is done. If in the next major version you refactor htmldate and trafiltura's code, it might be beneficial to encapsulate all the caches data structures (in fact, everything) in an object. This way, when the client of your library is done, he just gets rid of this object and has a 100% guarantee that all the memory is freed up (no global variables holding user data staying in the background). |
Thanks for your input, the idea seems interesting, could you please point me to examples of the solution you describe? In the meantime, you can try to set the |
I have forced CACHE_SIZE=2 in both htmldate and trafilatura, and I can confirm that the memory growth is gone. It's definitely the various caches which are growing very large and can't be cleared afterward. A clean encapsulation into an object would look like this (illustrated here for Trafilatura, but same principle for both libs). Please read the very last sentence because it simplifies A LOT the work to be done (to something that I think can be done in 1 hour of work):
|
Hi @dmoklaf, thanks for the detailed explanations! The extractor class makes perfect sense but the trick with |
As a workaround, and maybe a simpler alternative, I have just cleared the cache myself every 1000 documents using the @lru_cache cache_clear() method. I currently hit directly all the cached function, which makes my code fragile (as these might change):
This could become a target solution if each library were to provide such a public The advantage of this approach is that you (as the designer of the library) keep control over the cache sizes - it's not one-size-fits-all, some caches may need to be larger or smaller depending on how often each function is called and the frequency of its specific arguments. What do you think? |
Maybe we could just change/set a custom cache size easily when both librairies are imported? |
Ah that's a good solution. Unfortunatly @lru_cache is called at import time. So once imported it's too late, the caches are already set and can't be adjusted. So the solution would be to rebuild them through a utility function. We could have for each library a utility function that resets (including clearing) the caches to the appropriate size.
The advantage of this approach is that, like the object approach, the library developer keeps the possibility in specific cases to call |
Yes, either a |
Yes both solutions fit the bill. Regarding the second solution, having the user provide an additional
This function just rebuilds the cached version of the utility functions:
That way, a single function answers all the needs and you don't have to propagate a flag everywhere. First solution is more limited functionnally but sufficient too (for my needs). PS: I tried to track the memory usage using the Python tracemalloc module and I couldn't find the culprit. This is because tracemalloc tracks Python memory usage and not C modules memory usage, like lxml. So I think you are still caching lxml arguments somewhere. I am not sure this is useful (ie will the cache ever bring back a cached value? what is your use case for that where this would occur?) but you know your code much better than me |
Yes, this solution would be easier and bundling all functions using Since caching LXML trees was a problem I already checked and I think it's not happening anymore. So the memory use should come from somewhere else. I found that the URL manipulation library ( |
@adbar So with the new release 1.3.0 and the addition of Edit: By the way, in the changelog it is called |
I have not used yet the new code, I have only used my own workaround in my code (waiting for this fix). The workaround works similarly, clearing all the caches, but in both libraries (trafilatura and htmldate), while this fix seems to focus on htmldate and its charset_normalizer dependency only (which is not in my workaround). I call my workaround function to clear the cache every 1000 documents parsed. The outcome of my workaround which has been running quite a lot is that clearning the caches works perfectly. I have been using it for couple hundred thousand documents without any unbounded growth in memory anymore. So my current understanding is that these were not leaks, but just over-use of caches (a "default" configuration of the caches intended to use several gigabytes of memory if never cleared). |
Using latest version: from htmldate.meta import reset_caches
reset_caches() I'm getting the following error: AttributeError Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 reset_caches()
Input In [8], in reset_caches()
23 try:
24 encoding_languages.cache_clear()
---> 25 is_suspiciously_successive_range.cache_clear()
26 is_accentuated.cache_clear()
27 # prevent possible changes in function names
AttributeError: 'function' object has no attribute 'cache_clear' |
@kinoute Do you have the latest version of As @dmoklaf says you're free to use the function whenever you see fit, for example every n URLs. A similar function for Trafilatura will follow (see adbar/trafilatura#219). |
@adbar Indeed, upgrading to the latest version of Can't wait for the same feature in Trafilatura! Edit: Just in case somebody has the same problem, there was a conflict between |
See issue adbar/trafilatura#216.
Extracting the date from the same web page multiple times shows that the module is leaking memory, this doesn't appear to be related to
extensive_search
:tracemalloc doesn't give any clue.
The text was updated successfully, but these errors were encountered: