Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #56

Closed
adbar opened this issue Jun 15, 2022 · 21 comments
Closed

Memory leak #56

adbar opened this issue Jun 15, 2022 · 21 comments
Labels
bug Something isn't working

Comments

@adbar
Copy link
Owner

adbar commented Jun 15, 2022

See issue adbar/trafilatura#216.

Extracting the date from the same web page multiple times shows that the module is leaking memory, this doesn't appear to be related to extensive_search:

import os
import psutil
from htmldate import find_date

with open('test.html', 'rb') as inputf:
    html = inputf.read()

for i in range(10):
    result = find_date(html, extensive_search=False)
    process = psutil.Process(os.getpid())
    print(i, ":", process.memory_info().rss / 1024 ** 2)

tracemalloc doesn't give any clue.

@adbar adbar added the bug Something isn't working label Jun 15, 2022
@kinoute
Copy link

kinoute commented Jun 15, 2022

Running your example with memray, it seems also related to lxml:

0 : 43.53125
1 : 44.6875
2 : 45.953125
3 : 47.3515625
4 : 48.1015625
5 : 49.23828125
6 : 51.14453125
7 : 51.765625
8 : 53.2890625
9 : 54.8046875

Screenshot 2022-06-15 à 13 21 28

adbar added a commit that referenced this issue Jun 15, 2022
@adbar
Copy link
Owner Author

adbar commented Jun 15, 2022

I found it, it wasn't lxml as such, rather a combination with a function cache.

@kinoute
Copy link

kinoute commented Jun 15, 2022

Using master, the increase is acceptable now, good job. Can't wait for the new release.

adbar added a commit that referenced this issue Jun 16, 2022
@adbar
Copy link
Owner Author

adbar commented Jun 16, 2022

It's out ✔️

@adbar adbar closed this as completed Jun 16, 2022
@dmoklaf
Copy link

dmoklaf commented Jun 29, 2022

Have met this bug as well tonight through trafilatura. This bug was making the Python process very difficult to scale in number of pages crawled. Are you sure about the remaining lru_cache calls, i.e., that their function arguments do not involve a lot of memory (eg non-small HTML strings or lxml trees as argument)? Thx

@adbar
Copy link
Owner Author

adbar commented Jun 30, 2022

@dmoklaf Yes the main problem has been addressed, there are no lxml trees as argument anymore. Besides, a few other lru_cache calls have been removed.

You may have to update htmldate manually in order to benefit from the fix, a new trafilatura version should be out soon.

@dmoklaf
Copy link

dmoklaf commented Jun 30, 2022

Thanks - I have done that last night but still have regular memory increases (with a few thousand docs I reach 1gb quickly, and from there it keeps growing).

The use of global variables to hold data (in this case, the dictionaries hidden behind @lru_cache in both htmldata and trafilatura) is causing this and preventing me from saying to the process "get rid of all your caches" once the crawling is done.

If in the next major version you refactor htmldate and trafiltura's code, it might be beneficial to encapsulate all the caches data structures (in fact, everything) in an object. This way, when the client of your library is done, he just gets rid of this object and has a 100% guarantee that all the memory is freed up (no global variables holding user data staying in the background).

@adbar
Copy link
Owner Author

adbar commented Jun 30, 2022

Thanks for your input, the idea seems interesting, could you please point me to examples of the solution you describe?

In the meantime, you can try to set the CACHE_SIZE value in settings.py much lower (say 2 instead of 8192).

@dmoklaf
Copy link

dmoklaf commented Jul 1, 2022

I have forced CACHE_SIZE=2 in both htmldate and trafilatura, and I can confirm that the memory growth is gone. It's definitely the various caches which are growing very large and can't be cleared afterward.

A clean encapsulation into an object would look like this (illustrated here for Trafilatura, but same principle for both libs). Please read the very last sentence because it simplifies A LOT the work to be done (to something that I think can be done in 1 hour of work):

  1. Externally, client of the library don't call global functions anymore. They first create an Extractor object with the proper cache configuration. This object provides public methods to do the job:
extractor = trafilatura.Extractor(caches_size=1024)
for document in documents:
	extraction = extractor.extract(document)
	...
  1. When clients are done with the extraction, they just get rid of this Extractor object. As this object carries all the caching structure (see below), it guarantees them all the memory will be freed up upon the next garbage collection

  2. You thus define this class with a tiny skeleton that does nothing:

	class Extractor:

		def __init__(self, caches_size=4096):
			pass
  1. You remove all @lru_cache tags

  2. You move all public and private functions to methods of this new class - prefix all the private methods, like trim (not to be called by your clients) with a single underscore _:

	class Extractor:
		...
		def _some_private_method(self, ...):
			...
			trimmed_string = self._trim(string)
			...

		def _trim(self, text):
			...
  1. You verify that it works as before (but may be slow as no caching is going on)

  2. You add back the caches with the following trick: in the constructor of Extractor, you wrap the methods you want to cache using the good old lru_cache:

	class Extractor:
		...
		def __init__(self, caches_size=4096):
			self._cached_trim = functools.lru_cache(maxsize=caches_size)(self._trim)
  1. In all places that used to call self.trim() you decide whether you want to call instead the cached version self._cached_trim - I believe in some cases you may not want to do that (I take trim as an example because it is called from many places and I am not sure some callers are not wasting cache resources for very long strings that will never reappear):
	class Extractor:
		...
		def _some_private_method(self, ...):
			...
			trimmed_string = self._cached_trim(string)
			...
  1. The advantage of this approach is thus that, beyond guaranteeing zero-memory footprint once the extraction is done, you give yourself the capability to selectively decide whether to use caching or not on a per-call basis

  2. Your class may grow very large if you migrate all functions to methods - but in fact you only need to migrate the functions that need to be cached - the others can stay were they are, as private global functions

@adbar
Copy link
Owner Author

adbar commented Jul 1, 2022

Hi @dmoklaf, thanks for the detailed explanations!

The extractor class makes perfect sense but the trick with lru_cache looks a bit convoluted.
Would you be interested in drafting a pull request with just the extractor class and the trimming function? I could go on from there.

@dmoklaf
Copy link

dmoklaf commented Jul 1, 2022

As a workaround, and maybe a simpler alternative, I have just cleared the cache myself every 1000 documents using the @lru_cache cache_clear() method.

I currently hit directly all the cached function, which makes my code fragile (as these might change):

def clear_htmldate_caches()-> None:
        htmldate.core.compare_reference.cache_clear()
        htmldate.extractors.try_date_expr.cache_clear()
        htmldate.validators.date_validator.cache_clear()
        htmldate.validators.filter_ymd_candidate.cache_clear()

def clear_trafilatura_caches()-> None:
        trafilatura.utils.line_processing.cache_clear()
        trafilatura.utils.trim.cache_clear()

This could become a target solution if each library were to provide such a public clear_caches() function. No need for an object wrapper and does the job.

The advantage of this approach is that you (as the designer of the library) keep control over the cache sizes - it's not one-size-fits-all, some caches may need to be larger or smaller depending on how often each function is called and the frequency of its specific arguments.

What do you think?

@kinoute
Copy link

kinoute commented Jul 1, 2022

Maybe we could just change/set a custom cache size easily when both librairies are imported?

@dmoklaf
Copy link

dmoklaf commented Jul 1, 2022

Ah that's a good solution. Unfortunatly @lru_cache is called at import time. So once imported it's too late, the caches are already set and can't be adjusted. So the solution would be to rebuild them through a utility function.

We could have for each library a utility function that resets (including clearing) the caches to the appropriate size.

  1. The cached function code is slightly adjusted in each code file:
def trim(...):
    ... (core code of trim, no caching)

cached_trim = lru_cache(maxsize=utils.MAXSIZE)(trim)
  1. A global function is added to allow a client of the library to reset the caches
def reset_caches(maxsize=utils.MAXSIZE):
    utils.cached_trim = functools.lru_cache(maxsize=maxsize)(utils.trim)
    ...
  1. Individual calls within the library to cached functions are renamed accordingly:
utils.trim(mystring)
becomes
utils.cached_trim(mystring)

The advantage of this approach is that, like the object approach, the library developer keeps the possibility in specific cases to call trim instead of cached_trim. For example in specific places where he knows that the string is very large and would waste the process memory for nothing. I suspect this was one of the reason for the very large memory consumption (but I may be wrong). That gives a lot of flexibility.

@adbar
Copy link
Owner Author

adbar commented Jul 1, 2022

Yes, either a reset_caches() function, or two functions trim() /cached_trim() and an additional argument use_cache for the extraction functions?

@dmoklaf
Copy link

dmoklaf commented Jul 2, 2022

Yes both solutions fit the bill.

Regarding the second solution, having the user provide an additional use_cache argument has a development drawback for you: you would have to propagate this argument in each and every function call, plus perform if...else statements each time to determine which version to use of the cached/noncached functions. A simpler approach is to provide the user with an optional maxsize argument in a reset_cache function (like the first solution, but here with this optional argument):

  • If he wants no caching at all, he calls reset_caches(maxsize=0)
  • If he wants to reduce/increase caching, he calls reset_caches with the appropriate maxsize
  • If he wants to just clear the caches after having ended his crawling (to do other things with his daemon process), he calls reset_caches()

This function just rebuilds the cached version of the utility functions:

def reset_caches(maxsize=utils.MAXSIZE):
    utils.cached_trim = functools.lru_cache(maxsize=maxsize)(utils.trim)
    ...

That way, a single function answers all the needs and you don't have to propagate a flag everywhere.

First solution is more limited functionnally but sufficient too (for my needs).

PS: I tried to track the memory usage using the Python tracemalloc module and I couldn't find the culprit. This is because tracemalloc tracks Python memory usage and not C modules memory usage, like lxml. So I think you are still caching lxml arguments somewhere. I am not sure this is useful (ie will the cache ever bring back a cached value? what is your use case for that where this would occur?) but you know your code much better than me

@adbar
Copy link
Owner Author

adbar commented Jul 4, 2022

Yes, this solution would be easier and bundling all functions using lru_cache in a reset_caches() function should do the trick. Besides, function_name.cache_clear() is enough.

Since caching LXML trees was a problem I already checked and I think it's not happening anymore. So the memory use should come from somewhere else. I found that the URL manipulation library (courlan) was using a bit of memory and trimmed it down in the last version.
For the rest I don't know but please share your results if you find something!

@kinoute
Copy link

kinoute commented Jul 23, 2022

@adbar So with the new release 1.3.0 and the addition of clear_caches(), what is the best way to avoid memory leaks? Use this after every find_date() call?

Edit: By the way, in the changelog it is called clear_caches() but I can only find reset_caches() in recent commits.

@dmoklaf
Copy link

dmoklaf commented Jul 23, 2022

I have not used yet the new code, I have only used my own workaround in my code (waiting for this fix). The workaround works similarly, clearing all the caches, but in both libraries (trafilatura and htmldate), while this fix seems to focus on htmldate and its charset_normalizer dependency only (which is not in my workaround).

I call my workaround function to clear the cache every 1000 documents parsed. The outcome of my workaround which has been running quite a lot is that clearning the caches works perfectly. I have been using it for couple hundred thousand documents without any unbounded growth in memory anymore.

So my current understanding is that these were not leaks, but just over-use of caches (a "default" configuration of the caches intended to use several gigabytes of memory if never cleared).

@kinoute
Copy link

kinoute commented Jul 23, 2022

Using latest version:

from htmldate.meta import reset_caches
reset_caches()

I'm getting the following error:

AttributeError                            Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 reset_caches()

Input In [8], in reset_caches()
     23 try:
     24     encoding_languages.cache_clear()
---> 25     is_suspiciously_successive_range.cache_clear()
     26     is_accentuated.cache_clear()
     27 # prevent possible changes in function names

AttributeError: 'function' object has no attribute 'cache_clear'

@adbar
Copy link
Owner Author

adbar commented Jul 25, 2022

@kinoute Do you have the latest version of charset_normalizer? It could be that you have an older version of it where the function doesn't use a LRU cache yet.

As @dmoklaf says you're free to use the function whenever you see fit, for example every n URLs. A similar function for Trafilatura will follow (see adbar/trafilatura#219).

@kinoute
Copy link

kinoute commented Jul 25, 2022

@adbar Indeed, upgrading to the latest version of charset_normalizer fixed the issue. Thanks!

Can't wait for the same feature in Trafilatura!

Edit: Just in case somebody has the same problem, there was a conflict between requests and charset_normalizer. requests needed a charset version of ~ 2.0.0. You need to upgrade to requests to 2.28.1 which is compatible with charset_normalizer 2.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants