Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain input creates extremely long runtime and memory leak #29

Open
radeklat opened this issue Apr 12, 2014 · 15 comments
Open

Certain input creates extremely long runtime and memory leak #29

radeklat opened this issue Apr 12, 2014 · 15 comments
Labels

Comments

@radeklat
Copy link

@radeklat radeklat commented Apr 12, 2014

I am using chardet as part of a web crawler written in python3. I noticed that over time (many hours), the program consumes all memory. I narrowed down the problem to a single call of chardet.detect() method for certain web pages.

After some testing, it seems that chardet has problem with some special input and I managed to get a sample of such an input. It consumes on my machine about 220 MB of memory (however, the input is 2.5 MB) and takes about 1:22 minutes to process (in contrast to 43 ms when the file is truncated to about 2 MB). It seems not to be limited to python3, in python2 the memory consumption is even worse (312 MB).

Versions:

Fedora release 20 (Heisenbug) x86_64
chardet-2.2.1 (via pip)
python3-3.3.2-11.fc20.x86_64
python-2.7.5-11.fc20.x86_64

How to reproduce:

I cannot attach any files to this issue so I uploaded them to my dropbox account: https://www.dropbox.com/sh/26dry8zj18cv0m1/sKgP_E44qx/chardet_test.zip Please let me know of a better place where to put it if necessary. Here is an overview of the content and the results:

setup='import chardet; html = open("mem_leak_html.txt", "rb").read()'
python3 -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 43 ms per loop
python3 -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
# produces: 1 loops, best of 3: 1min 22s per loop
python3 mem_leak_test.py
# produces:
# Good input left 2.65 MB of unfreed memory.
# Bad input left 220.16 MB of unfreed memory.

python -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 41.7 ms per loop
python -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
# produces: 10 loops, best of 3: 111 sec per loop
python mem_leak_test.py
# produces:
# Good input left 3.00 MB of unfreed memory.
# Bad input left 312.00 MB of unfreed memory.
mem_leak_test.py:
import resource
import chardet
import gc

mem_use = lambda: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
html = open("mem_leak_html.txt", "rb").read()

def test(desc, instr):
    gc.collect()
    mem_start = mem_use()
    chardet.detect(instr)    
    gc.collect()
    mem_used = mem_use() - mem_start
    print('%s left %.2f MB of unfreed memory.' % (desc, mem_used))    

test('Good input', html[:2543482])
test('Bad input', html[:2543483])
@dan-blanchard
Copy link
Member

@dan-blanchard dan-blanchard commented Apr 15, 2014

If performance is an issue, I'd recommend you just use cChardet.

@sigmavirus24
Copy link
Member

@sigmavirus24 sigmavirus24 commented Apr 16, 2014

If there is in fact a memory leak though, we should try to address it.

@radeklat
Copy link
Author

@radeklat radeklat commented Apr 16, 2014

I don't really mind the long processing time because it does not happen very frequently. I mind that after two days my program consumes 6 GB of memory because of a single chardet call. I could provide more examples of input data that cause trouble if needed.

@sigmavirus24
Copy link
Member

@sigmavirus24 sigmavirus24 commented Apr 16, 2014

I'm going to work on building python 2.7 with Victor Stinner's tracemalloc extension to find the cause of the memory leak. Thanks for all of the details @rlat

@dan-blanchard
Copy link
Member

@dan-blanchard dan-blanchard commented Apr 16, 2014

If there is in fact a memory leak though, we should try to address it.

I agree. Sorry, I didn't mean to give the impression that this was something we were never going to address. I just meant that if you want it to be fast and efficient, then cChardet is the way to go.

@radeklat
Copy link
Author

@radeklat radeklat commented Sep 21, 2014

Since the last comment is over 5 month old I'd like to ask - is there any status update on this issue? Is there any way I can help to resolve this issue?

@sigmavirus24
Copy link
Member

@sigmavirus24 sigmavirus24 commented Oct 4, 2014

@rlat I haven't had a chance to look into this. I just saw https://github.com/jtushman/memory_utils mentioned and it reminded me of this bug.

@rsnair2
Copy link
Contributor

@rsnair2 rsnair2 commented Nov 29, 2014

@sigmavirus24 I am working on profiling this out and narrowing down the exact function/blocks where the problem is happening. I will be back with more info when I am done. If you still need help, is it alright if I join in? This is kinda my first time working on an open source project and I would love to help!

@sigmavirus24
Copy link
Member

@sigmavirus24 sigmavirus24 commented Nov 29, 2014

@rsnair2 we absolutely need the help. I'm very excited that you're willing to hwlp out. Feel free to ping me with questions via email

@rsnair2
Copy link
Contributor

@rsnair2 rsnair2 commented Nov 30, 2014

Thanks sigmavirus24. Will do and I will keep you updated.

@rsnair2
Copy link
Contributor

@rsnair2 rsnair2 commented Dec 14, 2014

So from what I gathered so far, the "bad input" essentially contains at-least one single high byte (>= 128). When there is no high-byte (and no escape characters), the detector easily comes to the conclusion that it is an "ascii" encoding as it does not have to go through any of the probers. On the other hand, a single high byte (in the current case that is at 2543482) causes the detector to use the SBCSGroupProber - which causes the program to take significantly more time and consume more memory. This is expected behavior.

As for the memory, I believe that ru_maxrss measures the high watermark memory usage - i.e., the maximum memory that has been used by the process at some point of time. So that difference doesn't necessarily imply a memory leak.

I am still fairly new to memory debugging in python, so I was wondering if you could share your opinion on this @sigmavirus24.

@radeklat
Copy link
Author

@radeklat radeklat commented Dec 14, 2014

It is possible that ru_maxrss measures the peak memory usage. So yes, in this case you cannot tell whether there is a memory leak or high memory usage. However, I think the memory leak manifested itself pretty well when it hogged to 6GB of memory after several calls with different problematic data (which I didn't include but can provide if necessary).

@rsnair2
Copy link
Contributor

@rsnair2 rsnair2 commented Dec 14, 2014

Hey @rlat, thanks for replying! It would be very useful if you could pass me in the problematic data so that I can try and replicate the problem as closely as possible.

@sigmavirus24
Copy link
Member

@sigmavirus24 sigmavirus24 commented Dec 14, 2014

So here's the thing. rss when talking about memory is usually talking about RAM allocated in pages. That number will never shrink unless you have other memory intensive programs in need of it and the computer page-faults and needs to reallocate it. The thing is, good programs shouldn't ever be allocated 6GB of RAM in one run. That said, if at any point in time, we're using 6GB of memory, that's absolutely problem that we need to fix.

@jackdied
Copy link

@jackdied jackdied commented Oct 11, 2016

Easier test case, the fast/slow case can be achieved by using inputs of
bytes([120]) + b'x' * 10000 # 10000 x's with an x in front
bytes([169]) + b'x' * 10000 # 10000x's plus a copyright glyph in front

It doesn't matter if the copyright character is at the beginning or end.

JustAnotherArchivist added a commit to JustAnotherArchivist/qwarc that referenced this issue Jul 25, 2019
chardet can be very slow (chardet/chardet#29 psf/requests#2359) and the decoding may be unnecessary if it's binary content.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.