New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain input creates extremely long runtime and memory leak #29

Open
radeklat opened this Issue Apr 12, 2014 · 15 comments

Comments

Projects
None yet
5 participants
@radeklat

radeklat commented Apr 12, 2014

I am using chardet as part of a web crawler written in python3. I noticed that over time (many hours), the program consumes all memory. I narrowed down the problem to a single call of chardet.detect() method for certain web pages.

After some testing, it seems that chardet has problem with some special input and I managed to get a sample of such an input. It consumes on my machine about 220 MB of memory (however, the input is 2.5 MB) and takes about 1:22 minutes to process (in contrast to 43 ms when the file is truncated to about 2 MB). It seems not to be limited to python3, in python2 the memory consumption is even worse (312 MB).

Versions:

Fedora release 20 (Heisenbug) x86_64
chardet-2.2.1 (via pip)
python3-3.3.2-11.fc20.x86_64
python-2.7.5-11.fc20.x86_64

How to reproduce:

I cannot attach any files to this issue so I uploaded them to my dropbox account: https://www.dropbox.com/sh/26dry8zj18cv0m1/sKgP_E44qx/chardet_test.zip Please let me know of a better place where to put it if necessary. Here is an overview of the content and the results:

setup='import chardet; html = open("mem_leak_html.txt", "rb").read()'
python3 -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 43 ms per loop
python3 -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
# produces: 1 loops, best of 3: 1min 22s per loop
python3 mem_leak_test.py
# produces:
# Good input left 2.65 MB of unfreed memory.
# Bad input left 220.16 MB of unfreed memory.

python -m timeit -s "$setup"  'chardet.detect(html[:2543482])'
# produces: 10 loops, best of 3: 41.7 ms per loop
python -m timeit -s "$setup"  'chardet.detect(html[:2543483])'
# produces: 10 loops, best of 3: 111 sec per loop
python mem_leak_test.py
# produces:
# Good input left 3.00 MB of unfreed memory.
# Bad input left 312.00 MB of unfreed memory.
mem_leak_test.py:
import resource
import chardet
import gc

mem_use = lambda: resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024
html = open("mem_leak_html.txt", "rb").read()

def test(desc, instr):
    gc.collect()
    mem_start = mem_use()
    chardet.detect(instr)    
    gc.collect()
    mem_used = mem_use() - mem_start
    print('%s left %.2f MB of unfreed memory.' % (desc, mem_used))    

test('Good input', html[:2543482])
test('Bad input', html[:2543483])
@dan-blanchard

This comment has been minimized.

Show comment
Hide comment
@dan-blanchard

dan-blanchard Apr 15, 2014

Member

If performance is an issue, I'd recommend you just use cChardet.

Member

dan-blanchard commented Apr 15, 2014

If performance is an issue, I'd recommend you just use cChardet.

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 Apr 16, 2014

Member

If there is in fact a memory leak though, we should try to address it.

Member

sigmavirus24 commented Apr 16, 2014

If there is in fact a memory leak though, we should try to address it.

@radeklat

This comment has been minimized.

Show comment
Hide comment
@radeklat

radeklat Apr 16, 2014

I don't really mind the long processing time because it does not happen very frequently. I mind that after two days my program consumes 6 GB of memory because of a single chardet call. I could provide more examples of input data that cause trouble if needed.

radeklat commented Apr 16, 2014

I don't really mind the long processing time because it does not happen very frequently. I mind that after two days my program consumes 6 GB of memory because of a single chardet call. I could provide more examples of input data that cause trouble if needed.

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 Apr 16, 2014

Member

I'm going to work on building python 2.7 with Victor Stinner's tracemalloc extension to find the cause of the memory leak. Thanks for all of the details @rlat

Member

sigmavirus24 commented Apr 16, 2014

I'm going to work on building python 2.7 with Victor Stinner's tracemalloc extension to find the cause of the memory leak. Thanks for all of the details @rlat

@dan-blanchard

This comment has been minimized.

Show comment
Hide comment
@dan-blanchard

dan-blanchard Apr 16, 2014

Member

If there is in fact a memory leak though, we should try to address it.

I agree. Sorry, I didn't mean to give the impression that this was something we were never going to address. I just meant that if you want it to be fast and efficient, then cChardet is the way to go.

Member

dan-blanchard commented Apr 16, 2014

If there is in fact a memory leak though, we should try to address it.

I agree. Sorry, I didn't mean to give the impression that this was something we were never going to address. I just meant that if you want it to be fast and efficient, then cChardet is the way to go.

@radeklat

This comment has been minimized.

Show comment
Hide comment
@radeklat

radeklat Sep 21, 2014

Since the last comment is over 5 month old I'd like to ask - is there any status update on this issue? Is there any way I can help to resolve this issue?

radeklat commented Sep 21, 2014

Since the last comment is over 5 month old I'd like to ask - is there any status update on this issue? Is there any way I can help to resolve this issue?

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 Oct 4, 2014

Member

@rlat I haven't had a chance to look into this. I just saw https://github.com/jtushman/memory_utils mentioned and it reminded me of this bug.

Member

sigmavirus24 commented Oct 4, 2014

@rlat I haven't had a chance to look into this. I just saw https://github.com/jtushman/memory_utils mentioned and it reminded me of this bug.

@rsnair2

This comment has been minimized.

Show comment
Hide comment
@rsnair2

rsnair2 Nov 29, 2014

Contributor

@sigmavirus24 I am working on profiling this out and narrowing down the exact function/blocks where the problem is happening. I will be back with more info when I am done. If you still need help, is it alright if I join in? This is kinda my first time working on an open source project and I would love to help!

Contributor

rsnair2 commented Nov 29, 2014

@sigmavirus24 I am working on profiling this out and narrowing down the exact function/blocks where the problem is happening. I will be back with more info when I am done. If you still need help, is it alright if I join in? This is kinda my first time working on an open source project and I would love to help!

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 Nov 29, 2014

Member

@rsnair2 we absolutely need the help. I'm very excited that you're willing to hwlp out. Feel free to ping me with questions via email

Member

sigmavirus24 commented Nov 29, 2014

@rsnair2 we absolutely need the help. I'm very excited that you're willing to hwlp out. Feel free to ping me with questions via email

@rsnair2

This comment has been minimized.

Show comment
Hide comment
@rsnair2

rsnair2 Nov 30, 2014

Contributor

Thanks sigmavirus24. Will do and I will keep you updated.

Contributor

rsnair2 commented Nov 30, 2014

Thanks sigmavirus24. Will do and I will keep you updated.

@rsnair2

This comment has been minimized.

Show comment
Hide comment
@rsnair2

rsnair2 Dec 14, 2014

Contributor

So from what I gathered so far, the "bad input" essentially contains at-least one single high byte (>= 128). When there is no high-byte (and no escape characters), the detector easily comes to the conclusion that it is an "ascii" encoding as it does not have to go through any of the probers. On the other hand, a single high byte (in the current case that is at 2543482) causes the detector to use the SBCSGroupProber - which causes the program to take significantly more time and consume more memory. This is expected behavior.

As for the memory, I believe that ru_maxrss measures the high watermark memory usage - i.e., the maximum memory that has been used by the process at some point of time. So that difference doesn't necessarily imply a memory leak.

I am still fairly new to memory debugging in python, so I was wondering if you could share your opinion on this @sigmavirus24.

Contributor

rsnair2 commented Dec 14, 2014

So from what I gathered so far, the "bad input" essentially contains at-least one single high byte (>= 128). When there is no high-byte (and no escape characters), the detector easily comes to the conclusion that it is an "ascii" encoding as it does not have to go through any of the probers. On the other hand, a single high byte (in the current case that is at 2543482) causes the detector to use the SBCSGroupProber - which causes the program to take significantly more time and consume more memory. This is expected behavior.

As for the memory, I believe that ru_maxrss measures the high watermark memory usage - i.e., the maximum memory that has been used by the process at some point of time. So that difference doesn't necessarily imply a memory leak.

I am still fairly new to memory debugging in python, so I was wondering if you could share your opinion on this @sigmavirus24.

@radeklat

This comment has been minimized.

Show comment
Hide comment
@radeklat

radeklat Dec 14, 2014

It is possible that ru_maxrss measures the peak memory usage. So yes, in this case you cannot tell whether there is a memory leak or high memory usage. However, I think the memory leak manifested itself pretty well when it hogged to 6GB of memory after several calls with different problematic data (which I didn't include but can provide if necessary).

radeklat commented Dec 14, 2014

It is possible that ru_maxrss measures the peak memory usage. So yes, in this case you cannot tell whether there is a memory leak or high memory usage. However, I think the memory leak manifested itself pretty well when it hogged to 6GB of memory after several calls with different problematic data (which I didn't include but can provide if necessary).

@rsnair2

This comment has been minimized.

Show comment
Hide comment
@rsnair2

rsnair2 Dec 14, 2014

Contributor

Hey @rlat, thanks for replying! It would be very useful if you could pass me in the problematic data so that I can try and replicate the problem as closely as possible.

Contributor

rsnair2 commented Dec 14, 2014

Hey @rlat, thanks for replying! It would be very useful if you could pass me in the problematic data so that I can try and replicate the problem as closely as possible.

@sigmavirus24

This comment has been minimized.

Show comment
Hide comment
@sigmavirus24

sigmavirus24 Dec 14, 2014

Member

So here's the thing. rss when talking about memory is usually talking about RAM allocated in pages. That number will never shrink unless you have other memory intensive programs in need of it and the computer page-faults and needs to reallocate it. The thing is, good programs shouldn't ever be allocated 6GB of RAM in one run. That said, if at any point in time, we're using 6GB of memory, that's absolutely problem that we need to fix.

Member

sigmavirus24 commented Dec 14, 2014

So here's the thing. rss when talking about memory is usually talking about RAM allocated in pages. That number will never shrink unless you have other memory intensive programs in need of it and the computer page-faults and needs to reallocate it. The thing is, good programs shouldn't ever be allocated 6GB of RAM in one run. That said, if at any point in time, we're using 6GB of memory, that's absolutely problem that we need to fix.

@jackdied

This comment has been minimized.

Show comment
Hide comment
@jackdied

jackdied Oct 11, 2016

Easier test case, the fast/slow case can be achieved by using inputs of
bytes([120]) + b'x' * 10000 # 10000 x's with an x in front
bytes([169]) + b'x' * 10000 # 10000x's plus a copyright glyph in front

It doesn't matter if the copyright character is at the beginning or end.

jackdied commented Oct 11, 2016

Easier test case, the fast/slow case can be achieved by using inputs of
bytes([120]) + b'x' * 10000 # 10000 x's with an x in front
bytes([169]) + b'x' * 10000 # 10000x's plus a copyright glyph in front

It doesn't matter if the copyright character is at the beginning or end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment