Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when using pipe on large amounts of data #4663

Closed
gabeorlanski opened this issue Nov 16, 2019 · 6 comments
Closed

Segmentation fault when using pipe on large amounts of data #4663

gabeorlanski opened this issue Nov 16, 2019 · 6 comments
Labels
feat / ner Feature: Named Entity Recognizer feat / parser Feature: Dependency Parser more-info-needed This issue needs more information perf / memory Performance: memory use scaling Scaling, serving and parallelizing spaCy

Comments

@gabeorlanski
Copy link

gabeorlanski commented Nov 16, 2019

The code I had used when this error appeared:

nlp = en_core_web_sm.load()
nlp.pipe(reports)

reports is a list of 2617 documents averaging 27K characters and 360 lines. In total, it has 71 million characters and is ~ 140 megabytes of data. This represents a single group of around 200 groups with a total of 400K documents. It is on the smaller side, so I assume it would also break for the groups with a lot more documents in them. One issue that could be causing this is that these reports are the result of parsing pdfs that include tables and tend to have strange characters in them. I am decoding with utf-8, but that could possibly be causing issues in the parser. The tables in the document do not have a uniform format, so they are quite hard to get rid of.

Environment:

  • spaCy version: 2.2.2
  • Platform: Linux-3.10.0-1062.4.1.el7.x86_64-x86_64-with-centos-7.7.1908-Core
  • Python version: 3.7.4
  • 128 GB of ram

One other thing that I noticed was that I was getting an error when trying to individually run nlp(report) using multiprocessing on the batched reports. The error I was getting:

error("'i' format requires -2147483648 <= number <= 2147483647")

It came after it had printed out the entire document. The code I had used for that was:

with mp.Pool(cores) as p:
    p.imap_unordered(parseBatch, (nlp,batches,))

def parseBatch(nlp, doc_batch):
    return [nlp(report) for report in doc_batch]

I had googled the report and found that it related to multiprocessing and joblib so that the segfault could be on their end. In terms of memory usage, while running pipe, it had not gone above 50%. I also tested this with and without documents that have more than 100K characters, but the issue persisted.

@gabeorlanski
Copy link
Author

gabeorlanski commented Nov 17, 2019

I did some testing with this code(outdated, check edit). You need tqdm and hurry.filesize to run them, or just comment out the filesize print and the tqdm when opening files. Unfortunately, I cannot share the data, but the 3000 documents had:

23898.62 average character count
71695872 Total characters
308.72 average line count
Size of reports = 139MB

This was on the second iteration, with the first being 500 documents. The second was 750 documents (it included those from the first)

File ".../venv/lib/python3.7/site-packages/spacy/util.py", line 560 in minibatch
  File ".../venv/lib/python3.7/site-packages/spacy/language.py", line 815 in pipe
  File "spacy_segfault.py", line 63 in main

I was able to repeat this error, this time starting at 600 documents on the first run as well as starting at 750, it gave me a segfault on only 750. Same exact spot every time. Running on the same 500 multiple times does cause this issue. The error did not occur @ 750 items when I started with 500, then went up in intervals of 250.

Update: So on 2500 different test documents where I first encountered this issue, through some debugging, I had realized that 9 of those documents were completely unreadable. For reference, here are the first few lines:

����� � � �� ��� �������� ������ � ���� 
 �� ��� ���� - ��� � �
 "������ "�$�% 

These were only on the different files that I had originally encountered this error in. I still, however, encountered a segfault on these documents (removing any with bad characters) when running a batch of 500. This amount should, according to the previous tests, work.

Edit: New Testing code It is very crude at the moment but it seems to be working somewhat at stress testing for my needs.

@adrianeboyd adrianeboyd added feat / ner Feature: Named Entity Recognizer feat / parser Feature: Dependency Parser perf / memory Performance: memory use scaling Scaling, serving and parallelizing spaCy labels Nov 18, 2019
@adrianeboyd
Copy link
Contributor

It's hard for us to debug this without the data/details to reproduce the issue. If you can pinpoint a particular text that's causing the problem, do you think you could create anonymized/dummy data that reproduces the problem?

Can you experiment with disabling components and running them individually (tagger, parser, and NER individually) to figure out which one is causing the problem? The parser and NER are actually similar enough underneath that it could turn out to be the same issue in both, but I could imagine that the parser is more likely to have problems with long strings of gibberish because it wouldn't be able to split sentences well. The tagger is less likely to be the problem, but it wouldn't hurt to test it, too.

The parser/NER memory leak has been fixed in v2.2.2, so reloading nlp isn't necessary (and just reloading the model doesn't actually help).

For the multiprocessing/pickle error, see:

https://stackoverflow.com/q/47776486/461847

There could be some useful information in the issue tracker if you look at the perf/* and scaling tags.

@adrianeboyd
Copy link
Contributor

It could also potentially be related to the error reported here (reported in the memory leak thread, but not due to the same cause as the slow memory leak that was fixed): #3618 (comment)

@gabeorlanski
Copy link
Author

@adrianeboyd I got the idea to make the stress test based on that issue. The segfault itself seemed to be fixed by removing all of those garbage documents. However, even on 1500 documents (31 MB), it never finished after running for around an hour. I have found that (for my case, at least), I have gotten much better results using a subclassed multiprocessing.Process and queues.

In terms of the data, I will take a look to see if I can anonymize it so that it can be of some use to you guys. But I will test different out those different components individually and get back to you on that front.

One theory I have is that the speed issues are almost entirely related to the lack of formatting in the document due to them being a result of pdftotext and how it formats the output. I have been working a lot on that end recently and hoping that I can come up with a solution that might have some use.

@adrianeboyd adrianeboyd added the more-info-needed This issue needs more information label Nov 22, 2019
@no-response
Copy link

no-response bot commented Dec 6, 2019

This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.

@no-response no-response bot closed this as completed Dec 6, 2019
@lock
Copy link

lock bot commented Jan 5, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / ner Feature: Named Entity Recognizer feat / parser Feature: Dependency Parser more-info-needed This issue needs more information perf / memory Performance: memory use scaling Scaling, serving and parallelizing spaCy
Projects
None yet
Development

No branches or pull requests

2 participants