Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object too large error in preprocessing script #102

Closed
ahalterman opened this issue Apr 14, 2020 · 4 comments · Fixed by #103
Closed

Object too large error in preprocessing script #102

ahalterman opened this issue Apr 14, 2020 · 4 comments · Fixed by #103
Labels
bug Bugs and behaviour differing from documentation enhancement Feature requests and improvements

Comments

@ahalterman
Copy link
Contributor

I've been getting a "bytes object is too large" error when processing a large-ish number of documents using the 01_parse.py script. Creating several smaller doc_bin objects resolves the issue. Full error:

ahalt@xxxxxxxx:~/sense2vec$ python sense2vec/scripts/01_parse.py hindu_complete.txt docbins en_core_web_sm -n 10
ℹ Using spaCy model en_core_web_sm
Preprocessing text...
Docs: 267103 [1:00:38, 73.42/s]
✔ Processed 267103 docs
Traceback (most recent call last):
  File "sense2vec/scripts/01_parse.py", line 47, in <module>
    plac.call(main)
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "sense2vec/scripts/01_parse.py", line 39, in main
    doc_bin_bytes = doc_bin.to_bytes()
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/spacy/tokens/_serialize.py", line 151, in to_bytes
    return zlib.compress(srsly.msgpack_dumps(msg))
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/srsly/_msgpack_api.py", line 16, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/home/ahalt/anaconda3/lib/python3.6/site-packages/srsly/msgpack/__init__.py", line 40, in packb
    return Packer(**kwargs).pack(o)
  File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 235, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 206, in srsly.msgpack._packer.Packer._pack
ValueError: bytes object is too large
@svlandeg svlandeg added bug Bugs and behaviour differing from documentation enhancement Feature requests and improvements labels Apr 14, 2020
@ahalterman
Copy link
Contributor Author

If you end up splitting the output files in the 01_parse.py script, you can easily run the preprocessing script over each of them using GNU parallel:

find docbins/ -name '*.spacy' | parallel --jobs 10 python sense2vec/scripts/02_preprocess.py {} s2v_format/ en_core_web_sm 

@dshefman1
Copy link

dshefman1 commented Apr 15, 2020

I had the same problem. See the error message below. After doing some more preprocessing, however, I no longer get the "bytes object is too large" ValueError. Preprocessing steps: (1) removed duplicates, (2) stripped whitespace from sentence end, (3) removed sentences of length > 2520 characters, (4) removed sentences of length < 11 characters. These 4 steps cut my dataset by 74% from 7,487,357 sentences to 1,978,295. So, I'm not sure which of those steps fixed the problem, but I no longer get the "bytes object is too large" ValueError.

~/sense2vec$ python scripts/01_parse.py ../corpus2.txt ../corpus_parsed2 en_core_web_lg --n 14
✔ Created output directory ../corpus_parsed2
ℹ Using spaCy model en_core_web_lg
Preprocessing text...
Docs: 7487357 [57:44, 2161.41/s]
✔ Processed 7487357 docs
Traceback (most recent call last):
  File "01_parse.py", line 45, in <module>
    plac.call(main)
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "01_parse.py", line 37, in main
    doc_bin_bytes = doc_bin.to_bytes()
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\spacy\tokens\_serialize.py", line 151, in to_bytes
    return zlib.compress(srsly.msgpack_dumps(msg))
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\srsly\_msgpack_api.py", line 16, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "~\AppData\Local\Continuum\anaconda3\lib\site-packages\srsly\msgpack\__init__.py", line 40, in packb
    return Packer(**kwargs).pack(o)
  File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 235, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 206, in srsly.msgpack._packer.Packer._pack
ValueError: bytes object is too large

@ahalterman
Copy link
Contributor Author

How big are each of your documents? Is each one a sentence or is it more like a news article? Mine are around 500 words/3000-4000 characters, so if yours are sentence-length that could keep you below the memory limit. (That could also explain why you're getting 2,000 docs/second and I'm getting 100/second on 14 cores.)

In general, though, it's not ideal to have to trim the corpus to prevent an out-of-memory error. I'm about to train vectors on a much larger corpus of text so I'll see how the splitting solution in #103 works.

@dshefman1
Copy link

How big are each of your documents? Is each one a sentence or is it more like a news article?

Each of my documents is a sentence that is 120 characters, on average. So, I agree with your statements.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation enhancement Feature requests and improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants