Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors when indexing wikipedia #1

Closed
thatguysimon opened this issue Dec 28, 2016 · 2 comments
Closed

Errors when indexing wikipedia #1

thatguysimon opened this issue Dec 28, 2016 · 2 comments

Comments

@thatguysimon
Copy link

Hi,
I've been trying to set up snowball with wikipedia.
I'm at the step of indexing wikipedia using elasticsearch, I do this by running:
python index/index.py
(by the way, I believe this step is missing from the readme)

The code successfully sets up the corenlp server but then runs into the following error:

...
...
INFO:CoreNLP_PyWrapper:Successful ping. The server has started.
INFO:CoreNLP_PyWrapper:Subprocess is ready.
Python(1828,0x7fffefa9d3c0) malloc: *** mach_vm_map(size=8872781095355314176) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Process Process-2:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 88, in parse
return corenlp.parse_doc(text)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 226, in parse_doc
return self.send_command_and_parse_result(cmd, timeout, raw=raw)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 246, in send_command_and_parse_result
data = self.send_command_and_get_string_result(cmd, timeout)
File "/Library/Python/2.7/site-packages/stanford_corenlp_pywrapper/sockwrap.py", line 289, in send_command_and_get_string_result
data = self.outpipe_fp.read(remaining_size)
INFO:CoreNLP_JavaServer: INPUT: 1 documents, 50939 characters, 8747 tokens, 50939.0 char/doc, 8747.0 tok/doc RATES: 0.125 doc/sec, 1092.4 tok/sec
MemoryError

WARNING:CoreNLP_PyWrapper:Bad JSON returned from subprocess; returning null.
WARNING:CoreNLP_PyWrapper:Bad JSON length 392719, starts with: 'J","NN","."],"lemmas":["Industrial","agriculture","base","on","large-scale","monoculture","farming","have","become","the","dominant","agricultural","methodology","."],"tokens":["Industrial","agriculture","based","on","large-scale","monoculture","farming","has","become","the","dominant","agricultural","methodology","."],"char_offsets":[[590,600],[601,612],[613,618],[619,621],[622,633],[634,645],[646,653],[654,657],[658,664],[665,668],[669,677],[678,690],[691,702],[702,703]],"ner":["O","O","O","O","O","O","O","O","O","O","O","O","O","O"],"normner":["","","","","","","","","","","","","",""]},{"pos":["NNP","NNP",",","NN","NN",",","NNS","JJ","IN","NNS","CC","NNS",",","CC","JJ","NNS","VBP","IN","JJ","NNS","RB","VBD","NNS","IN","NN",",","CC","IN","DT","JJ","NN","VBP","VBN","JJ","JJ","NN","CC","JJ","JJ","NN","NNS","."],"lemmas":["Modern","agronomy",",","plant","breeding",",","agrochemical","such","as","pesticide","and","fertilizer",",","and","technological","development","have","in","many","c'
Process Process-1:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "index/index.py", line 35, in work
parse = parser.combined_parse(xml, text, self.corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 221, in combined_parse
sentences = get_sentences(text, corenlp)
File "build/bdist.macosx-10.12-x86_64/egg/corenlp/parser.py", line 203, in get_sentences
sents = parse(text, corenlp)['sentences']
TypeError: 'NoneType' object has no attribute 'getitem'

I see 3 problems here - the memory allocation error, the "bad json" warning, and the TypeError at the end. Not sure which of these is causing the problem or if they are connected.

  • I'm using the same corenlp and parser versions as in the example config file - "stanford-corenlp-full-2015-04-20".

  • Corenlp python wrapper installed from:
    https://github.com/brendano/stanford_corenlp_pywrapper

  • log files:
    log_0.log

    INFO (LOGGER 0): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_00

    log_1.log

    INFO (LOGGER 1): Now processing file: /Users/mac/git/wikipedia/extracted2/AA/wiki_01

  • Wikipedia extracted using the latest version of wikipedia-extractor, the version in this repository wasn't working for me.

Would love to hear your thoughts.

Thanks,
Simon.

@aadah
Copy link
Owner

aadah commented Jan 3, 2017

Generally speaking, this program can't be run on a standard laptop because of insufficient memory, though indexing should be fine. My guess is that the process died in the midst of parsing (because of the memory error), returning the ill-formatted json downstream that was parsed to null. Run it again and see if the program is eating of RAM.

by the way, I believe this step is missing from the readme

Thanks for the notice. The README is incomplete. (I'm not supporting this repo anymore.) Make sure to run 'python index/index.py -i' to initalize the mapping in elasticstart.

@thatguysimon
Copy link
Author

I managed to get it to work by passing a parameter to use only 1 process.
It was extremely slow but it doesn't matter because I ended up rewriting the indexing code to index my own much much smaller corpus. Now it's working with 1 process, takes a while but it's bearable.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants