Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOFError: Ran out of input #242

Open
shidaide2019 opened this issue Jan 1, 2021 · 6 comments
Open

EOFError: Ran out of input #242

shidaide2019 opened this issue Jan 1, 2021 · 6 comments

Comments

@shidaide2019
Copy link

Sorry to disturb you, but I met a weird bug while extracting wiki bz2
My python version is 3.8, and anaconda version id 2020.11, I used pip install to get wikiextractor(3.0.4)
and when I ran command

python -m wikiextractor.WikiExtractor -o extracted enwiki-20201220-pages-articles-multistream.xml.bz2

It comes out such error message after about 50 mins running:

Traceback (most recent call last):
File "C:\Users\win\Anaconda3\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\win\Anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 621, in
main()
File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 616, in main
process_dump(input_file, args.templates, output_path, file_size,
File "C:\Users\win\Anaconda3\lib\site-packages\wikiextractor\WikiExtractor.py", line 357, in process_dump
reduce.start()
File "C:\Users\win\Anaconda3\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\win\Anaconda3\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\win\Anaconda3\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\win\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "C:\Users\win\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\win\Anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\win\Anaconda3\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

I'm looking forward to your answer.

@shidaide2019
Copy link
Author

At first I think it maybe result from multiprocessing ,so I changed the processes to 1 and the error is the same as multiprocess.

@runpingzhong
Copy link

i met the same problem,have you solved it?

@attardi
Copy link
Owner

attardi commented Feb 11, 2021

I think there is a problems on Windows, passing file descriptors across threads.
It would require some rewriting in order to open descriptors within threads.

@ArlanCooper
Copy link

I have the same problem on Windows ,too, how to solve it?

@number435398
Copy link

Same error. Of course it takes it like 30 mins or so to even reach the failure point in the code.

@rgryta
Copy link

rgryta commented Jun 4, 2023

This issue isn't easily solvable as wikiextractor relies on multiprocessing module and forking mechanism in order to create new processes instead of spawn that's available by Windows.

Your best option is to use WSL environment if you want to use officially distributed package. If you have to stick to Windows then you can try to use my quick patch for Windows support: #315

However, this patch basically moves all logic from multiprocessing to multithreading - which has abysmal performance in comparison to mp due to GIL - almost linearly slower depending on your CPU count. That being said at least it works. Extraction speed is at about 150 articles/s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants