Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Module named Html5lib.tokenizer #385

Closed
buffer1900 opened this issue Apr 16, 2018 · 8 comments
Closed

No Module named Html5lib.tokenizer #385

buffer1900 opened this issue Apr 16, 2018 · 8 comments

Comments

@buffer1900
Copy link

ERROR Fatal exception.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/wpull/application/app.py", line 152, in run
yield from pipeline.process()
File "/usr/local/lib/python3.6/dist-packages/wpull/pipeline/pipeline.py", line 194, in process
yield from self._process_one_worker()
File "/usr/local/lib/python3.6/dist-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
task.result()
File "/usr/local/lib/python3.6/dist-packages/wpull/pipeline/pipeline.py", line 119, in process
item = yield from self.process_one(_worker_id=worker_id)
File "/usr/local/lib/python3.6/dist-packages/wpull/pipeline/pipeline.py", line 103, in process_one
yield from task.process(item)
File "/usr/lib/python3.6/asyncio/coroutines.py", line 212, in coro
res = func(*args, **kw)
File "/usr/local/lib/python3.6/dist-packages/wpull/application/tasks/download.py", line 31, in process
self.build_html_parser(session)
File "/usr/local/lib/python3.6/dist-packages/wpull/application/tasks/download.py", line 37, in build_html_parser
from wpull.document.htmlparse.html5lib
import HTMLParser
File "/usr/local/lib/python3.6/dist-packages/wpull/document/htmlparse/html5lib
.py", line 3, in
import html5lib.tokenizer
ModuleNotFoundError: No module named 'html5lib.tokenizer'
CRITICAL Sorry, Wpull unexpectedly crashed.
CRITICAL Please report this problem to the authors at Wpull's issue tracker so it may be fixed. If you know how to program, maybe help us fix it? Thank you for helping us help you help us all.
INFO Exiting with status 1.

Any Help .

@chfoo
Copy link
Member

chfoo commented Apr 16, 2018

This appears to be the same issue as #332 and #333. Can you try to install an older version? Use html5lib==0.9999999.

@buffer1900
Copy link
Author

the old html5lib==0.9999999 works fine.
Thanks a lot.

@traverseda
Copy link

I mean that isn't really solving the problem, so I'm a bit unclear as to why all those issues are closed. Maybe it's time to mark this project as unmaintained?

@TheTechRobo
Copy link
Contributor

The project is not maintained very well at the moment because ArchiveTeam is mainly focusing on actually archiving everything. Most web crawler development is at https://github.com/ArchiveTeam/wget-lua/ at the moment, as that's what they use for the Warrior. That's a fork of Wget with Lua scripting, zstd WARC compression, deduplication, and other stuff.

I mean that isn't really solving the problem, so I'm a bit unclear as to why all those issues are closed.

chfoo linked issue #332, which is still open and is probably the canonical (non-duplicate) issue.

@JustAnotherArchivist
Copy link
Contributor

Not much maintenance has happened in recent years, yeah. I recently tried to start working on it again, but that's currently blocked by technical issues with our CI.

@traverseda
Copy link

@JustAnotherArchivist How would you feel about github actions as the CI? I know it's proprietary and kind of crappy, but there's an open source more or less compatible implementation here: https://github.com/nektos/act and I hear that gitea is supporting it as well.

Might give that a shot

@JustAnotherArchivist
Copy link
Contributor

My thoughts about GitHub Actions: ew no. Nice to see that there's a replacement though. That will help once GitHub Actions disappears or changes in stupid ways.
We have a Drone CI setup, it just isn't working correctly with this repo for some reason that hasn't been identified yet. There are no plans to replace Drone currently since we also use it for many other things, so we'll focus on fixing that instead. The person who administers it has just been very busy recently.

@traverseda
Copy link

Ahh, fair enough. I've used drone-ci in the past and it is a lot nicer than github actions. Hopefully that gets up and running sooner, I've been trying to get stuff running (I want a python-playwright subprocessor to replace the phantomjs one) but having some real hard times getting an environment I can actually develop in, seems like if I do get the python version right I can't compile LXML, or other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants