Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 73: surrogates not allowed #1346

Closed
devurandom opened this issue Jun 30, 2014 · 7 comments
Assignees
Milestone

Comments

@devurandom
Copy link

@devurandom devurandom commented Jun 30, 2014

I get the following, when executing nikola build from a Mercurial changegroup hook, run by the webserver after hg push, I get:

Scanning posts....done!
Traceback (most recent call last):
  File "/usr/lib/python3.3/site-packages/doit/doit_cmd.py", line 120, in run
    return command.parse_execute(args)
  File "/usr/lib/python3.3/site-packages/doit/cmd_base.py", line 78, in parse_execute
    return self.execute(params, args)
  File "/usr/lib/python3.3/site-packages/doit/cmd_base.py", line 268, in execute
    return self._execute(**exec_params)
  File "/usr/lib/python3.3/site-packages/doit/cmd_run.py", line 212, in _execute
    return runner.run_all(self.control.task_dispatcher())
  File "/usr/lib/python3.3/site-packages/doit/runner.py", line 237, in run_all
    self.run_tasks(task_dispatcher)
  File "/usr/lib/python3.3/site-packages/doit/runner.py", line 200, in run_tasks
    if not self.select_task(node, task_dispatcher.tasks):
  File "/usr/lib/python3.3/site-packages/doit/runner.py", line 111, in select_task
    if node.ignored_deps or self.dep_manager.status_is_ignore(task):
  File "/usr/lib/python3.3/site-packages/doit/dependency.py", line 454, in status_is_ignore
    return self._get(task.name, "ignore:")
  File "/usr/lib/python3.3/site-packages/doit/dependency.py", line 239, in get
    task_data = self._dbm[task_id]
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 73: surrogates not allowed

Please advise me how to debug this further. I assume the problem is with the task_id, but I do not know what it is generated from.

If I execute the command manually, there is no error message.

@Kwpolska

This comment has been minimized.

Copy link
Member

@Kwpolska Kwpolska commented Jun 30, 2014

task_id seems to be UTF-16, while it should be UTF-8. You can try:

  • nuking the .doit.db file (it helped in a sort-of-similar error)
  • checking around your server for UTF-16-named files
  • making sure hg uses UTF-8 while executing that task
  • running nikola list --all manually and via hg — and seeing where it crashes (may or may not help as the culprit function is not used by this command)
  • attempting debugging by sticking a print statement before line 239 in dependency.py (see traceback)
@devurandom

This comment has been minimized.

Copy link
Author

@devurandom devurandom commented Jun 30, 2014

I think I found it: One of the files contains , which apparently makes its way into doit as a string that encodes into the UTF-16 sequence \xe2\xdc\x80\xdc\xa6\xdc. (I modified doit/cmd_list.py:81 from line_data = {'name': task.name, 'doc':task.doc} to […] task.name.encode('utf-16') […] to get the UTF-16 representation.)

As a next step I would like to figure out where this comes from:

  • ls -l has no problem printing the filename in a en_GB.UTF-8 locale.
  • How do I get a binary representation of a filename?
  • If that representation differs from the Hg client to the server, maybe Mercurial mangles the filename…
  • If not, maybe doit or Nikola encodes the string in a weird way? Maybe twice?
@Kwpolska

This comment has been minimized.

Copy link
Member

@Kwpolska Kwpolska commented Jun 30, 2014

One of the files contains

why did you name your file this way?


Try running os.listdir() in both places. You should also see what sys.getfilesystemencoding() says.

@devurandom

This comment has been minimized.

Copy link
Author

@devurandom devurandom commented Jun 30, 2014

why did you name your file this way?

Because Nikola has no other way of setting an explanatory tooltip for an image.

@Kwpolska

This comment has been minimized.

Copy link
Member

@Kwpolska Kwpolska commented Jun 30, 2014

Ahh, #1286 (by you!). And a bonus implication for #1321/#1330.

But still: please consult os.listdir() and sys.getfilesystemencoding() in Python both locally and on the server, via hg (that’s important; hg and a user shell have different settings!). Someone along the way might be just encoding stuff into something wrong.

But then again: is U+2026. Which can be happily encoded into one UTF-16 word, without the need for surrogates; in fact, it would end up as the ASCII characters space and ampersand. There is clearly some encoding involved! Or maybe it’s actually garbage from .doit.db

Here’s a patch that might help to debug:

diff --git a/dependency.py b/dependency.py
index d4fd92d..fa44bca 100644
--- a/dependency.py
+++ b/dependency.py
@@ -237,6 +237,10 @@ class DbmDB(object):
         else:
             try:
                 task_data = self._dbm[task_id]
+            except UnicodeEncodeError:
+                with open('/tmp/nikola_debug', 'wb') as fh:
+                    fh.write(b'task id: ')
+                    fh.write(task_id)
+                    exit(1)
             except KeyError:
                 return
             self._db[task_id] = json.loads(task_data.decode('utf-8'))

At which point, you should build and run od -t x1 -c /tmp/nikola_debug and provide its output here.

@schettino72

This comment has been minimized.

Copy link
Member

@schettino72 schettino72 commented Jun 30, 2014

On Mon, Jun 30, 2014 at 1:39 PM, Dennis Schridde notifications@github.com
wrote:

I get the following, when executing nikola build from a Mercurial
changegroup hook, run by the webserver after hg push, I get:

Scanning posts....done!
Traceback (most recent call last):
File "/usr/lib/python3.3/site-packages/doit/doit_cmd.py", line 120, in run
return command.parse_execute(args)
File "/usr/lib/python3.3/site-packages/doit/cmd_base.py", line 78, in parse_execute
return self.execute(params, args)
File "/usr/lib/python3.3/site-packages/doit/cmd_base.py", line 268, in execute
return self._execute(**exec_params)
File "/usr/lib/python3.3/site-packages/doit/cmd_run.py", line 212, in _execute
return runner.run_all(self.control.task_dispatcher())
File "/usr/lib/python3.3/site-packages/doit/runner.py", line 237, in run_all
self.run_tasks(task_dispatcher)
File "/usr/lib/python3.3/site-packages/doit/runner.py", line 200, in run_tasks
if not self.select_task(node, task_dispatcher.tasks):
File "/usr/lib/python3.3/site-packages/doit/runner.py", line 111, in select_task
if node.ignored_deps or self.dep_manager.status_is_ignore(task):
File "/usr/lib/python3.3/site-packages/doit/dependency.py", line 454, in status_is_ignore
return self._get(task.name, "ignore:")
File "/usr/lib/python3.3/site-packages/doit/dependency.py", line 239, in get
task_data = self._dbm[task_id]
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce2' in position 73: surrogates not allowed

Please advise me how to debug this further. I assume the problem is with
the task_id, but I do not know what it is generated from.

An error generated by doit itself can debugged with:

doit --pdb XXX

Using nikola would be:

nikola build --pdb XXX
@Kwpolska

This comment has been minimized.

Copy link
Member

@Kwpolska Kwpolska commented Jun 24, 2015

No response for a year, closing.

@Kwpolska Kwpolska closed this Jun 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.