-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 2047 Ensure all files written by AiiDA and plugins are utf8 encoded #2107
Fix 2047 Ensure all files written by AiiDA and plugins are utf8 encoded #2107
Conversation
4c7bb59
to
9f15c23
Compare
ignore my previous review: for some reason GitHub showed me the outdated stuff |
a97a9e5
to
7d46e28
Compare
subfolder.create_file_from_filelike(StringIO(json.dumps(job_tmpl)), 'job_tmpl.json') | ||
subfolder.create_file_from_filelike(StringIO(json.dumps(calcinfo)), 'calcinfo.json') | ||
subfolder.create_file_from_filelike(StringIO(six.text_type(json.dumps(job_tmpl)), 'job_tmpl.json')) | ||
subfolder.create_file_from_filelike(StringIO(six.text_type(json.dumps(calcinfo)), 'calcinfo.json')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the same, StringIO gives you a file-like object while six.text_type converts a str to a unicode on Python 2 and is a no-op on Python 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I disagree on this one.
I know that the code looks ugly, but it’s to satisfy compatibility with Python 2 and 3.
The underlying issue is really create_file_from_filelike() which uses shutil.copyfileobj internally.
Minimal code that shows the difficult behaviour is below.
# encoding: utf-8
import io
import json
import shutil
from six import StringIO as StringIO
d_ascii={'a':'hello'}
filelike = StringIO(json.dumps(d_ascii))
with io.open('/tmp/test.dat', 'wb') as dest_file:
shutil.copyfileobj(filelike, dest_file)
This will work in Python2 but fail in Python3.
The reason is that the filelike object returned by StringIO is of the string type (due to coming from json.dumps). In Python 3, this is a unicode string which can’t be buffered in ‘wb’ mode, which requires bytes.
One could change the io.open line to encode to UTF-8 (and this is what I now do):
with io.open('/tmp/test.dat', 'w’, encoding=’utf8’) as dest_file:
but now Python 2 will fail as io.open demands a unicode object, but we are supplying a byte-string.
A solution which satisfies both Python2 and Python3, while ensuring that the written file is UTF-8 encoded, is to make sure that what comes out of json.dumps is converted to a unicode string.
Six.text_type() does this. At this point we haven’t encoded, but we eventually do this with io.open, and besides, json does encoding to UTF-8 internally.
In the end, the following examples satisfies both Python 2 and 3:
# encoding: utf-8
import io
import json
import shutil
import six
from six import StringIO as StringIO
d_ascii={'a':'hello'}
filelike = StringIO(six.text_type(json.dumps(d_ascii)))
with io.open('/tmp/test.dat', 'w', encoding='utf8') as dest_file:
shutil.copyfileobj(filelike, dest_file)
d_unicode={'b': u'γεια σας'}
filelike = StringIO(six.text_type(json.dumps(d_unicode)))
with io.open('/tmp/test.dat', 'w', encoding='utf8') as dest_file:
shutil.copyfileobj(filelike, dest_file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with using six.text_type
to work around the inconsistent json API is that if you do not specify anything, it uses the default system encoding for decoding a bytes
or a str
into a unicode string, which is exactly the ambiguity we try to avoid in the first place.
But if you are adding the encoding="utf8"
, then six.text_type("", encoding="utf8")
runs on Python 2 and fails on Python 3.
I would rather prefer an explicit if six.PY2: ...
switch or a (maintained and consistent) library like simplejson
.
The other drawback of these kind of solutions is that in a year or two nobody will have a clue anymore why this wrapping is there, whether it is still needed or not and it will have spread to other places because people are doing copy&paste, especially when it comes to solving problems they don't know about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair point.
I think the tidy thing to do, if we want to use a switch, is to have a json utility function, although I don't know where best to put this. Then it would be just a case of changing the json dump function calls where they occur.
aiida/backends/tests/parsers.py
Outdated
with open(os.path.join(outfolder, '_aiida_checks.json'), 'w') as f: | ||
json.dump({}, f) | ||
with io.open(os.path.join(outfolder, '_aiida_checks.json'), 'w', encoding='utf8') as fhandle: | ||
fhandle.write(six.text_type(json.dumps({}, ensure_ascii=False))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather write directly the expected content after the json.dumps
than do the six.text_type
wrapping here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapping now removed.
aiida/backends/tests/tcodexporter.py
Outdated
with io.open(fd._get_folder_pathsubfolder.get_abs_path( | ||
calc._SCHED_OUTPUT_FILE), 'w', encoding='utf8') as fhandle: | ||
fhandle.write(u"standard output") | ||
fhandle.flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the flush
here and below are pointless since the file will be closed afterwards and therefore flushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flush() removed.
aiida/cmdline/commands/cmd_export.py
Outdated
with open(folder.get_abs_path('data.json'), 'w') as handle: | ||
json.dump(data, handle) | ||
with io.open(folder.get_abs_path('data.json'), 'w', encoding='utf8') as fhandle: | ||
fhandle.write(six.text_type(json.dumps(data, ensure_ascii=False))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really not convinced that this is a portable and maintainable solution, see #2142
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now resolved.
with open(self.circus_socket_file, 'w') as handle: | ||
handle.write(socket_dir_path) | ||
with io.open(self.circus_socket_file, 'w', encoding='utf8') as fhandle: | ||
fhandle.write(six.text_type(socket_dir_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be a separate codepath for Python 2:
if six.PY2:
...
else:
...
This way it would be clearer on what transformation is needed once we drop Python 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved by /utils/json.py replacement.
We discussed this with @giovannipizzi and @ConradJohnston and we agreed that it would be possible to define the four used json functions |
…h a bytestring as input), therefore we use six.text_type to convert it to unicode and pass it to io.open(demofile.txt, w, encoding='utf-8') which expects a unicode object and then it encodes it utf-8.
…nts() to write in binary mode as we're sending theoutput to /dev/null anyway; remove the encoding arg from folder.open calls in orm/importexport as it's really a call to ZipFolder.open; change cStringIO to StringIO in MyWritingZipFile to allow for unicode
89c19cc
to
71a18b7
Compare
…ing the 'with open()' block
Thanks a lot @ConradJohnston , to me it looks good and ready to merge. @dev-zero if you are also happy, let me know and I will merge this PR. Just one final question that occurred to me. Since we are now requiring that all files written by AiiDA are utf-8 encoded, shouldn't we also ship a "migration" to properly write all currently written files in the repositories? |
Looks very good, thanks @ConradJohnston @sphuber I am not sure about that one yet. It might be easy to provide a script: use |
I suppose the problem with a migration is that we can't know how the files in a user's repository were written and can't infer the encoding by inspection. |
But do we, after this PR, assume that all files in the repository are utf-8 encoded? And if so, and this is not the case, will everything fall over? |
@sphuber - You could be right. |
I think we can merge this - this (hopefully) doesn't make the situation worse than now. |
So does that mean that any file that is retrieved by the engine from the supercomputer is stored wholesale in the repository with the same encoding? Which means, that nowhere in the code we are (or should be) expecting the files to always be utf-8, but it could be anything? |
that would be my expectation. instead for files aiida write itself and then has to read again (e.g. if it stores a file in the repo that it has to read again, say a json or other) then we should ensure always utf8 |
OK, if we take the stance that we can never assume utf-8 unless we explicitly write it ourselves, we don't need a migration and we can merge this PR. |
Many thanks to @ConradJohnston and @dev-zero for the hard work! |
@dev-zero already mentioned in the comments he approves after the made changes
Fixes #2047 and fixes #2142