Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 2047 Ensure all files written by AiiDA and plugins are utf8 encoded #2107

Conversation

ConradJohnston
Copy link
Contributor

@ConradJohnston ConradJohnston commented Oct 24, 2018

Fixes #2047 and fixes #2142

  • Changes the folders.Folder.open function to use io.open and checks to see if the file should be opened in binary mode or formatted mode. Enforces UTF-8 by default.
  • Replaces system "open" with "io.open" everywhere in the codebasem, except where sys open is preferable (e.g. HiddenFiles()). Explicitly sets the encoding as UTF-8, or as None for binary files.
  • Sets a number of testing strings as unicode strings explicitly.
  • Changes how writing JSON to zip files is done in Import/Export to handle uniicode.
  • In some cases, change cStringIO to StringIO to ensure that unicode can bed used.

@ConradJohnston ConradJohnston force-pushed the fix_2047_ensure_all_files_are_utf8_encoded branch from 4c7bb59 to 9f15c23 Compare October 25, 2018 07:29
aiida/backends/tests/cmdline/commands/test_data.py Outdated Show resolved Hide resolved
aiida/backends/tests/dataclasses.py Outdated Show resolved Hide resolved
aiida/backends/tests/test_caching_config.py Outdated Show resolved Hide resolved
@dev-zero
Copy link
Contributor

ignore my previous review: for some reason GitHub showed me the outdated stuff

aiida/backends/tests/parsers.py Outdated Show resolved Hide resolved
aiida/cmdline/commands/cmd_export.py Outdated Show resolved Hide resolved
@ConradJohnston ConradJohnston force-pushed the fix_2047_ensure_all_files_are_utf8_encoded branch 2 times, most recently from a97a9e5 to 7d46e28 Compare October 29, 2018 15:52
subfolder.create_file_from_filelike(StringIO(json.dumps(job_tmpl)), 'job_tmpl.json')
subfolder.create_file_from_filelike(StringIO(json.dumps(calcinfo)), 'calcinfo.json')
subfolder.create_file_from_filelike(StringIO(six.text_type(json.dumps(job_tmpl)), 'job_tmpl.json'))
subfolder.create_file_from_filelike(StringIO(six.text_type(json.dumps(calcinfo)), 'calcinfo.json'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the same, StringIO gives you a file-like object while six.text_type converts a str to a unicode on Python 2 and is a no-op on Python 3

Copy link
Contributor Author

@ConradJohnston ConradJohnston Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I disagree on this one.

I know that the code looks ugly, but it’s to satisfy compatibility with Python 2 and 3.
The underlying issue is really create_file_from_filelike() which uses shutil.copyfileobj internally.
Minimal code that shows the difficult behaviour is below.

# encoding: utf-8
import io
import json
import shutil
from six import StringIO as StringIO

d_ascii={'a':'hello'}
filelike = StringIO(json.dumps(d_ascii))

with io.open('/tmp/test.dat', 'wb')  as dest_file:
    shutil.copyfileobj(filelike, dest_file)

This will work in Python2 but fail in Python3.
The reason is that the filelike object returned by StringIO is of the string type (due to coming from json.dumps). In Python 3, this is a unicode string which can’t be buffered in ‘wb’ mode, which requires bytes.

One could change the io.open line to encode to UTF-8 (and this is what I now do):

with io.open('/tmp/test.dat', 'w’, encoding=’utf8’)  as dest_file:

but now Python 2 will fail as io.open demands a unicode object, but we are supplying a byte-string.

A solution which satisfies both Python2 and Python3, while ensuring that the written file is UTF-8 encoded, is to make sure that what comes out of json.dumps is converted to a unicode string.
Six.text_type() does this. At this point we haven’t encoded, but we eventually do this with io.open, and besides, json does encoding to UTF-8 internally.

In the end, the following examples satisfies both Python 2 and 3:

# encoding: utf-8
import io
import json
import shutil

import six
from six import StringIO as StringIO
 
d_ascii={'a':'hello'}
filelike = StringIO(six.text_type(json.dumps(d_ascii)))
with io.open('/tmp/test.dat', 'w', encoding='utf8')  as dest_file:
    shutil.copyfileobj(filelike, dest_file)

d_unicode={'b': u'γεια σας'}
filelike = StringIO(six.text_type(json.dumps(d_unicode)))
with io.open('/tmp/test.dat', 'w', encoding='utf8')  as dest_file:
    shutil.copyfileobj(filelike, dest_file)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with using six.text_type to work around the inconsistent json API is that if you do not specify anything, it uses the default system encoding for decoding a bytes or a str into a unicode string, which is exactly the ambiguity we try to avoid in the first place.
But if you are adding the encoding="utf8", then six.text_type("", encoding="utf8") runs on Python 2 and fails on Python 3.

I would rather prefer an explicit if six.PY2: ... switch or a (maintained and consistent) library like simplejson.

The other drawback of these kind of solutions is that in a year or two nobody will have a clue anymore why this wrapping is there, whether it is still needed or not and it will have spread to other places because people are doing copy&paste, especially when it comes to solving problems they don't know about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair point.
I think the tidy thing to do, if we want to use a switch, is to have a json utility function, although I don't know where best to put this. Then it would be just a case of changing the json dump function calls where they occur.

@coveralls
Copy link

coveralls commented Oct 29, 2018

Coverage Status

Coverage increased (+0.01%) to 68.671% when pulling 369dd9f on ConradJohnston:fix_2047_ensure_all_files_are_utf8_encoded into 9bc4e36 on aiidateam:develop.

with open(os.path.join(outfolder, '_aiida_checks.json'), 'w') as f:
json.dump({}, f)
with io.open(os.path.join(outfolder, '_aiida_checks.json'), 'w', encoding='utf8') as fhandle:
fhandle.write(six.text_type(json.dumps({}, ensure_ascii=False)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather write directly the expected content after the json.dumps than do the six.text_type wrapping here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping now removed.

with io.open(fd._get_folder_pathsubfolder.get_abs_path(
calc._SCHED_OUTPUT_FILE), 'w', encoding='utf8') as fhandle:
fhandle.write(u"standard output")
fhandle.flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the flush here and below are pointless since the file will be closed afterwards and therefore flushed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush() removed.

with open(folder.get_abs_path('data.json'), 'w') as handle:
json.dump(data, handle)
with io.open(folder.get_abs_path('data.json'), 'w', encoding='utf8') as fhandle:
fhandle.write(six.text_type(json.dumps(data, ensure_ascii=False)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really not convinced that this is a portable and maintainable solution, see #2142

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now resolved.

with open(self.circus_socket_file, 'w') as handle:
handle.write(socket_dir_path)
with io.open(self.circus_socket_file, 'w', encoding='utf8') as fhandle:
fhandle.write(six.text_type(socket_dir_path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be a separate codepath for Python 2:

if six.PY2:
  ...
else:
  ...

This way it would be clearer on what transformation is needed once we drop Python 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by /utils/json.py replacement.

@sphuber
Copy link
Contributor

sphuber commented Oct 30, 2018

We discussed this with @giovannipizzi and @ConradJohnston and we agreed that it would be possible to define the four used json functions load and dump (and the string variants) in aiida.utils.json, where we can do a try/catch importing simplejson and fall back on our own wrapper. We can change all the imports import json to from aiida.utils import json such that we have one single point of entry. In addition, the ujson and anyjson dependencies can be dropped in favor of simplejson as an optional dependency.

@ConradJohnston ConradJohnston force-pushed the fix_2047_ensure_all_files_are_utf8_encoded branch from 89c19cc to 71a18b7 Compare October 31, 2018 14:42
@sphuber
Copy link
Contributor

sphuber commented Nov 2, 2018

Thanks a lot @ConradJohnston , to me it looks good and ready to merge. @dev-zero if you are also happy, let me know and I will merge this PR.

Just one final question that occurred to me. Since we are now requiring that all files written by AiiDA are utf-8 encoded, shouldn't we also ship a "migration" to properly write all currently written files in the repositories?

@dev-zero
Copy link
Contributor

dev-zero commented Nov 2, 2018

Looks very good, thanks @ConradJohnston

@sphuber I am not sure about that one yet. It might be easy to provide a script: use chardet to detect the encoding, if something is not pure ASCII or UTF8, load it using the detected encoding, write it again in UTF-8. But for that we would have to be able to know which ones are really text and which are binaries (or to be treated as such, like the UPF).

@ConradJohnston
Copy link
Contributor Author

Thanks a lot @ConradJohnston , to me it looks good and ready to merge. @dev-zero if you are also happy, let me know and I will merge this PR.

Just one final question that occurred to me. Since we are now requiring that all files written by AiiDA are utf-8 encoded, shouldn't we also ship a "migration" to properly write all currently written files in the repositories?

I suppose the problem with a migration is that we can't know how the files in a user's repository were written and can't infer the encoding by inspection.

@sphuber
Copy link
Contributor

sphuber commented Nov 2, 2018

But do we, after this PR, assume that all files in the repository are utf-8 encoded? And if so, and this is not the case, will everything fall over?

@ConradJohnston
Copy link
Contributor Author

@sphuber - You could be right.
The encoder is set up to fail when it encounters characters that aren't in UTF-8, but the bigger problem is if it silently does the wrong thing. I think some IS0-8859-1 encoded files can be silently and wrongly interpreted by the UTF8 codec.
I'll see if there's an example of this.

@giovannipizzi
Copy link
Member

I think we can merge this - this (hopefully) doesn't make the situation worse than now.
Anyway I think we must treat any file coming from a remote supercomputer as binary, and the parser has to be clever enough to figure it out - I think it would be bad to get the output of a code and try to change it when storing. Ideally, we should (in the future, see #2046) keep track of the default encoding on the supercomputer at the time of storing, as an additional metadata - even if I think getting this right is complicated (e.g. the user might have changed env vars in the submission script or other weird stuff).

@sphuber
Copy link
Contributor

sphuber commented Nov 3, 2018

So does that mean that any file that is retrieved by the engine from the supercomputer is stored wholesale in the repository with the same encoding? Which means, that nowhere in the code we are (or should be) expecting the files to always be utf-8, but it could be anything?

@giovannipizzi
Copy link
Member

that would be my expectation. instead for files aiida write itself and then has to read again (e.g. if it stores a file in the repo that it has to read again, say a json or other) then we should ensure always utf8

@sphuber
Copy link
Contributor

sphuber commented Nov 3, 2018

OK, if we take the stance that we can never assume utf-8 unless we explicitly write it ourselves, we don't need a migration and we can merge this PR.

@sphuber
Copy link
Contributor

sphuber commented Nov 3, 2018

Many thanks to @ConradJohnston and @dev-zero for the hard work!

@sphuber sphuber self-requested a review November 3, 2018 10:58
@sphuber sphuber dismissed dev-zero’s stale review November 3, 2018 10:59

@dev-zero already mentioned in the comments he approves after the made changes

@sphuber sphuber merged commit 552c626 into aiidateam:develop Nov 3, 2018
@ConradJohnston ConradJohnston deleted the fix_2047_ensure_all_files_are_utf8_encoded branch November 21, 2018 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

replace json implementations with simplejson Ensure all files written by AiiDA and plugins are utf8 encoded
6 participants