New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove wchartype dependency and is_multi_byte [1/2] #5062

Merged
merged 7 commits into from Nov 28, 2017

Conversation

Projects
None yet
2 participants
@nsoranzo
Member

nsoranzo commented Nov 23, 2017

wchartype won't make sense in Python3, where all strings are UTF-8 encoded.

See individual commits for details.

xref. #1715

@jmchilton

This comment has been minimized.

Member

jmchilton commented Nov 23, 2017

Does get_file_peek still need to take that is multi-byte argument and just ignore it because of tool shed datatype definitions?

@@ -266,7 +259,7 @@ def is_column_based(fname, sep='\t', skip=0, is_multi_byte=False):
>>> is_column_based(fname)
True
"""
headers = get_headers(fname, sep, is_multi_byte=is_multi_byte)
headers = get_headers(fname, sep)

This comment has been minimized.

@nsoranzo

nsoranzo Nov 23, 2017

Member

Need to add a try/except UnicodeDecodeError around this line.

@nsoranzo nsoranzo changed the title from [WIP] Remove wchartype dependency and is_multi_byte to Remove wchartype dependency and is_multi_byte [1/2] Nov 24, 2017

@nsoranzo nsoranzo added status/review and removed status/WIP labels Nov 24, 2017

@nsoranzo nsoranzo added this to the 18.01 milestone Nov 24, 2017

@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Nov 27, 2017

Does get_file_peek still need to take that is multi-byte argument and just ignore it because of tool shed datatype definitions?

@jmchilton Then I should probably also re-add the is_multi_byte param to all the set_peek() methods, since some TS datatype subclass may be calling the parent's method when overriding it. What do you think?

@jmchilton

This comment has been minimized.

Member

jmchilton commented Nov 27, 2017

@nsoranzo Sounds fair - sorry about these data types - they are why we can't have nice things.

nsoranzo added some commits Nov 22, 2017

Add missing newline at end of some test files
It gets added anyway when the file is uploaded to Galaxy due to the
default "Use POSIX standard" (to_posix_lines) option. Many tests which
simply copy them with cat and compare them would fail a strict diff test.
Improve and test the PDF sniffer
Read just the first 4 bytes in binary mode.
Fix `SnpSiftDbNSFP.generate_primary_file()` to return a string
Also use a `with` statement to open/close a file in
`SnpSiftDbNSFP.regenerate_primary_file()` .
Add a datatype to sniff_order only if it has a sniff() method
This reduces the number of default sniffers from 412 to 246.
Update `get_fileobj()` to use utf-8 encoding in text mode
Also, merge its 3 parameters `gzip_only`, `bz2_only`, `zip_only` into
`compressed_formats` (a list of allowed formats).

As a consequence of the changes in `get_fileobj()`, update:
- `files_diff()`
- `get_file_peek()`, which now determines that a file is binary when a
  `UnicodeDecodeError` exception is raised and doesn't need
  `is_multi_byte` any more
- `iter_headers()` and `get_headers()`, which now return Unicode and don't
  need `is_multi_byte` parameter any more

As a consequence of the changes in `get_file_peek()`, update:
- `set_peek()`, which now doesn't need `is_multi_byte` any more

As a consequence of the changes in `get_headers()`, update:
- `guess_ext` and `is_column_based()`, which now determine that a file is
  binary when a `UnicodeDecodeError` exception is raised and don't need
  `is_multi_byte` any more

As a consequence of the changes to `guess_ext`, update:
- `handle_uploaded_dataset_file() doesn't need `is_multi_byte` any more

Also, remove duplicated calls to `get_file_peek()` in
lib/galaxy/datatypes/molecules.py and lib/galaxy/datatypes/msa.py

The `is_multi_byte` was not removed from the signature of `get_file_peek()`
and `set_peek()` in order to preserve compatibility for ToolShed datatypes,
thanks @jmchilton for the review.
@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Nov 28, 2017

@jmchilton Restored the is_multi_byte in the signatures.

@jmchilton jmchilton merged commit 8d7f344 into galaxyproject:dev Nov 28, 2017

7 checks passed

api test Build finished. 317 tests run, 4 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 163 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 58 tests run, 0 skipped, 0 failed.
Details
lgtm analysis: JavaScript No alert changes
Details
selenium test Build finished. 100 tests run, 1 skipped, 0 failed.
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@jmchilton

This comment has been minimized.

Member

jmchilton commented Nov 28, 2017

Thanks for supporting those datatypes and all of this - really nice work!

@nsoranzo

This comment has been minimized.

Member

nsoranzo commented Nov 28, 2017

Thanks for the review and merge, part 2/2 coming in the next few days, I just have to split changes in commits and test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment