-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2420 removed as_bytes and as_string #2468
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2468 +/- ##
==========================================
+ Coverage 84.74% 84.75% +<.01%
==========================================
Files 321 321
Lines 52417 52399 -18
==========================================
- Hits 44423 44409 -14
+ Misses 7994 7990 -4
Continue to review full report at Codecov.
|
Bio/bgzf.py
Outdated
@@ -446,7 +445,8 @@ def _load_bgzf_block(handle, text_mode=False): | |||
if expected_crc != crc: | |||
raise RuntimeError("CRC is %s, not %s" % (crc, expected_crc)) | |||
if text_mode: | |||
return block_size, _as_string(data) | |||
import codecs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well put this import at the top of the file (with the other standard library imports, roughly alphabetical)?
Looks good. Given this touches the online tests, we'll need to run the full test suite locally (TravisCI and AppVeyor do the offline tests only). |
Can you try a rebase (rather than merging in master), something like this assuming your repository's remote name is
|
old = _as_bytes(old) | ||
if isinstance(old, str): | ||
import codecs | ||
old = codecs.latin_1_encode(old)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterjc While we are looking at this and as @Andrey-Raspopov touched here, I am actually wondering what is the rationale to use (and keep using) latin-1 encoding as opposed to utf-8?
There may be some cases where that makes sense as the default encoding. Also, historically it seems that latin-1 was at one point a competing Unicode encoding scheme. However, it seems that nowadays it is UTF-8 that is the de-facto unicode encoding standard. It is also the default encoding that Python source files have, sans an encoding comment.
Is there a reason (other than maybe backwards compatibility) that we keep using latin-1
?
For reference, I also tried looking at our history and can see that the last commit that introduced latin-1
was made almost 9 years ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that UTF-8 makes more sense nowadays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also my comment above for Bio/bgzf.py
.
Very good question Bow. At least in SFF the mapping of character 255 was critical, and latin1 preserved it but (IIRC) other encodings did not. So leave SFF as is. For BGZF we probably ought to follow Other uses probably need a case by case consideration. |
I'd need to re-read Python 3 best practise, but from memory latin1 was a pragmatic choice for a minimal set - it would cover plain ASCII and all the characters likely to be seen in English language data files, but fail on 'exotic' characters like Chinese or Russian: https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) Very few Bioinformatics file formats ever bother to specify the encoding - in practise most text based ones are ASCII except for user entered text which could be their local encoding, and would be were most of the trouble occurs. |
I've changed imports and will rebase and run online tests as soon as I get to the PC. |
I've ran online tests and they are positive for every test that doesn't fail on master on my machine. |
I am OK with these changes, but the Bio.Entrez stuff has already been taken care of |
Could you file issues on the failing online tests, especially if they are still failing now (a few days later and thus less likely to be temporary network issues)? |
Right now I'm having trouble seeing which changes exactly are from this branch - GitHub seems confused by the complicated merge history - for example it shows the first diff as yield-from changes in @Andrey-Raspopov would you object to me cleaning up the branch history to a single commit from the current master? Or could you try that? |
@Andrey-Raspopov sadly the git merge just makes this more complicated, GitHub now shows 146 files changed. In these situations I find git rebase more useful (although as always with git, there is more than one way to do anything, sigh - Python has a different philosophy on this). |
I could clean up the history by making a new branch, and doing two cherry-picks, both of which required manually dealing with merge conflicts. Hopefully this passes all the tests... running the online tests locally now. |
I've got rid of the merge conflicts by rebasing. It shows that only 15 files changed, is it the same for you? |
@@ -446,7 +445,8 @@ def _load_bgzf_block(handle, text_mode=False): | |||
if expected_crc != crc: | |||
raise RuntimeError("CRC is %s, not %s" % (crc, expected_crc)) | |||
if text_mode: | |||
return block_size, _as_string(data) | |||
# Note ISO-8859-1 aka Latin-1 preserves first 256 chars | |||
return block_size, codecs.latin_1_decode(data)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it inconsistent to use Latin-1 here, but UTF-8 in the write
function below?
For example,
>>> import codecs
>>> s = "ü"
>>> b = s.encode()
>>> codecs.latin_1_decode(b)
('ü', 2)
>>> codecs.utf_8_decode(b)
('ü', 2)
I would also suggest to use decode
on the string directly, as in
>>> b.decode('UTF8')
'ü'
>>> b.decode('latin-1')
'ü'
Then you don't have to extract just the first value from the tuple returned by codecs.utf_8_decode
, codecs.latin_1_decode
. (If it's UTF-8, you can just do b.decode()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given Andrey isn't available to further improve this pull request right now (and his work as drawn attention to this pre-existing issue rather than caused it), I will merge this and then follow up on Bio/bgzf.py
separately.
Using codecs does seem redundant here.
Unfortunately I won't be available to edit this merge in the next 1.5 weeks, so feel free to make any amendments or pick them. |
Yes, the rebase also gave 15 files changed :) |
This pull request addresses issue #2420
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runflake8
locally, andunderstand that AppVeyor and TravisCI will be used to confirm the Biopython unit
tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)