Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2420 removed as_bytes and as_string #2468

Merged
merged 3 commits into from
Jan 8, 2020

Conversation

Andrey-Raspopov
Copy link
Contributor

This pull request addresses issue #2420

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run flake8 locally, and
    understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
    tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

@codecov
Copy link

codecov bot commented Dec 30, 2019

Codecov Report

Merging #2468 into master will increase coverage by <.01%.
The diff coverage is 65.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2468      +/-   ##
==========================================
+ Coverage   84.74%   84.75%   +<.01%     
==========================================
  Files         321      321              
  Lines       52417    52399      -18     
==========================================
- Hits        44423    44409      -14     
+ Misses       7994     7990       -4
Impacted Files Coverage Δ
Bio/_py3k/__init__.py 40% <ø> (-15.89%) ⬇️
Bio/SearchIO/HmmerIO/hmmer2_text.py 97.51% <ø> (-0.02%) ⬇️
Bio/SearchIO/BlastIO/blast_xml.py 94.47% <0%> (-0.02%) ⬇️
Bio/PDB/PDBList.py 16.45% <0%> (-0.53%) ⬇️
Bio/Blast/NCBIWWW.py 11.82% <0%> (-0.94%) ⬇️
Bio/SearchIO/HmmerIO/hmmer3_text.py 99.17% <100%> (-0.01%) ⬇️
Bio/bgzf.py 91.57% <100%> (+0.03%) ⬆️
Bio/SearchIO/FastaIO.py 95.95% <100%> (-0.02%) ⬇️
Bio/SearchIO/BlatIO.py 96.87% <100%> (-0.01%) ⬇️
Bio/KEGG/KGML/KGML_pathway.py 80.25% <100%> (-0.06%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5da5801...985937d. Read the comment docs.

Bio/bgzf.py Outdated
@@ -446,7 +445,8 @@ def _load_bgzf_block(handle, text_mode=False):
if expected_crc != crc:
raise RuntimeError("CRC is %s, not %s" % (crc, expected_crc))
if text_mode:
return block_size, _as_string(data)
import codecs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well put this import at the top of the file (with the other standard library imports, roughly alphabetical)?

@peterjc
Copy link
Member

peterjc commented Dec 30, 2019

Looks good. Given this touches the online tests, we'll need to run the full test suite locally (TravisCI and AppVeyor do the offline tests only).

@peterjc
Copy link
Member

peterjc commented Dec 30, 2019

Can you try a rebase (rather than merging in master), something like this assuming your repository's remote name is andrey (might be origin) and the official one is upstream (might be origin):

git fetch upstream
git rebase upstream/master
git push andrey 2420-as --force  # must force push as re-wrote branch history

old = _as_bytes(old)
if isinstance(old, str):
import codecs
old = codecs.latin_1_encode(old)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterjc While we are looking at this and as @Andrey-Raspopov touched here, I am actually wondering what is the rationale to use (and keep using) latin-1 encoding as opposed to utf-8?

There may be some cases where that makes sense as the default encoding. Also, historically it seems that latin-1 was at one point a competing Unicode encoding scheme. However, it seems that nowadays it is UTF-8 that is the de-facto unicode encoding standard. It is also the default encoding that Python source files have, sans an encoding comment.

Is there a reason (other than maybe backwards compatibility) that we keep using latin-1?

For reference, I also tried looking at our history and can see that the last commit that introduced latin-1 was made almost 9 years ago.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that UTF-8 makes more sense nowadays

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also my comment above for Bio/bgzf.py.

@peterjc
Copy link
Member

peterjc commented Dec 30, 2019

Very good question Bow.

At least in SFF the mapping of character 255 was critical, and latin1 preserved it but (IIRC) other encodings did not. So leave SFF as is.

For BGZF we probably ought to follow gzip.open and add encoding to our open() function and classes, and default to the system encoding.

Other uses probably need a case by case consideration.

@peterjc
Copy link
Member

peterjc commented Dec 30, 2019

I'd need to re-read Python 3 best practise, but from memory latin1 was a pragmatic choice for a minimal set - it would cover plain ASCII and all the characters likely to be seen in English language data files, but fail on 'exotic' characters like Chinese or Russian:

https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)

Very few Bioinformatics file formats ever bother to specify the encoding - in practise most text based ones are ASCII except for user entered text which could be their local encoding, and would be were most of the trouble occurs.

@Andrey-Raspopov
Copy link
Contributor Author

I've changed imports and will rebase and run online tests as soon as I get to the PC.

@Andrey-Raspopov
Copy link
Contributor Author

I've ran online tests and they are positive for every test that doesn't fail on master on my machine.

Bio/Entrez/Parser.py Outdated Show resolved Hide resolved
Bio/Entrez/__init__.py Outdated Show resolved Hide resolved
@mdehoon
Copy link
Contributor

mdehoon commented Jan 1, 2020

I am OK with these changes, but the Bio.Entrez stuff has already been taken care of

@peterjc
Copy link
Member

peterjc commented Jan 3, 2020

Could you file issues on the failing online tests, especially if they are still failing now (a few days later and thus less likely to be temporary network issues)?

@peterjc
Copy link
Member

peterjc commented Jan 3, 2020

Right now I'm having trouble seeing which changes exactly are from this branch - GitHub seems confused by the complicated merge history - for example it shows the first diff as yield-from changes in Bio/AlignIO/__init__.py which were part of 1a5a470

@Andrey-Raspopov would you object to me cleaning up the branch history to a single commit from the current master? Or could you try that?

@peterjc
Copy link
Member

peterjc commented Jan 7, 2020

@Andrey-Raspopov sadly the git merge just makes this more complicated, GitHub now shows 146 files changed. In these situations I find git rebase more useful (although as always with git, there is more than one way to do anything, sigh - Python has a different philosophy on this).

@peterjc
Copy link
Member

peterjc commented Jan 7, 2020

I could clean up the history by making a new branch, and doing two cherry-picks, both of which required manually dealing with merge conflicts. Hopefully this passes all the tests... running the online tests locally now.

@Andrey-Raspopov
Copy link
Contributor Author

I've got rid of the merge conflicts by rebasing. It shows that only 15 files changed, is it the same for you?

@@ -446,7 +445,8 @@ def _load_bgzf_block(handle, text_mode=False):
if expected_crc != crc:
raise RuntimeError("CRC is %s, not %s" % (crc, expected_crc))
if text_mode:
return block_size, _as_string(data)
# Note ISO-8859-1 aka Latin-1 preserves first 256 chars
return block_size, codecs.latin_1_decode(data)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it inconsistent to use Latin-1 here, but UTF-8 in the write function below?
For example,

>>> import codecs
>>> s = "ü"
>>> b = s.encode()
>>> codecs.latin_1_decode(b)
('ü', 2)
>>> codecs.utf_8_decode(b)
('ü', 2)

I would also suggest to use decode on the string directly, as in

>>> b.decode('UTF8')
'ü'
>>> b.decode('latin-1')
'ü'

Then you don't have to extract just the first value from the tuple returned by codecs.utf_8_decode, codecs.latin_1_decode. (If it's UTF-8, you can just do b.decode()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given Andrey isn't available to further improve this pull request right now (and his work as drawn attention to this pre-existing issue rather than caused it), I will merge this and then follow up on Bio/bgzf.py separately.

Using codecs does seem redundant here.

@Andrey-Raspopov
Copy link
Contributor Author

Unfortunately I won't be available to edit this merge in the next 1.5 weeks, so feel free to make any amendments or pick them.

@peterjc
Copy link
Member

peterjc commented Jan 8, 2020

Yes, the rebase also gave 15 files changed :)

@peterjc peterjc merged commit 1c212b9 into biopython:master Jan 8, 2020
@peterjc peterjc mentioned this pull request Jan 8, 2020
@Andrey-Raspopov Andrey-Raspopov deleted the 2420-as branch January 8, 2020 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants