2420 removed as_bytes and as_string #2468

Andrey-Raspopov · 2019-12-30T11:33:17Z

This pull request addresses issue #2420

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run flake8 locally, and
understand that AppVeyor and TravisCI will be used to confirm the Biopython unit
tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

codecov · 2019-12-30T12:19:41Z

Codecov Report

Merging #2468 into master will increase coverage by <.01%.
The diff coverage is 65.51%.

@@            Coverage Diff             @@
##           master    #2468      +/-   ##
==========================================
+ Coverage   84.74%   84.75%   +<.01%     
==========================================
  Files         321      321              
  Lines       52417    52399      -18     
==========================================
- Hits        44423    44409      -14     
+ Misses       7994     7990       -4

Impacted Files	Coverage Δ
Bio/_py3k/__init__.py	`40% <ø> (-15.89%)`	⬇️
Bio/SearchIO/HmmerIO/hmmer2_text.py	`97.51% <ø> (-0.02%)`	⬇️
Bio/SearchIO/BlastIO/blast_xml.py	`94.47% <0%> (-0.02%)`	⬇️
Bio/PDB/PDBList.py	`16.45% <0%> (-0.53%)`	⬇️
Bio/Blast/NCBIWWW.py	`11.82% <0%> (-0.94%)`	⬇️
Bio/SearchIO/HmmerIO/hmmer3_text.py	`99.17% <100%> (-0.01%)`	⬇️
Bio/bgzf.py	`91.57% <100%> (+0.03%)`	⬆️
Bio/SearchIO/FastaIO.py	`95.95% <100%> (-0.02%)`	⬇️
Bio/SearchIO/BlatIO.py	`96.87% <100%> (-0.01%)`	⬇️
Bio/KEGG/KGML/KGML_pathway.py	`80.25% <100%> (-0.06%)`	⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5da5801...985937d. Read the comment docs.

Tests/test_bgzf.py

peterjc · 2019-12-30T12:32:43Z

Bio/bgzf.py

@@ -446,7 +445,8 @@ def _load_bgzf_block(handle, text_mode=False):
    if expected_crc != crc:
        raise RuntimeError("CRC is %s, not %s" % (crc, expected_crc))
    if text_mode:
-        return block_size, _as_string(data)
+        import codecs


Might as well put this import at the top of the file (with the other standard library imports, roughly alphabetical)?

peterjc · 2019-12-30T12:34:26Z

Looks good. Given this touches the online tests, we'll need to run the full test suite locally (TravisCI and AppVeyor do the offline tests only).

peterjc · 2019-12-30T12:36:51Z

Can you try a rebase (rather than merging in master), something like this assuming your repository's remote name is andrey (might be origin) and the official one is upstream (might be origin):

git fetch upstream
git rebase upstream/master
git push andrey 2420-as --force  # must force push as re-wrote branch history

bow · 2019-12-30T13:06:35Z

Tests/test_bgzf.py

-                old = _as_bytes(old)
+                if isinstance(old, str):
+                    import codecs
+                    old = codecs.latin_1_encode(old)[0]


@peterjc While we are looking at this and as @Andrey-Raspopov touched here, I am actually wondering what is the rationale to use (and keep using) latin-1 encoding as opposed to utf-8?

There may be some cases where that makes sense as the default encoding. Also, historically it seems that latin-1 was at one point a competing Unicode encoding scheme. However, it seems that nowadays it is UTF-8 that is the de-facto unicode encoding standard. It is also the default encoding that Python source files have, sans an encoding comment.

Is there a reason (other than maybe backwards compatibility) that we keep using latin-1?

For reference, I also tried looking at our history and can see that the last commit that introduced latin-1 was made almost 9 years ago.

I agree that UTF-8 makes more sense nowadays

See also my comment above for Bio/bgzf.py.

peterjc · 2019-12-30T13:20:23Z

Very good question Bow.

At least in SFF the mapping of character 255 was critical, and latin1 preserved it but (IIRC) other encodings did not. So leave SFF as is.

For BGZF we probably ought to follow gzip.open and add encoding to our open() function and classes, and default to the system encoding.

Other uses probably need a case by case consideration.

peterjc · 2019-12-30T13:26:59Z

I'd need to re-read Python 3 best practise, but from memory latin1 was a pragmatic choice for a minimal set - it would cover plain ASCII and all the characters likely to be seen in English language data files, but fail on 'exotic' characters like Chinese or Russian:

https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)

Very few Bioinformatics file formats ever bother to specify the encoding - in practise most text based ones are ASCII except for user entered text which could be their local encoding, and would be were most of the trouble occurs.

Andrey-Raspopov · 2019-12-30T13:49:20Z

I've changed imports and will rebase and run online tests as soon as I get to the PC.

Andrey-Raspopov · 2019-12-31T12:16:02Z

I've ran online tests and they are positive for every test that doesn't fail on master on my machine.

Bio/Entrez/Parser.py

Bio/Entrez/__init__.py

mdehoon · 2020-01-01T14:59:19Z

I am OK with these changes, but the Bio.Entrez stuff has already been taken care of

peterjc · 2020-01-03T22:56:56Z

Could you file issues on the failing online tests, especially if they are still failing now (a few days later and thus less likely to be temporary network issues)?

peterjc · 2020-01-03T23:00:39Z

Right now I'm having trouble seeing which changes exactly are from this branch - GitHub seems confused by the complicated merge history - for example it shows the first diff as yield-from changes in Bio/AlignIO/__init__.py which were part of 1a5a470

@Andrey-Raspopov would you object to me cleaning up the branch history to a single commit from the current master? Or could you try that?

peterjc · 2020-01-07T11:41:16Z

@Andrey-Raspopov sadly the git merge just makes this more complicated, GitHub now shows 146 files changed. In these situations I find git rebase more useful (although as always with git, there is more than one way to do anything, sigh - Python has a different philosophy on this).

peterjc · 2020-01-07T17:03:08Z

I could clean up the history by making a new branch, and doing two cherry-picks, both of which required manually dealing with merge conflicts. Hopefully this passes all the tests... running the online tests locally now.

Andrey-Raspopov · 2020-01-08T00:02:44Z

I've got rid of the merge conflicts by rebasing. It shows that only 15 files changed, is it the same for you?

mdehoon · 2020-01-08T04:48:39Z

Bio/bgzf.py

@@ -446,7 +445,8 @@ def _load_bgzf_block(handle, text_mode=False):
    if expected_crc != crc:
        raise RuntimeError("CRC is %s, not %s" % (crc, expected_crc))
    if text_mode:
-        return block_size, _as_string(data)
+        # Note ISO-8859-1 aka Latin-1 preserves first 256 chars
+        return block_size, codecs.latin_1_decode(data)[0]


Isn't it inconsistent to use Latin-1 here, but UTF-8 in the write function below?
For example,

>>> import codecs >>> s = "ü" >>> b = s.encode() >>> codecs.latin_1_decode(b) ('Ã¼', 2) >>> codecs.utf_8_decode(b) ('ü', 2)

I would also suggest to use decode on the string directly, as in

>>> b.decode('UTF8') 'ü' >>> b.decode('latin-1') 'Ã¼'

Then you don't have to extract just the first value from the tuple returned by codecs.utf_8_decode, codecs.latin_1_decode. (If it's UTF-8, you can just do b.decode()).

Given Andrey isn't available to further improve this pull request right now (and his work as drawn attention to this pre-existing issue rather than caused it), I will merge this and then follow up on Bio/bgzf.py separately.

Using codecs does seem redundant here.

Andrey-Raspopov · 2020-01-08T10:50:34Z

Unfortunately I won't be available to edit this merge in the next 1.5 weeks, so feel free to make any amendments or pick them.

peterjc · 2020-01-08T10:53:29Z

Yes, the rebase also gave 15 files changed :)

Andrey-Raspopov requested review from bow, JoaoRodrigues, mdehoon and peterjc as code owners December 30, 2019 11:33

peterjc reviewed Dec 30, 2019

View reviewed changes

Tests/test_bgzf.py Show resolved Hide resolved

peterjc reviewed Dec 30, 2019

View reviewed changes

bow reviewed Dec 30, 2019

View reviewed changes

Dsujan mentioned this pull request Dec 31, 2019

Remove use of Bio._py3k (Python 2 / 3 compatibility) #2420

Closed

mdehoon reviewed Jan 1, 2020

View reviewed changes

Bio/Entrez/Parser.py Outdated Show resolved Hide resolved

mdehoon reviewed Jan 1, 2020

View reviewed changes

Bio/Entrez/__init__.py Outdated Show resolved Hide resolved

Andrey-Raspopov requested a review from etal as a code owner January 1, 2020 15:58

Andrey-Raspopov requested a review from widdowquinn as a code owner January 7, 2020 11:22

Andrey Raspopov and others added 3 commits January 7, 2020 16:40

Changed all _as_string to decodes

c0409f6

Remove as_bytes

fb160f9

Move codecs import to top

985937d

peterjc force-pushed the 2420-as branch from d18b86d to 985937d Compare January 7, 2020 16:59

mdehoon reviewed Jan 8, 2020

View reviewed changes

peterjc merged commit 1c212b9 into biopython:master Jan 8, 2020

peterjc mentioned this pull request Jan 8, 2020

Unicode encoding in BGZF #2512

Closed

Andrey-Raspopov deleted the 2420-as branch January 8, 2020 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2420 removed as_bytes and as_string #2468

2420 removed as_bytes and as_string #2468

Andrey-Raspopov commented Dec 30, 2019

codecov bot commented Dec 30, 2019 •

edited

peterjc Dec 30, 2019

peterjc commented Dec 30, 2019

peterjc commented Dec 30, 2019

bow Dec 30, 2019

mdehoon Jan 1, 2020

mdehoon Jan 8, 2020

peterjc commented Dec 30, 2019

peterjc commented Dec 30, 2019

Andrey-Raspopov commented Dec 30, 2019

Andrey-Raspopov commented Dec 31, 2019

mdehoon commented Jan 1, 2020

peterjc commented Jan 3, 2020

peterjc commented Jan 3, 2020

peterjc commented Jan 7, 2020

peterjc commented Jan 7, 2020

Andrey-Raspopov commented Jan 8, 2020

mdehoon Jan 8, 2020

peterjc Jan 8, 2020

Andrey-Raspopov commented Jan 8, 2020

peterjc commented Jan 8, 2020

2420 removed as_bytes and as_string #2468

2420 removed as_bytes and as_string #2468

Conversation

Andrey-Raspopov commented Dec 30, 2019

codecov bot commented Dec 30, 2019 • edited

Codecov Report

peterjc Dec 30, 2019

Choose a reason for hiding this comment

peterjc commented Dec 30, 2019

peterjc commented Dec 30, 2019

bow Dec 30, 2019

Choose a reason for hiding this comment

mdehoon Jan 1, 2020

Choose a reason for hiding this comment

mdehoon Jan 8, 2020

Choose a reason for hiding this comment

peterjc commented Dec 30, 2019

peterjc commented Dec 30, 2019

Andrey-Raspopov commented Dec 30, 2019

Andrey-Raspopov commented Dec 31, 2019

mdehoon commented Jan 1, 2020

peterjc commented Jan 3, 2020

peterjc commented Jan 3, 2020

peterjc commented Jan 7, 2020

peterjc commented Jan 7, 2020

Andrey-Raspopov commented Jan 8, 2020

mdehoon Jan 8, 2020

Choose a reason for hiding this comment

peterjc Jan 8, 2020

Choose a reason for hiding this comment

Andrey-Raspopov commented Jan 8, 2020

peterjc commented Jan 8, 2020

codecov bot commented Dec 30, 2019 •

edited