Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_Entrez_online.py test_efetch_gds_utf8 fails with encoding issue (Python 2) #1849

Closed
peterjc opened this issue Nov 15, 2018 · 4 comments
Closed
Assignees
Labels

Comments

@peterjc
Copy link
Member

peterjc commented Nov 15, 2018

It looks like this test is working on Python 3, but failing on Python 2.7 - we don't run this on AppVeyor or TravisCI because is it an online test and therefore we expect some intermittent failures and also don't want to burden the online servers.

64bit Linux, Python 2.7.15 via conda

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

$ python --version
Python 2.7.15

$ python test_Entrez_online.py
test_ecitmatch (test_Entrez_online.EntrezOnlineCase)
Test Entrez.ecitmatch to search for a citation ... ok
test_efetch_biosystems_xml (test_Entrez_online.EntrezOnlineCase)
Test Entrez parser with XML from biosystems ... ok
test_efetch_gds_utf8 (test_Entrez_online.EntrezOnlineCase)
Test correct handling of encodings in Entrez.efetch ... /mnt/.../conda/lib/python2.7/unittest/case.py:503: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if not first == second:
FAIL
...
======================================================================
FAIL: test_efetch_gds_utf8 (test_Entrez_online.EntrezOnlineCase)
Test correct handling of encodings in Entrez.efetch
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/.../repositories/biopython/Tests/test_Entrez_online.py", line 261, in test_efetch_gds_utf8
    self.assertEqual(result[342:359], expected_result)
AssertionError: '\xe2\x80\x9cfield of injur' != u'\u201cfield of injury\u201d'

(It also fails on #1848 but I have omitted that here)

Same issue on macOS using Apple provided Python 2.7.10

These are the left and right double quote characters,

https://www.fileformat.info/info/unicode/char/201c/index.htm
UTF-8 (hex), 0xE2 0x80 0x9C
UTF-16 (hex), 0x201C

https://www.fileformat.info/info/unicode/char/201d/index.htm
UTF-8 (hex), 0xE2 0x80 0x9D
UTF-16 (hex), 0x201D

Quoting the files,

            expected_result = u'“field of injury”'  # Use of Unicode double qoutation marks U+201C and U+201D                                         
            self.assertEqual(result[342:359], expected_result)

Does test_Entrez_online.py need to declare an encoding since it is non ASCII?

It does not seem to be a locale issue as the machines are both UTF8 based, and this fails the same way:

$ LANG=C python test_Entrez_online.py

I think the problem is likely an oversight in the encoding/decoding, and this seems to fix it but needs reviewing especially with the call to _binary_to_string_handle inside _open inside efetch:

$ git diff
diff --git a/Tests/test_Entrez_online.py b/Tests/test_Entrez_online.py
index 12c345ba5..d2f00a480 100644
--- a/Tests/test_Entrez_online.py
+++ b/Tests/test_Entrez_online.py
@@ -257,7 +257,10 @@ class EntrezOnlineCase(unittest.TestCase):
             self.assertIn(URL_API_KEY, handle.url)
             self.assertIn("id=200079209", handle.url)
             result = handle.read()
-            expected_result = u'“field of injury”'  # Use of Unicode double qoutation marks U+201C and U+201D
+            if sys.version_info[0] < 3:
+                result = result.decode("UTF8")
+            # Use of Unicode double quotation marks U+201C and U+201D
+            expected_result = u'“field of injury”'
             self.assertEqual(result[342:359], expected_result)
             handle.close()
         finally:
@nimne
Copy link
Contributor

nimne commented Nov 24, 2018

This error is probably due to the fact that in python3 unicode() is renamed to str(). As a result, strings are unicode by default in python3 but not python2. See here.

I can't think of a better solution than version checks, as you have suggested.

@peterjc
Copy link
Member Author

peterjc commented Nov 26, 2018

We might have something suitable already defined in Bio._py3k, perhaps as_unicode, rather than an explicit version check as per my initial suggestion.

peterjc added a commit that referenced this issue Nov 28, 2018
Closes GitHub issue #1849
@peterjc
Copy link
Member Author

peterjc commented Nov 28, 2018

Using _as_unicode takes the default encoding and fails - using UTF8 explicitly seems to be required here, so I've applied the if-statement shown earlier.

@peterjc
Copy link
Member Author

peterjc commented Dec 18, 2018

Test passes now, marking as resolved.

@peterjc peterjc closed this as completed Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants