Switch all encoding to UTF-8 #256

jiho · 2018-10-15T07:09:56Z

Latin-1 is not the preferred standard.

Input files in Latin-1 should be detected and converted to UTF-8 upon import. Field names (mapping) should all be UTF-8.

grololo06 · 2020-09-17T11:56:33Z

While copying WoRMS data into our DB:

ERROR:    For parent 160739 and child 1033779 : (psycopg2.errors.UntranslatableCharacter) character with byte sequence 0xc5 0x84 in encoding "UTF8" has no equivalent in encoding "LATIN1"

[SQL: INSERT INTO worms (aphia_id, url, scientificname, authority, status, unacceptreason, taxon_rank_id, rank, valid_aphia_id, valid_name, valid_authority, parent_name_usage_id, kingdom, phylum, class_, "order", family, genus, citation, lsid, is_marine, is_brackish, is_freshwater, is_terrestrial, is_extinct, match_type, modified, all_fetched) VALUES (%(aphia_id)s, %(url)s, %(scientificname)s, %(authority)s, %(status)s, %(unacceptreason)s, %(taxon_rank_id)s, %(rank)s, %(valid_aphia_id)s, %(valid_name)s, %(valid_authority)s, %(parent_name_usage_id)s, %(kingdom)s, %(phylum)s, %(class_)s, %(order)s, %(family)s, %(genus)s, %(citation)s, %(lsid)s, %(is_marine)s, %(is_brackish)s, %(is_freshwater)s, %(is_terrestrial)s, %(is_extinct)s, %(match_type)s, %(modified)s, %(all_fetched)s)]
[parameters: {'aphia_id': 1033779, 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=1033779', 'scientificname': 'Austrocidaris seymourensis', 'authority': 'Radwańska, 1996', 'status': 'accepted', 'unacceptreason': None, 'taxon_rank_id': 220, 'rank': 'Species', 'valid_aphia_id': 1033779, 'valid_name': 'Austrocidaris seymourensis', 'valid_authority': 'Radwańska, 1996', 'parent_name_usage_id': 160739, 'kingdom': 'Animalia', 'phylum': 'Echinodermata', 'class_': 'Echinoidea', 'order': 'Cidaroida', 'family': 'Cidaridae', 'genus': 'Austrocidaris', 'citation': 'Kroh, A.; Mooi, R. (2020). World Echinoidea Database. Austrocidaris seymourensis Radwańska, 1996&nbsp;&#8224;. Accessed through: World Register of Marine Species at: http://www.marinespecies.org/aphia.php?p=taxdetails&id=1033779 on 2020-09-17', 'lsid': 'urn:lsid:marinespecies.org:taxname:1033779', 'is_marine': True, 'is_brackish': False, 'is_freshwater': False, 'is_terrestrial': False, 'is_extinct': True, 'match_type': 'exact', 'modified': '2017-09-07T12:19:05.493Z', 'all_fetched': None}]

Problem is 'Radwańska, 1996', I guess.

moi90 · 2020-11-05T20:31:21Z

I'm very +1 on this!

grololo06 · 2021-09-15T08:06:57Z

DB is UTF8, API is unicode, all python code as well. The file format still to clear up.

grololo06 · 2021-09-15T18:13:02Z

trying to set '🦠 unicode!' in a text field, I got during the export:
'latin-1' codec can't encode character '\U0001f9a0' in position 708: ordinal not in range(256)

grololo06 · 2021-09-16T09:10:45Z

Writing of files:

Let's leave the choice to users b/w latin1 and utf8. For utf8 let's use https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 to mark the produced text file.

Reading of files:

If BOM is present then let's read with utf-8 else fallback to present latin1.

… encoding.

… option for using old encoding.

moi90 · 2021-09-20T14:06:22Z

Do you really insist on BOMs? It only makes reading the files harder, see here, for example: You need to read the files with encoding='utf-8-sig' to remove the BOM which defeats the whole purpose of "we indicate the encoding in the file so that the user does not have to guess".

grololo06 · 2021-09-21T06:19:38Z

Do you really insist on BOMs? It only makes reading the files harder, see here, for example: You need to read the files with encoding='utf-8-sig' to remove the BOM which defeats the whole purpose of "we indicate the encoding in the file so that the user does not have to guess".

Hello, we tried for ordinary users and on many OSes, the BOM is OK for spreadsheet apps.
For devs like us, it's not really difficult to adapt the code.

grololo06 · 2021-09-21T07:58:00Z

Fixed in 2.5.12

moi90 · 2021-09-22T07:17:05Z

Hmm... But even regular users don't always use spreadsheet apps. I really don't think that a BOM is the right way these days where UTF8 is the standard everywhere (even Windows starts using it). But you're the boss...

moi90 · 2021-09-22T07:20:42Z

This means that we might want a simple charset detection in pyecotaxa so the user does not have to worry about this.

jiho · 2021-09-22T13:40:30Z

I had not followed the export part of this. By "leave the choice to users to users" you mean a checkbox at export time asking to choose the encoding? I think most users will not understand it and those who will would be knowledgeable enough to deal with this afterwards.

So I think all export should be UTF-8 (and if some obscure windows-only utility fails, then too bad).

Then BOM or no BOM I don't know. @moi90: Rainer tested opening Laurent's file (which I assume is UTF-8 with BOM) with Python and found no problem.

grololo06 · 2021-09-23T03:32:14Z

#724

moi90 · 2021-11-25T16:47:15Z

I found out the following:

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist.

(https://stackoverflow.com/a/44573867/1116842)

So I have no problem with BOM anymore and will make utf-8-sig the default when reading EcoTaxa files.

…a web application. See ecotaxa/ecotaxa_front#256 Closes ecotaxa#3.

grololo06 added the feature New functionality label Mar 20, 2020

picheral added page-export Everything related to export functionality page-import Everything related to import functionality labels May 5, 2020

grololo06 self-assigned this Sep 16, 2020

jiho mentioned this issue Nov 5, 2020

File encoding ecotaxa/pyecotaxa#3

Open

grololo06 added this to the 2.5.11 milestone Jun 16, 2021

grololo06 modified the milestones: 2.5.11, 2.5.12 Sep 15, 2021

grololo06 pushed a commit that referenced this issue Sep 17, 2021

#256: Switch exports to utf-8 but leave users an option for using old…

20848cb

… encoding.

grololo06 pushed a commit to ecotaxa/ecotaxa_back that referenced this issue Sep 17, 2021

ecotaxa/ecotaxa_front#256: Switch exports to utf-8 but leave users an…

9418f0f

… option for using old encoding.

grololo06 added a commit to ecotaxa/ecotaxa_back that referenced this issue Sep 18, 2021

ecotaxa/ecotaxa_front#256: fix a big reg and align tests

e37b6fd

grololo06 closed this as completed Sep 21, 2021

jiho reopened this Sep 22, 2021

grololo06 closed this as completed Sep 23, 2021

moi90 added a commit to moi90/pyecotaxa that referenced this issue Jan 11, 2022

!write_tsv: Change default encoding to utf-8-sig to mirror the EcoTax…

26de6bb

…a web application. See ecotaxa/ecotaxa_front#256 Closes ecotaxa#3.

moi90 mentioned this issue Jan 11, 2022

!write_tsv: Change default encoding to utf-8-sig to mirror the EcoTax… ecotaxa/pyecotaxa#10

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch all encoding to UTF-8 #256

Switch all encoding to UTF-8 #256

jiho commented Oct 15, 2018

grololo06 commented Sep 17, 2020

moi90 commented Nov 5, 2020

grololo06 commented Sep 15, 2021

grololo06 commented Sep 15, 2021

grololo06 commented Sep 16, 2021 •

edited

moi90 commented Sep 20, 2021

grololo06 commented Sep 21, 2021

grololo06 commented Sep 21, 2021

moi90 commented Sep 22, 2021

moi90 commented Sep 22, 2021

jiho commented Sep 22, 2021

grololo06 commented Sep 23, 2021

moi90 commented Nov 25, 2021

Switch all encoding to UTF-8 #256

Switch all encoding to UTF-8 #256

Comments

jiho commented Oct 15, 2018

grololo06 commented Sep 17, 2020

moi90 commented Nov 5, 2020

grololo06 commented Sep 15, 2021

grololo06 commented Sep 15, 2021

grololo06 commented Sep 16, 2021 • edited

moi90 commented Sep 20, 2021

grololo06 commented Sep 21, 2021

grololo06 commented Sep 21, 2021

moi90 commented Sep 22, 2021

moi90 commented Sep 22, 2021

jiho commented Sep 22, 2021

grololo06 commented Sep 23, 2021

moi90 commented Nov 25, 2021

grololo06 commented Sep 16, 2021 •

edited