Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch all encoding to UTF-8 #256

Closed
jiho opened this issue Oct 15, 2018 · 13 comments
Closed

Switch all encoding to UTF-8 #256

jiho opened this issue Oct 15, 2018 · 13 comments
Assignees
Labels
feature New functionality page-export Everything related to export functionality page-import Everything related to import functionality
Milestone

Comments

@jiho
Copy link
Contributor

jiho commented Oct 15, 2018

Latin-1 is not the preferred standard.

Input files in Latin-1 should be detected and converted to UTF-8 upon import. Field names (mapping) should all be UTF-8.

@grololo06 grololo06 added the feature New functionality label Mar 20, 2020
@picheral picheral added page-export Everything related to export functionality page-import Everything related to import functionality labels May 5, 2020
@grololo06 grololo06 self-assigned this Sep 16, 2020
@grololo06
Copy link
Member

While copying WoRMS data into our DB:

ERROR:    For parent 160739 and child 1033779 : (psycopg2.errors.UntranslatableCharacter) character with byte sequence 0xc5 0x84 in encoding "UTF8" has no equivalent in encoding "LATIN1"

[SQL: INSERT INTO worms (aphia_id, url, scientificname, authority, status, unacceptreason, taxon_rank_id, rank, valid_aphia_id, valid_name, valid_authority, parent_name_usage_id, kingdom, phylum, class_, "order", family, genus, citation, lsid, is_marine, is_brackish, is_freshwater, is_terrestrial, is_extinct, match_type, modified, all_fetched) VALUES (%(aphia_id)s, %(url)s, %(scientificname)s, %(authority)s, %(status)s, %(unacceptreason)s, %(taxon_rank_id)s, %(rank)s, %(valid_aphia_id)s, %(valid_name)s, %(valid_authority)s, %(parent_name_usage_id)s, %(kingdom)s, %(phylum)s, %(class_)s, %(order)s, %(family)s, %(genus)s, %(citation)s, %(lsid)s, %(is_marine)s, %(is_brackish)s, %(is_freshwater)s, %(is_terrestrial)s, %(is_extinct)s, %(match_type)s, %(modified)s, %(all_fetched)s)]
[parameters: {'aphia_id': 1033779, 'url': 'http://www.marinespecies.org/aphia.php?p=taxdetails&id=1033779', 'scientificname': 'Austrocidaris seymourensis', 'authority': 'Radwańska, 1996', 'status': 'accepted', 'unacceptreason': None, 'taxon_rank_id': 220, 'rank': 'Species', 'valid_aphia_id': 1033779, 'valid_name': 'Austrocidaris seymourensis', 'valid_authority': 'Radwańska, 1996', 'parent_name_usage_id': 160739, 'kingdom': 'Animalia', 'phylum': 'Echinodermata', 'class_': 'Echinoidea', 'order': 'Cidaroida', 'family': 'Cidaridae', 'genus': 'Austrocidaris', 'citation': 'Kroh, A.; Mooi, R. (2020). World Echinoidea Database. Austrocidaris seymourensis Radwańska, 1996 †. Accessed through: World Register of Marine Species at: http://www.marinespecies.org/aphia.php?p=taxdetails&id=1033779 on 2020-09-17', 'lsid': 'urn:lsid:marinespecies.org:taxname:1033779', 'is_marine': True, 'is_brackish': False, 'is_freshwater': False, 'is_terrestrial': False, 'is_extinct': True, 'match_type': 'exact', 'modified': '2017-09-07T12:19:05.493Z', 'all_fetched': None}]

Problem is 'Radwańska, 1996', I guess.

@moi90
Copy link

moi90 commented Nov 5, 2020

I'm very +1 on this!

@grololo06 grololo06 added this to the 2.5.11 milestone Jun 16, 2021
@grololo06 grololo06 modified the milestones: 2.5.11, 2.5.12 Sep 15, 2021
@grololo06
Copy link
Member

DB is UTF8, API is unicode, all python code as well. The file format still to clear up.

@grololo06
Copy link
Member

trying to set '🦠 unicode!' in a text field, I got during the export:
'latin-1' codec can't encode character '\U0001f9a0' in position 708: ordinal not in range(256)

@grololo06
Copy link
Member

grololo06 commented Sep 16, 2021

Writing of files:

Reading of files:

  • If BOM is present then let's read with utf-8 else fallback to present latin1.

grololo06 pushed a commit to ecotaxa/ecotaxa_back that referenced this issue Sep 17, 2021
grololo06 added a commit to ecotaxa/ecotaxa_back that referenced this issue Sep 18, 2021
@moi90
Copy link

moi90 commented Sep 20, 2021

Do you really insist on BOMs? It only makes reading the files harder, see here, for example: You need to read the files with encoding='utf-8-sig' to remove the BOM which defeats the whole purpose of "we indicate the encoding in the file so that the user does not have to guess".

@grololo06
Copy link
Member

Do you really insist on BOMs? It only makes reading the files harder, see here, for example: You need to read the files with encoding='utf-8-sig' to remove the BOM which defeats the whole purpose of "we indicate the encoding in the file so that the user does not have to guess".

Hello, we tried for ordinary users and on many OSes, the BOM is OK for spreadsheet apps.
For devs like us, it's not really difficult to adapt the code.

@grololo06
Copy link
Member

Fixed in 2.5.12

@moi90
Copy link

moi90 commented Sep 22, 2021

Hmm... But even regular users don't always use spreadsheet apps. I really don't think that a BOM is the right way these days where UTF8 is the standard everywhere (even Windows starts using it). But you're the boss...

@moi90
Copy link

moi90 commented Sep 22, 2021

This means that we might want a simple charset detection in pyecotaxa so the user does not have to worry about this.

@jiho
Copy link
Contributor Author

jiho commented Sep 22, 2021

I had not followed the export part of this. By "leave the choice to users to users" you mean a checkbox at export time asking to choose the encoding? I think most users will not understand it and those who will would be knowledgeable enough to deal with this afterwards.

So I think all export should be UTF-8 (and if some obscure windows-only utility fails, then too bad).

Then BOM or no BOM I don't know. @moi90: Rainer tested opening Laurent's file (which I assume is UTF-8 with BOM) with Python and found no problem.

@jiho jiho reopened this Sep 22, 2021
@grololo06
Copy link
Member

#724

@moi90
Copy link

moi90 commented Nov 25, 2021

I found out the following:

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist.

(https://stackoverflow.com/a/44573867/1116842)

So I have no problem with BOM anymore and will make utf-8-sig the default when reading EcoTaxa files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality page-export Everything related to export functionality page-import Everything related to import functionality
Projects
None yet
Development

No branches or pull requests

4 participants