Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation tool : special characters (é,è,',...) #700

Closed
LouisDuverneuil opened this issue Dec 28, 2020 · 7 comments
Closed

Annotation tool : special characters (é,è,',...) #700

LouisDuverneuil opened this issue Dec 28, 2020 · 7 comments
Assignees

Comments

@LouisDuverneuil
Copy link

Question

For the annotation tool, I would like to import texts containing special characters (such as é, è or '). But these characters appear like this: �. Is it possible to have texts with special characters?

Additional context
I tried to import an encoded text (example: é becomes é). It works for the text displayed during annotation, but not when I download the data in the export labels part (the answers are cut because the encoding disappears).

Thanks

@Timoeller
Copy link
Contributor

Hey @LouisDuverneuil sorry for the late reply, we were laying low between the years : )

About your encoding issue, the tool should be able to display utf-8 encodings. Have you tried this encoding for your files?

Btw. I cannot reproduce your error, see screenshot:
Screenshot from 2021-01-04 19-10-24

@LouisDuverneuil
Copy link
Author

Hey @Timoeller,

Thank you for your reply, here are some Screenshots to explain the problem :

Capture d’écran (35)
Capture d’écran (36)

As you can see, the encoding is not working : the text appears encoded in the Documents section, which poses a problem when selecting answers later.

Thanks

@Timoeller
Copy link
Contributor

Encoding issues are very annoying because they vary over operating system and might be browser related. So I think I need some more information to understand your problem and help you.

So in the second screenshot the encoding works, right? This is where you annotate. I agree that the encoding seems to be broken in the first screenshot, in the "documents" tab.

When I use your example text, annotate in it and export as squad json I do not get errors, see screenshot:
Screenshot from 2021-01-05 11-42-02

Are you using Windows? Which browser are you using? Can you try Chrome please?
And how does your json file look like when you export it?

@LouisDuverneuil
Copy link
Author

Yes in the second screenshot it works but then, when I annotate on it, there is a problem with the selection. This is an example of the output json if I selected 'préparation' as an answer :
Capture d’écran (39)

I use windows and Chrome. To import the text to annotate I use the txt writer of windows. When I don't encode the text before to import, my special characters look like this : �. So I encoded them in ascii before, but the result is the one presented in the previous screenshot.

Thank you for your time !

@Timoeller
Copy link
Contributor

Judging the screenshot I would say the annotation tool works in this case:
the answer text can be found as substring of the "context" and the "answer_start" is also correct. Only the encoding of context is wrong.

Please encode your text as utf-8 and NOT ascii or something else. ascii does not include the accents you actually want...
Digging into the encoding you used, it seems that your files are Windows-1252 encoded ("&#233" translates to é in this encoding). Please change this!

@LouisDuverneuil
Copy link
Author

It worked, I changed my ecoding directly on my text editor

Thank you very much !!

@Timoeller
Copy link
Contributor

Nice, happy to assist.

Nevertheless, especially if you have long contexts it can very well be that there are some encoding and offset issues. IF you encounter problems there please open another issue so we can help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants