Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: input contains invalid UTF-8 around byte 30 (of 68) #53

Open
talalriaz opened this issue Jan 26, 2022 · 1 comment
Open

error: input contains invalid UTF-8 around byte 30 (of 68) #53

talalriaz opened this issue Jan 26, 2022 · 1 comment

Comments

@talalriaz
Copy link

talalriaz commented Jan 26, 2022

I encountered this error while running the following code:

import pycld2 as cld2
text ="""
Happy Tailors Day! Hackett We�re celebrating with a special offer
"""
isReliable, textBytesFound, details =  cld2.detect(text)

Here is the error:

error: input contains invalid UTF-8 around byte 30 (of 68)
@ned2
Copy link

ned2 commented Mar 28, 2022

There's been some great exploration of this issue in this polyglot issue and [also in the older cld2 project(https://github.com/mikemccand/chromium-compact-language-detector/issues/22) that pycld2 is forked from (some of which are from folks using Polyglot, which actually depends on pycld2 rather than that older cld2 project).

Have not tried it yet, but this solution, which uses a regex to strip the two offending UTF8 control characters from the input, looks like the most elegant solution to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants