Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems reading UTF-8 with portuguese characters #33

Closed
mflps opened this issue Apr 22, 2016 · 2 comments
Closed

Problems reading UTF-8 with portuguese characters #33

mflps opened this issue Apr 22, 2016 · 2 comments

Comments

@mflps
Copy link

mflps commented Apr 22, 2016

image

When reading this .log file it fails in this line... :(

2016-03-07 11:34:48 W3SVCXYZ805 SERVER13 10.101.146.157 GET /pt/Prt/PublishingImages/mailimages/visto_131114.jpg - 80 - 10.101.146.3 HTTP/1.1 Mozilla/5.0+(compatible;+MSIE+10.0;+Windows+NT+6.1;+WOW64;+Trident/6.0;+SRHE+S.R.+Habitação+e+Equipamentos) - - ind.xyz.pt 200 0 0 10143 308 0

The problem are the characters ç and ã

If you convert the file to unicode it works well, though...

@MikeRys
Copy link
Collaborator

MikeRys commented Apr 26, 2016

Hex(E7) is the windows-1252 code page encoding for ç and not its UTF-8 encoding (which would be C3A7 if I remember correctly). So your file was not in UTF-8 but in a windows-1252 code page that is currently not supported.
Please convert your files to a supported encoding before uploading or write a custom extractor. And feel free to vote for support of codepages at https://feedback.azure.com/forums/327234-data-lake/suggestions/13077555-add-ansi-code-page-support-for-built-in-extractors

@mflps
Copy link
Author

mflps commented May 3, 2016

Hi Michael,

It's exactly that!

Yes we are converting the file to UTF-8 now, but it's a slow process and
should be avoided...

I'll vote for this!!

regards,
Miguel

On Tue, Apr 26, 2016 at 8:20 PM, Michael Rys notifications@github.com
wrote:

Hex(E7) is the windows-1252 code page encoding for ç and not its UTF-8
encoding (which would be C3A7 if I remember correctly). So your file was
not in UTF-8 but in a windows-1252 code page that is currently not
supported.
Please convert your files to a supported encoding before uploading or
write a custom extractor. And feel free to vote for support of codepages at
https://feedback.azure.com/forums/327234-data-lake/suggestions/13077555-add-ansi-code-page-support-for-built-in-extractors


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#33 (comment)

Miguel Santos

mobi2do - applications that go with you

tel: +351913601091

email: miguels@mobi2do.com

skype: migas2006

@MikeRys MikeRys closed this as completed Jul 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants