Charset windows-1251 #5493

Closed
dyadyaMisha opened this Issue Oct 31, 2013 · 56 comments

Comments

Projects
None yet

How can I fix this?

2013-10-31_161707

http://www.jackyfox.com/tag/charlock_holmes/ <- here is a solution used to be

olx commented Nov 6, 2013

Интересно для какой версии этот фикс... Такие же проблемы на версии 6.2

Member

Razer6 commented Nov 6, 2013

@olx Please write in English!

olx commented Nov 6, 2013

I have the same problem (version 6.2). This hack http://www.jackyfox.com/tag/charlock_holmes/ for another version

Opened issue 2 month ago: #5007

XVilka commented Dec 4, 2013

Still same for 6.3 gitlab.

XVilka referenced this issue in brianmario/charlock_holmes Dec 4, 2013

Closed

Can't detect windows-1251 in gitlab #53

olx commented Dec 4, 2013

Razer, show us the module in gitlab, which use charlock holmes to detect and encode charset? There is only one encode.rb in gitlab project, but it use json inside.

XVilka commented Dec 4, 2013

This is pretty old bug - brianmario/charlock_holmes#19

XVilka commented Dec 4, 2013

See the answer from @brianmario:

ICU's detection algorithms can sometimes require quite a bit of data to make an accurate detection. If it only has 100-200 bytes it may not be able to "accurately" determine what the source encoding is. But even at that, it's just a guess. The values returned from charlock_holmes are handed over directly from ICU internally so if something is being detected wrong, it's an issue in ICU itself.

We have this issue at GitHub as well from time to time, but I've made sure to make the code aware of the fact that we may need to fall back to forcing a clean UTF-8 version of the output by checking if the String is valid in the detected encoding.

detection = CharlockHolmes::EncodingDetector.detect(data)
data.force_encoding(detection[:encoding]) if detection && !detection[:binary]
if !data.valid_encoding?
  # this will forcibly clean the string of invalid UTF-8 characters
  # and tag it as UTF-8 for us
  data = CharlockHolmes::Converter.convert(data, 'UTF-8', 'UTF-8')
end

Hope this helps!

XVilka commented Dec 13, 2013

May be possible solution is to determine limited list of supported encodings in each project?

+1

mkwork commented Jan 12, 2014

At the first this hack http://www.jackyfox.com/tag/charlock_holmes/ is still working but code was little replaced.
At my configuration in can be found at /home/git/gitlab/vendor/bundle/ruby/1.9.1/gems/gitlab-grit-2.6.3/lib.
So while it,s gem, looks like fix must to be in it.
May be better solution will be replace charlock_holmes for an rchardet looks like it works more robust.

Contributor

jvanbaarsen commented Mar 2, 2014

@randx Can you take a look?

XVilka commented Mar 27, 2014

Any ideas/updates on this bug?

LazyRoy commented Apr 9, 2014

Thanks dyadyaMisha for workaround!

For Gitlab 6.7 you need to patch files:
/home/git/gitlab/vendor/bundle/ruby/2.0.0/gems/gitlab-grit-2.6.4/lib/grit_ext.rb
/home/git/gitlab/vendor/bundle/ruby/2.0.0/gems/gitlab_git-5.7.1/lib/gitlab_git/encoding_herlper.rb
/home/git/gitlab/vendor/bundle/ruby/2.0.0/gems/gitlab_git-5.4.0/lib/gitlab_git/encoding_herlper.rb

Patch:

# encoding message to detect encoding
if detect && detect[:encoding]
  message.force_encoding("windows-1251")
end

YURSHAT commented May 30, 2014

dyadyaMisha's method works in Files but not work in Commits :(
GitLab 6.9
files
commits
Does anyone know how to fix?

Contributor

jvanbaarsen commented May 30, 2014

@YURSHAT Can you please prepare a test repository where the encoding is not correct? That way i have something to work with. Thanks!

YURSHAT commented May 30, 2014

@jvanbaarsen Yes, of course.
Prepared test repository. http://git.krinkels.org/YURSHAT/test-project

Contributor

jvanbaarsen commented May 30, 2014

Thanks!

Contributor

jvanbaarsen commented May 30, 2014

@randx Maybe you can take a look? I've tried looking into this, but i just ran into a massive rabbit hole :(

YURSHAT commented Jun 2, 2014

@jvanbaarsen, you saw a bug with the encoding in the README file?
How can I solve the problem? Or wait for the fix in the next releases?

Contributor

jvanbaarsen commented Jun 2, 2014

@YURSHAT I've taken a look at it, but i cant figure out where to fix this, i asked @randx if he can take a look, but he is very busy, so not sure when he gets around to it.

In the meantime, all i can suggest is not using russian characters, but just use english.

XVilka commented Jun 25, 2014

Any chance to fix that in 7.x.x version?

I have installed the 7.0.0 but the same..

tsipa commented Jul 8, 2014

Mine 6.9 affected too.

Contributor

jvanbaarsen commented Jul 9, 2014

@dosire Is this issue on the radar of the development team? If so, what is the opinion about this?

Owner

dosire commented Jul 9, 2014

@jvanbaarsen I have no idea, I'll ask.

Contributor

jvanbaarsen commented Jul 9, 2014

@dosire Thanks!

Owner

dosire commented Jul 10, 2014

@jvanbaarsen These encoding problems are very hard to fix in a way that doesn't break stuff. This issue should be fixed in Charlock Holmes but that is hard as stated in #5493 (comment)

According to the ICU documentation:

For best accuracy in charset detection, the input data should be primarily in a single language, and a minimum of a few hundred bytes worth of plain text in the language are needed. The detection process will attempt to ignore html or xml style markup that could otherwise obscure the content.

Owner

dosire commented Jul 10, 2014

Thanks for commenting @brianmario I regret for my previous comment. I didn't read #5493 (comment) carefully and I shouldn't have said this was a problem in Charlock Holmes. We will look into the solution you suggested. Again, thanks for taking the time to comment here, it is appreciated.

@dosire no worries or apology needed :) Sorry about the short response earlier, was just about to hit the road for a 3hr drive. Didn't mean it to sound defensive...

I've been monitoring the tweaking the amount of data I feed to charlock (and thus ICU) for the past couple of years at GitHub and for whatever reason never saw what I linked in the ICU documentation :P Glad I found it!

Owner

dosire commented Jul 11, 2014

@brianmario No problem, your response was fine. At our side @randx will look into this and see if we can apply the UTF-8 re-encoding fix for invalid encodings.

Member

maxlazio commented Sep 26, 2014

With the test data that I have at my disposal this fix makes no difference.
Can someone supply a test repository with few different scenarios of where this goes wrong(diff view, commit view, etc)?

This is how a file looks currently in GitLab (v7.4.0.pre)
screenshot_2014-09-26_11 05 45

And this is how it looks like with the above mentioned fix:

screenshot_2014-09-26_11 04 06

now I have 7.3 and there is no fix now to non-utf encoding (cp1251, for example).

XVilka commented Nov 6, 2014

@dosire @brianmario Can be option to select/limit using encodings for the each repository/team something like a workaround solution? E.g. to limit some repo encodings to UTF-8, windows-1251, cp866 and koi8-r for the Russian team/repo?

jehy commented Nov 12, 2014

Still the same with version 7.4.3 - completely spoils all expericence.
Also, russian encoding completely breaks gitlab up to 500 errors: https://gitlab.com/gitlab-org/gitlab-ce/issues/696#note_371638

Option to limit repo encodings could be a solution.

We suffer from this problem too. Got kind of used to applying dyadyaMisha's workaround after each upgrade. But with that new issue mentioned by Jehy it gets a bit frightening. There might be a lot more related bugs waiting behind the next corner.
May be these problems demand closer attention?

Owner

randx commented Dec 2, 2014

Can you provide a test repository please so we can reproduce the problem?

Alexey-I commented Dec 2, 2014

I am working on it. We have issues not only with this error but also with #8163. I am trying to reproduce as much as possible within one test repo to provide it to you. Will post it here as soon as it is ready.

Alexey-I commented Dec 2, 2014

I'd better split those test repos to keep them as simple as possible.
This one demonstrates bug reported here.

Owner

dosire commented Dec 2, 2014

@Alexey-I Which url gives the error?

Alexey-I commented Dec 2, 2014

Commit
shows commit message and both files. Everything is displayed OK except
for "echo-1251.sh" file content — it's encoding is wrong. Both files
should look identical.

This file
contents

are OK.

This file
contents

are garbled.

Ah just noticed: blame for second
file

is even more interesting. It displays not just wrong encoding but
question-mark symbols in place of cyrillic text.

С уажением,
Алексей Иванов.

Owner

dosire commented Dec 2, 2014

@Alexey-I Thanks, that helps.

xRayDev commented Jan 26, 2015

@dyadyaMisha
fix in comment #5493 (comment)
still work but files for path in new version gitlab here
/gem/ruby/2.1.0/gems/gitlab_git-7.0.0.rc14/lib/gitlab_git/encoding_helper.rb
/gem/ruby/2.1.0/gems/gitlab-grit-2.6.12/lib/grit_ext.rb

Fix method still same.

Probably we will not see a fix as their ears. And we will apply the patch with the release of each version. :(

XVilka commented Feb 12, 2015

Can someone prepare the tests, but not in the test repository format but something like:
"bytestream" => decoded text?
To try to go deeper through gitlab and even charlock_homes - directly to the ICU?

@xraydev i have tried to apply this patch to the files that you mentionned in my GitLab 7.7.2 install and all i get is an error 502 page.

xRayDev commented Feb 19, 2015

@exuperok I not try this fix on Gitlab version 7.7.2. My GitLab version 7.7.0.
You watch log?

xRayDev commented Feb 19, 2015

@XVilka I can post file from my repo with windows-1251 cyrillic encoding

that will be helpful, thank you for your answer

On Thu, Feb 19, 2015 at 4:55 PM, xRayDev notifications@github.com wrote:

@XVilka https://github.com/XVilka I can post file from my repo with
windows-1251 cyrillic encoding


Reply to this email directly or view it on GitHub
#5493 (comment).

exuperok commented Mar 4, 2015

@xraydev did you post them on your blog or on a git repository that i can look into?

xRayDev commented Mar 5, 2015

@exuperok Of Course. Here diff xRayDev/gitlab_windows1251@ac65ae9 for Installed gitlab_7.7.1-omnibus.5.4.1.ci-1_amd64.deb

I think automatic recognition will work only on specific set of cases. The easiest solution is to add a parameter "Default codepage" to a repository's settings page. Then use specified codepage every time while converting non UTF-8 data.

Contributor

jvanbaarsen commented Nov 7, 2015

We are currently in the process of closing down the issue tracker on GitHub, to prevent duplication with the GitLab.com issue tracker. Since this is an older issue I'll be closing this for now. If you think this is still an issue I encourage you to open it on the GitLab.com issue tracker.

jvanbaarsen closed this Nov 7, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment