Skip to content
This repository has been archived by the owner. It is now read-only.

Fix diff encoding for non ascii paths and non utf-8 file encoding #1

Merged
merged 1 commit into from Nov 26, 2012

Conversation

@hiroponz
Copy link

hiroponz commented Nov 24, 2012

  • The encoding of file name is utf-8, but that of diff body is dependent on file encoding.
  • If the output of diff is too short, it will make a mistake in guss of encoding.
@hiroponz

This comment has been minimized.

Copy link
Author

hiroponz commented Nov 24, 2012

The file encoding is Shift_JIS.

Before
befor

After
after

@ghost ghost assigned SaitoWu Nov 24, 2012
@dzaporozhets

This comment has been minimized.

Copy link
Member

dzaporozhets commented Nov 24, 2012

@SaitoWu can you please take a look

@SaitoWu

This comment has been minimized.

Copy link

SaitoWu commented Nov 25, 2012

Seems like github also has this problem: hiroponz/example@ca33b50

@hiroponz

This comment has been minimized.

Copy link
Author

hiroponz commented Nov 26, 2012

I use detect_all to fix hiroponz/example@ca33b50.

For example

pry(#<Grit::Diff>):2> detect = CharlockHolmes::EncodingDetector.detect("@@ -0,0 +1,2 @@\n+\x82\xB1\x82\xEA\x82\xCD\x83e\x83X\x83g\x82\xC5\x82\xB7\x81B\r\n+this is test.\r")
=> {:type=>:text, :encoding=>"windows-1252", :confidence=>51, :language=>"en"}

detect returns wrong encoding.

pry(#<Grit::Diff>):2> detect_all = CharlockHolmes::EncodingDetector.detect_all("@@ -0,0 +1,2 @@\n+\x82\xB1\x82\xEA\x82\xCD\x83e\x83X\x83g\x82\xC5\x82\xB7\x81B\r\n+this is test.\r")
=> [{:type=>:text, :encoding=>"windows-1252", :confidence=>51, :language=>"en"},
 {:type=>:text, :encoding=>"windows-1250", :confidence=>36, :language=>"hu"},
 {:type=>:text, :encoding=>"Shift_JIS", :confidence=>10, :language=>"ja"},
 {:type=>:text, :encoding=>"GB18030", :confidence=>10, :language=>"zh"},
 {:type=>:text, :encoding=>"Big5", :confidence=>10, :language=>"zh"}]

detect_all returns all possible encodings. In this case, ":language=>"ja" is right. So, I get "ja" from ENV"LANG", and find encoding by language.

lines = @diff.lines.to_a

# path encoding is dependent on environment
path = GritExt.encode! lines.shift(2).join

This comment has been minimized.

Copy link
@hiroponz

hiroponz Nov 26, 2012

Author

I have changed comment and variable identifier(file -> path).

@hiroponz

This comment has been minimized.

Copy link
Author

hiroponz commented Nov 26, 2012

Sorry, I have pushed with force option and then review messages have been lost.

@SaitoWu

This comment has been minimized.

Copy link

SaitoWu commented Nov 26, 2012

@hiroponz Cool!. But detect_all also has its problem. its a trade-off.

the ENV['lang'] variable can be only one, ja, en or zh. and the detect_all returns all possible encodings.

if i have a shift_jis and gb18030 encoding files in my repo, and my ENV['lang'] is zh.

the encoed! method will return the wrong encoding for my shift-jis file. because i always choose zh first.

* The encoding of path is dependent on environment, but that of diff body is dependent on file encoding.
@hiroponz

This comment has been minimized.

Copy link
Author

hiroponz commented Nov 26, 2012

@SaitoWu Yes, there is a trade-off. Detect would return the right encoding in many cases, so that I have reverted encode! method.

SaitoWu added a commit that referenced this pull request Nov 26, 2012
Fix diff encoding for non ascii paths and non utf-8 file encoding
@SaitoWu SaitoWu merged commit 78b83d0 into gitlabhq:master Nov 26, 2012
@SaitoWu

This comment has been minimized.

Copy link

SaitoWu commented Nov 26, 2012

@hiroponz Merged! Thanks. 😄

@hiroponz hiroponz deleted the hiroponz:fix_non_utf8_file_encoding branch Oct 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
3 participants
You can’t perform that action at this time.