Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Unicode problem (I guess) #38

Closed
lydell opened this Issue Mar 2, 2013 · 22 comments

Comments

Projects
None yet
4 participants
Contributor

lydell commented Mar 2, 2013

On Windows 7:

$ ruby --version && kramdown --version
ruby 1.9.3p385 (2013-02-06) [i386-mingw32]
0.14.2

$ kramdown
"å
^Z
c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:213:in `check': incompatible encoding regexp
 match (UTF-8 regexp with CP850 string) (Encoding::CompatibilityError)
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:213:in `block in parse_spans'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:212:in `each'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:212:in `any?'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:212:in `parse_spans'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:165:in `block in update_tree'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:161:in `map!'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:161:in `update_tree'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:179:in `block in update_tree'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:161:in `map!'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:161:in `update_tree'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/kramdown.rb:103:in `parse'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/parser/base.rb:78:in `parse'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/lib/kramdown/document.rb:119:in `initialize'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `new'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `<top (required)>'
        from c:/Ruby193/bin/kramdown:23:in `load'
        from c:/Ruby193/bin/kramdown:23:in `<

Seem to happen for any input containing either of ', " or -- and a unicode character (such as å, ä, ö, é).

Contributor

lydell commented Mar 2, 2013

What settings? How?

@ghost ghost assigned gettalong Mar 2, 2013

Owner

gettalong commented Mar 2, 2013

Could you please run the following:

ruby -e 'puts Encoding.default_external'
ruby -e 'puts Encoding.default_internal'

The solution is probably that the input needs to be converted to UTF-8 before being parsed. Not sure, though, if the output needs to be converted back...

Contributor

lydell commented Mar 3, 2013

$ ruby -e 'puts Encoding.default_external'
CP850
$ ruby -e 'puts Encoding.default_internal'

The solution is probably that the input needs to be converted to UTF-8 before being parsed.

I have also tried to convert files saved in UTF-8 with input as described. The same thing happens.

Owner

gettalong commented Mar 3, 2013

Yeah, because your environment external encoding is CP850 which means that all files read and all text input get the encoding label CP850...

What I have meant with "The solution..." is that kramdown needs to do this internally, i.e. convert input to UTF-8. I will see how the Ruby stdlib CSV library does this and will probably follow along the footsteps.

Contributor

lydell commented Mar 3, 2013

Can I work around this? I know nothing about ruby, I just have it installed so I can use kramdown and sass, but it sounds like a bad thing not to have an "environment external encoding" to anything else than UTF-8.

Owner

gettalong commented Mar 3, 2013

You can try calling kramdown in the following way until I can fix this (note that the input has to be valid UTF-8):

ruby --external-encoding UTF-8 -S kramdown
Contributor

lydell commented Mar 3, 2013

Thanks, that works.

Contributor

lydell commented Mar 3, 2013

Possibly related:

C:\Users\Simon\Desktop>ruby aaå.rb
Hello, World!

C:\Users\Simon\Desktop>kramdown aaå.rb
C:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `read': No such file or directory - aaå.rb (Errno::ENOENT
)
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `<top (required)>'
        from C:/Ruby193/bin/kramdown:23:in `load'
        from C:/Ruby193/bin/kramdown:23:in `<main>'
Owner

gettalong commented Mar 3, 2013

You may also want to globally change the default encoding to UTF-8 if that is what you use. See for example http://stackoverflow.com/questions/469163/how-to-set-the-default-encoding-in-windows-xp and http://stackoverflow.com/questions/11806512/ruby-1-9-wrong-file-encoding-on-windows.

And yes, this will be related. However, this is a general problem: If you use UTF-8, you should set that as encoding for your computer because otherwise the CP850 encoding will always make trouble.

svnpenn commented Mar 6, 2013

Setting LANG fixes it for me

LANG=en_US.CP850 kramdown aaå.rb
/usr/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `read': No such
 file or directory - aaå.rb (Errno::ENOENT)
        from /usr/lib/ruby/gems/1.9.1/gems/kramdown-0.14.2/bin/kramdown:72:in `<
top (required)>'
        from /usr/bin/kramdown:23:in `load'
        from /usr/bin/kramdown:23:in `<main>'
LANG=en_US.UTF-8 kramdown aaå.rb
<p>puts `Hello, World!'</p>

I ran into this problem today because Tilt doesn't handle encoding properly (rtomayko/tilt#75). This is how I patched the problem, in case it's useful to anyone.

Ruby 2

module Kramdown::Parser
  module EncodingFix
    def adapt_source(source)
      super.force_encoding('UTF-8')
    end
  end

  class Base; prepend EncodingFix; end
end

Ruby <= 1.9

module Kramdown::Parser
  class Base
    alias old_adapt_source adapt_source

    def adapt_source(source)
      old_adapt_source(source).force_encoding('UTF-8')
    end
  end
end

gettalong added a commit that referenced this issue Mar 9, 2013

Hopefully fixed encoding problem of GH issue #38
kramdown dnow transcodes the source to UTF-8 for internal use and
transcodes it back after conversion.
Owner

gettalong commented Mar 9, 2013

So, this does to seem to be a bit more complicated... or funny, depending on how you look at it.

I took the example from @lydell and put it in a file on Windows 7 with Notepad and selected ANSI encoding. So, what does one expect here? I expected that the file would contain CP850 encoded characters because this is what seems to be the encoding on Windows 7 command line. But when you look up what ANSI encoding means, you see that it is actually called CP-1252. So the file gets saved in CP-1252 format.

On the command line, ruby reads it in as CP-850 (because this is the external encoding) and then outputs the result as CP-850 which leads to å becoming õ... which is just not right.

So... you are basically screwed on Windows if you expect a sane default environment because the command line encoding differs from the GUI encoding (or however one wants to phrase that).

However, since there is also still a bug in kramdown I will fix this bug by converting the source string to UTF-8 in Kramdown::Parser::Base.adapt_source and convert the result back to the original encoding of the string in Kramdown::Converter::Base.convert. The back-conversion is not really needed in the most common use cases because on terminal output or when writing to files Ruby automatically transcodes strings to the external encoding. However, when the string is further transformed in Ruby the caller probably expects a string in the same encoding as he has given.

And the result of all this? If you save a file on Windows with a CP-850 encoding, kramdown will now work correctly. Just remember that saving a file in Notepad with the ANSI encoding does not mean CP-850 but CP-1252 (or WINDOWS-1252 as it is known to ruby)!

Coming to you with the next release of kramdown which will be the (spoiler alert) 1.0.0 😄


The problem with the input file can't be solved by kramdown since this is a general problem (Question: Is the encoding of the file system paths different to the external encoding on Windows 7 cmd command line? Answer: Yes, it seems so. Solution: For your and my sake/saneness, please just set the default encoding to UTF-8 everywhere and use UTF-8 everywhere).

@gettalong gettalong closed this Mar 9, 2013

Contributor

lydell commented Mar 11, 2013

I just tried 1.0.0. I would like to confirm that the original test case now works! Thanks!

Converting a kramdown file with UTF-8 (without BOM) encoding now works out of the box, without changing any settings or typing extra things on the command line. Great!

However, the "possibly related" issue still persists, with the same error. @svnpenn's LANG fix does not work for me:

$ LANG=en-US.UTF-8 kramdown aaå.rb
c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1351:in `===': invalid byte sequence in UTF-8 (ArgumentError)
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1351:in `block in parse_in_order'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1347:in `catch'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1347:in `parse_in_order'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1341:in `order!'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1432:in `permute!'
        from c:/Ruby193/lib/ruby/1.9.1/optparse.rb:1453:in `parse!'
        from c:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-1.0.0/bin/kramdown:16:in `<top (required)>'
        from c:/Ruby193/bin/kramdown:23:in `load'
        from c:/Ruby193/bin/kramdown:23:in `<main>'

Is it related, or should a new issue be opened? Or, should I change my settings? If the latter—exactly what should I change? (The answer to that should be added to kramdown's install instructions for Windows.) Why can ruby find files with unicode characters in them, but not kramdown?

Anyways, thanks for the quick solution for the main problem! (For the time being, I could just avoid unicode characters in my file names, or rename them temporarily.)

Owner

gettalong commented Mar 11, 2013

There are still known problems with Ruby, Windows and Unicode path names (see, for example, http://bugs.ruby-lang.org/issues/1685). However, they may not apply to this situation.

You should be able to work around this by using ruby --external-encoding UTF-8. However, you need to be aware that this assumes that the content files for kramdown are also in UTF-8 and not CP850!

I have search a bit and found the chcp command with which you can change the used CMD.com code page. Code Page 65001 can be used for UTF-8. You should also change the console font from the raster font to something else (Lucida Console works fine for me).

After changing to code page 1252 (chcp 1252), I was able to execute the following command:

C:\temp>chcp 850
Aktive Codepage: 850.

C:\temp>kramdown ä.txt
C:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-1.0.0/bin/kramdown:59:in `read': No such file or directory - ä.txt (Errno::ENOENT)
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/kramdown-1.0.0/bin/kramdown:59:in `<top (required)>'
        from C:/Ruby193/bin/kramdown:23:in `load'
        from C:/Ruby193/bin/kramdown:23:in `<main>'

C:\temp>chcp 1252
Aktive Codepage: 1252.

C:\temp>kramdown ä.txt
<p>aaä</p>

C:\temp>

Please note that the contents of the ä.txt file was in Windows 1252 encoding and not UTF-8. If it were in UTF-8, the output would have been <p>aaä</p> (because Ruby would interpret the characters as being Windows 1252 encoded before giving it to kramdown).

And the moral of the story? I don't think that I can provide you with a general solution. You need to make sure that the proper code page on the command line is set and that all your files and their content is encoded in this code page. I'm sorry I can't help more but I don't really use Windows that often.

svnpenn commented Mar 11, 2013

@lydell you input

LANG=en-US.UTF-8 kramdown aaå.rb

when I input

LANG=en_US.UTF-8 kramdown aaå.rb

notice carefully the underscore.

Owner

gettalong commented Mar 11, 2013

Does using LANG=... on Windows really work?

svnpenn commented Mar 11, 2013

@gettalong with Cygwin

Owner

gettalong commented Mar 11, 2013

Ah, yes, of course, there it should work. But I don't think it works with the Windows Command Shell.

Contributor

lydell commented Mar 11, 2013

Ah, I'm so used to seeing en-US that I just couldn't see that underscore … However, I still got the same result :(

LANG=... does not work with Windows Command Shell, that's right. The discussion might have been a bit confused, since I've sometimes used Windows Command Shell and sometimes (mostly) Git bash (I don't know if LANG=... works there).

In Windows Command Shell, running chcp 1252 before running kramdown aaå.rb works! aaå.rb is saved with UTF-8 encoding though. And using chcp 65001 did not work! I'm now very confused …

Unfortunately, the chcp command does not work in Git bash, which is what I use the most. Luckily, I found a solution: cat aaå.rb | kramdown.

It seems like Sass has the same problem:

$ sass aaå.rb
Errno::ENOENT: No such file or directory - aaÕ.rb
  Use --trace for backtrace.

I also tried feeding the aaå.rb files to half a dozen compilers using node.js. All of them found it and used it correctly. So it really seems to be a ruby thing.

To sum up, thanks for your efforts! The important thing is that the main issue is resolved. I will continue to experiment with this, because I really want Windows users to be able to enjoy kramdown :)

Owner

gettalong commented Mar 11, 2013

If you change the code page using chcp, you basically change what Ruby sees as the encoding of its environment. I referred to this as the default external encoding (use ruby -e "puts Encoding.default_external" to output it).

So if you change the code page to 1252, kramdown/ruby now can read the file name correctly because the encodings match (the default external with the encoding of the file name). However, since the content of the file is in a different encoding than the file name, the output from kramdown is garbled...

(side note: I don't really know how file system paths work on Windows and whether they are encoded in UTF-8 or Windows 1252, I'm just interpreting the data).

What exactly is "Git bash"? If it is based on Cygwin, the LANG trick from @svnpenn should work!

Owner

gettalong commented Mar 11, 2013

Also just found http://stackoverflow.com/questions/2050973/what-encoding-are-filenames-in-ntfs-stored-as

After reading this it seems that Ruby is using the ANSI version of the fopen system call because it works if the external encoding is Windows 1252 but not if it is UTF-8. So this could probably be fixed by always converting the file name to the proper ANSI encoding when passing it to a Ruby file method.

svnpenn commented Mar 11, 2013

@gettalong "Git Bash" is essentially MinGW, and yes it is based on Cygwin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment