Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect line numbering/location information when certain unicode characters are used #158

Closed
mivok opened this issue Aug 17, 2014 · 7 comments
Assignees
Labels

Comments

@mivok
Copy link

@mivok mivok commented Aug 17, 2014

This is a strange one:

$ cat testcase.rb
#!/usr/bin/env ruby
require 'kramdown'
text = <<EOF
nbsp nbsp (the first space on this line is unicode 0xA0 - a non-breaking space)

- Test

# Test
EOF

doc = Kramdown::Document.new(text).root
puts doc.children.inspect

$ ./testcase.rb
[<kd:p nil {:location=>1} [<kd:text "nbsp nbsp" nil>]>, <kd:blank "\n" nil>, <kd:ul nil {:location=>3} [<kd:li nil {:location=>3} [<kd:p nil {:location=>3, :transparent=>true} [<kd:text "Test" nil>]>]>]>, <kd:blank "\n" nil>, <kd:header nil {:level=>1, :raw_text=>"Test", :location=>6} [<kd:text "Test" nil>]>]

The location for the header is parsed as line 6, when it should be line 5. Interestingly, the location for the list item is shown as line 3, which is correct.

This came from a much larger markdown document, and I managed to get it down to that small test case while still reproducing the issue.

@mivok
Copy link
Author

@mivok mivok commented Aug 17, 2014

I've managed to reproduce with other unicode characters as well, so it's not just the non-breaking space:

#!/usr/bin/env ruby
require 'kramdown'
text = <<EOF
®

- Test

# Test
EOF

doc = Kramdown::Document.new(text).root
puts doc.children.inspect

@mivok mivok changed the title Incorrect line numbering/location information when non-breaking unicode character is used Incorrect line numbering/location information when certain unicode characters are used Aug 17, 2014
mivok added a commit to markdownlint/markdownlint that referenced this issue Aug 17, 2014
Fixes #52

This works around some issues with unicode characters causing issues with line
numbering, and fixes a crash bug in markdownlint.

While no current rules rely on non-ascii characters, this may not always be
the case (e.g a rule that checks for the use of smart quotes in text), and so
this workaround should be considered temporary until a fix is found to
gettalong/kramdown#158
@mivok
Copy link
Author

@mivok mivok commented Aug 17, 2014

I just spotted https://github.com/gettalong/kramdown/blob/master/lib/kramdown/utils/string_scanner.rb#L33 which mentions that this could be an issue with older rubies so I wanted to point out that this is in ruby 2.1.1:

$ ruby -e 'require "strscan";p StringScanner.new("foo", 1).respond_to?(:charpos)'
true

@mivok
Copy link
Author

@mivok mivok commented Aug 17, 2014

Interestingly, changing best_pos to pos in https://github.com/gettalong/kramdown/blob/master/lib/kramdown/utils/string_scanner.rb#L64 causes the line numbers to display correctly, and I ran the testcase in ruby 1.9.3 (after adding # encoding: utf-8 to the script) and it also showed the correct line number. Unfortunately I'm not sure what the implications of this are (presumably there is a good reason for using charpos/best_pos), but it should point you in the right direction.

@gettalong gettalong self-assigned this Aug 17, 2014
@gettalong
Copy link
Owner

@gettalong gettalong commented Aug 17, 2014

Thanks for the detailed bug report! You are completely right!

The StringScanner#charpos method returns the character position whereas the StringScanner#pos method returns the byte position. When setting the position with StringScanner#pos= (see http://www.ruby-doc.org/stdlib-2.0.0/libdoc/strscan/rdoc/StringScanner.html#method-i-pointer-3D) the byte position is needed.

So the #best_pos method is not needed, #pos is sufficient.

@gettalong gettalong added the bug label Aug 17, 2014
@jhund
Copy link
Contributor

@jhund jhund commented Sep 5, 2014

@gettalong I'm happy to submit a pull request to fix this issue since we rely on it in Repositext. Should I go ahead?

@gettalong
Copy link
Owner

@gettalong gettalong commented Sep 6, 2014

@jhund Thanks but I have already implemented the changes, I will just need to push a new release which will be done in the next few days.

@jhund
Copy link
Contributor

@jhund jhund commented Sep 8, 2014

@gettalong great, thanks for fixing the location bug.

@gettalong gettalong closed this Sep 9, 2014
gettalong added a commit that referenced this issue Sep 9, 2014
Since the StringScanner#pos= method sets the byte position,
we need to use StringScanner#pos (not #charpos) to get the
correct position for later use.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants