New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-english characters are encoded #315

Closed
danroth27 opened this Issue May 28, 2015 · 19 comments

Comments

Projects
None yet
10 participants
@danroth27
Member

danroth27 commented May 28, 2015

Reported as aspnet/Razor#409 by @anfomin

Every non-english character is encoded. For example:

@{
    string s = "Добрый день"; // this is Russian text
}
@s

The result is something like:
2015-05-27 9 55 55


Characters displayed correct, but page size increased significantly 👎 Also MVC 5 rendered unicode symbol correctly in HTML source.

@blowdart

This comment has been minimized.

Show comment
Hide comment
@blowdart

blowdart May 28, 2015

Member

Actually MVC 5 would have done the same if you'd switched to AntiXSS as the encoder.

By design.

Member

blowdart commented May 28, 2015

Actually MVC 5 would have done the same if you'd switched to AntiXSS as the encoder.

By design.

@Tratcher

This comment has been minimized.

Show comment
Hide comment
@Tratcher

Tratcher May 28, 2015

Member

What API are you complaining about?

Member

Tratcher commented May 28, 2015

What API are you complaining about?

@Tratcher

This comment has been minimized.

Show comment
Hide comment
@Tratcher

Tratcher May 28, 2015

Member

Ah, IHtmlEncoder

Member

Tratcher commented May 28, 2015

Ah, IHtmlEncoder

@dougbu

This comment has been minimized.

Show comment
Hide comment
@dougbu

dougbu May 28, 2015

Member

@blowdart however in MVC 5 it was a choice and the default HTML encoder provided readable source of a reasonable size.

Member

dougbu commented May 28, 2015

@blowdart however in MVC 5 it was a choice and the default HTML encoder provided readable source of a reasonable size.

@blowdart

This comment has been minimized.

Show comment
Hide comment
@blowdart

blowdart May 28, 2015

Member

The standard Russian characters should have been safe listed with the new encoder.
You'll need to examine the whitelist generation to see why they're not.

Member

blowdart commented May 28, 2015

The standard Russian characters should have been safe listed with the new encoder.
You'll need to examine the whitelist generation to see why they're not.

@dougbu

This comment has been minimized.

Show comment
Hide comment
@dougbu

dougbu May 28, 2015

Member

@blowdart from the HtmlEncoder source, the default escaped range is UnicodeRanges.BasicLatin which has this convenient comment:

/// A <see cref="UnicodeRange"/> corresponding to the 'Basic Latin' Unicode block (U+0000..U+007F).

Are you suggesting a different default?

Member

dougbu commented May 28, 2015

@blowdart from the HtmlEncoder source, the default escaped range is UnicodeRanges.BasicLatin which has this convenient comment:

/// A <see cref="UnicodeRange"/> corresponding to the 'Basic Latin' Unicode block (U+0000..U+007F).

Are you suggesting a different default?

@blowdart

This comment has been minimized.

Show comment
Hide comment
@blowdart

blowdart May 28, 2015

Member

I though it was changed, but ok. And no, let's be honest, I'd encode everything by default to avoid certain utf7 badness. Customer can use the Default property to widen their safe list as needed.

Member

blowdart commented May 28, 2015

I though it was changed, but ok. And no, let's be honest, I'd encode everything by default to avoid certain utf7 badness. Customer can use the Default property to widen their safe list as needed.

@anfomin

This comment has been minimized.

Show comment
Hide comment
@anfomin

anfomin May 29, 2015

Every character encoding brings another problems. Sometimes we use <script> generation in razor pages like:

<script>
    $(function() {
        var someVariable = "@thisIsDotNetVariable";
        // other code goes here
    });
</script>

So JavaScript will not decode this value and will display unicode numbers.

anfomin commented May 29, 2015

Every character encoding brings another problems. Sometimes we use <script> generation in razor pages like:

<script>
    $(function() {
        var someVariable = "@thisIsDotNetVariable";
        // other code goes here
    });
</script>

So JavaScript will not decode this value and will display unicode numbers.

@blowdart

This comment has been minimized.

Show comment
Hide comment
@blowdart

blowdart May 29, 2015

Member

Don't do that for javascript. It's simply wrong.

@ encoding is HTML encoding. It's not JavaScript encoding. Using HTML encoding inside javascript means some of the characters that can cause javascript injection are not encoded.

The most foolproof way to inject variable values into javascript is to have a div or span or other tag, and use a data attribute - using @ encoding there is correct, and you're protected. Then you'd retrieve it within the javascript.

Member

blowdart commented May 29, 2015

Don't do that for javascript. It's simply wrong.

@ encoding is HTML encoding. It's not JavaScript encoding. Using HTML encoding inside javascript means some of the characters that can cause javascript injection are not encoded.

The most foolproof way to inject variable values into javascript is to have a div or span or other tag, and use a data attribute - using @ encoding there is correct, and you're protected. Then you'd retrieve it within the javascript.

@anfomin

This comment has been minimized.

Show comment
Hide comment
@anfomin

anfomin May 29, 2015

@blowdart OK, this it good explanation not to use @ with JS. But encoded-text increase page size 7 times (each character converted into 7). I do not think this is good behavior by default.

Also there is no attack can be done using Russian or any other language characters so they do not require HTML encoding. This characters do not have any "special" behavior compared to English ones.

anfomin commented May 29, 2015

@blowdart OK, this it good explanation not to use @ with JS. But encoded-text increase page size 7 times (each character converted into 7). I do not think this is good behavior by default.

Also there is no attack can be done using Russian or any other language characters so they do not require HTML encoding. This characters do not have any "special" behavior compared to English ones.

@blowdart

This comment has been minimized.

Show comment
Hide comment
@blowdart

blowdart May 29, 2015

Member

Actually you can attack via some characters depending on the underlying page encoding and other badness. Utf7 is a great example of this - http://openmya.hacker.jp/hasegawa/public/20071107/s6/h6.html?file=datae.txt has some details.

We're discussing internally what we want to do - but the encoders are used by things other than razor, and defaulting to encoding everything over and above ASCII is what makes it 100% safe for the other things to consume, so I'm not minded to change the default encoder configuration. With gzip compression which most servers support these days the actually transport size increase shouldn't be that painful, but I realise if you're not in a default English speaking country it's probably not a cost you want to take (being a Brit, it annoys me when we end up assuming everything speaks US english).

You could just whitelist everything yourself, either by

HtmlEncoder.Default = new HtmlEncoder(UnicodeRanges.All);

before the encoder gets registered in DI, or via

serviceColl.AddWebEncoders(o => { 
    o.CodePointFilter = new CodePointFilter(UnicodeRanges.All);
});

If you wanted to narrow it down, the you can create ranges and pass them into the encoder which match the Unicode code charts which contain the characters you need (Unicode doesn't do languages, so you will need to mix and match - you can find the code charts at http://www.unicode.org/charts/)

Member

blowdart commented May 29, 2015

Actually you can attack via some characters depending on the underlying page encoding and other badness. Utf7 is a great example of this - http://openmya.hacker.jp/hasegawa/public/20071107/s6/h6.html?file=datae.txt has some details.

We're discussing internally what we want to do - but the encoders are used by things other than razor, and defaulting to encoding everything over and above ASCII is what makes it 100% safe for the other things to consume, so I'm not minded to change the default encoder configuration. With gzip compression which most servers support these days the actually transport size increase shouldn't be that painful, but I realise if you're not in a default English speaking country it's probably not a cost you want to take (being a Brit, it annoys me when we end up assuming everything speaks US english).

You could just whitelist everything yourself, either by

HtmlEncoder.Default = new HtmlEncoder(UnicodeRanges.All);

before the encoder gets registered in DI, or via

serviceColl.AddWebEncoders(o => { 
    o.CodePointFilter = new CodePointFilter(UnicodeRanges.All);
});

If you wanted to narrow it down, the you can create ranges and pass them into the encoder which match the Unicode code charts which contain the characters you need (Unicode doesn't do languages, so you will need to mix and match - you can find the code charts at http://www.unicode.org/charts/)

@dougbu

This comment has been minimized.

Show comment
Hide comment
@dougbu

dougbu May 29, 2015

Member

Note setting HtmlEncoder.Default is likely not what you want. That changes only the HTML encoder, leaving the JavaScript and URL encoders with their (ASCII) defaults.

Recommend

  services.Configure<WebEncoderOptions>(options => new CodePointFilter(UnicodeRanges.All));

because (today) nothing configures WebEncoderOptions but much of the system adds the web encoders.

Member

dougbu commented May 29, 2015

Note setting HtmlEncoder.Default is likely not what you want. That changes only the HTML encoder, leaving the JavaScript and URL encoders with their (ASCII) defaults.

Recommend

  services.Configure<WebEncoderOptions>(options => new CodePointFilter(UnicodeRanges.All));

because (today) nothing configures WebEncoderOptions but much of the system adds the web encoders.

@Tratcher

This comment has been minimized.

Show comment
Hide comment
@Tratcher

Tratcher Sep 9, 2015

Member

@blowdart Is there any action required here? This sounds all by design.

Member

Tratcher commented Sep 9, 2015

@blowdart Is there any action required here? This sounds all by design.

@blowdart

This comment has been minimized.

Show comment
Hide comment
@blowdart

blowdart Sep 9, 2015

Member

By design. The MVC team wanted to change it somewhat, but never came up with a suitable suggestion,

Member

blowdart commented Sep 9, 2015

By design. The MVC team wanted to change it somewhat, but never came up with a suitable suggestion,

@Tratcher Tratcher added the wontfix label Sep 9, 2015

@Tratcher Tratcher closed this Sep 9, 2015

@penihel

This comment has been minimized.

Show comment
Hide comment
@penihel

penihel Nov 23, 2015

I have the same problem with accents (pt-BR). In previous versions this was not a problem.

I think to a final version, native, pure and simple MVC should accept this kind of character without that develop change any settings.

More than a technical issue, for me it is a matter of attention to non-English developers

penihel commented Nov 23, 2015

I have the same problem with accents (pt-BR). In previous versions this was not a problem.

I think to a final version, native, pure and simple MVC should accept this kind of character without that develop change any settings.

More than a technical issue, for me it is a matter of attention to non-English developers

@Tratcher Tratcher removed their assignment Nov 23, 2015

@kerryjiang

This comment has been minimized.

Show comment
Hide comment
@kerryjiang

kerryjiang Jun 4, 2016

The api was changed, the issue can be walked abound by the code:

services.Configure<WebEncoderOptions>(options => 
{
      options.TextEncoderSettings = new TextEncoderSettings(UnicodeRanges.All);
});

kerryjiang commented Jun 4, 2016

The api was changed, the issue can be walked abound by the code:

services.Configure<WebEncoderOptions>(options => 
{
      options.TextEncoderSettings = new TextEncoderSettings(UnicodeRanges.All);
});
@miloush

This comment has been minimized.

Show comment
Hide comment
@miloush

miloush Jun 7, 2016

I agree that BasicLatin is a horrible default. Not only it increases the file size and overhead (don't we expect the core to run on small devices too?) but also makes the page illegible to humans.

miloush commented Jun 7, 2016

I agree that BasicLatin is a horrible default. Not only it increases the file size and overhead (don't we expect the core to run on small devices too?) but also makes the page illegible to humans.

@Cuiqs76

This comment has been minimized.

Show comment
Hide comment
@Cuiqs76

Cuiqs76 Feb 24, 2017

where is the WebEncoderOptions? @kerryjiang

Cuiqs76 commented Feb 24, 2017

where is the WebEncoderOptions? @kerryjiang

@Eirenarch

This comment has been minimized.

Show comment
Hide comment
@Eirenarch

Eirenarch Jun 21, 2018

This default is absurd. In addition it is not even clear to me if using UnicodeRanges.All is safe or I'd better list a couple of more popular Unicode ranges like Cyrillic, GreekandCoptic, Hebrew and Arabic and leave the rest encoded

Eirenarch commented Jun 21, 2018

This default is absurd. In addition it is not even clear to me if using UnicodeRanges.All is safe or I'd better list a couple of more popular Unicode ranges like Cyrillic, GreekandCoptic, Hebrew and Arabic and leave the rest encoded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment