Skip to content

Replace NBSPs#2150

Merged
guardrex merged 2 commits intodotnet:masterfrom
svick:remove-nbsps
May 15, 2017
Merged

Replace NBSPs#2150
guardrex merged 2 commits intodotnet:masterfrom
svick:remove-nbsps

Conversation

@svick
Copy link
Copy Markdown
Member

@svick svick commented May 14, 2017

When working on #2149, I have noticed that many code blocks contain the NO-BREAK SPACE character (U+00A0).

I don't know if they actually cause any issues, but I think they shouldn't be used in code blocks, so I replaced them with a normal space.

The code I used is here.

@guardrex
Copy link
Copy Markdown
Contributor

Thanks for hunting these pesky characters down and removing them. When looking at the markdown and then the rendered bytes, it seems they get dropped when the docs are built. It's very agreeable to get these out tho.

Did you do a global search (i.e., outside of code blocks) for U+00A0? If so, did you find more of these ... as @mariaw says ... little "treasures" in there?

Expanding out to other similar characters (e.g., the dreaded BOM!): If this is a larger problem and if there are all kinds of unnecessary and undesirable characters sprinkled in across the docs, shouldn't we determine if the docs can be cleansed of all of these characters in one big pass?

@svick
Copy link
Copy Markdown
Member Author

svick commented May 14, 2017

Did you do a global search (i.e., outside of code blocks) for U+00A0? If so, did you find more of these

Yes, there is more of them. For example, here is the start of the output of git grep -I $'\u00A0' on this branch (where NBSPs are shown as <C2><A0>):

core/tools/dotnet-clean.md:Sets the verbosity level of the command. Allowed levels are q[uiet
],<C2><A0>m[inimal],<C2><A0>n[ormal],<C2><A0>d[etailed], and<C2><A0>diag[nostic].
framework/migration-guide/versions-and-dependencies.md:|.NET 4.7|4|- Support for the level of
 TLS support provided by the operating system.<br/> - Ability to configure default message se
curity settings for TLS1.1 or TLS1.2. <br /> - Improved reliability of the <xref:System.Runti
me.Serialization.Json.DataContractJsonSerializer>. <br /> - Improved reliability of serializa
tion and deserialization with WCF applications. <br /> - Ability to extend the ASP.NET object
 cache. <br /> - Support for a touch/stylus stack based on<C2><A0>`WM_POINTER` Windows messag
es instead of<C2><A0>the Windows Ink Services Platform (WISP) for WPF applications. <br /> - 
Use of Window's Print Document Package API for printing in WPF applications.<br /> - Enhanced
 high DPI and multi-monitor support for Windows Forms applications running on Windows 10 Crea
tors Update.<C2><A0>| | <E2><9C><93>  10 Creators Update <br/> <br/> + 10 Anniversary Update 
<br/> + 8.1 <br/> +7| + 2016 <br/> + 2012 R2 <br/> + 2012 <br/> + 2008 R2 SP1 |Use `Release` 
DWORD:<br/><br/> - 460798 (Windows 10 Creators Update) <br/> - 460805 (all other OS versions)
 <br/><br/> (see [instructions](../../../docs/framework/migration-guide/how-to-determine-whic
h-versions-are-installed.md)) |
fsharp/language-reference/code-formatting-guidelines.md:When indentation is required, you mus
t use spaces, not tabs. At least one space is required. Your organization can create coding s
tandards to specify the number of spaces to use for indentation; three or four spaces of inde
ntation at each level where indentation occurs is typical. You can configure Visual Studio to
 match your organization's indentation standards by changing the options in the<C2><A0>`Optio
ns` dialog box, which is available from the `Tools` menu. In the `Text Editor` node, expand `
F#` and then click `Tabs`.<C2><A0>For a description of the available options, see [Options, T
ext Editor, All Languages, Tabs](https://msdn.microsoft.com/library/7sffa753.aspx).
fsharp/language-reference/query-expressions.md:    distinct<C2><A0><C2><A0><C2><A0><C2><A0><C
2><A0><C2><A0><C2><A0>
fsharp/language-reference/query-expressions.md:    for student in db.Student do<C2><A0><C2><A
0><C2><A0><C2><A0><C2><A0><C2><A0><C2><A0>
fsharp/language-reference/query-expressions.md:    groupBy student.Age into g<C2><A0><C2><A0>
<C2><A0><C2><A0><C2><A0><C2><A0><C2><A0>
fsharp/language-reference/query-expressions.md:    where (g.Count() > 1)<C2><A0><C2><A0><C2><
A0><C2><A0><C2><A0><C2><A0><C2><A0>
fsharp/language-reference/query-expressions.md:    select student<C2><A0>
fsharp/language-reference/query-expressions.md:    select n.StudentID<C2><A0><C2><A0><C2><A0>

(The ones in the F# language reference code blocks were not replaced, because they're <pre>, not ```.)

I wasn't completely sure they should be replaced elsewhere and thought they are less of an issue elsewhere, so I didn't replace them. Should I?

Expanding out to other similar characters (e.g., the dreaded BOM!): If this is a larger problem and if there are all kinds of unnecessary and undesirable characters sprinkled in across the docs, shouldn't we determine if the docs can be cleansed of all of these characters in one big pass?

Here is a table of all unusual characters in *.md files in the docs directory (in this branch, so after NBSPs in code blocks were already replaced). Many of them are not wrong (e.g. Greek or Russian characters or some box drawing characters from the output of Yeoman) and some are a matter of style (e.g - vs. – vs. —).

char code count
tab 0009 953
NBSP 00A0 134
¥ 00A5 2
§ 00A7 2
© 00A9 55
ª 00AA 1
® 00AE 52
° 00B0 5
± 00B1 6
´ 00B4 4
À 00C0 7
Á 00C1 2
à 00C3 1
Ä 00C4 1
Å 00C5 10
Ê 00CA 7
Ö 00D6 1
× 00D7 55
Ø 00D8 6
ß 00DF 2
à 00E0 7
á 00E1 6
â 00E2 2
ä 00E4 12
å 00E5 7
æ 00E6 1
ç 00E7 37
è 00E8 3
é 00E9 51
ê 00EA 8
í 00ED 6
ï 00EF 1
ñ 00F1 1
ó 00F3 5
ô 00F4 4
ö 00F6 15
ø 00F8 6
ü 00FC 6
İ 0130 2
ı 0131 1
ˆ 02C6 1
̊ 030A 2
Δ 0394 3
Ι 0399 3
έ 03AD 3
ή 03AE 1
ί 03AF 4
α 03B1 6
β 03B2 1
ε 03B5 3
η 03B7 2
ι 03B9 2
κ 03BA 1
μ 03BC 7
ν 03BD 5
ο 03BF 9
ρ 03C1 3
τ 03C4 5
υ 03C5 11
χ 03C7 2
ψ 03C8 1
Ϩ 03E8 2
В 0412 1
Д 0414 4
Ж 0416 3
З 0417 1
К 041A 1
П 041F 2
Т 0422 1
а 0430 4
б 0431 1
в 0432 8
г 0433 2
д 0434 4
е 0435 11
з 0437 1
и 0438 4
й 0439 2
к 043A 3
л 043B 1
м 043C 2
н 043D 5
о 043E 5
п 043F 2
р 0440 6
с 0441 5
т 0442 6
у 0443 4
щ 0449 1
ы 044B 1
ь 044C 2
ю 044E 1
я 044F 2
ӑ 04D1 1
م 0645 2
1EA5 1
en space 2002 14
ZWSP​​ 200B 20
2013 679
2014 608
2018 107
2019 1712
2020 9
2022 8
2026 681
2030 11
20AC 2
2122 1
212B 4
2192 3
2212 6
2318 2
2500 84
2502 12
2514 8
251C 8
256D 1
256E 1
256F 1
2570 1
2713 520
276F 1
278A 7
278B 3
278C 8
3055 1
305F 1
308C 1
30A8 1
30B1 1
30B9 1
30C6 1
30D7 1
30E9 1
30EA 1
30EB 1
30FC 1
5217 1
5225 1
5348 2
5B50 1
5B57 1
5F8C 1
6587 1
8B58 1
BOM FEFF 1
🔧 1F527 31

@guardrex
Copy link
Copy Markdown
Contributor

guardrex commented May 14, 2017

The <C2> is a  (U+00C2 : LATIN CAPITAL LETTER A WITH CIRCUMFLEX) ... another "treasure" that came into some of those lines via cut-and-paste. The reason I know that is that a few of those lines were handled by me on a PR several weeks ago. Combine this knowledge with the lovely BOM that pops into files from time-to-time, and I've drawn some conclusions here:

  1. I need to always cleanse cut-and-paste text prior to use (at least right now, see 2 👇).
  2. Contributors everywhere can paste these types of characters into topics and they will not get caught (flagged to us) by CI ... meaning that over time more and more of these pesky devils are going to find their way into the docs.

I hope @mairaw will give us good news on the possibility of the CI warning/failing on at least some of the nastier ones (e.g., the BOM).

RE: Removing the rest of the U+00A0 in favor of U+0020: I like it. Makes sense for this PR if @mairaw likes it. @mairaw will know if this problem can/should be approached more globally than by creating separate PR's that address individual characters.

@svick svick changed the title Replace NBSPs in code blocks Replace NBSPs May 15, 2017
@svick
Copy link
Copy Markdown
Member Author

svick commented May 15, 2017

The <C2> is a  (U+00C2 : LATIN CAPITAL LETTER A WITH CIRCUMFLEX)

No, it's not, at least not in this case. <C2><A0> is just UTF-8 encoded U+00A0 (NBSP). git grep seems to display Unicode that way.

RE: Removing the rest of the U+00A0 in favor of U+0020: I like it.

Ok, done.

@guardrex
Copy link
Copy Markdown
Contributor

Never trust Notepad (ANSI/Windows-1252) to tell you an encoding! lol 😄 Yes, I got the encoding wrong. I see it now (dotnet-clean.md) ...

capture

My questions are still valid tho: Can we prevent undesirable codepoints creeping into the docs over the years? Is there an opportunity to purge the docs of undesirable codepoints in one bold stroke?

@guardrex guardrex merged commit dc1c456 into dotnet:master May 15, 2017
@svick svick deleted the remove-nbsps branch May 15, 2017 08:56
@svick
Copy link
Copy Markdown
Member Author

svick commented May 15, 2017

Can we prevent undesirable codepoints creeping into the docs over the years? Is there an opportunity to purge the docs of undesirable codepoints in one bold stroke?

That would require CI. And considering that CI currently seems to verify only content (the "OpenPublishing.Build" check) and the few remaining project.json projects (the "OrcaBot [.NET Core - Nix]" check), I think first we have to ask: Can we get a decent working CI?

(Unless I'm missing something and CI is actually working fine.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants