Replace NBSPs by svick · Pull Request #2150 · dotnet/docs

svick · 2017-05-14T14:16:06Z

When working on #2149, I have noticed that many code blocks contain the NO-BREAK SPACE character (U+00A0).

I don't know if they actually cause any issues, but I think they shouldn't be used in code blocks, so I replaced them with a normal space.

The code I used is here.

guardrex · 2017-05-14T21:27:02Z

Thanks for hunting these pesky characters down and removing them. When looking at the markdown and then the rendered bytes, it seems they get dropped when the docs are built. It's very agreeable to get these out tho.

Did you do a global search (i.e., outside of code blocks) for U+00A0? If so, did you find more of these ... as @mariaw says ... little "treasures" in there?

Expanding out to other similar characters (e.g., the dreaded BOM!): If this is a larger problem and if there are all kinds of unnecessary and undesirable characters sprinkled in across the docs, shouldn't we determine if the docs can be cleansed of all of these characters in one big pass?

svick · 2017-05-14T22:45:57Z

Did you do a global search (i.e., outside of code blocks) for U+00A0? If so, did you find more of these

Yes, there is more of them. For example, here is the start of the output of git grep -I $'\u00A0' on this branch (where NBSPs are shown as <C2><A0>):

core/tools/dotnet-clean.md:Sets the verbosity level of the command. Allowed levels are q[uiet
],<C2><A0>m[inimal],<C2><A0>n[ormal],<C2><A0>d[etailed], and<C2><A0>diag[nostic].
framework/migration-guide/versions-and-dependencies.md:|.NET 4.7|4|- Support for the level of
 TLS support provided by the operating system.<br/> - Ability to configure default message se
curity settings for TLS1.1 or TLS1.2. <br /> - Improved reliability of the <xref:System.Runti
me.Serialization.Json.DataContractJsonSerializer>. <br /> - Improved reliability of serializa
tion and deserialization with WCF applications. <br /> - Ability to extend the ASP.NET object
 cache. <br /> - Support for a touch/stylus stack based on<C2><A0>`WM_POINTER` Windows messag
es instead of<C2><A0>the Windows Ink Services Platform (WISP) for WPF applications. <br /> - 
Use of Window's Print Document Package API for printing in WPF applications.<br /> - Enhanced
 high DPI and multi-monitor support for Windows Forms applications running on Windows 10 Crea
tors Update.<C2><A0>| | <E2><9C><93>  10 Creators Update <br/> <br/> + 10 Anniversary Update 
<br/> + 8.1 <br/> +7| + 2016 <br/> + 2012 R2 <br/> + 2012 <br/> + 2008 R2 SP1 |Use `Release` 
DWORD:<br/><br/> - 460798 (Windows 10 Creators Update) <br/> - 460805 (all other OS versions)
 <br/><br/> (see [instructions](../../../docs/framework/migration-guide/how-to-determine-whic
h-versions-are-installed.md)) |
fsharp/language-reference/code-formatting-guidelines.md:When indentation is required, you mus
t use spaces, not tabs. At least one space is required. Your organization can create coding s
tandards to specify the number of spaces to use for indentation; three or four spaces of inde
ntation at each level where indentation occurs is typical. You can configure Visual Studio to
 match your organization's indentation standards by changing the options in the<C2><A0>`Optio
ns` dialog box, which is available from the `Tools` menu. In the `Text Editor` node, expand `
F#` and then click `Tabs`.<C2><A0>For a description of the available options, see [Options, T
ext Editor, All Languages, Tabs](https://msdn.microsoft.com/library/7sffa753.aspx).
fsharp/language-reference/query-expressions.md:    distinct<C2><A0><C2><A0><C2><A0><C2><A0><C
2><A0><C2><A0><C2><A0>
fsharp/language-reference/query-expressions.md:    for student in db.Student do<C2><A0><C2><A
0><C2><A0><C2><A0><C2><A0><C2><A0><C2><A0>
fsharp/language-reference/query-expressions.md:    groupBy student.Age into g<C2><A0><C2><A0>
<C2><A0><C2><A0><C2><A0><C2><A0><C2><A0>
fsharp/language-reference/query-expressions.md:    where (g.Count() > 1)<C2><A0><C2><A0><C2><
A0><C2><A0><C2><A0><C2><A0><C2><A0>
fsharp/language-reference/query-expressions.md:    select student<C2><A0>
fsharp/language-reference/query-expressions.md:    select n.StudentID<C2><A0><C2><A0><C2><A0>

(The ones in the F# language reference code blocks were not replaced, because they're <pre>, not ```.)

I wasn't completely sure they should be replaced elsewhere and thought they are less of an issue elsewhere, so I didn't replace them. Should I?

Expanding out to other similar characters (e.g., the dreaded BOM!): If this is a larger problem and if there are all kinds of unnecessary and undesirable characters sprinkled in across the docs, shouldn't we determine if the docs can be cleansed of all of these characters in one big pass?

Here is a table of all unusual characters in *.md files in the docs directory (in this branch, so after NBSPs in code blocks were already replaced). Many of them are not wrong (e.g. Greek or Russian characters or some box drawing characters from the output of Yeoman) and some are a matter of style (e.g - vs. – vs. —).

char	code	count
tab	0009	953
NBSP	00A0	134
¥	00A5	2
§	00A7	2
©	00A9	55
ª	00AA	1
®	00AE	52
°	00B0	5
±	00B1	6
´	00B4	4
À	00C0	7
Á	00C1	2
Ã	00C3	1
Ä	00C4	1
Å	00C5	10
Ê	00CA	7
Ö	00D6	1
×	00D7	55
Ø	00D8	6
ß	00DF	2
à	00E0	7
á	00E1	6
â	00E2	2
ä	00E4	12
å	00E5	7
æ	00E6	1
ç	00E7	37
è	00E8	3
é	00E9	51
ê	00EA	8
í	00ED	6
ï	00EF	1
ñ	00F1	1
ó	00F3	5
ô	00F4	4
ö	00F6	15
ø	00F8	6
ü	00FC	6
İ	0130	2
ı	0131	1
ˆ	02C6	1
̊	030A	2
Δ	0394	3
Ι	0399	3
έ	03AD	3
ή	03AE	1
ί	03AF	4
α	03B1	6
β	03B2	1
ε	03B5	3
η	03B7	2
ι	03B9	2
κ	03BA	1
μ	03BC	7
ν	03BD	5
ο	03BF	9
ρ	03C1	3
τ	03C4	5
υ	03C5	11
χ	03C7	2
ψ	03C8	1
Ϩ	03E8	2
В	0412	1
Д	0414	4
Ж	0416	3
З	0417	1
К	041A	1
П	041F	2
Т	0422	1
а	0430	4
б	0431	1
в	0432	8
г	0433	2
д	0434	4
е	0435	11
з	0437	1
и	0438	4
й	0439	2
к	043A	3
л	043B	1
м	043C	2
н	043D	5
о	043E	5
п	043F	2
р	0440	6
с	0441	5
т	0442	6
у	0443	4
щ	0449	1
ы	044B	1
ь	044C	2
ю	044E	1
я	044F	2
ӑ	04D1	1
م	0645	2
ấ	1EA5	1
en space	2002	14
ZWSP	200B	20
–	2013	679
—	2014	608
‘	2018	107
’	2019	1712
†	2020	9
•	2022	8
…	2026	681
‰	2030	11
€	20AC	2
™	2122	1
Å	212B	4
→	2192	3
−	2212	6
⌘	2318	2
─	2500	84
│	2502	12
└	2514	8
├	251C	8
╭	256D	1
╮	256E	1
╯	256F	1
╰	2570	1
✓	2713	520
❯	276F	1
➊	278A	7
➋	278B	3
➌	278C	8
さ	3055	1
た	305F	1
れ	308C	1
エ	30A8	1
ケ	30B1	1
ス	30B9	1
テ	30C6	1
プ	30D7	1
ラ	30E9	1
リ	30EA	1
ル	30EB	1
ー	30FC	1
列	5217	1
別	5225	1
午	5348	2
子	5B50	1
字	5B57	1
後	5F8C	1
文	6587	1
識	8B58	1
BOM	FEFF	1
🔧	1F527	31

guardrex · 2017-05-14T23:48:19Z

The <C2> is a Â (U+00C2 : LATIN CAPITAL LETTER A WITH CIRCUMFLEX) ... another "treasure" that came into some of those lines via cut-and-paste. The reason I know that is that a few of those lines were handled by me on a PR several weeks ago. Combine this knowledge with the lovely BOM that pops into files from time-to-time, and I've drawn some conclusions here:

I need to always cleanse cut-and-paste text prior to use (at least right now, see 2 👇).
Contributors everywhere can paste these types of characters into topics and they will not get caught (flagged to us) by CI ... meaning that over time more and more of these pesky devils are going to find their way into the docs.

I hope @mairaw will give us good news on the possibility of the CI warning/failing on at least some of the nastier ones (e.g., the BOM).

RE: Removing the rest of the U+00A0 in favor of U+0020: I like it. Makes sense for this PR if @mairaw likes it. @mairaw will know if this problem can/should be approached more globally than by creating separate PR's that address individual characters.

svick · 2017-05-15T01:11:31Z

The <C2> is a Â (U+00C2 : LATIN CAPITAL LETTER A WITH CIRCUMFLEX)

No, it's not, at least not in this case. <C2><A0> is just UTF-8 encoded U+00A0 (NBSP). git grep seems to display Unicode that way.

RE: Removing the rest of the U+00A0 in favor of U+0020: I like it.

Ok, done.

guardrex · 2017-05-15T03:55:12Z

Never trust Notepad (ANSI/Windows-1252) to tell you an encoding! lol 😄 Yes, I got the encoding wrong. I see it now (dotnet-clean.md) ...

My questions are still valid tho: Can we prevent undesirable codepoints creeping into the docs over the years? Is there an opportunity to purge the docs of undesirable codepoints in one bold stroke?

svick · 2017-05-15T11:56:20Z

Can we prevent undesirable codepoints creeping into the docs over the years? Is there an opportunity to purge the docs of undesirable codepoints in one bold stroke?

That would require CI. And considering that CI currently seems to verify only content (the "OpenPublishing.Build" check) and the few remaining project.json projects (the "OrcaBot [.NET Core - Nix]" check), I think first we have to ask: Can we get a decent working CI?

(Unless I'm missing something and CI is actually working fine.)

Replace NBSPs in code blocks

bcd49d4

dnfclas added the cla-already-signed label May 14, 2017

Replace NBSPs outside code blocks

8186ec0

svick changed the title ~~Replace NBSPs in code blocks~~ Replace NBSPs May 15, 2017

guardrex approved these changes May 15, 2017

View reviewed changes

guardrex merged commit dc1c456 into dotnet:master May 15, 2017

svick deleted the remove-nbsps branch May 15, 2017 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace NBSPs#2150

Replace NBSPs#2150
guardrex merged 2 commits intodotnet:masterfrom
svick:remove-nbsps

svick commented May 14, 2017

Uh oh!

guardrex commented May 14, 2017

Uh oh!

svick commented May 14, 2017

Uh oh!

guardrex commented May 14, 2017 •

edited

Loading

Uh oh!

svick commented May 15, 2017

Uh oh!

guardrex commented May 15, 2017

Uh oh!

svick commented May 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

svick commented May 14, 2017

Uh oh!

guardrex commented May 14, 2017

Uh oh!

svick commented May 14, 2017

Uh oh!

guardrex commented May 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

svick commented May 15, 2017

Uh oh!

guardrex commented May 15, 2017

Uh oh!

svick commented May 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guardrex commented May 14, 2017 •

edited

Loading