Skip to content

[API Proposal]: New StreamReader Property CurrentBOM #128267

@penalvch

Description

@penalvch

Background and motivation

The new API purpose is creation of a new property of StreamReader called CurrentBOM, which allows one to determine the presence of a byte order mark (BOM) via boolean result. This is due to how the present behavior of CurrentEncoding is not designed to properly detect existence of a BOM in UTF encoded files.

In addition, approving this proposal avoids the .NET+PWSH community headaches when interoperating with UTF files at scale.

Specifically:

  1. With C# reading files line by line in an attempt to detect existence of a BOM. This approach is wasteful in comparison to the proposed, and less efficient speedwise then ReadToEnd().
  2. With .NETized PWSH, it's being forced to do concessionary workarounds (e.g. reading twice, reading once and doing byte conversions in-memory to do a manual custom one-off check, etc.).

Example PWSH code that demonstrates the problem below:

$bompath="$env:TEMP\bom.txt"
$utf8Bom=[System.Text.UTF8Encoding]::new($true)
[System.IO.File]::WriteAllText($bompath,'',$utf8Bom)
$bytes=[System.IO.File]::ReadAllBytes($bompath)
if($bytes.Length -eq 3 -and $bytes[0] -eq 0xEF -and $bytes[1] -eq 0xBB -and $bytes[2] -eq 0xBF){
    write-host 'BOM DETECTED'
}else{
    write-host 'NO BOM DETECTED'
}
$sr=New-Object System.IO.StreamReader($bompath,$false)
$encoding=$sr.CurrentEncoding
$encoding.GetPreamble().length

$nobompath="$env:TEMP\bom-no.txt"
$utf8NoBom=[System.Text.UTF8Encoding]::new($false)
[System.IO.File]::WriteAllText($nobompath,'',$utf8NoBom)
$bytes=[System.IO.File]::ReadAllBytes($nobompath)
if($bytes.Length -eq 3 -and $bytes[0] -eq 0xEF -and $bytes[1] -eq 0xBB -and $bytes[2] -eq 0xBF){
    write-host 'BOM DETECTED'
}else{
    write-host 'NO BOM DETECTED'
}
$sr=New-Object System.IO.StreamReader($nobompath,$false)
$encoding=$sr.CurrentEncoding
$encoding.GetPreamble().length

What is expected:

BOM DETECTED
3
NO BOM DETECTED
0

What you get:

BOM DETECTED
3
NO BOM DETECTED
3

API Proposal

namespace System.IO;

public partial class StreamReader
{
    public bool CurrentBOM { get; }
}

API Usage

using System;
using System.IO;
using System.Text;

class Test
{
    
    public static void Main()
    {
        string path = @"c:\temp\MyTest.txt";

        try
        {
            if (File.Exists(path))
            {
                File.Delete(path);
            }

            //Use UTF-16 encoding
            using (StreamWriter sw = new StreamWriter(path, false, new UnicodeEncoding()))
            {
                sw.WriteLine("My test");
                sw.WriteLine("text.");
            }

            using (StreamReader sr = new StreamReader(path, true))
            {
                while (sr.Peek() >= 0)
                {
                    Console.Write((char)sr.Read());
                }

                //Test for BOM after reading, or at least after the first read.
                Console.WriteLine("BOM present: {0}.", sr.CurrentBOM);
            }
        }
        catch (Exception e)
        {
            Console.WriteLine("The process failed: {0}", e.ToString());
        }
    }
}

Alternative Designs

One could amend the CurrentEncoding Property such that when performing a GetPreamble() the length is properly detected (i.e. proper endian, number of bytes, and type of bytes in precise order). While doing so would introduce a breaking change (i.e. a feature not implemented accurately is now accurate and precise) having accuracy and precision on the logic is reasonable.

Risks

None identified.

Metadata

Metadata

Assignees

No one assigned

    Labels

    api-suggestionEarly API idea and discussion, it is NOT ready for implementationarea-System.IOuntriagedNew issue has not been triaged by the area owner

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions