Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte Order Mark (BOM) handling functions #3870

Closed
wants to merge 1 commit into from
Closed

Conversation

burner
Copy link
Member

@burner burner commented Dec 11, 2015

No description provided.

@schveiguy schveiguy changed the title Byte Order Mark (BOM) handeling functions Byte Order Mark (BOM) handling functions Dec 11, 2015
@burner
Copy link
Member Author

burner commented Dec 11, 2015

@schveiguy thanks

@schveiguy
Copy link
Member

It seems wasteful to use a runtime linear search to look up byte order sequences. Is an AA not possible? I'm also not sure whether the byte order mark sequence lookup is useful. The byte order mark is 0xFEFF. Any code that is going to be outputting some coding will be able to translate that into the correct bytes. By far the most useful thing that is difficult to get right is a function which returns the encoding based on the bytes given. I know nothing about the encodings other than the standard 5, so I can't comment on those.

Also, I think your little endian bytes are wrong for UTF32, they should be 0xFF 0xFE 0x00, 0x00

tuple(to!(ubyte[])([0xFF, 0xFE]), BOM.utf16le),
tuple(to!(ubyte[])([0x00, 0x00, 0xFE, 0xFF]), BOM.utf32be),
tuple(to!(ubyte[])([0x00, 0x00, 0xFF, 0xFE]), BOM.utf32le),
tuple(to!(ubyte[])([0x00, 0x00, 0xFF, 0xFE]), BOM.utf32le),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a duplicate line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already fixed that on my machine

@schveiguy
Copy link
Member

I'm leaning towards saying that we should ignore all but the standard utf encodings (utf8, utf16, utf32)

@schveiguy
Copy link
Member

Is an AA not possible?

Actually, forget AA. If you order the array properly, you can just use array index based on the enum value.

@burner
Copy link
Member Author

burner commented Dec 11, 2015

we can not ignore utf encodings, also not the standard utf encodings. I created this PR, because these show up in XML quite often (at least in the test suite). I need this to create std.xml2.

after this I will create a readText version that works with BOM's

@burner
Copy link
Member Author

burner commented Dec 11, 2015

I thought about sorting as well, but I have two access ways ubyte[] and enum. Maybe I will create some sort of proxy object. Not sure yet.

Update:
Then I can do binary search/trie on the ubyte[] and use the enum as index lookup

}
}

return BOM.ascii;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this will never occur. Last entry in the table has a 0-length array. which should always match. Perhaps the last table entry shouldn't be there.

I'm also thinking ascii really shouldn't even be here. Really, what you mean to say is there is no BOM (BOM is not strictly required).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the ascii should be none.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, using nested switch statements instead of looping over arrays will result in much faster and smaller code.

@schveiguy
Copy link
Member

Sorted by ubyte sequence (for binary search/trie), and then setting enum values to enable lookup by index would probably be the best mechanism.

@schveiguy
Copy link
Member

BTW, here is my code that does BOM detection in new io package: https://github.com/schveiguy/phobos/blob/new-io3/std/io/text.d#L208

@burner
Copy link
Member Author

burner commented Dec 11, 2015

well, I hope at some point you can rewrite that code and use these functions.

Thanks for the feedback, I will work on this over the weekend.

@schveiguy
Copy link
Member

Possibly. But I don't plan on using utf7 decoding, etc, since we don't have a char type for that!

@burner
Copy link
Member Author

burner commented Dec 11, 2015

I'm not planing on using utf7 or utf1 as well, but at least this is useful when figuring out why reading a file yields no useful results.

@schveiguy
Copy link
Member

right, the question is, what do I do in the case where some obscure utf comes up? It's possibly another encoding, or possibly a file that has no BOM (or possibly not even a text file!). I think the right thing for an io package to do is nothing (in other words, I treat it as utf8 and pass the bytes through).

It's an interesting question and puzzle to solve.

@burner
Copy link
Member Author

burner commented Dec 11, 2015

that is a good question. I would say that is depending on your application. The good thing if you find the bom for utf-(8,16) you can be sure it is not utf8 or ascii. because the first ubyte value are not valid utf8 or ascii chars

as said earlier, I think it depends, but it least with this PR you don't have to write the code to check if a given byte sequence indicates some bom.

at least for utf-(8,16,32) I plan to write two functions: (names are debatable)

BOM doesFileHaveBom(string filename);
T readTextWithBom(T)(string filename) if(isSomeString!T);

the length of the $(D BOM) sequence. For UTF-7 the value $(D 4) is
returned.
*/
size_t bomLength(in BOM bom) pure @safe nothrow @nogc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this is excessive trivia, since as its implementation shows, it is a trivial consequence if bomSequence().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple table lookup, indexed by BOM, would be fast and simple:

[0,3,2,2,4,4,4,3,4,3,3,4]

@WalterBright
Copy link
Member

I don't know where this functionality belongs, but I know it does not belong in file.d. It is not a file operation, and importing std.file to get it winds up importing the operating system file API, which has nothing whatsoever to do with the operation.

];

/** Returns the $(D BOM) for a given $(D input) of $(D ubyte)s.
If no $(D BOM) can be matched the $(D input) is assumed to be encoded in ASCII.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explicitly state that it does not attempt to infer the encoding used if no BOM is present.

@andralex
Copy link
Member

Guys, there's a bit of overengineering going on here. Just expose the BOM markers as an enumerated value in std.encoding and let people use it in conjunction with startsWith with the sequences of interest. I recommend this to be closed, it's just too much compared to what's needed.

@burner
Copy link
Member Author

burner commented Dec 17, 2015

it is becoming overengineered no doubt, but startsWith doesn't work because,

In the case where $(D doesThisStart) starts with multiple of the ranges or
elements in $(D withOneOfThese), then the shortest one matches ....

I will move it to std.encoding and see if I can make it KISS

burner added a commit to burner/phobos that referenced this pull request Jan 17, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework
burner added a commit to burner/phobos that referenced this pull request Feb 1, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework
burner added a commit to burner/phobos that referenced this pull request Feb 12, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.
burner added a commit to burner/phobos that referenced this pull request Feb 12, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments
burner added a commit to burner/phobos that referenced this pull request Feb 23, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments
burner added a commit to burner/phobos that referenced this pull request Feb 23, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments

whitespace
burner added a commit to burner/phobos that referenced this pull request Mar 11, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments

whitespace

remove used import
burner added a commit to burner/phobos that referenced this pull request Mar 11, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments

whitespace

remove used import

steven suggestion
burner added a commit to burner/phobos that referenced this pull request Apr 6, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments

whitespace

remove used import

steven suggestion

utfBom
burner added a commit to burner/phobos that referenced this pull request Apr 26, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments

whitespace

remove used import

steven suggestion

utfBom
burner added a commit to burner/phobos that referenced this pull request May 2, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments

whitespace

remove used import

steven suggestion

utfBom

andrei nitpicks
burner added a commit to burner/phobos that referenced this pull request May 2, 2016
* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments

whitespace

remove used import

steven suggestion

utfBom

andrei nitpicks

andrei null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants