-
-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Byte Order Mark (BOM) handling functions #3870
Conversation
|
@schveiguy thanks |
|
It seems wasteful to use a runtime linear search to look up byte order sequences. Is an AA not possible? I'm also not sure whether the byte order mark sequence lookup is useful. The byte order mark is Also, I think your little endian bytes are wrong for UTF32, they should be |
| tuple(to!(ubyte[])([0xFF, 0xFE]), BOM.utf16le), | ||
| tuple(to!(ubyte[])([0x00, 0x00, 0xFE, 0xFF]), BOM.utf32be), | ||
| tuple(to!(ubyte[])([0x00, 0x00, 0xFF, 0xFE]), BOM.utf32le), | ||
| tuple(to!(ubyte[])([0x00, 0x00, 0xFF, 0xFE]), BOM.utf32le), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a duplicate line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already fixed that on my machine
|
I'm leaning towards saying that we should ignore all but the standard utf encodings (utf8, utf16, utf32) |
copy paste error
Actually, forget AA. If you order the array properly, you can just use array index based on the enum value. |
|
we can not ignore utf encodings, also not the standard utf encodings. I created this PR, because these show up in XML quite often (at least in the test suite). I need this to create std.xml2. after this I will create a readText version that works with BOM's |
|
I thought about sorting as well, but I have two access ways ubyte[] and enum. Maybe I will create some sort of proxy object. Not sure yet. Update: |
| } | ||
| } | ||
|
|
||
| return BOM.ascii; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this will never occur. Last entry in the table has a 0-length array. which should always match. Perhaps the last table entry shouldn't be there.
I'm also thinking ascii really shouldn't even be here. Really, what you mean to say is there is no BOM (BOM is not strictly required).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, the ascii should be none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, using nested switch statements instead of looping over arrays will result in much faster and smaller code.
|
Sorted by ubyte sequence (for binary search/trie), and then setting enum values to enable lookup by index would probably be the best mechanism. |
|
BTW, here is my code that does BOM detection in new io package: https://github.com/schveiguy/phobos/blob/new-io3/std/io/text.d#L208 |
|
well, I hope at some point you can rewrite that code and use these functions. Thanks for the feedback, I will work on this over the weekend. |
|
Possibly. But I don't plan on using utf7 decoding, etc, since we don't have a char type for that! |
|
I'm not planing on using utf7 or utf1 as well, but at least this is useful when figuring out why reading a file yields no useful results. |
|
right, the question is, what do I do in the case where some obscure utf comes up? It's possibly another encoding, or possibly a file that has no BOM (or possibly not even a text file!). I think the right thing for an io package to do is nothing (in other words, I treat it as utf8 and pass the bytes through). It's an interesting question and puzzle to solve. |
|
that is a good question. I would say that is depending on your application. The good thing if you find the bom for utf-(8,16) you can be sure it is not utf8 or ascii. because the first ubyte value are not valid utf8 or ascii chars as said earlier, I think it depends, but it least with this PR you don't have to write the code to check if a given byte sequence indicates some bom. at least for utf-(8,16,32) I plan to write two functions: (names are debatable) BOM doesFileHaveBom(string filename);
T readTextWithBom(T)(string filename) if(isSomeString!T); |
| the length of the $(D BOM) sequence. For UTF-7 the value $(D 4) is | ||
| returned. | ||
| */ | ||
| size_t bomLength(in BOM bom) pure @safe nothrow @nogc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this is excessive trivia, since as its implementation shows, it is a trivial consequence if bomSequence().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple table lookup, indexed by BOM, would be fast and simple:
[0,3,2,2,4,4,4,3,4,3,3,4]
|
I don't know where this functionality belongs, but I know it does not belong in file.d. It is not a file operation, and importing |
| ]; | ||
|
|
||
| /** Returns the $(D BOM) for a given $(D input) of $(D ubyte)s. | ||
| If no $(D BOM) can be matched the $(D input) is assumed to be encoded in ASCII. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should explicitly state that it does not attempt to infer the encoding used if no BOM is present.
|
Guys, there's a bit of overengineering going on here. Just expose the BOM markers as an enumerated value in |
|
it is becoming overengineered no doubt, but startsWith doesn't work because,
I will move it to std.encoding and see if I can make it KISS |
* move to std.encoding * less overengineering dlang#3870 rework
* move to std.encoding * less overengineering dlang#3870 rework
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314.
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments whitespace
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments whitespace remove used import
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments whitespace remove used import steven suggestion
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments whitespace remove used import steven suggestion utfBom
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments whitespace remove used import steven suggestion utfBom
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments whitespace remove used import steven suggestion utfBom andrei nitpicks
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments whitespace remove used import steven suggestion utfBom andrei nitpicks andrei null
No description provided.