-
-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Byte Order Mark (BOM) handling functions rewrite #3880
Conversation
| the found $(D BOM) or $(D BOM.invalid). | ||
| */ | ||
| BOM getBOM(Range)(Range input) | ||
| if(isInputRange!Range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a standardized way of declaring template constraints, because this is the fourth unique variant in style that I have seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll bet one round of drinks, for everybody at DConf 2016, that that forum discussion is going to become a bikeshedding discussion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just have Walter or Andrei make a decision, as it only effects Phobos maintainers and the differences in our options are trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about going with what dfmt does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know what, forget I said anything. This is a minor point and I don't want this to turn this into a bikesheading discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dfmt is anti-bikeshedding
|
The logic however looks good, much cleaner this time :) |
| [0x2B, 0x2F, 0x76, 0x39], | ||
| [0x2B, 0x2F, 0x76, 0x2B], | ||
| [0x2B, 0x2F, 0x76, 0x2F], | ||
| [0x2B, 0x2F, 0x76, 0x38, 0x2D] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there was only 4 of these?
| /** Mapping of a byte sequence to $(B Byte Order Mark (BOM)) | ||
| */ | ||
| enum bomEntries = [ | ||
| tuple(BOM.utf32be, to!(ubyte[])([0x00, 0x00, 0xFE, 0xFF])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use enum for arrays, this will create a new array every time you use it. Instead, do immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fd36fad
to
7c886fe
Compare
|
Is there any good reason for |
| } | ||
| } | ||
|
|
||
| assert(false, "This can never happen, unless startsWith consider an empty" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startsWith matches the shortest of its needles, and an empty needle always matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, as I wrote. Unless startsWith changes this assert will never execute. If it does people will know about it.
|
bomEntries and getBOMTableIndex are public so you use them to add a BOM to a file you created |
|
How does |
|
File -> getBOMTableIndex -> new File -> prepend same BOM sequence by using the index of getBOMTableIndex and bomEntries. utf7 could have several start sequences, only using getBOM does not work |
|
Well, like Steven suggested, Also, the name |
|
@JakobOvrum @schveiguy fixed |
| the found $(D Tuple) of $(D BOM) and the $(D BOM) sequence as a | ||
| $(D ubyte[]). | ||
| */ | ||
| auto getBOM(Range)(Range input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use return type inference when the return type is easy to represent and provides documentation value. The return type here is always Tuple!(BOM, immutable ubyte[]).
Named fields ala Tuple!(BOM, "bom", immutable ubyte[], "byteSequence") is probably also desirable.
| /** Mapping of a byte sequence to $(B Byte Order Mark (BOM)) | ||
| */ | ||
| immutable bomTable = [ | ||
| tuple(BOM.noBOM, new ubyte[0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this in the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO yes, otherwise you would have to handle the absence of a BOM differently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getBOM can still return it
|
std.file.read returns |
| the found $(D Tuple!(BOM,ubyte[])) of $(D BOM) and the $(D BOM) sequence | ||
| as a $(D ubyte[]). | ||
| */ | ||
| auto getBOM(Range)(Range input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why Github marked it as outdated, but my comment about explicit return type still stands
|
I just realized something -- nowhere do you specify a fully decoded BOM, e.g. |
|
@schveiguy I do not follow, please eleborate what you mean by fully decoded BOM |
|
@burner what I mean is that if you have string data (i.e. a enum dchar utfBOM = 0xfeff;So you can do stuff like: // skip any BOM
if(myString.front == utfBOM) myString.popFront;without having to remember the hex version. |
|
@schveiguy what about encodings whos bom does not fit into a dchar utf16(le,be) utf8? |
|
The BOM is used to determine the encoding. But it's possible you already know the encoding. Then the BOM becomes another character. Having a definition of that somewhere is useful. For instance, in my iopipe library, I have an example program "convert" that reads a file, and converts it to another encoding. In this section of code, I have determined the encoding and have the input already set up to process the data. However, on output, I need to ensure there is a BOM in the new file. So I have this code: if(!input.window.empty && input.window.front != 0xfeff)
{
// write a BOM if not present
put(oChain, dchar(0xfeff));
}It would be nice to just have that constant be a manifest constant somewhere so that I don't have to remember the hex code. |
|
@DmitryOlshansky the BOM constants will not work as not all BOM constants are 4 byte long. The presented use case is exactly what getBOM is for. To append a BOM I plan to create to createBOM(OutputRange) next. |
|
@burner You miss the point. BOM is a codepoint, please add the constant stop worrying about encoding, you already covered byte-level representations. |
|
@DmitryOlshansky added the constant not sure about the doc though |
|
Thanks for adding that, LGTM. |
|
LGTM |
| /** Mapping of a byte sequence to $(B Byte Order Mark (BOM)) | ||
| */ | ||
| immutable bomTable = [ | ||
| BOMSeq(BOM.none, new ubyte[0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't null work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
I approve, please pull after nits. Thanks! |
* move to std.encoding * less overengineering dlang#3870 rework Don't use top-level selective import in std.math because of DMD issue 314. some quickfur comments whitespace remove used import steven suggestion utfBom andrei nitpicks andrei null
|
@DmitryOlshansky @schveiguy @JakobOvrum @andralex fixed the nitpicks |
|
Auto-merge toggled on |
|
@schveiguy thanks |
| @@ -3361,3 +3361,175 @@ version(unittest) | |||
| return "0123456789ABCDEF"[n & 0xF]; | |||
| } | |||
| } | |||
|
|
|||
| import std.typecons; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why here and not at the top like everything else?
|
@JackStouffer we can fix that after if it's really a sticking point. The auto tester is almost done, and it takes 1 hour to run the current test :) |
#3870 rework