Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte Order Mark (BOM) handling functions rewrite #3880

Merged
merged 1 commit into from
May 2, 2016

Conversation

burner
Copy link
Member

@burner burner commented Dec 18, 2015

  • move to std.encoding
  • less overengineering

#3870 rework

the found $(D BOM) or $(D BOM.invalid).
*/
BOM getBOM(Range)(Range input)
if(isInputRange!Range)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a standardized way of declaring template constraints, because this is the fourth unique variant in style that I have seen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll bet one round of drinks, for everybody at DConf 2016, that that forum discussion is going to become a bikeshedding discussion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just have Walter or Andrei make a decision, as it only effects Phobos maintainers and the differences in our options are trivial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about going with what dfmt does?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know what, forget I said anything. This is a minor point and I don't want this to turn this into a bikesheading discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dfmt is anti-bikeshedding

@JackStouffer
Copy link
Member

The logic however looks good, much cleaner this time :)

[0x2B, 0x2F, 0x76, 0x39],
[0x2B, 0x2F, 0x76, 0x2B],
[0x2B, 0x2F, 0x76, 0x2F],
[0x2B, 0x2F, 0x76, 0x38, 0x2D]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there was only 4 of these?

/** Mapping of a byte sequence to $(B Byte Order Mark (BOM))
*/
enum bomEntries = [
tuple(BOM.utf32be, to!(ubyte[])([0x00, 0x00, 0xFE, 0xFF])),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use enum for arrays, this will create a new array every time you use it. Instead, do immutable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@JakobOvrum
Copy link
Member

Is there any good reason for bomEntries and getBOMTableIndex to be public?

}
}

assert(false, "This can never happen, unless startsWith consider an empty"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startsWith matches the shortest of its needles, and an empty needle always matches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, as I wrote. Unless startsWith changes this assert will never execute. If it does people will know about it.

@burner
Copy link
Member Author

burner commented Dec 20, 2015

bomEntries and getBOMTableIndex are public so you use them to add a BOM to a file you created

@JakobOvrum
Copy link
Member

How does getBOMTableIndex help with that?

@burner
Copy link
Member Author

burner commented Dec 20, 2015

File -> getBOMTableIndex -> new File -> prepend same BOM sequence by using the index of getBOMTableIndex and bomEntries.

utf7 could have several start sequences, only using getBOM does not work

@JakobOvrum
Copy link
Member

Well, like Steven suggested, getBOM should return the tuple.

Also, the name bomEntries doesn't make a lot of sense. Entries into what, exactly? Going by getBOMTableIndex it seems like bomTable.

@burner
Copy link
Member Author

burner commented Dec 21, 2015

@JakobOvrum @schveiguy fixed

the found $(D Tuple) of $(D BOM) and the $(D BOM) sequence as a
$(D ubyte[]).
*/
auto getBOM(Range)(Range input)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use return type inference when the return type is easy to represent and provides documentation value. The return type here is always Tuple!(BOM, immutable ubyte[]).

Named fields ala Tuple!(BOM, "bom", immutable ubyte[], "byteSequence") is probably also desirable.

/** Mapping of a byte sequence to $(B Byte Order Mark (BOM))
*/
immutable bomTable = [
tuple(BOM.noBOM, new ubyte[0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this in the table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO yes, otherwise you would have to handle the absence of a BOM differently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getBOM can still return it

@JakobOvrum
Copy link
Member

std.file.read returns void[]. Would it be an idea to accept const(void)[] inputs?

the found $(D Tuple!(BOM,ubyte[])) of $(D BOM) and the $(D BOM) sequence
as a $(D ubyte[]).
*/
auto getBOM(Range)(Range input)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why Github marked it as outdated, but my comment about explicit return type still stands

@schveiguy
Copy link
Member

I just realized something -- nowhere do you specify a fully decoded BOM, e.g. dchar(0xFEFF), this would probably be good to include somehow.

@burner
Copy link
Member Author

burner commented Mar 11, 2016

@schveiguy I do not follow, please eleborate what you mean by fully decoded BOM

@schveiguy
Copy link
Member

@burner what I mean is that if you have string data (i.e. a dchar, or a utf dchar range), it is nice to have it defined, something like:

enum dchar utfBOM = 0xfeff;

So you can do stuff like:

// skip any BOM
if(myString.front == utfBOM) myString.popFront;

without having to remember the hex version.

@burner
Copy link
Member Author

burner commented Mar 16, 2016

@schveiguy what about encodings whos bom does not fit into a dchar utf16(le,be) utf8?
I mean what you are describing is basically the use case of getBOM. It feels weird IMO that you have a string of know type and then ask for its BOM to find out its type? Maybe I'm not understand what you are trying to say.

@schveiguy
Copy link
Member

The BOM is used to determine the encoding. But it's possible you already know the encoding. Then the BOM becomes another character. Having a definition of that somewhere is useful.

For instance, in my iopipe library, I have an example program "convert" that reads a file, and converts it to another encoding. In this section of code, I have determined the encoding and have the input already set up to process the data. However, on output, I need to ensure there is a BOM in the new file. So I have this code:

    if(!input.window.empty && input.window.front != 0xfeff)
    {
        // write a BOM if not present
        put(oChain, dchar(0xfeff));
    }

It would be nice to just have that constant be a manifest constant somewhere so that I don't have to remember the hex code.

@DmitryOlshansky
Copy link
Member

@burner address the last nit - add BOM constant, after all BOM has a constant codepoint assigned to it. Then it should be ready for final nod of approval from @andralex

@burner
Copy link
Member Author

burner commented Apr 6, 2016

@DmitryOlshansky the BOM constants will not work as not all BOM constants are 4 byte long. The presented use case is exactly what getBOM is for. To append a BOM I plan to create to createBOM(OutputRange) next.

@DmitryOlshansky
Copy link
Member

@burner You miss the point. BOM is a codepoint, please add the constant stop worrying about encoding, you already covered byte-level representations.

@burner
Copy link
Member Author

burner commented Apr 6, 2016

@DmitryOlshansky added the constant not sure about the doc though

@schveiguy
Copy link
Member

Thanks for adding that, LGTM.

@burner
Copy link
Member Author

burner commented Apr 26, 2016

@DmitryOlshansky
Copy link
Member

LGTM

/** Mapping of a byte sequence to $(B Byte Order Mark (BOM))
*/
immutable bomTable = [
BOMSeq(BOM.none, new ubyte[0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't null work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@andralex
Copy link
Member

I approve, please pull after nits. Thanks!

* move to std.encoding
* less overengineering

dlang#3870 rework

Don't use top-level selective import in std.math because of DMD issue 314.

some quickfur comments

whitespace

remove used import

steven suggestion

utfBom

andrei nitpicks

andrei null
@burner
Copy link
Member Author

burner commented May 2, 2016

@DmitryOlshansky @schveiguy @JakobOvrum @andralex fixed the nitpicks

@schveiguy
Copy link
Member

Auto-merge toggled on

@burner
Copy link
Member Author

burner commented May 2, 2016

@schveiguy thanks

@@ -3361,3 +3361,175 @@ version(unittest)
return "0123456789ABCDEF"[n & 0xF];
}
}

import std.typecons;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here and not at the top like everything else?

@schveiguy
Copy link
Member

@JackStouffer we can fix that after if it's really a sticking point. The auto tester is almost done, and it takes 1 hour to run the current test :)

@schveiguy schveiguy merged commit 30587d4 into dlang:master May 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants