Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DDMD] Make char unsigned #2215

Closed
wants to merge 1 commit into from
Closed

Conversation

yebblies
Copy link
Member

C++ has the annoying situation of three char variants - signed char, unsigned char, and char. Obviously char is either signed or unsigned, but it gets its own unique mangling anyway.

For the translation, I'm converting signed char to byte, unsigned char to ubyte, and char to char.

For this mapping to be accurate, the C++ char needs to be unsigned, which is enabled for gcc and dmc with the switches below.

The compiler currently uses char when the sign doesn't matter, and unsigned char when it does (eg for unicode stuff). This is problematic in D as string literals do not convert to const(ubyte)*, only const(char)*.

The plan is to change all strings to const(char)* or char* inside the compiler, and this is the first step.

@ghost
Copy link

ghost commented Jun 19, 2013

Ah yes, it's a classic problem when e.g. wrapping C++. You can't tell programmatically whether a char is used as a byte array or a char array, unless you manually look and figure it out on your own.

Btw, what about VC? vcproj/vcxproj might need an update (who do we ping?).

@yebblies
Copy link
Member Author

Well, the switch is /J for msvc. @rainers ?

@WalterBright
Copy link
Member

The plan is to change all strings to const(char)* or char* inside the compiler, and this is the first step.

I really do not want to rely on 'char' in C++ code to be unsigned, switch or no switch. This is an execrable practice, and is unreliable (think linking code modules compiled with different switch settings, such as third party libraries). Those switches only exist to enable bad code to be compiled.

This is problematic in D as string literals do not convert to const(ubyte), only const(char).

Why not just use an explicit cast in the generated D code for now?

@yebblies
Copy link
Member Author

I really do not want to rely on 'char' in C++ code to be unsigned, switch or no switch. This is an execrable practice, and is unreliable (think linking code modules compiled with different switch settings, such as third party libraries). Those switches only exist to enable bad code to be compiled.

While a sensible idea in general I don't think this applies to dmd at all. Are there any actual potential problems you are aware of?

Why not just use an explicit cast in the generated D code for now?

Explicit cast to what? String literals are assigned to char *, const char * and unsigned char * and possibly more, as well as the array versions. An unconditional cast on a string literal will prevent implicit conversions to any other type (except adding const). Knowing which type is needed would require semantic analysis.

The only viable alternative I can think of is adding a typedef unsigned char d_char to the C++ source, and apart from 'avoiding -J' I don't see a single reason why this would be better, it certainly isn't cleaner.

What is wrong with -J? It sure looks like nothing from here.

@WalterBright
Copy link
Member

You said you intended to replace all the unsigned char* in dmd with char *. This means that code that formerly self-documented that it relied on unsigned behavior is now relying on a compiler switch. Bit rot will set in.

There is nothing clean about using -J.

The explicit cast would be:

cast(ubyte*)"string literal"

@yebblies
Copy link
Member Author

Bit rot will set in.

No it will not. Any incorrect changes will break the D version.

There is nothing clean about using -J.

It is clean in that it matches what D does. The code will look and behave similar to what happens in D. Most (all) developers working on the compiler are D programmers, so there should be nothing surprising about char being unsigned. It also helps that you never want char to be signed.

I initially tried converting unsigned char to char without using -J, and the only part of the compiler that broke was the unicode BOM. I have not found any code that cares at all, although there could be some that is not covered by the test suite.

The explicit cast would be:
cast(ubyte*)"string literal"

As I said yesterday, this then does not work if you want to assign the string literal to a const char *. Knowing the type of the LHS (or parameter type) is much much harder that just using the correct type in the C++ code.

This change makes the C++ code behave more like D, which is very very helpful in porting to D.

@yebblies
Copy link
Member Author

cast(ubyte*)"string literal"

Note this also violates the type system.

@yebblies
Copy link
Member Author

yebblies commented Jul 6, 2013

To summarize the points raised so far again using -J: (paraphrased)

It is generally seen as bad practice

This is not important as dmd is one project that we have complete control over. Other things seen as bad practice in C++ are using goto and never freeing memory, both of which can be used effectively as I believe -J can here.

Linking with different libraries with different char signedness settings may cause problems.

We do not do that with dmd, so this does not apply.

unsigned char is self-documenting

char in D is always unsigned, and as most people working on the D compiler are D programmers, and we plan to migrate the compiler to D, I think matching D's semantics removes the need to document char's sign, as it is what D programmers (and a lot of other people) expect.

The conversion problems can be handled another way ...

None of the solutions presented here actually work.


I have a big bunch of string related const-correctness fixes for dmd, and they are blocked by this.

@WalterBright
Copy link
Member

I believe it is a bad practice to write C++ code and require that the C++ compiler behave like a D compiler. This is just doomed to result in problems with any maintenance done to the C++ code, as I outlined before. I don't believe it is a sustainable practice to rely on globally informing anyone working on it "BTW, all char's are ASSUMED to be unsigned."

I (and I doubt anyone else) would care for the displeasure of porting dmd to a new C++ compiler that defaults to signed, and then having to track down all the resulting subtle bugs in a codebase that they do not thoroughly understand.

As for the cast(ubyte*), I don't know exactly what you're doing so cannot give the correct cast you'll need, but a cast of some sort should do the job.

I have a big bunch of string related const-correctness fixes for dmd, and they are blocked by this.

Const-correctness should be an orthogonal issue, though I understand you being blocked by this. dmd could certainly use being refactored for const-correctness. I've done a little bit of that, but obviously more is better.

@yebblies
Copy link
Member Author

yebblies commented Jul 8, 2013

I believe it is a bad practice to write C++ code and require that the C++ compiler behave like a D compiler. This is just doomed to result in problems with any maintenance done to the C++ code, as I outlined before. I don't believe it is a sustainable practice to rely on globally informing anyone working on it "BTW, all char's are ASSUMED to be unsigned."

I was under the impression we are planning to abandon the C++ compiler once the D port is ready, meaning this is a fairly controllable short-term problem.

In my experience, most programmers, especially D programmers, assume that C++ char is unsigned. This is part of the reason I do not expect this to cause confusion.

I (and I doubt anyone else) would care for the displeasure of porting dmd to a new C++ compiler that defaults to signed, and then having to track down all the resulting subtle bugs in a codebase that they do not thoroughly understand.

Again, is it likely that this will ever happen? If we are abandoning the C++ compiler, long-term porting prospects are not really our concern.

I would like to add an assertion somewhere along with a comment, that should be sufficient to explain what the problem is to somebody porting the compiler. Do you know a good place to put this? I would put it in main but that is not shared by gdc and ldc.

As for the cast(ubyte*), I don't know exactly what you're doing so cannot give the correct cast you'll need, but a cast of some sort should do the job.

I need to convert the following two bits of code:

x = "abc";
and
y = "abc";
where x is const char and y is const unsigned char.

const char always translates to D's const(char), so that is easy. unsigned char either means D's char or ubyte, depending on the context. By making char unsigned, unsigned char will always translate to ubyte.

I do not see how a cast can be generated here, as without semantic analysis you cannot know what the type of the LHS is.

@WalterBright
Copy link
Member

In DMD, wherever unsigned char is used, an unsigned byte is meant. Why not translate it that way?

@yebblies
Copy link
Member Author

yebblies commented Jul 8, 2013

In DMD, wherever unsigned char is used, an unsigned byte is meant. Why not translate it that way?

Because the unsigned char * variables have string literals assigned to them, and they are converted to and from char *.

That would mean adding a cast at each place the conversion is done in the c++ source.

D has distinct types for 'unicode char' and 'unsigned byte', C++ does not. Somehow we need to indicate which is which in the C++ source, and I think char vs unsigned char is a good fit.

@andralex
Copy link
Member

I think we should clarify that we must all are on board with the Singularity - at one point we'll step-migrate the code base of dmd to D and then work exclusively on it. There are real advantages to bootstrapping:

  • Work on the compiler will be significantly faster and of significantly better quality (if we don't really believe that we may as well give up!)
  • The compiler being in D is an invaluable resource for early detection of bugs and regressions, ideas for optimizations and new features, and general "virtuous circle" kind of effects.
  • There's an important PR aspect, e.g. translating the compiler to D would propel D as a popular language used on github.

The first two points are not projections or speculations - they are derived from discussions with other language creators.

As the Singularity gets near, I entirely expect we'll have some odd commits happening that we wouldn't normally go for if we wanted to continue with the code base in C++. I consider that entirely normal.

@WalterBright
Copy link
Member

I've done translations of several large projects from one language to another. Yes, that often means putting in some dodgy crud to get things to work, with the intent to refactor it out later. But that dodgy crud goes in the translation, not the original.

The problem is if the original gets converted into unmaintainable junk in the process, then when the old version works and the new doesn't, it's much harder to figure out where the translation went wrong. Even worse is when the original original works, the dodgy original fails, and the translation fails.

A process I've found to work:

  1. convert original to dodgy translation
  2. get translation to work identically with original
  3. refactor translation to remove dodgy crud

When (2) is not working identically, you put instrumentation in the original and the translation to track down where they diverge.

dmd is a large program. The dependencies on chars being unsigned are clearly documented as they are typed as "unsigned char". Removing this would REMOVE the indications of the dependency. This will be a trainwreck sooner or later.

Put the casts in the translated D code.

@andralex
Copy link
Member

Two questions @yebblies:

  1. How far are we from the Singularity?
  2. Can you insert the casts in the translation?

Presumably if 2 is not possible but 1 is near we can rely on this as an exception to the general practice.

@yebblies
Copy link
Member Author

@WalterBright

We are doing an automated translation of an actively developed project. This proposed change is related to the automation more than the port itself.

For this to work, we need to be able to automatically convert the C++ source to D. This is not possible with the current source, so I am introducing minor refactorings to make the code more palatable for the converter.

There are three possible places to make changes during the conversion:

  • Before conversion, in the C++ source (difficulty, must be valid both in C++ and to the generator/D)
  • During conversion (must be possible to automatically apply with reasonable effort, ie no semantic analysis)
  • After conversion, in the D source (AWFUL, must be manually merged every time the original changes)

After is NOT an option. This is like fixing a bug by getting users to patch the exe after building.

During is difficult, because it requires semantic analysis to determine the result type of the conversion.

That leaves before.

dodgy crud
unmaintainable junk
trainwreck

You are severely overstating the problems of changing char to default to unsigned and replacing unsigned char with char. Please focus on the problems this exact change brings to dmd.

@andralex

Two questions @yebblies:
How far are we from the Singularity?

The C++ source still needs the following large fixes before the D compiler can work:

The compiler needs support for:

These don't block the compiler from working, but it currently uses a slightly modified version of D that allows:

  • variable shadowing
  • array -> pointer implicit conversion
  • implicit int -> narrower int
  • implicit int -> bool

So these will need to be fixed in the C++ source.

This is a lot of work, but we are getting there. I've just started running the generated compiler on the test suite.

Can you insert the casts in the translation?

Not as far as I can see.

Presumably if 2 is not possible but 1 is near we can rely on this as an exception to the general practice.

I don't think this is the case, but I don't think it matters.

This change IS NOT the source-destroying pile of garbage Walter is trying to paint it as.

This is a mild decrease in readability, with a very low chance of introducing bugs. For this to create bugs someone would have to write code with the assumption that char is signed, and not test the code. My experience is that bugs from assuming char is unsigned are much more common.

@WalterBright
Copy link
Member

This is a mild decrease in readability, with a very low chance of introducing bugs.

The bugs will be subtle and hard to find. I've been coding in C for decades - these kinds of bugs really suck. The attempt to change the source to enable them just sets my teeth on edge.

For this to create bugs someone would have to write code with the assumption that char is signed, and not test the code.

No, the bugs are in assuming the char is either signed or unsigned. 'char' in C++ is NEITHER, and ANY dependence on it being signed or unsigned is an INVISIBLE bug.

My experience is that bugs from assuming char is unsigned are much more common.

My experience with these kinds of bugs is they silently lurk in your code for long periods of time and then pop up and wreck things.

I do not understand how your translator works, but I just do not understand why the translator cannot simply insert a cast into the generated D code. This would not be manual work.

Barring that, I suggest doing this change as a branch, not the mainline. Being a branch would pretty much free you to manipulate it as you see fit.

@yebblies
Copy link
Member Author

The bugs will be subtle and hard to find. I've been coding in C for decades - these kinds of bugs really suck. The attempt to change the source to enable them just sets my teeth on edge.

I get that. I'm only asking for this as part of the process of moving to D.

@yebblies
Copy link
Member Author

No, the bugs are in assuming the char is either signed or unsigned. 'char' in C++ is NEITHER, and ANY dependence on it being signed or unsigned is an INVISIBLE bug.

This is wrong. With -J or -funsigned-char, char is ALWAYS UNSIGNED. We are explicitly setting it to unsigned. Because we can guarantee it is unsigned, there is no bug in depending on it being unsigned.

This can easily be enforced with the following code in main:
assert((int)(char)-1 == 0xFF); // char MUST be unsigned

How is it possible to introduce bugs? I'm really not seeing it.

Barring that, I suggest doing this change as a branch, not the mainline. Being a branch would pretty much free you to manipulate it as you see fit.

This is what I currently have. Now it is time to merge it into the mainline.

@9rnsr
Copy link
Contributor

9rnsr commented Jul 19, 2013

It might be a silly question but: how about using #define char unsigned char?

@WalterBright
Copy link
Member

I also wouldn't mind if you used a typedef on the C++ side to inform the translator which of ubyte or char it should be translated to.

With -J or -funsigned-char, char is ALWAYS UNSIGNED.

As I mentioned before, those switches are for code written before 1989. I doubt they've been used much since (as programmers got the message to not depend on signed-ness of char), and I doubt they are thoroughly tested as they aren't used. They are an inherently bad idea.

What will really suck about this idea is once all the unsigned chars are rewritten as chars, there's no indication whatsoever in the source code of where dependencies on them being unsigned will be. We have enough problems with bugs in dmd, we do not need to introduce bug-prone and unmaintainable changes in such a large and complex code base.

That is what I currently have.

I meant a branch here, not on your fork.

@yebblies
Copy link
Member Author

It might be a silly question but: how about using #define char unsigned char?

This is pretty much what -J does.

I also wouldn't mind if you used a typedef on the C++ side to inform the translator which of ubyte or char it should be translated to.

So change every use of unsigned char that should translate to D's char to be utf32_t or something? That would work.

Do you have a preference on a name for the typedef?

@yebblies
Copy link
Member Author

And of course by utf32_t I mean utf8_t.

@WalterBright
Copy link
Member

Yes, utf8_t would be fine.

@ibuclaw
Copy link
Member

ibuclaw commented Jul 22, 2013

+1 changing code rather than adding switches.

@WalterBright
Copy link
Member

I think it's fantastic that we were able to find a solution that works for all of us. Can we close this PR now?

@yebblies
Copy link
Member Author

Yup.

@yebblies yebblies closed this Jul 23, 2013
@andralex
Copy link
Member

Congratulations for everybody involved for patiently converging on an intelligent solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants