-
-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DDMD] Make char unsigned #2215
Conversation
Ah yes, it's a classic problem when e.g. wrapping C++. You can't tell programmatically whether a char is used as a byte array or a char array, unless you manually look and figure it out on your own. Btw, what about VC? vcproj/vcxproj might need an update (who do we ping?). |
Well, the switch is |
I really do not want to rely on 'char' in C++ code to be unsigned, switch or no switch. This is an execrable practice, and is unreliable (think linking code modules compiled with different switch settings, such as third party libraries). Those switches only exist to enable bad code to be compiled.
Why not just use an explicit cast in the generated D code for now? |
While a sensible idea in general I don't think this applies to dmd at all. Are there any actual potential problems you are aware of?
Explicit cast to what? String literals are assigned to The only viable alternative I can think of is adding a What is wrong with |
You said you intended to replace all the unsigned char* in dmd with char *. This means that code that formerly self-documented that it relied on unsigned behavior is now relying on a compiler switch. Bit rot will set in. There is nothing clean about using -J. The explicit cast would be:
|
No it will not. Any incorrect changes will break the D version.
It is clean in that it matches what D does. The code will look and behave similar to what happens in D. Most (all) developers working on the compiler are D programmers, so there should be nothing surprising about char being unsigned. It also helps that you never want char to be signed. I initially tried converting
As I said yesterday, this then does not work if you want to assign the string literal to a This change makes the C++ code behave more like D, which is very very helpful in porting to D. |
Note this also violates the type system. |
To summarize the points raised so far again using
This is not important as dmd is one project that we have complete control over. Other things seen as bad practice in C++ are using goto and never freeing memory, both of which can be used effectively as I believe
We do not do that with dmd, so this does not apply.
None of the solutions presented here actually work. I have a big bunch of string related const-correctness fixes for dmd, and they are blocked by this. |
I believe it is a bad practice to write C++ code and require that the C++ compiler behave like a D compiler. This is just doomed to result in problems with any maintenance done to the C++ code, as I outlined before. I don't believe it is a sustainable practice to rely on globally informing anyone working on it "BTW, all char's are ASSUMED to be unsigned." I (and I doubt anyone else) would care for the displeasure of porting dmd to a new C++ compiler that defaults to signed, and then having to track down all the resulting subtle bugs in a codebase that they do not thoroughly understand. As for the cast(ubyte*), I don't know exactly what you're doing so cannot give the correct cast you'll need, but a cast of some sort should do the job.
Const-correctness should be an orthogonal issue, though I understand you being blocked by this. dmd could certainly use being refactored for const-correctness. I've done a little bit of that, but obviously more is better. |
I was under the impression we are planning to abandon the C++ compiler once the D port is ready, meaning this is a fairly controllable short-term problem. In my experience, most programmers, especially D programmers, assume that C++ char is unsigned. This is part of the reason I do not expect this to cause confusion.
Again, is it likely that this will ever happen? If we are abandoning the C++ compiler, long-term porting prospects are not really our concern. I would like to add an assertion somewhere along with a comment, that should be sufficient to explain what the problem is to somebody porting the compiler. Do you know a good place to put this? I would put it in main but that is not shared by gdc and ldc.
I need to convert the following two bits of code:
I do not see how a cast can be generated here, as without semantic analysis you cannot know what the type of the LHS is. |
In DMD, wherever unsigned char is used, an unsigned byte is meant. Why not translate it that way? |
Because the That would mean adding a cast at each place the conversion is done in the c++ source. D has distinct types for 'unicode char' and 'unsigned byte', C++ does not. Somehow we need to indicate which is which in the C++ source, and I think |
I think we should clarify that we must all are on board with the Singularity - at one point we'll step-migrate the code base of dmd to D and then work exclusively on it. There are real advantages to bootstrapping:
The first two points are not projections or speculations - they are derived from discussions with other language creators. As the Singularity gets near, I entirely expect we'll have some odd commits happening that we wouldn't normally go for if we wanted to continue with the code base in C++. I consider that entirely normal. |
I've done translations of several large projects from one language to another. Yes, that often means putting in some dodgy crud to get things to work, with the intent to refactor it out later. But that dodgy crud goes in the translation, not the original. The problem is if the original gets converted into unmaintainable junk in the process, then when the old version works and the new doesn't, it's much harder to figure out where the translation went wrong. Even worse is when the original original works, the dodgy original fails, and the translation fails. A process I've found to work:
When (2) is not working identically, you put instrumentation in the original and the translation to track down where they diverge. dmd is a large program. The dependencies on chars being unsigned are clearly documented as they are typed as "unsigned char". Removing this would REMOVE the indications of the dependency. This will be a trainwreck sooner or later. Put the casts in the translated D code. |
Two questions @yebblies:
Presumably if 2 is not possible but 1 is near we can rely on this as an exception to the general practice. |
We are doing an automated translation of an actively developed project. This proposed change is related to the automation more than the port itself. For this to work, we need to be able to automatically convert the C++ source to D. This is not possible with the current source, so I am introducing minor refactorings to make the code more palatable for the converter. There are three possible places to make changes during the conversion:
After is NOT an option. This is like fixing a bug by getting users to patch the exe after building. During is difficult, because it requires semantic analysis to determine the result type of the conversion. That leaves before.
You are severely overstating the problems of changing
The C++ source still needs the following large fixes before the D compiler can work:
The compiler needs support for:
These don't block the compiler from working, but it currently uses a slightly modified version of D that allows:
So these will need to be fixed in the C++ source. This is a lot of work, but we are getting there. I've just started running the generated compiler on the test suite.
Not as far as I can see.
I don't think this is the case, but I don't think it matters. This change IS NOT the source-destroying pile of garbage Walter is trying to paint it as. This is a mild decrease in readability, with a very low chance of introducing bugs. For this to create bugs someone would have to write code with the assumption that |
The bugs will be subtle and hard to find. I've been coding in C for decades - these kinds of bugs really suck. The attempt to change the source to enable them just sets my teeth on edge.
No, the bugs are in assuming the char is either signed or unsigned. 'char' in C++ is NEITHER, and ANY dependence on it being signed or unsigned is an INVISIBLE bug.
My experience with these kinds of bugs is they silently lurk in your code for long periods of time and then pop up and wreck things. I do not understand how your translator works, but I just do not understand why the translator cannot simply insert a cast into the generated D code. This would not be manual work. Barring that, I suggest doing this change as a branch, not the mainline. Being a branch would pretty much free you to manipulate it as you see fit. |
I get that. I'm only asking for this as part of the process of moving to D. |
This is wrong. With This can easily be enforced with the following code in main: How is it possible to introduce bugs? I'm really not seeing it.
This is what I currently have. Now it is time to merge it into the mainline. |
It might be a silly question but: how about using |
I also wouldn't mind if you used a typedef on the C++ side to inform the translator which of ubyte or char it should be translated to.
As I mentioned before, those switches are for code written before 1989. I doubt they've been used much since (as programmers got the message to not depend on signed-ness of char), and I doubt they are thoroughly tested as they aren't used. They are an inherently bad idea. What will really suck about this idea is once all the unsigned chars are rewritten as chars, there's no indication whatsoever in the source code of where dependencies on them being unsigned will be. We have enough problems with bugs in dmd, we do not need to introduce bug-prone and unmaintainable changes in such a large and complex code base.
I meant a branch here, not on your fork. |
This is pretty much what
So change every use of Do you have a preference on a name for the typedef? |
And of course by |
Yes, utf8_t would be fine. |
+1 changing code rather than adding switches. |
I think it's fantastic that we were able to find a solution that works for all of us. Can we close this PR now? |
Yup. |
Congratulations for everybody involved for patiently converging on an intelligent solution! |
C++ has the annoying situation of three char variants -
signed char
,unsigned char
, andchar
. Obviouslychar
is either signed or unsigned, but it gets its own unique mangling anyway.For the translation, I'm converting
signed char
tobyte
,unsigned char
toubyte
, andchar
tochar
.For this mapping to be accurate, the C++
char
needs to be unsigned, which is enabled for gcc and dmc with the switches below.The compiler currently uses
char
when the sign doesn't matter, andunsigned char
when it does (eg for unicode stuff). This is problematic in D as string literals do not convert toconst(ubyte)*
, onlyconst(char)*
.The plan is to change all strings to
const(char)*
orchar*
inside the compiler, and this is the first step.