Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings without null-termination #1936

Closed

Conversation

matthijskooijman
Copy link
Collaborator

When working with the Arduino String class, I've found that I couldn't efficiently combine it with some external libraries that explicitely pass char* and length around, without nul-terminating their strings. This prompted me to modify and expose the concat (const char* cstr, unsigned int length) method, add a new String(const char* cstr, unsigned int length) constructor. While I was going over the string class, I found some other minor improvements, which are included here.

@matthijskooijman
Copy link
Collaborator Author

I just added one more commit: "Add String::concat(const uint8_t *, unsigned int) version".

@PaulStoffregen
Copy link
Sponsor Contributor

If a concat(pointer, len) is going to be added, perhaps "const void *" would be better?

@matthijskooijman
Copy link
Collaborator Author

@PaulStoffregen Hmm, that might make things easier. On the other hand, having a void* version might complicate overloading, or make it easier for people to accidentally pass a different kind of pointer without the compiler warning them about that. These might not be important for the concat(void*, len) case, but if we do this here, we should (be prepared to) also apply this in other cases where we want to accept a (string) buffer, where this might be more relevant.

Does this make sense to you, or do you see things differntly?

@matthijskooijman matthijskooijman added Component: Core Related to the code for the standard Arduino API Version: 1.5.x feature request A request to make an enhancement (not a bug fix) labels Sep 10, 2014
@matthijskooijman
Copy link
Collaborator Author

@PaulStoffregen, would be great if you could respond to my last comment here.

@matthijskooijman
Copy link
Collaborator Author

@cmaglie, any chance you get merge this for the next release? I just ran into trouble again by not having a concat(str, len) function...

@matthijskooijman
Copy link
Collaborator Author

I've just rebased this PR, so it is again current. I also realized this changed the AVR WString code, but not SAM, so I updated each commit to apply the same changes to both.

@ffissore
Copy link
Contributor

@ArduinoBot build this please

@matthijskooijman
Copy link
Collaborator Author

I tested this PR against all examples in the repository, which all still work as before (78% even compiles to the same binary). Some sketches become bigger, some sketches became smaller (on average all sketches became 3 bytes bigger. Limiting to only the sketches that actually changed, so the sketches using String in some way, these are 14 bytes bigger on average).

All size increases seem to be caused because String::concat(const char*, unsigned int) now calls memcpy instead of strcopy, making it remember the length parameter, causing the compiler to store data on the stack instead of just in registers, which has a little overhead. All size decrease is because strcpy() can be left out, because String::concat(char) is simplifed and because a bunch of strlen() calls are removed. Nothing out of the ordinary or unexpected stood out.

@Chris--A
Copy link
Contributor

Chris--A commented Jul 9, 2015

The features here would combine well with my PR here: #3096

In the future, rather than relying on the len input (diff), validate() could be called to ensure the String internals reflect its contents as a null-terminated string, rather than what was copied directly in.

This would resolve problems if someone was to copy or concatenate a buffer which has trailing nulls.

char buff[10] = {'d','a','t','a'};
String s;

//User could easily use:
s.concat( buff );

//But now has the option to use:
s.concat( buff, sizeof(buff) );

Whether it be a logical error (buffer isn't as full as expected), or intention (nulls mean something else), validate() will ensure s.len reflects:

  • a null terminated string.
  • what is output with Print members.
  • and when used in other c-string function using s.c_str();, or memory functions requiring a length.

@matthijskooijman
Copy link
Collaborator Author

In the future, rather than relying on the len input (diff), validate() could be called to ensure the String internals reflect its contents as a null-terminated string, rather than what was copied directly in.

One of the goals of this PR is to allow a String to store embedded NUL characters as well (e.g. to store a binary packet in it). IIUC, your suggestion would exactly break that possibility.

Having a validate() function sounds reasonable, but I don't think the String class should be calling it automatically - the user should be in control when he wants to fix the length, I think.

Instead of calling strlen in a dozen places, instead just call
String::concat(s), which will then call strlen. This shrinks the code
size of these calls significantly, the StringAppendOperator example on
the Arduino Uno shrinks by 72 bytes. This change does incur a slight
runtime cost, because there is one extra function call.
Previously, these methods took a
nul-terminated string and appended it to the current buffer. The length
argument (or length of the passed String object) was used to allocate
memory, but was not used when actually copying the string. This meant
that:

 - If the length was not correct, or the string passed in was not
   nul-terminated, the buffer would overflow.
 - If the string passed in contained embedded nul characters (e.g, in
   among the first length characters), any characters after the embedded
   nul would not be copied (but len would be updated to indicate they
   were).

In practice, neither of the above would occur, since the length passed is
always the known length of the string, usually as returned by strlen.
However, to allow using this concat method to pass in strings that are
not nul-terminated, its behaviour should change.

This commit changes the method to use memcpy instead of strcpy, copying
exactly the number of bytes passed in. For the current calls to this
method, which pass a nul-terminated string, without embedded nul
characters and a correct length, this behaviour should not change.

However, this concat method can now also be used in the two cases
mentioned above. Non-nul-terminated strings now work as expected and for
strings with embedded newlines the entire string is copied as-is,
instead of leaving uninitialized memory after the embedded nul byte.

Note that a lot of operations will still only see the string up to the
embedded nul byte, but that's not really fixable unless we reimplement
functions like strcmp.
Before, it used strncpy, but that has undefined behaviour when the
source and destination strings overlap. memove is guaranteed to work
correctly in this case.

Note that memmove simply copies all bytes, whereas strncpy stopped at
the first nul-byte. This means this commit also makes remove work for
strings that contain embedded nul bytes.
Now concat(const char*, unsigned int) no longer requires a
nul-terminated string, we can simplify the concat(char) method to just
pass the address of the single character instead of having to create
buffer with nul-termination.
This method is useful when receiving data from external sources that
pass an explicit length instead of a NUL-terminated string.
This constructor allows converting a buffer containing a
non-nul-terminated string to a String object, by explicitely passing the
length.
Before, substring would (temporarily) add a \0 in the original string at
the end of the substring, so the substring could be copied into a new
String object. However, now that the String::copy() method can work with
non-nul-terminated strings (by passing an explicit length), this trick
is not needed and we can just call the copy method instead.
This just calls the char* version, but allows calling the method with a
uint8_t* as well (which is not uncommon for buffers).
This allows creating a String from a uint8_t[] or uint8_t* as well,
without having to add explicit casts.
@matthijskooijman matthijskooijman force-pushed the stringnonull branch 2 times, most recently from f821d03 to deadaa2 Compare August 4, 2015 08:10
@matthijskooijman
Copy link
Collaborator Author

Rebased this PR again, and added one more commit to add a String(const uint8_t *, unsigned int) constructor.

@Chris--A
Copy link
Contributor

Having a validate() function sounds reasonable, but I don't think the String class should be calling it automatically - the user should be in control when he wants to fix the length, I think.

I agree, it should be an optional.

With regards to this PR, as it doesn't seem to change any existing functionality, and only add features, this can have a benefit (saw the issue you referenced above).

When working with the Arduino String class, I've found that I couldn't efficiently combine it with some external libraries that explicitely pass char* and length around

This seems to be a different problem to the issue. By combine, do you mean something like this:

String s = "Hello";
s += Obj.func();

I'm not sure I understand what difficulties you are having.

@facchinm facchinm added the Print and Stream class The Arduino core library's Print and Stream classes label Jan 24, 2017
@matthijskooijman
Copy link
Collaborator Author

matthijskooijman commented Jul 25, 2017

When working with the Arduino String class, I've found that I couldn't efficiently combine it with some external libraries that explicitely pass char* and length around

For example, consider a library that calls a callback function when data is received, passing in a (non-nul-terminated) char* buffer and a length. If I want to accumulate the data in a String object, now I have to allocate memory for a copy of the buffer with added nul-termination, and then concat that to the existing String. Instead, I would like to write something like:

String accum;
void onReceive(const char*buf, unsigned len) {
     accum.concat(buf, len);
}

Which should just copy the data into the String's buffer, without needing another intermediate copy.

For another (pretty much the same) example, see #6546.

@facchinm
Copy link
Member

@matthijskooijman do you think it still makes sense to port this PR to API repo?

@facchinm facchinm added the Waiting for feedback More information must be provided before we can proceed label Sep 18, 2019
@matthijskooijman
Copy link
Collaborator Author

I think these changes are still useful. They probably need a bit of rebasing and cleaning up. I'm willing to have a look at resubmitting this, if you (or someone else) will review and merge it?

@facchinm
Copy link
Member

I can't assure anything about merging but for sure we'll review them (I think it's a good patch so the merge chances are high).

@matthijskooijman
Copy link
Collaborator Author

I can't assure anything about merging but for sure we'll review them (I think it's a good patch so the merge chances are high).

Yeah, that's what I meant. Thanks! Pullrequest created at arduino/ArduinoCore-API#97, so I'm closing this one.

@matthijskooijman matthijskooijman deleted the stringnonull branch March 28, 2020 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Core Related to the code for the standard Arduino API feature request A request to make an enhancement (not a bug fix) Print and Stream class The Arduino core library's Print and Stream classes Waiting for feedback More information must be provided before we can proceed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants