Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings without null-termination #97

Merged
merged 9 commits into from
Jan 25, 2021
Merged

Strings without null-termination #97

merged 9 commits into from
Jan 25, 2021

Conversation

matthijskooijman
Copy link
Collaborator

When working with the Arduino String class, I've found that I couldn't efficiently combine it with some external libraries that explicitely pass char* and length around, without nul-terminating their strings. This prompted me to modify and expose the concat (const char* cstr, unsigned int length) method, add a new String(const char* cstr, unsigned int length) constructor. While I was going over the string class, I found some other minor improvements, which are included here.

This is a port of arduino/Arduino#1936. The commits are identical, except for some improved commit messages and one commit was dropped since that change was already made by someone else in the meantime.

I've provided some testcases by updating the string examples: arduino/Arduino#9239

If this is ok to merge, I'll also provide a PR for the reference documentation.

@cvwillegen
Copy link

Can this please be merged? I'm running into problems with this, as I would like to use string with embedded NUL chars in them. When copying a String with a NUL char in it, the memory after the NUL bytes is not being copied, so the rest of that String is uninitialised memory.

Here's sample code showing the error:

String sGlobal;

void dumpString(const String &s)
{
  Serial.print("Got a string of length ");
  Serial.println(s.length());
  Serial.print(">");
  for (size_t t = 0; t < s.length(); ++t) {
    if (s.charAt(t) != '\0' && isascii(s.charAt(t))) {
      Serial.print(s.charAt(t));
    } else if (s.charAt(t) == '\0') {
      Serial.print("\\0");
    } else {
      Serial.print("\\x");
      Serial.print(s.charAt(t), HEX);
    }
  }
  Serial.print("<");

  Serial.println();
}

void encode(String &s)
{
  while (s.length() < 12)
  {
    s += ' ';
  }

  Serial.println("s in encode is, after filling with spaces:");
  dumpString(s);

  s.setCharAt(11, '!');
  s.setCharAt(10, 'd');
  s.setCharAt(9, 'l');
  s.setCharAt(8, 'r');
  s.setCharAt(7, 'o');
  s.setCharAt(6, 'w');
  s.setCharAt(5, '\0');
  s.setCharAt(4, 'o');
  s.setCharAt(3, 'l');
  s.setCharAt(2, 'l');
  s.setCharAt(1, 'e');
  s.setCharAt(0, 'H');
  
  Serial.println("s in encode is, after setting its chars:");
  dumpString(s);

}

void test() {
  String sLocal;
  Serial.println("sLocal in test is, after init:");
  dumpString(sLocal);
  Serial.println("sGlobal in test is, after init:");
  dumpString(sGlobal);
  
  encode(sLocal);
  Serial.println("sLocal in test is, after encode:");
  dumpString(sLocal);
  Serial.println("sGlobal in test is, after encode:");
  dumpString(sGlobal);

  sGlobal = sLocal;
  Serial.println("sGlobal in test is, after assignment:");
  dumpString(sGlobal);
}

void setup()
{
  Serial.begin(115200);

  test();
}

void loop()
{
}

Output of this sample:

Local in test is, after init:
Got a string of length 0
><
sGlobal in test is, after init:
Got a string of length 0
><
s in encode is, after filling with spaces:
Got a string of length 12
>            <
s in encode is, after setting its chars:
Got a string of length 12
>Hello\0world!<
sLocal in test is, after encode:
Got a string of length 12
>Hello\0world!<
sGlobal in test is, after encode:
Got a string of length 0
><
sGlobal in test is, after assignment:
Got a string of length 12
>Hello\0�n\xFFFFFFEF\xFFFFFFD6\xFFFFFFFF\<

@cvwillegen
Copy link

@matthijskooijman How can I build this and run it on my own Arduino's? As you might recall, I added #112 and would like to create a patch to fix it. But then I need to compile this code myself in order to fix it. How does one do that?

@matthijskooijman
Copy link
Collaborator Author

@cvwillegen You should be able to clone this repository (and then the right branch) into ~/Arduino/hardware/arduino-git/avr (so you'd get a ~/Arduino/hardware/arduino-git/avr/platform.txt file, adapt ~/Arduino if you modified your sketchbook directory or are running on Windows).

Then, it is probably useful to change the name in platform.txt file to allow distinguishing your git clone'd Arduino core from the original one (until IDE 1.8.13 is released, containing arduino/Arduino#10007).

If you then restart the Arduino IDE, you should the AVR boards twice in the boards menu. If you select the right one, it will use the code from git rather than the official release.

Is that enough info?

@aentinger
Copy link
Contributor

@matthijskooijman can you please rebase? I'd like to see the CI unit tests run, probably the tests need some fixing.

@codecov-io
Copy link

Codecov Report

Merging #97 (9a604e7) into master (2ca15ad) will decrease coverage by 0.34%.
The diff coverage is 85.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #97      +/-   ##
==========================================
- Coverage   96.41%   96.07%   -0.35%     
==========================================
  Files          14       14              
  Lines         837      840       +3     
==========================================
  Hits          807      807              
- Misses         30       33       +3     
Impacted Files Coverage Δ
api/String.h 90.90% <ø> (+2.02%) ⬆️
api/String.cpp 97.69% <85.00%> (-0.77%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ca15ad...9a604e7. Read the comment docs.

@matthijskooijman
Copy link
Collaborator Author

@aentinger I finally got around to having another look at this, I see you already did a rebase. I just did another force push, which:

With that, the testcases succeed again.

I think this could be merged as-is (after squashing the fixup), but I also want to look at:

  1. Adding some testcases with embedded nul bytes to verify that those actually work and stay working in the future (and increase coverage).
  2. Fixing String::equals (and related methods) with embedded nul characters (probably required to write proper testcases). This is also String == String does not work with embedded NUL characters #112.
  3. Fixing some more places where embedded nul bytes still cause problems (in particular certain operations will not work since they use library functions that assume nul termination).

Ideally, we'd do 1. and 2. before merging this, 3. is probably more tricky and might be left for later.

@cvwillegen
Copy link

There are still strcmp() calls in the source, instead of memcmp(), so it's not going to work as-is...

@matthijskooijman
Copy link
Collaborator Author

There are still strcmp() calls in the source, instead of memcmp(), so it's not going to work as-is...

Yeah, that's point 2. on my list above. But for some purposes (such as passing around buffers with embedded nuls) the PR is already useful as it stands (as long as you do not need comparison or some other operations). But as I said, ideally we'd fix at least the comparisons before merging this.

@cvwillegen
Copy link

Yeah, that's point 2. on my list above. But for some purposes (such as passing around buffers with embedded nuls) the PR is already useful as it stands (as long as you do not need comparison or some other operations). But as I said, ideally we'd fix at least the comparisons before merging this.

I actually ran into this when doing an == on 2 NUL-containing strings :-)

Indeed, I created #112 and you pointed me to this PR...

Copy link
Contributor

@aentinger aentinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @matthijskooijman 👋 and a happy new year 🎆

Thank you very much for all your good work 👍! I agree that both

  1. Adding some testcases with embedded nul bytes to verify that those actually work and stay working in the future (and increase coverage).

and

  1. Fixing String::equals (and related methods) with embedded nul characters (probably required to write proper testcases). This is also String == String does not work with embedded NUL characters #112.

will be necessary for merging this PR. Maybe if you could provide test cases in the appropriate files for String::equals and String::compareTo? I don't see a need to provide embedded-nul-string test code for operators ==, !=, <, >, <=, >= as they are basically wrappers around the two functioned mentioned before.

test/src/String/test_comparisonFunc.cpp Outdated Show resolved Hide resolved
test/src/String/StringPrinter.h Outdated Show resolved Hide resolved
@aentinger
Copy link
Contributor

I assume a rebase is in order @matthijskooijman after my merging of #137?

Instead of calling strlen in a dozen places, instead just call
String::concat(s), which will then call strlen. This shrinks the code
size of these calls significantly, the StringAppendOperator example on
the Arduino Uno shrinks by 72 bytes. This change does incur a slight
runtime cost, because there is one extra function call.
Previously, these methods took a nul-terminated string and appended it
to the current buffer. The length argument (or length of the passed
String object) was used to allocate memory, but was not used when
actually copying the string. This meant that:

 - If the passed length was too short, or the string passed in was not
   nul-terminated, the buffer would overflow.
 - If the string passed in contained embedded nul characters (i.e,
   among the first length characters), any characters after the embedded
   nul would not be copied (but len would be updated to indicate they
   were).

In practice, neither of the above would occur, since the length passed is
always the known length of the string, usually as returned by strlen.
However, to make this method public, and to allow using this concat
method to pass in strings that are not nul-terminated, it should be
changed to be more robust.

This commit changes the method to use memcpy instead of strcpy, copying
exactly the number of bytes passed in. For the current calls to this
method, which pass a nul-terminated string, without embedded nul
characters and a correct length, this behaviour should not change.

However, this concat method can now also be used in the two cases
mentioned above. Non-nul-terminated strings now work as expected and for
strings with embedded newlines the entire string is copied as-is,
instead of leaving uninitialized memory after the embedded nul byte.

Note that a lot of operations will still only see the string up to the
embedded nul byte, but that's not really fixable unless we reimplement
functions like strcmp.
Now concat(const char*, unsigned int) no longer requires a
nul-terminated string, we can simplify the concat(char) method to just
pass the address of the single character instead of having to create
buffer with nul-termination.
This method is useful when receiving data from external sources that
pass an explicit length instead of a NUL-terminated string.
This constructor allows converting a buffer containing a
non-nul-terminated string to a String object, by explicitely passing the
length.
Before, substring would (temporarily) add a \0 in the original string at
the end of the substring, so the substring could be copied into a new
String object. However, now that the String::copy() method can work with
non-nul-terminated strings (by passing an explicit length), this trick
is not needed and we can just call the copy method instead.
This just calls the char* version, but allows calling the method with a
uint8_t* as well (which is not uncommon for buffers).
This allows creating a String from a uint8_t[] or uint8_t* as well,
without having to add explicit casts.
@matthijskooijman
Copy link
Collaborator Author

I assume a rebase is in order @matthijskooijman after my merging of #137?

Yup, done.

Copy link
Contributor

@aentinger aentinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thank you @matthijskooijman 🚀

@aentinger aentinger merged commit fdccb19 into master Jan 25, 2021
@aentinger aentinger deleted the stringnonull branch January 25, 2021 09:29
@matthijskooijman
Copy link
Collaborator Author

Yay for merging this, but I didn't make the changes mentioned in #97 (review) yet... And there was still a fixup commit in here that should have been squashed before merging...

Regardless, I believe that the changes in this PR are already useful and worth including, though support for strings with NUL is now not entirely complete and lacks testcases, so that still needs to be fixed (no time for that right now, though).

mysterywolf added a commit to RTduino/RTduino that referenced this pull request Sep 15, 2024
mysterywolf pushed a commit to RTduino/RTduino that referenced this pull request Sep 15, 2024
When working with the Arduino String class, I've found that I couldn't
efficiently combine it with some external libraries that explicitely
pass char* and length around, without nul-terminating their strings.
This prompted me to modify and expose the concat (const char* cstr,
unsigned int length) method, add a new String(const char* cstr, unsigned
int length) constructor. While I was going over the string class, I
found some other minor improvements, which are included here.

This is a port of arduino/Arduino#1936. The commits are identical,
except for some improved commit messages and one commit was dropped
since that change was already made by someone else in the meantime.

I've provided some testcases by updating the string examples:
arduino/Arduino#9239

If this is ok to merge, I'll also provide a PR for the reference
documentation.

cherry-pick from: arduino/ArduinoCore-API#97
mysterywolf pushed a commit to RTduino/RTduino that referenced this pull request Sep 15, 2024
When working with the Arduino String class, I've found that I couldn't
efficiently combine it with some external libraries that explicitely
pass char* and length around, without nul-terminating their strings.
This prompted me to modify and expose the concat (const char* cstr,
unsigned int length) method, add a new String(const char* cstr, unsigned
int length) constructor. While I was going over the string class, I
found some other minor improvements, which are included here.

This is a port of arduino/Arduino#1936. The commits are identical,
except for some improved commit messages and one commit was dropped
since that change was already made by someone else in the meantime.

I've provided some testcases by updating the string examples:
arduino/Arduino#9239

If this is ok to merge, I'll also provide a PR for the reference
documentation.

cherry-pick from: arduino/ArduinoCore-API#97
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants