-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This conversion to UTF-8 string is not safe #203
Comments
Ok, so I have been doing more research and have found the I think this is the safest way to safely decode a UTF-8 response in chunks I'm likely to come up with at this time. const BUFFER_SIZE: usize = 256;
fn read_response(mut response: Response<&mut EspHttpConnection>) -> Result<()> {
// Fixed buffer to read into
let mut buffer = [0_u8; BUFFER_SIZE];
// Offset into the buffer to indicate that there are still
// bytes at the beginning that have not been decoded yet
let mut offset = 0;
// Keep track of the total number of bytes read to print later
let mut total = 0;
loop {
// read into the buffer starting at the offset to not overwrite
// the incomplete UTF-8 sequence we put there earlier
if let Ok(size) = response.read(&mut buffer[offset..]) {
if size == 0 {
// no more bytes to read from the response
if offset > 0 {
bail!("Response ends with an invalid UTF-8 sequence with length: {}", offset)
}
break;
}
// Update the total number of bytes read
total += size;
// remember that we read into an offset and recalculate the
// real length of the bytes to decode
let size_plus_offset = size + offset;
match str::from_utf8(&buffer[..size_plus_offset]) {
Ok(text) => {
// buffer contains fully valid UTF-8 data,
// print it and reset the offset to 0
println!("{}", text);
offset = 0;
},
Err(error) => {
// buffer contains incomplete UTF-8 data, we will
// print the valid part, copy the invalid sequence to
// the beginning of the buffer and set an offset for the
// next read
let valid_up_to = error.valid_up_to();
println!("{}", str::from_utf8(&buffer[..valid_up_to])?);
buffer.copy_within(valid_up_to.., 0);
offset = size_plus_offset - valid_up_to;
}
}
}
}
println!("Total: {} bytes", total);
Ok(())
} |
I just realised that I should also check the This would signify an invalid sequence that needs to be skipped as stated in the linked docs (I did not cover this case in the function above) |
Ok this implementation deals with invalid UTF-8 sequences too, but I'm not sure it's helpful in the context of the training materials. const BUFFER_SIZE: usize = 256;
struct ResponsePrinter {
// Fixed buffer to read into
buffer: [u8; BUFFER_SIZE],
// Offset into the buffer to indicate that there are still
// bytes at the beginning that have not been decoded yet
offset: usize,
}
impl ResponsePrinter {
fn new() -> ResponsePrinter {
ResponsePrinter {
buffer: [0_u8; BUFFER_SIZE],
offset: 0,
}
}
fn print(&mut self, mut response: Response<&mut EspHttpConnection>) -> Result<()> {
// Keep track of the total number of bytes read to print later
let mut total = 0;
loop {
// read into the buffer starting at the offset to not overwrite
// the incomplete UTF-8 sequence we put there earlier
if let Ok(size) = response.read(&mut self.buffer[self.offset..]) {
if size == 0 {
// no more bytes to read from the response
if self.offset > 0 {
bail!("Response ends with an invalid UTF-8 sequence with length: {}", self.offset)
}
break;
}
// Update the total number of bytes read
total += size;
// recursive print to handle invalid UTF-8 sequences
self.print_utf8(size)?;
}
}
println!("Total: {} bytes", total);
Ok(())
}
fn print_utf8(&mut self, size: usize) -> Result<()> {
// remember that we read into an offset and recalculate the
// real length of the bytes to decode
let size_plus_offset = size + self.offset;
match str::from_utf8(&self.buffer[..size_plus_offset]) {
Ok(text) => {
// buffer contains fully valid UTF-8 data,
// print it and reset the offset to 0
print!("{}", text);
self.offset = 0;
},
Err(error) => {
// A UTF-8 decode error was encountered, print
// the valid part and figure out what to do with the rest
let valid_up_to = error.valid_up_to();
print!("{}", str::from_utf8(&self.buffer[..valid_up_to])?);
if let Some(error_len) = error.error_len() {
// buffer contains invalid UTF-8 data, print a replacement
// character then copy the remainder (probably valid) to the
// beginning of the buffer, reset the offset and deal with
// the remainder in a recursive call to print_utf8
print!("{}", char::REPLACEMENT_CHARACTER);
let valid_after = valid_up_to + error_len;
self.buffer.copy_within(valid_after.., 0);
self.offset = 0;
return self.print_utf8(size_plus_offset - valid_after);
} else {
// buffer contains incomplete UTF-8 data, copy the invalid
// sequence to the beginning of the buffer and set an offset
// for the next read
self.buffer.copy_within(valid_up_to.., 0);
self.offset = size_plus_offset - valid_up_to;
}
}
}
Ok(())
}
} |
Hi! Thanks for opening the issue and sharing your findings on the topic! Would you mind opening a PR with your solution? For the purpose of the training, I would keep it as simple as possible as the main point of the exercise is the HTTP request. |
I'll cut it down to make it more palatable but at least safe for the happy path of valid utf-8 :) |
https://github.com/esp-rs/std-training/blob/f7ce2e7d7afe520c027d836e81e1e50bd9d64caf/intro/http-client/examples/http_client.rs#L72C43-L72C43
TLDR: The linked code for reading UTF-8 breaks on the given URLs when a UTF-8 sequence is split up into chunks. The rest of this bug report is my meandering through trying to find a correct way to do this or at least a way that doesn't break. See my later comments for potential solutions
Handling split UTF-8 sequences: #203 (comment)
Handling split UTF-8 sequences and invalid UTF-8 sequences: #203 (comment)
It seems to me that there is a chance it will fail if the end of the buffer lands in the middle of a multibyte character. Or am I missing something?
i wondered if the choice of a 256 byte buffer would guarantee to keep things on character boundaries, but my understanding of UTF-8 suggests not.
Googling so far has not suggested a good solution to this (from_utf8_lossy doesn't seem to solve the issue). In my own solution I pushed everything into a Vector before doing the conversion, but that is obviously problematic. The next idea I had was to let it fail then try again with one less byte until it works, while remembering to pass the remainder back when I get another buffer full, but that would be quite complicated. So if we're going to go that route it might be better to think of a proper stream handling solution that consumes only complete characters until there are no more (probably in a ring buffer or something).
Anyway I just wanted to flag this in case it is an issue that has a known solution that I missed. As it is the error condition in the context of HTML is probably quite rare as the vast majority of mark up characters are single byte.
EDIT: I realised that I didn't check if i could force an error so I just ran the example against the HTTPS request for https://espressif.com/ which I think might be full of multibyte characters and with a buffer size of 256 I ran into the following error quite quickly:
EDIT: If you can't tell, I'm new to rust :) I feel like we need to use something like https://docs.rs/utf8-read/latest/utf8_read/# but with the
embedded_svc::io::Read
trait instead of thestd::io::Read
trait. Given that they are analogous is there a shortcut for this (like a cast)?EDIT: Ok I keep learning: the following works but I don't think it is efficient enough for production code - i think it would be better to have a UTF-8 reader that just returned chunks of valid UTF-8 bytes instead of converting to 4 byte chars (which then get converted back to UTF-8 by
print!
, never mind the cost of printing to stdout 1 character at a time!). Also this method makes it difficult to know the size in bytes:The text was updated successfully, but these errors were encountered: