Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This conversion to UTF-8 string is not safe #203

Closed
pghalliday opened this issue Aug 28, 2023 · 5 comments · Fixed by #204
Closed

This conversion to UTF-8 string is not safe #203

pghalliday opened this issue Aug 28, 2023 · 5 comments · Fixed by #204

Comments

@pghalliday
Copy link
Contributor

pghalliday commented Aug 28, 2023

https://github.com/esp-rs/std-training/blob/f7ce2e7d7afe520c027d836e81e1e50bd9d64caf/intro/http-client/examples/http_client.rs#L72C43-L72C43

TLDR: The linked code for reading UTF-8 breaks on the given URLs when a UTF-8 sequence is split up into chunks. The rest of this bug report is my meandering through trying to find a correct way to do this or at least a way that doesn't break. See my later comments for potential solutions

Handling split UTF-8 sequences: #203 (comment)

Handling split UTF-8 sequences and invalid UTF-8 sequences: #203 (comment)


It seems to me that there is a chance it will fail if the end of the buffer lands in the middle of a multibyte character. Or am I missing something?

i wondered if the choice of a 256 byte buffer would guarantee to keep things on character boundaries, but my understanding of UTF-8 suggests not.

Googling so far has not suggested a good solution to this (from_utf8_lossy doesn't seem to solve the issue). In my own solution I pushed everything into a Vector before doing the conversion, but that is obviously problematic. The next idea I had was to let it fail then try again with one less byte until it works, while remembering to pass the remainder back when I get another buffer full, but that would be quite complicated. So if we're going to go that route it might be better to think of a proper stream handling solution that consumes only complete characters until there are no more (probably in a ring buffer or something).

Anyway I just wanted to flag this in case it is an issue that has a known solution that I missed. As it is the error condition in the context of HTML is probably quite rare as the vast majority of mark up characters are single byte.

EDIT: I realised that I didn't check if i could force an error so I just ran the example against the HTTPS request for https://espressif.com/ which I think might be full of multibyte characters and with a buffer size of 256 I ran into the following error quite quickly:

Error: incomplete utf-8 byte sequence from index 255

EDIT: If you can't tell, I'm new to rust :) I feel like we need to use something like https://docs.rs/utf8-read/latest/utf8_read/# but with the embedded_svc::io::Read trait instead of the std::io::Read trait. Given that they are analogous is there a shortcut for this (like a cast)?

EDIT: Ok I keep learning: the following works but I don't think it is efficient enough for production code - i think it would be better to have a UTF-8 reader that just returned chunks of valid UTF-8 bytes instead of converting to 4 byte chars (which then get converted back to UTF-8 by print!, never mind the cost of printing to stdout 1 character at a time!). Also this method makes it difficult to know the size in bytes:

use utf8_read::Reader;
use embedded_svc::io::adapters::ToStd;

...
            let mut reader = Reader::new(ToStd::new(response));
            for characterResult in reader.into_iter() {
                let character = characterResult?;
                print!("{}", character)
            }
...
@pghalliday pghalliday changed the title Is this conversion to UTF-8 string safe? This conversion to UTF-8 string is not safe Aug 29, 2023
@pghalliday
Copy link
Contributor Author

Ok, so I have been doing more research and have found the valid_up_to method of Utf8Error: https://doc.rust-lang.org/std/str/struct.Utf8Error.html#method.valid_up_to

I think this is the safest way to safely decode a UTF-8 response in chunks I'm likely to come up with at this time.

const BUFFER_SIZE: usize = 256;

fn read_response(mut response: Response<&mut EspHttpConnection>) -> Result<()> {
    // Fixed buffer to read into
    let mut buffer = [0_u8; BUFFER_SIZE];
    // Offset into the buffer to indicate that there are still
    // bytes at the beginning that have not been decoded yet
    let mut offset = 0;
    // Keep track of the total number of bytes read to print later
    let mut total = 0;
    loop {
        // read into the buffer starting at the offset to not overwrite
        // the incomplete UTF-8 sequence we put there earlier
        if let Ok(size) = response.read(&mut buffer[offset..]) {
            if size == 0 {
                // no more bytes to read from the response
                if offset > 0 {
                    bail!("Response ends with an invalid UTF-8 sequence with length: {}", offset)
                }
                break;
            }
            // Update the total number of bytes read
            total += size;
            // remember that we read into an offset and recalculate the
            // real length of the bytes to decode
            let size_plus_offset = size + offset;
            match str::from_utf8(&buffer[..size_plus_offset]) {
                Ok(text) => {
                    // buffer contains fully valid UTF-8 data,
                    // print it and reset the offset to 0
                    println!("{}", text);
                    offset = 0;
                },
                Err(error) => {
                    // buffer contains incomplete UTF-8 data, we will
                    // print the valid part, copy the invalid sequence to
                    // the beginning of the buffer and set an offset for the
                    // next read
                    let valid_up_to = error.valid_up_to();
                    println!("{}", str::from_utf8(&buffer[..valid_up_to])?);
                    buffer.copy_within(valid_up_to.., 0);
                    offset = size_plus_offset - valid_up_to;
                }
            }
        }
    }
    println!("Total: {} bytes", total);
    Ok(())
}

@pghalliday
Copy link
Contributor Author

I just realised that I should also check the error_len from the error: https://doc.rust-lang.org/std/str/struct.Utf8Error.html#method.error_len

This would signify an invalid sequence that needs to be skipped as stated in the linked docs (I did not cover this case in the function above)

@pghalliday
Copy link
Contributor Author

Ok this implementation deals with invalid UTF-8 sequences too, but I'm not sure it's helpful in the context of the training materials.

const BUFFER_SIZE: usize = 256;

struct ResponsePrinter {
    // Fixed buffer to read into
    buffer: [u8; BUFFER_SIZE],
    // Offset into the buffer to indicate that there are still
    // bytes at the beginning that have not been decoded yet
    offset: usize,
}

impl ResponsePrinter {
    fn new() -> ResponsePrinter {
        ResponsePrinter {
            buffer: [0_u8; BUFFER_SIZE],
            offset: 0,
        }
    }

    fn print(&mut self, mut response: Response<&mut EspHttpConnection>) -> Result<()> {
        // Keep track of the total number of bytes read to print later
        let mut total = 0;
        loop {
            // read into the buffer starting at the offset to not overwrite
            // the incomplete UTF-8 sequence we put there earlier
            if let Ok(size) = response.read(&mut self.buffer[self.offset..]) {
                if size == 0 {
                    // no more bytes to read from the response
                    if self.offset > 0 {
                        bail!("Response ends with an invalid UTF-8 sequence with length: {}", self.offset)
                    }
                    break;
                }
                // Update the total number of bytes read
                total += size;
                // recursive print to handle invalid UTF-8 sequences
                self.print_utf8(size)?;
            }
        }
        println!("Total: {} bytes", total);
        Ok(())
    }

    fn print_utf8(&mut self, size: usize) -> Result<()> {
        // remember that we read into an offset and recalculate the
        // real length of the bytes to decode
        let size_plus_offset = size + self.offset;
        match str::from_utf8(&self.buffer[..size_plus_offset]) {
            Ok(text) => {
                // buffer contains fully valid UTF-8 data,
                // print it and reset the offset to 0
                print!("{}", text);
                self.offset = 0;
            },
            Err(error) => {
                // A UTF-8 decode error was encountered, print
                // the valid part and figure out what to do with the rest
                let valid_up_to = error.valid_up_to();
                print!("{}", str::from_utf8(&self.buffer[..valid_up_to])?);
                if let Some(error_len) = error.error_len() {
                    // buffer contains invalid UTF-8 data, print a replacement
                    // character then copy the remainder (probably valid) to the
                    // beginning of the buffer, reset the offset and deal with
                    // the remainder in a recursive call to print_utf8
                    print!("{}", char::REPLACEMENT_CHARACTER);
                    let valid_after = valid_up_to + error_len;
                    self.buffer.copy_within(valid_after.., 0);
                    self.offset = 0;
                    return self.print_utf8(size_plus_offset - valid_after);
                } else {
                    // buffer contains incomplete UTF-8 data, copy the invalid
                    // sequence to the beginning of the buffer and set an offset
                    // for the next read
                    self.buffer.copy_within(valid_up_to.., 0);
                    self.offset = size_plus_offset - valid_up_to;
                }
            }
        }
        Ok(())
    }
}

@SergioGasquez
Copy link
Member

Hi! Thanks for opening the issue and sharing your findings on the topic! Would you mind opening a PR with your solution? For the purpose of the training, I would keep it as simple as possible as the main point of the exercise is the HTTP request.

@pghalliday
Copy link
Contributor Author

I'll cut it down to make it more palatable but at least safe for the happy path of valid utf-8 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
2 participants