Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utf8Error when trying to parse 2902.dat #21

Closed
ScanMountGoat opened this issue Feb 21, 2023 · 6 comments
Closed

Utf8Error when trying to parse 2902.dat #21

ScanMountGoat opened this issue Feb 21, 2023 · 6 comments

Comments

@ScanMountGoat
Copy link
Contributor

I was looking for robust and efficient ldraw parsers in Rust and found this crate. From my initial experience, the library has been very well documented and easy to use. I noticed in #2 that the code likely hasn't been run against the entire parts library yet. I've already found some panics.

I'm getting the error Utf8Error { valid_up_to: 10, error_len: Some(1) } at this section of code. Rust doesn't seem to complain about the formatting when I trying and convert the file to utf8 with std::fs::read_to_string. This happens on the recently downloaded ldraw as well as bricklink studio parts libraries. I can upload the dat file if needed.

weldr/lib/src/lib.rs

Lines 414 to 420 in aae1bff

named!(
comment<Command>,
do_parse!(
content: take_not_cr_or_lf
>> (Command::Comment(CommentCmd::new(std::str::from_utf8(content).unwrap())))
)
);

@ScanMountGoat
Copy link
Contributor Author

This is the code I'm running to test with.

use std::path::Path;
use weldr::{parse, FileRefResolver, ResolveError, SourceMap};

struct MyCustomResolver;

impl FileRefResolver for MyCustomResolver {
    fn resolve(&self, filename: &str) -> Result<Vec<u8>, ResolveError> {
        let catalog_path = Path::new(r"C:\Users\Public\Documents\LDraw");
        let base_paths = vec![
            catalog_path.join("p"),
            catalog_path.join("p").join("48"),
            catalog_path.join("parts"),
            catalog_path.join("parts").join("s"),
        ];
        for prefix in &base_paths {
            let full_path = prefix.join(filename);
            if let Ok(bytes) = std::fs::read(&full_path) {
                return Ok(bytes);
            }
        }
        Err(ResolveError::new_raw(filename))
    }
}

fn main() {
    let resolver = MyCustomResolver {};
    let mut source_map = SourceMap::new();

    for file in std::fs::read_dir(r"C:\Users\Public\Documents\LDraw\parts").unwrap() {
        let file = file.unwrap().path().to_string_lossy().to_string();
        match parse(&file, &resolver, &mut source_map) {
            Ok(file_ref) => {
                let root = file_ref.get(&source_map);
                let count = root.iter(&source_map).count();
                println!("File: {:?}, Commands: {}", root.filename, count);
            }
            Err(e) => println!("error parsing: {file:?}: {e}"),
        }
    }
}

@ScanMountGoat
Copy link
Contributor Author

ScanMountGoat commented Feb 21, 2023

It looks like the issue is actually in how "0 Cran creusé" is represented inparts\s\2902s01.dat as 3020 4372616E 20637265 7573E9 in hex. Rust doesn't like that the e with an accent is only one byte 0xE9. Official tools like LDView can read the file with the comments displayed with the accent.

@djeedai
Copy link
Owner

djeedai commented Feb 22, 2023

I've not touched that code for a long time, but it looks from the description like this is an issue with a non-canonical encoding of a UTF-8 string, which Rust from_utf8() refuses to handle. I'm not quite sure what can be done here.

@ScanMountGoat
Copy link
Contributor Author

ScanMountGoat commented Feb 22, 2023

I'm reworking the parsing code on my fork to use the latest version of nom with functions instead of macros. This also removes the unwrap calls on invalid utf8 and returns an error instead. I should have a PR ready to review soon. I've already recieved a response from the moderators on the ldraw forums. The files in question are not conforming to the spec and will be fixed in the next parts update, so there's no need to do any kind of sanitization in weldr.

@djeedai
Copy link
Owner

djeedai commented Feb 22, 2023

Nice, thanks for all of that!

@djeedai
Copy link
Owner

djeedai commented Feb 23, 2023

I just checked, 2902s01.dat didn't even exist last time this library was updated :)

@djeedai djeedai mentioned this issue Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants