My first name is Clément. Throughout my life, I've encountered my fair share of bad printings of my name because of bad encoding management: the text is encoded (turned from an internal representation into a sequence of bytes) then decoded (turned from a sequence of bytes into an internal representation) using different schemes. This often leads to non-ASCII characters being mangled, replaced, or outright ignored.
For example:
The string "Clément"
└╴encoded as UTF-8 is 43 6C C3 A9 6D 65 6E 74
└╴decoded as Latin-1 / Codepage 1252 is "Clément"
Having this sort of visualisations is why I created this crate. You take a
number of
engines,
pass them to
deencode::deencode()
to get back a
tree
of possible sequences of encodings and decodings, and then work on that tree.
This crate is published on crates.io; with documentation at docs.rs.
// List the engines to use.
let engines: Vec<&dyn Engine> = vec![&UTF8, &LATIN1, &MIXED816BE, &MIXED816LE, &UTF7];
// Explore the tree of possible encodings and decodings.
let mut tree = deencode("Clément", &engines, 1);
// Remove duplicate entries from the tree.
let _ = tree.deduplicate();
// Export the tree with box drawings.
println!("{}", tree);
// Export the tree as JSON.
println!("{}", serde_json::to_string(&tree).unwrap());