Skip to content

cigix/deencode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deencode: Reverse engineer encoding errors

My first name is Clément. Throughout my life, I've encountered my fair share of bad printings of my name because of bad encoding management: the text is encoded (turned from an internal representation into a sequence of bytes) then decoded (turned from a sequence of bytes into an internal representation) using different schemes. This often leads to non-ASCII characters being mangled, replaced, or outright ignored.

For example:

The string "Clément"
└╴encoded as UTF-8 is 43 6C C3 A9 6D 65 6E 74
  └╴decoded as Latin-1 / Codepage 1252 is "Clément"

Having this sort of visualisations is why I created this crate. You take a number of engines, pass them to deencode::deencode() to get back a tree of possible sequences of encodings and decodings, and then work on that tree.

This crate is published on crates.io; with documentation at docs.rs.

Example usage

// List the engines to use.
let engines: Vec<&dyn Engine> = vec![&UTF8, &LATIN1, &MIXED816BE, &MIXED816LE, &UTF7];
// Explore the tree of possible encodings and decodings.
let mut tree = deencode("Clément", &engines, 1);
// Remove duplicate entries from the tree.
let _ = tree.deduplicate();
// Export the tree with box drawings.
println!("{}", tree);
// Export the tree as JSON.
println!("{}", serde_json::to_string(&tree).unwrap());

Some additional reading

About

A Rust library to reverse-engineer text encoding issues

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages