Skip to content

daniel5151/compressed-emoji-shortcodes

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Highly Compressed Emoji Shortcode Mapping

An experiment to try and find a highly compressed representation of the entire unicode shortcodes-to-emoji mapping that can be indexed without requiring any dynamic allocation. In other words: what's the smallest amount of static storage(code and data) required to write a function with the following signature:

fn shortcode_to_emoji(input: &str) -> Option<&str>

Check out the project's writeup for more details on the rationale and idea behind this project.

DISCLAIMER: This is a project that was hacked together over the course of a couple weekends, where iteration speed took priority over code quality/robustness. The implementation is a mess, and the build system is incredibly brittle. If you're even remotely considering building off this repo - please don't. You'd be much better off building off these ideas and writing a new implementation from scratch.

Make sure to try out the online demo!

Building and Running

At the moment, building this library requires running a *nix OS with curl installed, which is required to download the emoji database used as part of the build process. Aside from that, the core maximally-compressed-emoji-shortcodes library is a bog-standard Rust crate which can be built with cargo build.

Given that this is more of a quick-and-dirty experiment rather than a proper ready-to-use library, there isn't any easy-to-use config file to play around with. If you're seriously interested in playing around with this mess of code, get ready to do some spelunking. Some potentially useful starting off points:

  • Changing the size of the hash - ./rust-phf/phf/src/map.rs:18 (the keys: Slice<u16> field) and ./rust-phf/phf_shared/src/lib.rs:43 (the checksum function).
  • Fixup Table search times - build.rs
  • Swapping out the key/value data set - build.rs

This repo includes several test/example projects that use the maximally-compressed-emoji-shortcodes library:

  • kowalski-analysis: a very messy playground for testing various properties of the library (e.g: false-positive rates, accuracy, compression ratio, etc...).
  • example_no_std: a no_std Rust binary that serves as a rough benchmark for how much space the library will occupy when deployed in an embedded context.
  • shortcode-web: using the magic of wasm-pack, you can play with this project via a neat little online demo! Try it out at here!
    • Note: the .wasm size of the shortcode-web demo is not representative of the binary size on an proper embedded platform, since wasm-bindgen introduces almost 15kb of overhead for some reason (i.e: when the single exported function is replaced with a noop). I could probably slim this down by bypassing wasm-bindgen entirely, and figuring out how to accept Javascript Strings over the FFI, but that's high effort. So yeah, just subtract 15kb from the (uncompressed) .wasm size to get a better idea of the compression factor.

Future Work?

I'm pretty much done with this project for now, but there are still a few ideas that might be worth exploring to compress things down even more:

  • Compressing the emoji UTF-8 strings using some sort of domain specific representation
    • e.g: it seems like most emoji fall under a small subrange of unicode codepoints, it might be possible to shave a couple bits of overhead from each emoji mapping by adding/removing an offset from the stored value.
  • Using non-standard key hash sizes (i.e: 9-bit, 10-bit, etc...).
    • Follow up: automatically trying out various key hash sizes to minimize space overhead while maintaining a favorable hash-collision rate.
  • Actually cleaning up this abomination of a codebase and releasing a proper library that employs these various techniques

Oh, and of course, I should probably port it to one of my keyboards. After all, that was the whole inspiration for this endeavor!

Acknowledgements

This project wouldn't have been possible without the incredible rust-phf library. The in-tree version of rust-phf is a stripped down version and heavily modified version the library, optimizing the map's internal representation for this particular use-case.

The initial POC of this project was based off of https://github.com/kornelski/gh-emoji.

The emoji shortcode database is downloaded directly from Github's gemoji library.

Special thanks to Matt D'Souza and Ethan Hardy, who I nerd-sniped into helping me with this funky little project.

About

A Quest to Find a Highly Compressed Emoji :shortcode: Lookup Function

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages