Skip to content

fuww/ftfy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ftfy — fixes text for you

An Elixir port of the Python ftfy library (version 6.3.1). It takes in broken Unicode text and makes it less broken — most importantly, it detects and fixes mojibake (text that was decoded in the wrong encoding).

iex> Ftfy.fix_text("✔ No problems")
"✔ No problems"

iex> Ftfy.fix_text("Broken text… it’s flubberific!")
"Broken text… it's flubberific!"

iex> Ftfy.fix_text("LOUD NOISES")
"LOUD NOISES"

iex> Ftfy.fix_encoding_and_explain("só")
{"só", [{"encode", "latin-1"}, {"decode", "utf-8"}]}

What it does

Ftfy.fix_text/2 runs a sequence of fixes, each individually configurable via Ftfy.TextFixerConfig:

  • fix_encoding — detect mojibake and undo it by re-encoding and re-decoding through the right pair of encodings (the heart of ftfy), including the sub-fixes restore_byte_a0, replace_lossy_sequences, decode_inconsistent_utf8, and fix_c1_controls
  • unescape_html — decode HTML entities (&, é, ’, …)
  • remove_terminal_escapes — strip ANSI color codes
  • fix_latin_ligaturesfi
  • fix_character_width — fullwidth/halfwidth → standard width
  • uncurl_quotes — curly quotes → straight quotes
  • fix_line_breaks — CRLF, CR, LS, PS, NEL → \n
  • fix_surrogates — repair UTF-16 surrogate pairs
  • remove_control_chars — strip useless control characters
  • Unicode normalization (NFC by default)

Other entry points mirror the Python API: fix_and_explain/2, fix_encoding/2, fix_encoding_and_explain/2, fix_text_segment/2, apply_plan/2, guess_bytes/1, fix_file/2, and explain_unicode/1. The Ftfy.Fixes, Ftfy.Badness, Ftfy.Chardata, Ftfy.Codecs, and Ftfy.Formatting modules expose the lower-level building blocks.

Configuration

Pass a keyword list or a %Ftfy.TextFixerConfig{}:

Ftfy.fix_text(text, uncurl_quotes: false)
Ftfy.fix_text(text, %Ftfy.TextFixerConfig{normalization: "NFKC"})

Command line

Build the escript and fix text from a file or stdin:

mix escript.build
echo '✔ No problems' | ./ftfy
./ftfy -e latin-1 broken.txt -o fixed.txt

Installation

Add ftfy to your dependencies in mix.exs:

def deps do
  [
    {:ftfy, "~> 0.1.0"}
  ]
end

Notes on the port

  • The encoding-detection data tables (HTML entities, the single-byte charmap encodings, the fullwidth/halfwidth map, the wcwidth width tables) and the two large heuristic regexes are generated from the reference implementation by scripts/gen_data.py into the Ftfy.Data module (internal, undocumented). The reference package is vendored as a git submodule at vendor/python-ftfy (pinned to the v6.3.1 tag); run git submodule update --init before regenerating.
  • Ftfy.Codecs reimplements Python's bad_codecs: the sloppy-windows-* and related charmap encodings, and the utf-8-variants (CESU-8 / Java modified UTF-8) decoder, including incremental decoding.
  • The behavioral test corpus is read directly from the pinned vendor/python-ftfy submodule (tests/test_cases.json); the unit tests are ported from python-ftfy. All 151 "pass" cases and 10 "known failure" cases match the reference. (Running the tests therefore needs the submodule: git submodule update --init.)
  • One deliberate difference: the BEAM cannot represent lone UTF-16 surrogate codepoints in a binary, so Ftfy.Fixes.fix_surrogates/1 is effectively a no-op on valid strings, and explain_unicode/1 omits the Unicode character name (the BEAM has no names database).

License and credits

This library is a port of ftfy ("fixes text for you"), created by Robyn Speer. ftfy is the result of years of careful work on the messy reality of broken Unicode, and this Elixir port exists only because of it — our deepest thanks to Robyn Speer for building and maintaining the original, and for releasing it under a permissive license.

  • Original ftfy: Copyright 2023 Robyn Speer, licensed under the Apache License, Version 2.0 — https://github.com/rspeer/python-ftfy
  • This Elixir port: Copyright 2026 FashionUnited, also licensed under the Apache License, Version 2.0.

The data tables and test corpus in this repository are generated from / ported directly from python-ftfy 6.3.1 and remain the work of the original author. See LICENSE for the full license text and NOTICE for the attribution and change notice required by the Apache License.

If you use ftfy in research, please cite the original author's work as described at https://github.com/rspeer/python-ftfy.

About

An Elixir port of the Python [ftfy](https://github.com/rspeer/python-ftfy) library

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors