A program to detect the encoding of a text file.
C++
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md
tellenc.cpp

README.md

tellenc

Overview

Tellenc is program to detect the encoding of a text file. Its usage is very simple:

tellenc [-v] <filename>

One file name should be provided, and a ‘-v’ option can be used to make tellenc to generate verbose output, which may help the user know how it is working and provide clues about extending the program. It currently detects the following encodings:

  • ASCII,
  • UTF-8
  • UTF-16/32 (little-endian or big-endian)
  • Latin1
  • Windows-1250
  • Windows-1252
  • CP437
  • GB2312
  • GBK
  • Big5
  • SJIS
  • EUC-JP
  • EUC-KR
  • KOI8-R

Extending tellenc

Extending this program should be easy. Here are the steps:

  1. Find some text representative of the language
  2. Save the text in the appropriate legacy encoding
  3. Run tellenc with the ‘-v’ option and the text file created above
  4. Look into the output and choose the double-bytes that appear in high frequency and are also unique (not already in freq_analysis_data in the source code)
  5. Add the value pair { code, encoding_name } to freq_analysis_data in the source code

You are welcome to send me patches. Be sure to send me the test text file, too.

Building tellenc

Tellenc only requires a C++98-conformant compiler, and there are no other library dependencies. Here are a few possible command lines for different compilers.

MSVC (Windows):

cl /EHsc /Ox tellenc.cpp

GCC (Linux):

g++ -O2 tellenc.cpp -o tellenc -s

Clang (Mac):

clang++ -O2 tellenc.cpp -o tellenc

Previously I could get a very small executable with MSVC 6 + STLport 4.5.1:

cl /Ox /GX /Gr /G6 /MD /D_STLP_NO_IOSTREAMS tellenc.cpp /link /opt:nowin98

However, MSVC 6 is just too obsolete, and it does not accept the UTF-8 BOM character. I no longer maintain this build environment.

I can still get a quite small Windows executable with MSVC 7.1 + STLport 5.1.0 (size is less than half that of the executable generated by a more modern compiler, if the result only depends on system DLLs):

cl /Ox /GX /Gr /G7 /D_STLP_NO_IOSTREAMS tellenc.cpp /link /opt:nowin98

It probably does not matter, unless you like small sizes very much. :-)