A PostgreSQL extension to detect the character set of a string-based column and convert it to UTF8. It returns the character set name that can then be used to convert the string data to a different character set, using the either the PostgreSQL convert(string bytea, ...) functions or the builtin convert_to_UTF8() db function. This extension is primarily useful if your database encoding is SQL_ASCII.
The pg_chardetect extension uses software from ICU (http://site.icu-project.org/) and Pavel Stehule's PLToolBox2 (http://pgfoundry.org/projects/pstcollection/).
Note this version is targeted specifically for PostgreSQL 9.0 on Ubuntu 10.0.4 LTS. Ports to other versions of PostgreSQL and OS's are welcome!
CHARACTER SET DETECTION IS NOT NECESSARILY DETERMINISTIC, PARTICULARLY FOR LATIN-BASED LANGUAGES. THE DETECTOR MAY MAKE THE WRONG GUESS AND TRANSCODE YOUR DATA INCORRECTLY! BE SURE TO FOLLOW PRUDENT PROCEDURES TO PROTECT YOUR DATA, INCLUDING BUT NOT LIMITED TO BACKUPS, HISTORY TABLES, AND THROWAWAY DATABASES!
YOU HAVE BEEN WARNED!
Follow the instructions at http://www.icu-project.org/repos/icu/icu/tags/release-54-1/readme.html to download and install the ICU libraries and headers. Alternatively, you can install the necessary ICU libraries and headers using a package manager, e.g:
sudo apt-get install -y libicu42 libicu42-dbg libicu-dev
Building the pg_chardetect PostgreSQL extension uses the PGXS extension framework and requires the PostgreSQL development packages be installed:
sudo apt-get install postgresql-9.0
sudo apt-get install postgresql-9.0-dbg
sudo apt-get install postgresql-contrib-9.0
sudo apt-get install postgresql-dev
sudo apt-get install postgresql-server-dev-9.0
The ICU libraries are installed in /usr/local/lib
. Run
sudo ldconfig -v | grep icu
to list the libraries in the linker's path. The output should contain libicu*.so
. If not, check the /etc/ld.so.conf
and /etc/ld.so.conf.d/*
files to verify /usr/local/lib
is included in the linker's configuration. Add /usr/local/lib
if necessary and rerun the ldconfig
command above to set the linker's cache.
To build the extension
mkdir pg_chardetect && cd pg_chardetect
git clone https://github.com/aweber/pg_chardetect.git
make clean && make && sudo make install
The supporting libraries above must be built and installed prior to building the pg_chardetect extension. The install step will copy pg_chardetect.sql
and pg_chardetect.so
to the appropriate locations to run under PostgreSQL.
To install the extension into a PostgreSQL database server run:
sudo su - postgres
createdb test
psql test -f $(pg_config --sharedir)/contrib/pg_chardetect.sql
As postgres load the test data:
cd pg_chardetect/test-data
zcat test.dump_p.gz | psql test
Run the following to observe the extension in action:
psql test -c "select original_encoding, language, convert_this, convert_this::bytea, char_set_detect(convert_this), convert_to_UTF8(convert_this, true) from test"
The output of char_set_detect(text)
is a (encoding name, language, confidence level)
tuple. The encoding name should be an IANA encoding name. ICU reports the language, if it can be determined. The confidence level ranges from 0 to 100, with 0 begin no confidence and 100 be absolute confidence.
The output of convert_to_UTF8(text) is, of course, the input text converted to UTF8, if possible. If not possible the original text is returned.
The query above should run without error. The ICU library may or may not report NULL for the charset detection tuple, depending on whether or not it could detect the character set.
Example usage of the pg_chardetect db functions can be found in test-data/pg_chardetect-test.sql, including a trigger function template for automatic conversion during inserts and updates. The update technique will bloat your tables, so be sure to (auto)vacuum well and often!
Please contribute further tests with known character sets and expected results. The more thoroughly tested this module is the better.