-
Notifications
You must be signed in to change notification settings - Fork 7
Encoding
If you have some issues with the character encoding in your application (e.g. you see characters replaced with '?'), then this page may be usefull for you, I hope.
The first task is to correctly upload the dump into MySQL. (The zero task is to download parsed Wiktionary database from this site.)
I prefer several commands in windows command-line cmd.exe:
mysql$ CREATE DATABASE enwikt20100824_parsed;
mysql$ USE enwikt20100824_parsed
mysql$ SOURCE path_to_file.sql
See, e.g.: [File wikt_parsed_empty_sql](File wikt_parsed_empty_sql)
My connection parameters (from Java source code):
"enwikt20111008_parsed?useUnicode=false&characterEncoding=ISO8859_1&autoReconnect=true&useUnbufferedInput=false";
There are the following parameters of the parsed Wiktionary database in MySQL:
mysql> SHOW VARIABLES LIKE 'character_set%';
Variable_name | Value |
---|---|
character_set_client | latin1 |
character_set_connection | latin1 |
character_set_database | latin1 |
character_set_filesystem | binary |
character_set_results | latin1 |
character_set_server | latin1 |
character_set_system | utf8 |
As I understand, all text information (in Wikipedia and Wiktionary databases) is stored in the binary format.
MySQL thinks (see table above) that the data are stored in the latin1
format.
So in Java code I am using the following function to decode text from binary (bytes) to UTF8:
str_sql.append("SELECT text FROM wiki_text WHERE id=");
str_sql.append(id);
ResultSet rs = s.executeQuery (str_sql.toString());
String text = bytesToUTF8(rs.getBytes("text"));
...
public static String bytesToUTF8(byte[] bytes) {
return bytesTo(bytes, "UTF8");
}
public static String bytesTo(byte[] bytes, String encode) {
try {
return new String(bytes, encode);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
return EMPTY_STRING;
}
}
I hope it will help to find the solution using your programming language.
From the letter of the user, who successfully solved encoding problems.
It was used the Mac MySQL frontend Sequel (similar to MySQLWorkbench, at least for the basic features).
The following variables were used (actually, character_set_client and character_set_database were "binary" also, but somehow they changed to utf8):
- character_set_client utf8
- character_set_connection binary
- character_set_database utf8
- character_set_filesystem binary
- character_set_results latin1
- character_set_server binary
- character_set_system utf8
According to this article the variable "character_set_server" only determines the default encoding of new databases.
Since a non-default encoding was chosen (during the creation of a database in order to load the Wiktionary parsed database dump file in), the above setting of character_set_server
should be irrelevant.
It was created a database with encoding "cp1252 West European (latin1)". Then the dump file with encoding "Western (ISO Latin 1)" was imported to the MySQL database.
- Markus Bertheau. MySQL and UTF-8 — no more question marks!