Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set charset/collation properly for each text column if using MySQL. #414

Merged
merged 6 commits into from Aug 30, 2014
Merged

Set charset/collation properly for each text column if using MySQL. #414

merged 6 commits into from Aug 30, 2014

Conversation

knu
Copy link
Member

@knu knu commented Jul 25, 2014

With this change, Huginn is able to store up to 4-byte UTF-8 characters
in its database. This should fix #286.

@knu
Copy link
Member Author

knu commented Jul 25, 2014

This is only for Ruby >=2.0 because I used Module#prepend in monkey patch, and it effectively does nothing on Ruby 1.9.
The migration is skipped for non-mysql database adapters, so I think it's safe for everyone.

@knu
Copy link
Member Author

knu commented Jul 25, 2014

Hmm, the error above looks unfamiliar. There might be a strange situation in the migration chain.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.42%) when pulling 3f4b640 on knu:set_charset_for_mysql into 67fcfef on cantino:master.

@cantino
Copy link
Member

cantino commented Jul 26, 2014

Very interesting approach @knu!

Will this work on Ruby <2.0?

Do you think we should pull this into a gem and require it? That might make this useful for other projects too.

@cantino
Copy link
Member

cantino commented Jul 26, 2014

Going forward, what do our migrations need to contain to have good UTF8 support?

@cantino
Copy link
Member

cantino commented Aug 9, 2014

@knu, please see my questions above when you have time.

@knu
Copy link
Member Author

knu commented Aug 12, 2014

Will this work on Ruby <2.0?

You can make it work in Ruby 1.9.x by replacing prepend with the alias_method_chain technique, but since the 1.9 series of Ruby will be dying in about six month I'd rather urge users to move to a newer version of Ruby.

Do you think we should pull this into a gem and require it? That might make this useful for other projects too.

Maybe, but this is just a local hack after all. Naming it something like "per-column charset support for ActiveRecord/MySQL" would be a hype when it only works on DDL (i.e. migration) and not on DML, which means you can only practically use it for mixing utf8mb4 with utf8 (and ascii).

It would be a long way to make ActiveRecord and the underlying MySQL driver(s) support operating on a table with columns of mixed charsets properly. So, at the moment, while this patch works for many projects including Huginn, I'm still kind of reluctant to release it as a gem as it is now.

I'll have to watch the Rails development closely to figure out how to integrate this into AR.

@cantino
Copy link
Member

cantino commented Aug 12, 2014

Good to know. It'd be nice to maintain support for Ruby 1.9 for the time being, since we want Huginn to be as accessible as possible.

@virtadpt
Copy link
Collaborator

For what it's worth, one of the most common platforms these days is Ubuntu 14.04 LTS (EC2 and Digital Ocean come immediately to mind), which packages Ruby v1.9 and it's notoriously problematic to migrate it to Ruby v2.0. Being able to run Huginn with Ruby v1.9 until the next release of Ubuntu server comes out would be very, very helpful.

@cantino
Copy link
Member

cantino commented Aug 13, 2014

Yea, I'm not comfortable dropping 1.9 yet.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.4%) when pulling 2fdaab5 on knu:set_charset_for_mysql into d7c5417 on cantino:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.36%) when pulling 2302a59 on knu:set_charset_for_mysql into d7c5417 on cantino:master.

@cantino
Copy link
Member

cantino commented Aug 20, 2014

Is this done @knu?

@knu
Copy link
Member Author

knu commented Aug 20, 2014

@cantino Yes, the utf8mb4 charset support has been extended to cover ruby 1.9 by the last commit.

@cantino
Copy link
Member

cantino commented Aug 21, 2014

Should we be using utf8-unicode-ci instead of utf8-general-ci?

http://stackoverflow.com/questions/1036454/what-are-the-diffrences-between-utf8-general-ci-and-utf8-unicode-ci

options.update(charset: 'utf8', collation: 'utf8_general_ci')
case name
when 'username'
options.update(limit: 767 / 4, charset: 'utf8mb4', collation: 'utf8mb4_general_ci')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain again why we treat username differently from the other columns, or why all columns can't be utf8mb4_bin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used for login ID, so I thought it should be less sensitive to cases and in terms of uniqueness.

@cantino
Copy link
Member

cantino commented Aug 21, 2014

Testing a deploy now to my setup.

@knu
Copy link
Member Author

knu commented Aug 21, 2014

@cantino Maybe. They are both broken depending on the language used. I chose utf8_general_ci because it's the default collation of MySQL. Actually, we avoid the use of utf8_unicode_ci here in Japan because Japanese words "haha" (mother), "papa" (dad) and "baba" (grandma) are all treated equal in comparison under that collation. But I imagine there'd be a similar kind of problem in other (Western) countries and that would be why Rails uses utf8_unicode_ci by default.

For now, few users would dare to use non-ASCII characters in their username, so I can live with utf8_unicode_ci.

@cantino
Copy link
Member

cantino commented Aug 21, 2014

Oh oh.

 ** [out :: aicomplete.com] == 20140813110107 SetCharsetForMysql: migrating ===============================
 ** [out :: aicomplete.com] 
 ** [out :: aicomplete.com] -- change_column("agents", "options", :text, {:limit=>16777215, :null=>true, :default=>nil, :charset=>"utf8mb4", :collation=>"utf8mb4_bin"})
 ** [out :: aicomplete.com] 
 ** [out :: aicomplete.com] rake aborted!
 ** [out :: aicomplete.com] StandardError: An error has occurred, all later migrations canceled:
 ** [out :: aicomplete.com] 
 ** [out :: aicomplete.com] Mysql2::Error: Unknown character set: 'utf8mb4': ALTER TABLE `agents` CHANGE `options` `options` mediumtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL

@cantino
Copy link
Member

cantino commented Aug 21, 2014

What version of mysql does this require?

@knu
Copy link
Member Author

knu commented Aug 21, 2014

Oh. Looks like it's from MySQL 5.5.

@cantino
Copy link
Member

cantino commented Aug 21, 2014

Hmm. I what should we do for older versions?

@knu
Copy link
Member Author

knu commented Aug 21, 2014

We could use binary (blob) types instead and implement serializer/deserializer, I guess. As for JSON fields, we have JsonSerializedField to tweak. I smell of unexpected encoding errors everywhere until we could finally get it to work, though...

@cantino
Copy link
Member

cantino commented Aug 21, 2014

Yes, that sounds difficult. How hard would it be to only run the migration on supported versions of MySQL, and make sure schema.rb in the repo works okay on older mysql versions -- that is, your library should just ignore collation commands if mysql cannot handle them?

Does feel like maybe pulling it into a gem would be good.

@knu
Copy link
Member Author

knu commented Aug 21, 2014

@cantino I've added a quick hack to replace utf8mb4 with utf8 if the server has no support for utf8mb4, with which the stock schema.rb will work on any version of MySQL.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.55%) when pulling f4f6930 on knu:set_charset_for_mysql into d7c5417 on cantino:master.

@cantino
Copy link
Member

cantino commented Aug 22, 2014

Unfortunately, I'm still getting the same error:

** [out :: aicomplete.com] == 20140813110107 SetCharsetForMysql: migrating ===============================
 ** [out :: example.com] 
 ** [out :: example.com] -- change_column("agents", "options", :text, {:limit=>16777215, :null=>true, :default=>nil, :charset=>"utf8mb4", :collation=>"utf8mb4_bin"})
 ** [out :: example.com] 
 ** [out :: example.com] rake aborted!
 ** [out :: example.com] StandardError: An error has occurred, all later migrations canceled:
 ** [out :: example.com] 
 ** [out :: example.com] Mysql2::Error: Unknown character set: 'utf8mb4': ALTER TABLE `agents` CHANGE `options` `options` mediumtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL
 ** [out :: example.com] /home/huginn/app/shared/bundle/ruby/1.9.1/gems/activerecord-4.1.5/lib/active_record/connection_adapters/abstract_mysql_

When I run ActiveRecord::Base.connection.utf8mb4_supported?, I get false.

One other question: once this is merged, do we have to do anything differently now when we add new tables or columns? How do we set the charset and collation correctly going forward?

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.34%) when pulling 857b8ea on knu:set_charset_for_mysql into 0227496 on cantino:master.

@cantino
Copy link
Member

cantino commented Aug 30, 2014

I think you're good to merge @knu. Going forward, will usages of utf8mb4 cause errors on older MySQL servers if we use it in future migration column definitions, or will your code translate these to utf8 when the server can't support it?

@knu
Copy link
Member Author

knu commented Aug 30, 2014

@cantino The charset fallback will be effective in any future migration coumn definitions.

knu added a commit that referenced this pull request Aug 30, 2014
Set charset/collation properly for each text column if using MySQL.
@knu knu merged commit ac7cd8c into huginn:master Aug 30, 2014
@cantino
Copy link
Member

cantino commented Aug 30, 2014

Awesome, nice work :)

@knu knu deleted the set_charset_for_mysql branch September 4, 2014 04:07
@othreed
Copy link
Contributor

othreed commented Sep 30, 2014

I am having problems with switching to utf8mb4.
I deleted my old database and did the changed mentioned -> changed the mysql configuration in my.cnf to use the utf8mb4 as a default and also changed the DATABASE_ENCODING to utf8mb4
in the .env file.
When I run rake db:migrate I get the following error for the index.

== 20120728210244 DeviseCreateUsers: migrating ================================ create_table(:users) -> 0.0348s -- add_index(:users, :email, {:unique=>true}) rake aborted! StandardError: An error has occurred, all later migrations canceled: Mysql2::Error: Specified key was too long; max key length is 1000 bytes: CREATE UNIQUE INDEX index_users_on_email

what did I miss?
thank you for your help

@knu
Copy link
Member Author

knu commented Oct 1, 2014

It is generally not a good idea to build everything in utf8mb4 by making it the default character set on the server side. Check out this comment. You could still extend the maximum key size by tweaking InnoDB parameters, though.

@othreed
Copy link
Contributor

othreed commented Oct 1, 2014

Thank you for this information. I disabled InnoDB and switched to MyISAM to save on RAM... was this a wrong decision? Also, do I need to do any more changed to huginn or is the code ready for this utf8mb4 tweak?

These changes are inside my my.cnf:
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ci
skip-innodb
default-storage-engine = myisam

and in the .env file I added this:
''DATABASE_ENCODING=utf8mb4

thanks for any help :)

@knu
Copy link
Member Author

knu commented Oct 1, 2014

You don't need to change the server settings for this. Just let it use utf8. It's the client side that specifies character sets in DDL as necessary.

@othreed
Copy link
Contributor

othreed commented Oct 1, 2014

I removed the server settings and restarted mysql. This is what SHOW VARIABLES; says about my database encoding now:

character_set_client                              | latin1                                                                                                                 
character_set_connection                          | latin1                                                                                                                 
character_set_database                            | latin1                                                                                                                 character_set_filesystem                          | binary                                                                                                                
character_set_results                             | latin1                                                                                                                 
character_set_server                              | latin1                                                                                                                 
character_set_system                              | utf8                                                                                                                   
character_sets_dir                                | /usr/share/mysql/charsets/                                                                                             
collation_connection                              | latin1_swedish_ci                                                                                                      
collation_database                                | latin1_swedish_ci                                                                                                      
collation_server                                  | latin1_swedish_ci

I still get the same error running sudo RAILS_ENV=production bundle exec rake db:migrate --trace

== 20120728210244 DeviseCreateUsers: migrating ================================
-- create_table(:users)
   -> 0.0372s
-- add_index(:users, :email, {:unique=>true})
rake aborted!
StandardError: An error has occurred, all later migrations canceled:

Mysql2::Error: Specified key was too long; max key length is 1000 bytes: CREATE UNIQUE INDEX `index_users_on_email`  ON `users` (`email`) /home/pi/app/shared/bundle/ruby/1.9.1/gems/activerecord-4.1.5/lib/active_record/connection_adapters/abstract_mysql_adapter.rb:303:in `query'

running SHOW CREATE DATABASE huginn_production; gives me the following output:

`CREATE DATABASE `huginn_production` /*!40100 DEFAULT CHARACTER SET utf8mb4 */

running SHOW CREATE TABLE users; gives me the following output:

| users | CREATE TABLE `users` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `email` varchar(255) NOT NULL DEFAULT '',
  `encrypted_password` varchar(255) NOT NULL DEFAULT '',
  `reset_password_token` varchar(255) DEFAULT NULL,
  `reset_password_sent_at` datetime DEFAULT NULL,
  `remember_created_at` datetime DEFAULT NULL,
  `sign_in_count` int(11) DEFAULT '0',
  `current_sign_in_at` datetime DEFAULT NULL,
  `last_sign_in_at` datetime DEFAULT NULL,
  `current_sign_in_ip` varchar(255) DEFAULT NULL,
  `last_sign_in_ip` varchar(255) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4 |

I did not understand the comments... where should I make the changes ?

@knu
Copy link
Member Author

knu commented Oct 1, 2014

Maybe you need to set DATABASE_ENCODING to utf8 when creating a database and then change it to utf8mb4 when running the app. I'll check out and fix that.

knu added a commit that referenced this pull request Oct 1, 2014
This is because we specify the `utf8mb4` charset in some of the columns
and expect others to be defaulted to `utf8`.

I'd like to make this a feature of `ar_mysql_column_charset`.  If you
want to set a table charset, you can always pass a keyword option
`options: "default charset=utf8mb4"` to `create_table`.

Complements #414.
@knu
Copy link
Member Author

knu commented Oct 1, 2014

The above commit should fix the problem.

Please set DATABASE_ENCODING to utf8mb4 and start over from creating a new database. (either bundle exec rake db:create db:migrate or bundle exec rake db:setup)

@othreed
Copy link
Contributor

othreed commented Oct 1, 2014

in my .env file I have set:

DATABASE_ENCODING=utf8mb4

also I included your commit.

Now bundle exec rake db:setup --trace shows me more output but still I get the error:

** Invoke db:setup (first_time)
** Invoke db:schema:load_if_ruby (first_time)
** Invoke db:create (first_time)
** Invoke db:load_config (first_time)
** Execute db:load_config
** Execute db:create
** Invoke environment (first_time)
** Execute environment
** Execute db:schema:load_if_ruby
** Invoke db:schema:load (first_time)
** Invoke environment 
** Invoke db:load_config 
** Execute db:schema:load
-- enable_extension("plpgsql")
   -> 0.0098s
-- create_table("agent_logs", {:force=>true})
   -> 0.3347s
-- create_table("agents", {:force=>true})
   -> 0.0647s
-- add_index("agents", ["guid"], {:name=>"index_agents_on_guid", :using=>:btree})
   -> 0.5116s
-- add_index("agents", ["schedule"], {:name=>"index_agents_on_schedule", :using=>:btree})
   -> 0.0338s
-- add_index("agents", ["type"], {:name=>"index_agents_on_type", :using=>:btree})
   -> 0.0330s
-- add_index("agents", ["user_id", "created_at"], {:name=>"index_agents_on_user_id_and_created_at", :using=>:btree})
   -> 0.0333s
-- create_table("control_links", {:force=>true})
   -> 0.0353s
-- add_index("control_links", ["control_target_id"], {:name=>"index_control_links_on_control_target_id", :using=>:btree})
   -> 0.0294s
-- add_index("control_links", ["controller_id", "control_target_id"], {:name=>"index_control_links_on_controller_id_and_control_target_id", :unique=>true, :using=>:btree})
   -> 0.0307s
-- create_table("delayed_jobs", {:force=>true})
   -> 0.0410s
-- add_index("delayed_jobs", ["priority", "run_at"], {:name=>"delayed_jobs_priority", :using=>:btree})
   -> 0.0300s
-- create_table("events", {:force=>true})
   -> 0.0425s
-- add_index("events", ["agent_id", "created_at"], {:name=>"index_events_on_agent_id_and_created_at", :using=>:btree})
   -> 0.0283s
-- add_index("events", ["expires_at"], {:name=>"index_events_on_expires_at", :using=>:btree})
   -> 0.0310s
-- add_index("events", ["user_id", "created_at"], {:name=>"index_events_on_user_id_and_created_at", :using=>:btree})
   -> 0.0303s
-- create_table("links", {:force=>true})
   -> 0.0341s
-- add_index("links", ["receiver_id", "source_id"], {:name=>"index_links_on_receiver_id_and_source_id", :using=>:btree})
   -> 0.0272s
-- add_index("links", ["source_id", "receiver_id"], {:name=>"index_links_on_source_id_and_receiver_id", :using=>:btree})
   -> 0.0309s
-- create_table("scenario_memberships", {:force=>true})
   -> 0.0296s
-- add_index("scenario_memberships", ["agent_id"], {:name=>"index_scenario_memberships_on_agent_id", :using=>:btree})
   -> 0.0260s
-- add_index("scenario_memberships", ["scenario_id"], {:name=>"index_scenario_memberships_on_scenario_id", :using=>:btree})
   -> 0.0278s
-- create_table("scenarios", {:force=>true})
   -> 0.0424s
-- add_index("scenarios", ["user_id", "guid"], {:name=>"index_scenarios_on_user_id_and_guid", :unique=>true, :using=>:btree})
   -> 0.0331s
-- create_table("services", {:force=>true})
   -> 0.0457s
-- add_index("services", ["provider"], {:name=>"index_services_on_provider", :using=>:btree})
   -> 0.0276s
-- add_index("services", ["uid"], {:name=>"index_services_on_uid", :using=>:btree})
   -> 0.0316s
-- add_index("services", ["user_id", "global"], {:name=>"index_services_on_user_id_and_global", :using=>:btree})
   -> 0.0354s
-- create_table("user_credentials", {:force=>true})
   -> 0.0364s
-- add_index("user_credentials", ["user_id", "credential_name"], {:name=>"index_user_credentials_on_user_id_and_credential_name", :unique=>true, :using=>:btree})
rake aborted!
ActiveRecord::StatementInvalid: Mysql2::Error: Specified key was too long; max key length is 1000 bytes: CREATE UNIQUE INDEX `index_user_credentials_on_user_id_and_credential_name` USING btree ON `user_credentials` (`user_id`, `credential_name`) 

@knu
Copy link
Member Author

knu commented Oct 1, 2014

What does show create database huginn (or whatever you named it) say? The database is expected to be utf8.

@othreed
Copy link
Contributor

othreed commented Oct 1, 2014

thanks for the help.

huginn_production | CREATE DATABASE `huginn_production` /*!40100 DEFAULT CHARACTER SET utf8mb4 */ 

@knu
Copy link
Member Author

knu commented Oct 1, 2014

That's oh. Running rake db:create on the lastest master should create a database in utf8, not utf8mb4.

@othreed
Copy link
Contributor

othreed commented Oct 1, 2014

ok I will look into this , thanks :)

@othreed
Copy link
Contributor

othreed commented Oct 1, 2014

I must have messed up the settings somewhere because rake db:create makes the default to utf8mb4.
so I ran this:

ALTER DATABASE 'huginn_production' DEFAULT CHARACTER SET utf8;

after this db:migrate and db:seed worked with no errors. I will test and report :)

@othreed
Copy link
Contributor

othreed commented Oct 1, 2014

thanks for the help it works now..
I wanted to do this utf8mb4 because I thought it would help me with an error I have with a website agent processing a weather data rss.
http://www.geomar.de/service/wetter/feed/

But even after applying the change I still get the old error with the website agent.

I found out that the square symbol is what causes the data error in the rss feed: Summe Globalstrahlung: 1,89332kWh/m²

In the agent log I get this error:

"\xC2" from ASCII-8BIT to UTF-8

Is this another issue or does it count into this thread?

@knu
Copy link
Member Author

knu commented Oct 1, 2014

"\xC2" is not a valid byte sequence in UTF-8, so I don't think it is related to this issue.

Maybe the WebSiteAgent failed to detect the encoding of the feed. Try force_encoding: "utf8". (And file another issue if you still have a problem)

@othreed
Copy link
Contributor

othreed commented Oct 1, 2014

thanks a bunch for the tip - now it works : )

@knu
Copy link
Member Author

knu commented Oct 1, 2014

@othreed Great! Can you share the agent's options? The URL seems to return a Content-Type with charset=utf-8, so force_encoding shouldn't be essentially necessary. Also, I'd like to know the ruby version you are using for the create database issue above to reproduce for me.

@othreed
Copy link
Contributor

othreed commented Oct 1, 2014

ruby 1.9.3p194 (2012-04-20 revision 35410) [arm-linux-eabihf]
MySQL 5.5.38-0+wheezy1

Website Agent for Local_Weather_Feed

{
  "expected_update_period_in_days": "2",
  "force_encoding": "UTF-8",
  "url": "http://www.geomar.de/service/wetter/feed/",
  "type": "text",
  "mode": "on_change",
  "extract": {
    "weather_sensors": {
      "regexp": "([^:\\s<>][^:><]+): ([^<]+)",
      "index": 0
    }
  }
}

@knu
Copy link
Member Author

knu commented Oct 1, 2014

Thanks! I'll look into this tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issues with UTF-8 Character Sets
6 participants