-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set charset/collation properly for each text column if using MySQL. #414
Conversation
This is only for Ruby >=2.0 because I used Module#prepend in monkey patch, and it effectively does nothing on Ruby 1.9. |
Hmm, the error above looks unfamiliar. There might be a strange situation in the migration chain. |
Very interesting approach @knu! Will this work on Ruby <2.0? Do you think we should pull this into a gem and require it? That might make this useful for other projects too. |
Going forward, what do our migrations need to contain to have good UTF8 support? |
@knu, please see my questions above when you have time. |
You can make it work in Ruby 1.9.x by replacing prepend with the alias_method_chain technique, but since the 1.9 series of Ruby will be dying in about six month I'd rather urge users to move to a newer version of Ruby.
Maybe, but this is just a local hack after all. Naming it something like "per-column charset support for ActiveRecord/MySQL" would be a hype when it only works on DDL (i.e. migration) and not on DML, which means you can only practically use it for mixing utf8mb4 with utf8 (and ascii). It would be a long way to make ActiveRecord and the underlying MySQL driver(s) support operating on a table with columns of mixed charsets properly. So, at the moment, while this patch works for many projects including Huginn, I'm still kind of reluctant to release it as a gem as it is now. I'll have to watch the Rails development closely to figure out how to integrate this into AR. |
Good to know. It'd be nice to maintain support for Ruby 1.9 for the time being, since we want Huginn to be as accessible as possible. |
For what it's worth, one of the most common platforms these days is Ubuntu 14.04 LTS (EC2 and Digital Ocean come immediately to mind), which packages Ruby v1.9 and it's notoriously problematic to migrate it to Ruby v2.0. Being able to run Huginn with Ruby v1.9 until the next release of Ubuntu server comes out would be very, very helpful. |
Yea, I'm not comfortable dropping 1.9 yet. |
Is this done @knu? |
@cantino Yes, the utf8mb4 charset support has been extended to cover ruby 1.9 by the last commit. |
Conflicts: Gemfile.lock
Should we be using |
options.update(charset: 'utf8', collation: 'utf8_general_ci') | ||
case name | ||
when 'username' | ||
options.update(limit: 767 / 4, charset: 'utf8mb4', collation: 'utf8mb4_general_ci') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain again why we treat username differently from the other columns, or why all columns can't be utf8mb4_bin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used for login ID, so I thought it should be less sensitive to cases and in terms of uniqueness.
Testing a deploy now to my setup. |
@cantino Maybe. They are both broken depending on the language used. I chose For now, few users would dare to use non-ASCII characters in their username, so I can live with |
Oh oh.
|
What version of mysql does this require? |
Oh. Looks like it's from MySQL 5.5. |
Hmm. I what should we do for older versions? |
We could use binary (blob) types instead and implement serializer/deserializer, I guess. As for JSON fields, we have JsonSerializedField to tweak. I smell of unexpected encoding errors everywhere until we could finally get it to work, though... |
Yes, that sounds difficult. How hard would it be to only run the migration on supported versions of MySQL, and make sure schema.rb in the repo works okay on older mysql versions -- that is, your library should just ignore collation commands if mysql cannot handle them? Does feel like maybe pulling it into a gem would be good. |
@cantino I've added a quick hack to replace utf8mb4 with utf8 if the server has no support for utf8mb4, with which the stock schema.rb will work on any version of MySQL. |
Unfortunately, I'm still getting the same error:
When I run One other question: once this is merged, do we have to do anything differently now when we add new tables or columns? How do we set the charset and collation correctly going forward? |
Conflicts: db/schema.rb
I think you're good to merge @knu. Going forward, will usages of |
@cantino The charset fallback will be effective in any future migration coumn definitions. |
Set charset/collation properly for each text column if using MySQL.
Awesome, nice work :) |
I am having problems with switching to utf8mb4.
what did I miss? |
It is generally not a good idea to build everything in utf8mb4 by making it the default character set on the server side. Check out this comment. You could still extend the maximum key size by tweaking InnoDB parameters, though. |
Thank you for this information. I disabled InnoDB and switched to MyISAM to save on RAM... was this a wrong decision? Also, do I need to do any more changed to huginn or is the code ready for this utf8mb4 tweak? These changes are inside my my.cnf: and in the .env file I added this: thanks for any help :) |
You don't need to change the server settings for this. Just let it use utf8. It's the client side that specifies character sets in DDL as necessary. |
I removed the server settings and restarted mysql. This is what SHOW VARIABLES; says about my database encoding now:
I still get the same error running sudo RAILS_ENV=production bundle exec rake db:migrate --trace
running SHOW CREATE DATABASE huginn_production; gives me the following output:
running SHOW CREATE TABLE users; gives me the following output:
I did not understand the comments... where should I make the changes ? |
Maybe you need to set DATABASE_ENCODING to utf8 when creating a database and then change it to utf8mb4 when running the app. I'll check out and fix that. |
This is because we specify the `utf8mb4` charset in some of the columns and expect others to be defaulted to `utf8`. I'd like to make this a feature of `ar_mysql_column_charset`. If you want to set a table charset, you can always pass a keyword option `options: "default charset=utf8mb4"` to `create_table`. Complements #414.
The above commit should fix the problem. Please set DATABASE_ENCODING to utf8mb4 and start over from creating a new database. (either |
in my .env file I have set:
also I included your commit. Now bundle exec rake db:setup --trace shows me more output but still I get the error:
|
What does |
thanks for the help.
|
That's oh. Running |
ok I will look into this , thanks :) |
I must have messed up the settings somewhere because rake db:create makes the default to utf8mb4.
after this db:migrate and db:seed worked with no errors. I will test and report :) |
thanks for the help it works now.. But even after applying the change I still get the old error with the website agent. I found out that the square symbol is what causes the data error in the rss feed: Summe Globalstrahlung: 1,89332kWh/m² In the agent log I get this error:
Is this another issue or does it count into this thread? |
"\xC2" is not a valid byte sequence in UTF-8, so I don't think it is related to this issue. Maybe the WebSiteAgent failed to detect the encoding of the feed. Try |
thanks a bunch for the tip - now it works : ) |
@othreed Great! Can you share the agent's options? The URL seems to return a Content-Type with charset=utf-8, so force_encoding shouldn't be essentially necessary. Also, I'd like to know the ruby version you are using for the create database issue above to reproduce for me. |
ruby 1.9.3p194 (2012-04-20 revision 35410) [arm-linux-eabihf] Website Agent for Local_Weather_Feed
|
Thanks! I'll look into this tomorrow. |
With this change, Huginn is able to store up to 4-byte UTF-8 characters
in its database. This should fix #286.