fix: remove character set and collate column info by default #9316

villebro · 2020-03-16T23:21:26Z

SUMMARY

It is common for SQL engines to offer the option to define a custom character set and collation scheme for columns. The convention is to define this using the following syntax (example from MySQL): VARCHAR(255) CHARACTER SET LATIN1 COLLATE UTF8MB4_GENERAL_CI. A few examples:

Currently Superset already removes collation info, but retains the character set info. This often causes an overflow when saving metadata due to the column type only being 32 characters wide. This PR makes removal of collation and character set info the new default behaviour.

TEST PLAN

Local test + CI

ADDITIONAL INFORMATION

Has associated issue: closes Data too long for column 'type' at row 1 #5018
Changes UI
Requires DB Migration.
Confirm DB Migration upgrade and downgrade tested.
Introduces new feature or API
Removes existing feature or API

REVIEWERS

@john-bodley @mistercrunch

villebro · 2020-03-16T23:22:48Z

superset/db_engine_specs/base.py

@@ -161,6 +161,7 @@ class BaseEngineSpec:  # pylint: disable=too-many-public-methods
        utils.DbColumnType.STRING: (
            re.compile(r".*CHAR.*", re.IGNORECASE),
            re.compile(r".*STRING.*", re.IGNORECASE),
+            re.compile(r".*TEXT.*", re.IGNORECASE),


This is mostly unrelated to this PR, but I noticed that the TEXT types were not being identified as string column types (I'm surprised this hadn't come up previously).

john-bodley · 2020-03-17T00:27:46Z

@villebro looking at the example data (using a MySQL database),

mysql> show create table energy_usage;
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table        | Create Table                                                                                                                                                                                                                                        |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| energy_usage | CREATE TABLE `energy_usage` (
  `source` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `target` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `value` float DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

it seems that the column are defined with the COLLATE information which is reflected here,

>>> from superset.utils.core import get_example_database
>>> example_db = get_example_database()
>>> sqla_table = example_db.get_table("energy_usage")
>>> col = next(iter(sqla_table.columns))
>>> col
Column('source', VARCHAR(collation='utf8_unicode_ci', length=255), table=<energy_usage>)

and thus I'm a little perplexed as to why we need column_datatype_to_string as surely this collation and character encoding code could (and possibly should) be removed at the column level.

villebro · 2020-03-17T13:17:38Z

@john-bodley what you're saying makes total sense. The columns from which these strings are created are constructed here:

https://github.com/apache/incubator-superset/blob/85e9a4fa990927327755f5743317e848fc230f01/superset/models/core.py#L577-L586

I'm thinking we should loop through these columns prior to returning the Table instance, and remove collation and character set details if present. Do yo agree?

Edit: on second thought best to not mutate the original table, probably wise to do in db_engine_spec.column_datatype_to_string instead

john-bodley · 2020-03-17T17:04:11Z

@villebro just to clarify what is the issue with having these collate and character set defined at the column level?

villebro · 2020-03-17T18:01:35Z

@john-bodley right now the column type column is VARCHAR(32), aka. they overflow and cause an exception when adding a new table. We could potentially make the column wider, say VARCHAR(100), which would solve the problem, but wouldn't really add any value other than the edge case where someone wants to quickly check the collation in Superset (probably not a common use case). So I guess it comes down to removing the additional info to avoid having to make a metadata migration and keeping the type column slightly lighter in the table editor, or just making it wider and providing the users more context of the column types.

superset/db_engine_specs/base.py

superset-github-bot bot added the preset-io label Mar 16, 2020

pull-request-size bot added the size/L label Mar 16, 2020

villebro commented Mar 16, 2020

View reviewed changes

mistercrunch approved these changes Mar 17, 2020

View reviewed changes

villebro added 2 commits March 17, 2020 14:37

fix: remove character set and collate column info by default

6fdf3c9

lint

57b5d2a

remove collation and charset info before compile

07f1fba

villebro requested a review from john-bodley March 17, 2020 16:50

john-bodley approved these changes Mar 17, 2020

View reviewed changes

superset/db_engine_specs/base.py Show resolved Hide resolved

superset/db_engine_specs/base.py Show resolved Hide resolved

villebro merged commit 982c234 into apache:master Mar 17, 2020

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.36.0 labels Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove character set and collate column info by default #9316

fix: remove character set and collate column info by default #9316

villebro commented Mar 16, 2020

villebro Mar 16, 2020

john-bodley commented Mar 17, 2020 •

edited

villebro commented Mar 17, 2020 •

edited

john-bodley commented Mar 17, 2020

villebro commented Mar 17, 2020 •

edited

fix: remove character set and collate column info by default #9316

fix: remove character set and collate column info by default #9316

Conversation

villebro commented Mar 16, 2020

CATEGORY

SUMMARY

TEST PLAN

ADDITIONAL INFORMATION

REVIEWERS

villebro Mar 16, 2020

Choose a reason for hiding this comment

john-bodley commented Mar 17, 2020 • edited

villebro commented Mar 17, 2020 • edited

john-bodley commented Mar 17, 2020

villebro commented Mar 17, 2020 • edited

john-bodley commented Mar 17, 2020 •

edited

villebro commented Mar 17, 2020 •

edited

villebro commented Mar 17, 2020 •

edited