Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove character set and collate column info by default #9316

Merged
merged 3 commits into from Mar 17, 2020
Merged

fix: remove character set and collate column info by default #9316

merged 3 commits into from Mar 17, 2020

Conversation

villebro
Copy link
Member

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

SUMMARY

It is common for SQL engines to offer the option to define a custom character set and collation scheme for columns. The convention is to define this using the following syntax (example from MySQL): VARCHAR(255) CHARACTER SET LATIN1 COLLATE UTF8MB4_GENERAL_CI. A few examples:

Currently Superset already removes collation info, but retains the character set info. This often causes an overflow when saving metadata due to the column type only being 32 characters wide. This PR makes removal of collation and character set info the new default behaviour.

TEST PLAN

Local test + CI

ADDITIONAL INFORMATION

REVIEWERS

@john-bodley @mistercrunch

@@ -161,6 +161,7 @@ class BaseEngineSpec: # pylint: disable=too-many-public-methods
utils.DbColumnType.STRING: (
re.compile(r".*CHAR.*", re.IGNORECASE),
re.compile(r".*STRING.*", re.IGNORECASE),
re.compile(r".*TEXT.*", re.IGNORECASE),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly unrelated to this PR, but I noticed that the TEXT types were not being identified as string column types (I'm surprised this hadn't come up previously).

@john-bodley
Copy link
Member

john-bodley commented Mar 17, 2020

@villebro looking at the example data (using a MySQL database),

mysql> show create table energy_usage;
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table        | Create Table                                                                                                                                                                                                                                        |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| energy_usage | CREATE TABLE `energy_usage` (
  `source` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `target` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `value` float DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
+--------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

it seems that the column are defined with the COLLATE information which is reflected here,

>>> from superset.utils.core import get_example_database
>>> example_db = get_example_database()
>>> sqla_table = example_db.get_table("energy_usage")
>>> col = next(iter(sqla_table.columns))
>>> col
Column('source', VARCHAR(collation='utf8_unicode_ci', length=255), table=<energy_usage>)

and thus I'm a little perplexed as to why we need column_datatype_to_string as surely this collation and character encoding code could (and possibly should) be removed at the column level.

@villebro
Copy link
Member Author

villebro commented Mar 17, 2020

@john-bodley what you're saying makes total sense. The columns from which these strings are created are constructed here:

https://github.com/apache/incubator-superset/blob/85e9a4fa990927327755f5743317e848fc230f01/superset/models/core.py#L577-L586

I'm thinking we should loop through these columns prior to returning the Table instance, and remove collation and character set details if present. Do yo agree?

Edit: on second thought best to not mutate the original table, probably wise to do in db_engine_spec.column_datatype_to_string instead

@john-bodley
Copy link
Member

@villebro just to clarify what is the issue with having these collate and character set defined at the column level?

@villebro
Copy link
Member Author

villebro commented Mar 17, 2020

@john-bodley right now the column type column is VARCHAR(32), aka. they overflow and cause an exception when adding a new table. We could potentially make the column wider, say VARCHAR(100), which would solve the problem, but wouldn't really add any value other than the edge case where someone wants to quickly check the collation in Superset (probably not a common use case). So I guess it comes down to removing the additional info to avoid having to make a metadata migration and keeping the type column slightly lighter in the table editor, or just making it wider and providing the users more context of the column types.

superset/db_engine_specs/base.py Show resolved Hide resolved
superset/db_engine_specs/base.py Show resolved Hide resolved
@villebro villebro merged commit 982c234 into apache:master Mar 17, 2020
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.36.0 labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 0.36.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data too long for column 'type' at row 1
3 participants