fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns #17360

cccs-joel · 2021-11-05T17:43:09Z

SUMMARY

Some columns such as the ones representing a complex structure (array, struct, enum or a combination of these) may require more than 32 chars to store the datatype. Changing datatype to TEXT and no limit was suggested by @villebro in the 1st associated issue listed below.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before:

After:

TESTING INSTRUCTIONS

After the migration, easiest way to test is to edit an existing dataset and change the type value of a column (using the legacy datasource editor) to something larger than 32 characters, Superset should accept the change and confirm the row was changed in the database.

ADDITIONAL INFORMATION

Has associated issue:

Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided: downtime is minimal, takes a second to execute the script.
Introduces new feature or API
Removes existing feature or API

… names for complexed columns

superset/datasets/schemas.py

codecov · 2021-11-05T18:04:27Z

Codecov Report

Merging #17360 (bb42450) into master (485852d) will decrease coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head bb42450 differs from pull request most recent head 7aebfa5. Consider uploading reports for the commit 7aebfa5 to get more accurate results

@@            Coverage Diff             @@
##           master   #17360      +/-   ##
==========================================
- Coverage   68.11%   68.07%   -0.05%     
==========================================
  Files        1653     1653              
  Lines       66374    66374              
  Branches     7121     7121              
==========================================
- Hits        45211    45182      -29     
- Misses      19266    19295      +29     
  Partials     1897     1897

Flag	Coverage Δ
hive	`81.78% <100.00%> (ø)`
postgres	`?`
presto	`82.07% <100.00%> (+<0.01%)`	⬆️
python	`82.55% <100.00%> (-0.10%)`	⬇️
sqlite	`81.88% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/connectors/base/models.py	`88.19% <100.00%> (ø)`
superset/datasets/schemas.py	`96.61% <100.00%> (ø)`
superset/sql_validators/postgres.py	`50.00% <0.00%> (-50.00%)`	⬇️
superset/databases/commands/update.py	`85.71% <0.00%> (-8.17%)`	⬇️
superset/common/utils/dataframe_utils.py	`85.71% <0.00%> (-7.15%)`	⬇️
superset/databases/commands/create.py	`82.35% <0.00%> (-5.89%)`	⬇️
superset/reports/commands/log_prune.py	`85.71% <0.00%> (-3.58%)`	⬇️
superset/commands/importers/v1/utils.py	`89.13% <0.00%> (-2.18%)`	⬇️
superset/databases/api.py	`90.94% <0.00%> (-2.10%)`	⬇️
superset/db_engine_specs/postgres.py	`96.36% <0.00%> (-0.91%)`	⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d2299c...7aebfa5. Read the comment docs.

etr2460 · 2021-11-05T18:26:52Z

Should we use a VARCHAR here with a longer char limit instead of TEXT? I think there are performance implications of using TEXT, but i'm not knowledgeable enough of a backend engineer to know for sure

cccs-joel · 2021-11-05T18:43:08Z

Should we use a VARCHAR here with a longer char limit instead of TEXT? I think there are performance implications of using TEXT, but i'm not knowledgeable enough of a backend engineer to know for sure

We could... but how longer? We deal with complexed columns (think many levels of nested elements) and the schema becomes the default type when saving the query as a dataset in sqllab. Others reported similar symptoms in the above issues. But yeah, more than happy to hear from the engineers for the potential side effects of this change.

etr2460 · 2021-11-05T21:49:09Z

how long are we talking? more than 255 characters? more than 1000? sorry, i don't know just how complex this can get

ktmud · 2021-11-05T22:13:56Z

I'm wondering if we can have two columns for this... store 95% cases in VACHAR with a reasonable limit and use TEXT to store large ENUMs and more advanced structs. Then there can be some helper functions to translate the complex types to more generic types to be used by the UI.

cccs-joel · 2021-11-09T18:36:31Z

how long are we talking? more than 255 characters? more than 1000? sorry, i don't know just how complex this can get

I have use cases with more than 3000 characters, hard to predict.

github-actions · 2021-11-14T17:57:55Z

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

betodealmeida · 2021-11-14T18:22:09Z

Should we use a VARCHAR here with a longer char limit instead of TEXT? I think there are performance implications of using TEXT, but i'm not knowledgeable enough of a backend engineer to know for sure

I know for fact that for Postgres there's no cost in using TEXT instead of VARCHAR, and it might even be faster in some cases. Not sure about MySQL and other DBs.

betodealmeida

Looks good.

For reference, in the models for SIP-68 I'm also using TEXT for column types.

villebro · 2021-11-15T12:57:56Z

Should we use a VARCHAR here with a longer char limit instead of TEXT? I think there are performance implications of using TEXT, but i'm not knowledgeable enough of a backend engineer to know for sure

I know for fact that for Postgres there's no cost in using TEXT instead of VARCHAR, and it might even be faster in some cases. Not sure about MySQL and other DBs.

It's also my experience that VARCHAR and TEXT have pretty similar performance on all databases I've used. I don't think it will have any performance impact in this case.

villebro

LGTM - just wondering if we should add a note in UPDATING.md, as this migration will probably may take some time to complete on large deployments?

etr2460 · 2021-11-16T16:54:06Z

Thanks to people smarter than I double checking the perf implications.

We definitely should have a note in UPDATING though, as this will probably be a table lock on mysql dbs. otherwise, lgtm

ktmud · 2021-11-16T20:08:18Z

IIRC, there does have some implications on search performance for VARCHAR vs TEXT in MySQL. Search w/ indexing is either not possible or much slower doe TEXT depending on which storage engine you use and which MySQL version you are on.

…atype

cccs-joel · 2021-11-19T21:10:09Z

Thanks to people smarter than I double checking the perf implications.

We definitely should have a note in UPDATING though, as this will probably be a table lock on mysql dbs. otherwise, lgtm

Sorry, I'm new to this process, should I write this note?

cccs-joel · 2021-11-19T21:28:51Z

IIRC, there does have some implications on search performance for VARCHAR vs TEXT in MySQL. Search w/ indexing is either not possible or much slower doe TEXT depending on which storage engine you use and which MySQL version you are on.

Thanks for you input, I appreciate as this is not my expertise, I still need guidance whether or not we should use a VARCHJAR with a higher number or a TEXT for that specific column.

github-actions · 2021-11-24T11:43:39Z

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

betodealmeida · 2021-11-24T15:06:47Z

Thanks to people smarter than I double checking the perf implications.
We definitely should have a note in UPDATING though, as this will probably be a table lock on mysql dbs. otherwise, lgtm

Sorry, I'm new to this process, should I write this note?

Yeah, just add it to the file (https://github.com/apache/superset/blob/master/UPDATING.md) under the next release.

betodealmeida · 2021-11-24T15:26:47Z

IIRC, there does have some implications on search performance for VARCHAR vs TEXT in MySQL. Search w/ indexing is either not possible or much slower doe TEXT depending on which storage engine you use and which MySQL version you are on.

Thanks for you input, I appreciate as this is not my expertise, I still need guidance whether or not we should use a VARCHJAR with a higher number or a TEXT for that specific column.

Looks like with MySQL we can use still TEXT, but we need to specify a length in order to have an index: https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-column-prefixes. So we could change to TEXT, and later still add an index if needed.

Performance-wise there seems to be an extra cost when operating on TEXT (https://dba.stackexchange.com/a/222182), but I think it's safe to assume that we're only going to do simple scans on this table, so it should be fine from what I understand.

Since we don't know the maximum expected size of this column I think it's OK to:

Switch to TEXT
If needed in the future add an index prefix, after consulting the community on the size

cccs-joel · 2021-11-24T17:06:13Z

Thanks to people smarter than I double checking the perf implications.
We definitely should have a note in UPDATING though, as this will probably be a table lock on mysql dbs. otherwise, lgtm

Sorry, I'm new to this process, should I write this note?

Yeah, just add it to the file (https://github.com/apache/superset/blob/master/UPDATING.md) under the next release.
Done here: #17541 assuming unless you want me to do it in this pull request.

github-actions · 2021-11-26T10:29:17Z

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

github-actions · 2021-12-03T18:29:16Z

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

cccs-joel · 2021-12-13T14:47:58Z

Can someone take a look at this PR, some checks didn't pass for obscure reasons but other than that, seems ready to go.

betodealmeida

@cccs-joel this looks great, but we need to update the down revision in your migration before merging.

UPDATING.md

betodealmeida · 2021-12-13T16:24:00Z

superset/migrations/versions/3ba29ecbaac5_change_datatype_of_type_in_basecolumn.py

+
+# revision identifiers, used by Alembic.
+revision = "3ba29ecbaac5"
+down_revision = "b92d69a6643c"


I think your tests are failing because your migration is now introducing a second HEAD. You can change the down revision here and in line 20 to abe27eaf93db (which you can see if you run superset db heads).

Accept proposed changes. Co-authored-by: Beto Dealmeida <roberto@dealmeida.net>

github-actions · 2021-12-13T21:05:45Z

⚠️ @cccs-joel Your base branch master has just also updated superset/migrations.

❗ Please consider rebasing your branch to avoid db migration conflicts.

Change datatype of column type in BaseColumn to allow larger datatype…

29651e8

… names for complexed columns

cccs-joel requested a review from a team as a code owner November 5, 2021 17:43

pull-request-size bot added the size/M label Nov 5, 2021

betodealmeida changed the title ~~Fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns~~ fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns Nov 5, 2021

betodealmeida reviewed Nov 5, 2021

View reviewed changes

superset/datasets/schemas.py Show resolved Hide resolved

betodealmeida added the risk:db-migration PRs that require a DB migration label Nov 5, 2021

betodealmeida approved these changes Nov 14, 2021

View reviewed changes

villebro reviewed Nov 15, 2021

View reviewed changes

Merge remote-tracking branch 'origin/master' into fix-resize-type-dat…

59b1697

…atype

Fixed formatting

79f2022

cccs-joel mentioned this pull request Nov 24, 2021

docs: Update UPDATING.md #17541

Closed

9 tasks

Added an entry in the UPDATING.md file as a potential downtime

a5a7582

Merge latest changes from master and resolve conflicts for UPDATING.md

5f8b027

betodealmeida requested changes Dec 13, 2021

View reviewed changes

cccs-joel and others added 3 commits December 13, 2021 13:34

Update UPDATING.md

a12f988

Accept proposed changes. Co-authored-by: Beto Dealmeida <roberto@dealmeida.net>

Merge branch 'apache:master' into fix-resize-type-datatype

0b229fd

Updated down revision number to reflect new head

bb42450

Merge branch 'apache:master' into fix-resize-type-datatype

7aebfa5

betodealmeida approved these changes Dec 13, 2021

View reviewed changes

betodealmeida merged commit e6db62c into apache:master Dec 13, 2021

This was referenced Jan 27, 2022

Field table_columns.type varchar(32) is too short #18203

Closed

ClickHouse some ENUM types doesn't stores in "datasources" - psycopg2.errors.StringDataRightTruncation: value too long for type character varying(32) #13572

Closed

This was referenced Feb 2, 2023

the "type" field in the "table_columns" table was too short #13690

Closed

Cannot add new dataset from prestodb due to too long type character varying #14458

Closed

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.5.0 labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns #17360

fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns #17360

cccs-joel commented Nov 5, 2021

codecov bot commented Nov 5, 2021 •

edited

etr2460 commented Nov 5, 2021

cccs-joel commented Nov 5, 2021

etr2460 commented Nov 5, 2021

ktmud commented Nov 5, 2021

cccs-joel commented Nov 9, 2021

github-actions bot commented Nov 14, 2021

betodealmeida commented Nov 14, 2021

betodealmeida left a comment

villebro commented Nov 15, 2021

villebro left a comment

etr2460 commented Nov 16, 2021

ktmud commented Nov 16, 2021

cccs-joel commented Nov 19, 2021

cccs-joel commented Nov 19, 2021

github-actions bot commented Nov 24, 2021

betodealmeida commented Nov 24, 2021

betodealmeida commented Nov 24, 2021

cccs-joel commented Nov 24, 2021 •

edited

github-actions bot commented Nov 26, 2021

github-actions bot commented Dec 3, 2021

cccs-joel commented Dec 13, 2021

betodealmeida left a comment

betodealmeida Dec 13, 2021

github-actions bot commented Dec 13, 2021

fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns #17360

fix: Change datatype of column type in BaseColumn to allow larger datatype names for complexed columns #17360

Conversation

cccs-joel commented Nov 5, 2021

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

codecov bot commented Nov 5, 2021 • edited

Codecov Report

etr2460 commented Nov 5, 2021

cccs-joel commented Nov 5, 2021

etr2460 commented Nov 5, 2021

ktmud commented Nov 5, 2021

cccs-joel commented Nov 9, 2021

github-actions bot commented Nov 14, 2021

betodealmeida commented Nov 14, 2021

betodealmeida left a comment

Choose a reason for hiding this comment

villebro commented Nov 15, 2021

villebro left a comment

Choose a reason for hiding this comment

etr2460 commented Nov 16, 2021

ktmud commented Nov 16, 2021

cccs-joel commented Nov 19, 2021

cccs-joel commented Nov 19, 2021

github-actions bot commented Nov 24, 2021

betodealmeida commented Nov 24, 2021

betodealmeida commented Nov 24, 2021

cccs-joel commented Nov 24, 2021 • edited

github-actions bot commented Nov 26, 2021

github-actions bot commented Dec 3, 2021

cccs-joel commented Dec 13, 2021

betodealmeida left a comment

Choose a reason for hiding this comment

betodealmeida Dec 13, 2021

Choose a reason for hiding this comment

github-actions bot commented Dec 13, 2021

codecov bot commented Nov 5, 2021 •

edited

cccs-joel commented Nov 24, 2021 •

edited