[SIP] Propose visualizations based on data #12724

wernerdaehn · 2021-01-25T07:06:40Z

[SIP] Propose visualizations based on data

Motivation

I have been working for Business Objects and SAP and I am in the Business Intelligence Market for more than 20 years. One thing that is still not satisfying is how the charting options are chosen.
Over the time the number of available charts and their variants will increase more and more and selecting from the long list is cumbersome. Also not everybody knows all visualization options for every case.
But given that superset has a semantic layer, you can preselect the visualizations.

Example: 2 Attributes & 2 Measures? Very likely a Pie Chart will not be the proper visualization.

There is an entire academic theory about different axis types (Nominalscale, Ordinalscale, Intervalscale, Ratioscale) for example. In case you are interested we can work on the details.

Proposed Change

Collect more metadata about attributes: Number of distinct values, what axis type it can be used for,...
Define the aggregation type of a measure and if it is semi-additive
For each charting option and variant specify a rank how useful it is based on the number of attributes, number of measures, axis type of the attribute, measure type.
Order the charting options based on an overall rank

Please let me know if you are interested and I would spend some time to work out the details.

junlincc · 2021-01-25T10:01:36Z

Thanks for suggesting! @wernerdaehn

Collect more metadata about attributes: Number of distinct values, what axis type it can be used for,...

It is aligned with our long term product roadmap. in fact, when we implemented new time picker in Superset, we thought about allowing user to query the earliest(min) and latest(max) time available in the timestamp dimension. couldn't get to it by v1.0 because of potential performance issues and our time constraints. collecting more metadata of dataset is something we wanna do once we get to refactoring the major control fields like metrics, filter etc.

Define the aggregation type of a measure and if it is semi-additive

something we will consider. it probably will require us to 'thickening' our semantic layer in Superset and steepen the learning curve of Superset.

3 & 4.

both are features available in Tableau. I agree they provides nice user experience and enables non tech users to create visualization intuitively. we would love to get to both someday.

Screen.Recording.2021-01-25.at.1.26.05.AM.mov

junlincc · 2021-01-25T10:07:33Z

@wernerdaehn if you would like contribute any above items to Superset in any ways, we would love to work with you!

wernerdaehn · 2021-01-25T10:08:44Z

@junlincc Thanks for the feedback. Just for the records, what Tableau does is just the very beginning!
See here for how wide the topic can get: https://datavizproject.com/

wernerdaehn · 2021-01-25T10:10:07Z

Any suggestion of what I can do for you in that regards? Else I will try to come up with something to discuss but would love to get your guidance.

ktmud · 2021-01-25T10:51:01Z

Thanks for bringing up this topic! This definitely is an interesting area of work and has a lot of potential for Superset.

What you described is often called automated chart specification, or automated Exploratory Data Analysis (EDA), which is also quite big among DataViZ academics: https://github.com/mstaniak/autoEDA-resources

It would be tremendously valuable if we could somehow integrate the latest research findings to an open source/commercial BI software.

This SIP is a good starting point, which seems to have identified a couple of items we can already do. I’d recommend keep researching on this topic and start digging into the Superset codebase/architecture to form a more concrete action plan. We should at least be able to answer:

What is possible and what is not, and
What is the MVP
Which API we need to change or add?
What other areas of work we need to tackle first before working on this? E.g. SIP-34 column stats looks like a must.

Some other useful links:

rusackas · 2021-04-22T06:45:54Z

I just wanted to chime in and say that I love this idea, and it's something that my team is starting to more seriously investigate. @wernerdaehn would you be interested in joining discussions (synchronously or otherwise) around this and being a part of implementing the solution? If not, I think we may need more clarification on how the approaches to implementation and any risks/dependencies involved, as @ktmud was suggesting. In other words, I think this is a great idea for a SIP, but we need more details to be able to put it to a vote and carry it out effectively.

wernerdaehn · 2021-04-22T07:23:34Z

@rusackas By all means, Evan! More than happy to contribute.

As a preliminary start, here is my thinking:

According to explanatory statistics there are four types of scales, ordered by capabilities:

Nominal: Only useful calculation is around counting. Example: Color.
Ordinal: has in addition an order. Example: User satisfaction 1-10. It is clear that 1 is better than 2 but a difference between 1-and-3 does not have the same meaning as 8-and-10.
Interval: has in addition a useful meaning of distance between two values. Example: Today it is 5°C warmer than yesterday.
Ratio: in addition it has a value of 0 and hence absolute comparisons do make sense. Example: Revenue was 10% higher.

If somebody wants to visualize a nominal value and a ratio value, e.g. Revenue per Color, a Bar chart is one of the few that makes sense. For two ratio values, e.g. revenue per customer-age a scatter plot is suited.

The next type of decision is the number of axis.

If there is a single nominal axis, e.g. gender, the pie chart might be interesting to show the number of customers per gender.
If I want to visualize the revenue compared to the previous year revenue per country and time, I need a chart type that can show a ratio scale, a list of regions and the development over time. A geomap colored as a heatmap and a time animation would do the trick.

The type of axis can further be refined:

time: year, month, day, timestamp, week, weekday
geo
hierarchy

One side effect of these types is how to render missing values. A country without revenue should still be present (geomap) or not (bar chart). A month without revenue should still be shown, you do not want to see just 11 months.

The number of distinct values of nominal and ordinal scales is an important decision point as well. A Pie chart with 5000 categories might not be the best suited chart type. Showing above revenue per country over time could be shown as line chart with one line per country. Excellent for comparisons between countries unless you have 100 countries and 100 lines hence.

The final decision type is the purpose of the visualization:

Comparison
Relationship
Proportion
Percent of the whole
Location
Distribution...

The nice thing is that we can start small and grow the solution. Initially we just categorize each column of the result set into the scale type and each chart has the information which scale type it allows for what axis. That by itself would reduce the list of charts to offer by a lot. And from that we can grow and grow with the available metadata on the data and the chart info.

stale · 2021-06-26T03:07:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

junlincc added enhancement:request Enhancement request submitted by anyone from the community explore:control Related to the controls panel of Explore explore:dataset Related to the dataset of Explore viz:explore:ux labels Jan 25, 2021

stale bot added the inactive Inactive for >= 30 days label Jun 26, 2021

apache locked and limited conversation to collaborators Feb 2, 2022

geido converted this issue into discussion #18430 Feb 2, 2022

stale bot removed the inactive Inactive for >= 30 days label Feb 2, 2022

geido added explore:design Related to the Explore UI/UX and removed viz:explore:ux labels Feb 9, 2022

rusackas added the sip Superset Improvement Proposal label Jun 7, 2023

rusackas added this to SIPs (Superset Improvement Proposals) Jun 7, 2023

rusackas moved this to DENIED / CLOSED in SIPs (Superset Improvement Proposals) Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[SIP] Propose visualizations based on data #12724

[SIP] Propose visualizations based on data #12724

wernerdaehn commented Jan 25, 2021

junlincc commented Jan 25, 2021 •

edited

Loading

junlincc commented Jan 25, 2021

wernerdaehn commented Jan 25, 2021

wernerdaehn commented Jan 25, 2021

ktmud commented Jan 25, 2021 •

edited

Loading

rusackas commented Apr 22, 2021

wernerdaehn commented Apr 22, 2021 •

edited

Loading

stale bot commented Jun 26, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

[SIP] Propose visualizations based on data #12724

[SIP] Propose visualizations based on data #12724

Comments

wernerdaehn commented Jan 25, 2021