Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate data outside of topN into a single category #2176

Closed
wants to merge 8 commits into from

Conversation

@asdf2014
Copy link
Member

commented Feb 15, 2017

Aggregate data outside of topN into a single category

There is a problem that those category icons will fill up the whole chart, when i have tens of thousands metrics need to be display. So we should aggregate those category outside of topN into a single category named Others Category.

image

image

@asdf2014

This comment has been minimized.

Copy link
Member Author

commented Feb 15, 2017

Why i cannot find dbs database in unittests.db when i running druid_tests.py on local? Please give some help... @mistercrunch

[root@superset .superset]# sqlite3 /root/.superset/unittests.db 
SQLite version 3.6.20
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .databases
seq  name             file                                                      
---  ---------------  ----------------------------------------------------------
0    main             /root/.superset/unittests.db                              
sqlite> .tables
ab_permission            ab_register_user         ab_user_role           
ab_permission_view       ab_role                  ab_view_menu           
ab_permission_view_role  ab_user                
sqlite> 
@asdf2014

This comment has been minimized.

Copy link
Member Author

commented Feb 15, 2017

And i tried to change from .base_tests import SupersetTestCase into from tests.base_tests import SupersetTestCase to import .base_tests as a package correctly. Is this reasonable? @mistercrunch

@@ -223,6 +224,17 @@ def get_df(self, query_obj=None):
df[DTTM_ALIAS] += timedelta(hours=self.datasource.offset)
df.replace([np.inf, -np.inf], np.nan)
df = df.fillna(0)
if (self.others_category is not None) & \

This comment has been minimized.

Copy link
@xrmx

xrmx Feb 15, 2017

Contributor

you want and not &, also you don't need the \

This comment has been minimized.

Copy link
@asdf2014

asdf2014 Feb 16, 2017

Author Member

Nice! Thx a lot.

This comment has been minimized.

Copy link
@asdf2014

asdf2014 Feb 16, 2017

Author Member

Oh no, it still need the \ cuz Line too long (83 > 79 characters) :(

if (self.others_category is not None) & \
(self.others_category != 'None'):
top_n = int(self.others_category)
if (top_n > 0) & (len(df) > top_n):

This comment has been minimized.

Copy link
@xrmx

xrmx Feb 15, 2017

Contributor

same here and you can drop the parens

This comment has been minimized.

Copy link
@asdf2014

asdf2014 Feb 16, 2017

Author Member

I see.

@xrmx

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2017

Benedict Jin and others added some commits Feb 16, 2017

Benedict Jin
@asdf2014

This comment has been minimized.

Copy link
Member Author

commented Feb 16, 2017

I know... But i still got the issue:

Traceback (most recent call last):
  File "/root/superset-0.15.4/lib/python2.7/site-packages/superset/tests/core_tests.py", line 19, in <module>
    from .base_tests import SupersetTestCase
ValueError: Attempted relative import in non-package

@xrmx

@mistercrunch

This comment has been minimized.

Copy link
Contributor

commented Feb 17, 2017

We got rid of forms.py as we moved a lot of this logic to the frontend to build a more responsive React app. There's an equivalent fields.js file now.

I'm confused as to how to handle TopN in Druid. It appears that each node or segment processing returns a local TopN and the coordinator merges the results. In our case on high cardinality TopN queries we've seen much of the long tail making it through and incomplete results for items that actually fit in the real TopN.

User may want and expect a few things:

  • The data is evenly distributed and TopN happens to perform well and results are mostly accurate. We may still want to "pad the TopN" (user asks for top 50, but we run a top 100 or something somewhat arbitrary to give room) and shop off the tail
  • Would adding a phase to the query help? First get the topN say 50, then run another TopN query that filters on this. I still feel like this logic belongs in the Druid API, not on our side...
  • Letting the user force a group by when they need accurate results. This would be a checkbox "Avoid TopN queries" that would force a GroupBy query
  • Letting the user define the TopN number AND the actual number of value that we should truncate to.

I may take this to the Druid user group and have a conversation there. I feel like I'm missing something.

@asdf2014

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2017

You mean this divide-and-rule-of the way in Druid is very efficient, but in our case the TopN query will happen the long tail situation. So, we should add a option to use simple GroupBy query for higher performance? @mistercrunch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.