Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate data outside of topN into a single category #2176

Closed
wants to merge 8 commits into from
Closed

Aggregate data outside of topN into a single category #2176

wants to merge 8 commits into from

Conversation

asdf2014
Copy link
Member

@asdf2014 asdf2014 commented Feb 15, 2017

Aggregate data outside of topN into a single category

There is a problem that those category icons will fill up the whole chart, when i have tens of thousands metrics need to be display. So we should aggregate those category outside of topN into a single category named Others Category.

image

image

@asdf2014
Copy link
Member Author

@asdf2014 asdf2014 commented Feb 15, 2017

Why i cannot find dbs database in unittests.db when i running druid_tests.py on local? Please give some help... @mistercrunch

[root@superset .superset]# sqlite3 /root/.superset/unittests.db 
SQLite version 3.6.20
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .databases
seq  name             file                                                      
---  ---------------  ----------------------------------------------------------
0    main             /root/.superset/unittests.db                              
sqlite> .tables
ab_permission            ab_register_user         ab_user_role           
ab_permission_view       ab_role                  ab_view_menu           
ab_permission_view_role  ab_user                
sqlite> 

@asdf2014
Copy link
Member Author

@asdf2014 asdf2014 commented Feb 15, 2017

And i tried to change from .base_tests import SupersetTestCase into from tests.base_tests import SupersetTestCase to import .base_tests as a package correctly. Is this reasonable? @mistercrunch

superset/viz.py Outdated
@@ -223,6 +224,17 @@ def get_df(self, query_obj=None):
df[DTTM_ALIAS] += timedelta(hours=self.datasource.offset)
df.replace([np.inf, -np.inf], np.nan)
df = df.fillna(0)
if (self.others_category is not None) & \
Copy link
Contributor

@xrmx xrmx Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want and not &, also you don't need the \

Copy link
Member Author

@asdf2014 asdf2014 Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thx a lot.

Copy link
Member Author

@asdf2014 asdf2014 Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, it still need the \ cuz Line too long (83 > 79 characters) :(

superset/viz.py Outdated
if (self.others_category is not None) & \
(self.others_category != 'None'):
top_n = int(self.others_category)
if (top_n > 0) & (len(df) > top_n):
Copy link
Contributor

@xrmx xrmx Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here and you can drop the parens

Copy link
Member Author

@asdf2014 asdf2014 Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@xrmx
Copy link
Contributor

@xrmx xrmx commented Feb 15, 2017

@asdf2014
Copy link
Member Author

@asdf2014 asdf2014 commented Feb 16, 2017

I know... But i still got the issue:

Traceback (most recent call last):
  File "/root/superset-0.15.4/lib/python2.7/site-packages/superset/tests/core_tests.py", line 19, in <module>
    from .base_tests import SupersetTestCase
ValueError: Attempted relative import in non-package

@xrmx

@mistercrunch
Copy link
Member

@mistercrunch mistercrunch commented Feb 17, 2017

We got rid of forms.py as we moved a lot of this logic to the frontend to build a more responsive React app. There's an equivalent fields.js file now.

I'm confused as to how to handle TopN in Druid. It appears that each node or segment processing returns a local TopN and the coordinator merges the results. In our case on high cardinality TopN queries we've seen much of the long tail making it through and incomplete results for items that actually fit in the real TopN.

User may want and expect a few things:

  • The data is evenly distributed and TopN happens to perform well and results are mostly accurate. We may still want to "pad the TopN" (user asks for top 50, but we run a top 100 or something somewhat arbitrary to give room) and shop off the tail
  • Would adding a phase to the query help? First get the topN say 50, then run another TopN query that filters on this. I still feel like this logic belongs in the Druid API, not on our side...
  • Letting the user force a group by when they need accurate results. This would be a checkbox "Avoid TopN queries" that would force a GroupBy query
  • Letting the user define the TopN number AND the actual number of value that we should truncate to.

I may take this to the Druid user group and have a conversation there. I feel like I'm missing something.

@asdf2014
Copy link
Member Author

@asdf2014 asdf2014 commented Feb 20, 2017

You mean this divide-and-rule-of the way in Druid is very efficient, but in our case the TopN query will happen the long tail situation. So, we should add a option to use simple GroupBy query for higher performance? @mistercrunch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants