Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of JOIN statement for queries #875

Closed
ololobus opened this issue Aug 2, 2016 · 24 comments

Comments

@ololobus
Copy link

commented Aug 2, 2016

Are there any plans to add JOINs support?

Here is a simple case: there are two tables documents and languages, every document has language_id column, every language has name and code columns. Thus, now I can only display distribution by language_id which is not too informative.
document_languages

It could be solved andding a simple join statement with select of e.g. languages.code instead of documents.language_id.

I would be glad to help somehow and even started reading the project code, but it's hard enough to navigate through all this code. I found that main functions for queries performing are placed here https://github.com/airbnb/caravel/blob/master/caravel/models.py

P.S. Sorry, if I missed something and there is already a possibility to do it somehow.

@mistercrunch

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2016

http://airbnb.io/caravel/faq.html#can-i-query-join-multiple-tables-at-one-time

Can I query/join multiple tables at one time?

Not directly no. A Caravel SQLAlchemy datasource can only be a single table or a view.

When working with tables, the solution would be to materialize a table that contains all the fields needed for your analysis, most likely through some scheduled batch process.

A view is a simple logical layer that abstract an arbitrary SQL queries as a virtual table. This can allow you to join and union multiple tables, and to apply some transformation using arbitrary SQL expressions. The limitation there is your database performance as Caravel effectively will run a query on top of your query (view). A good practice may be to limit yourself to joining your main large table to one or many small tables only, and avoid using GROUP BY where possible as Caravel will do its own GROUP BY and doing the work twice might slow down performance.

Whether you use a table or a view, the important factor is whether your database is fast enough to serve it in an interactive fashion to provide a good user experience in Caravel.

@mistercrunch

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2016

Though we may add support for joinable tables in the future it's not on the short term roadmap, it would only be for many to one left joins on static joining criteria.

@ololobus

This comment has been minimized.

Copy link
Author

commented Aug 2, 2016

@mistercrunch I've seen this FAQ answer, of course. But it looks more confusing than helpful for me, because I understood it in a way, that I can create new views inside Caravel somehow rather than directly inside DB. OK, thank you.

@alanmcruickshank

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2016

@ololobus @mistercrunch This is actually possible at the moment through some recently added functionality added for the work on additional datetimes. This method is way faster than using views (unless you're able to materialise), due to how sqlalchemy executes the query.

Head into your table setup and click the edit button:
screenshot-a

Go to the "List Table Columns" Tab and click add
screenshot-b

give your column a name (in your case, language_code or something like that), select group by and filter by and then in expression add something like:

(select langauges.code from languages where language_id = documents.languages_id)

Save and then you should be able to group by this column.

What you're effectively doing it added a composite column into the sqlalchemy model which is then injected directly into the query at runtime.

@ololobus

This comment has been minimized.

Copy link
Author

commented Aug 2, 2016

@alanmcruickshank It worked, thank you!

I've take a look on the produced SQL query in this case:

SELECT (select languages.code from languages where languages.id =
    documents.language_id) AS language_code, COUNT(*) AS count 
    FROM documents JOIN (SELECT (select languages.code from languages where languages.id =
    documents.language_id) AS language_code__ 
    FROM documents 
    WHERE created_at >= '2016-07-26 23:09:12.000000' AND created_at <= '2016-08-02 23:09:12.000000' GROUP BY (select languages.code from languages where languages.id =
    documents.language_id) ORDER BY COUNT(*) DESC 
     LIMIT 50) AS anon_1 ON (select languages.code from languages where languages.id =
    documents.language_id) = language_code__ 
    WHERE created_at >= '2016-07-26 23:09:12.000000' AND created_at <= '2016-08-02 23:09:12.000000' GROUP BY (select languages.code from languages where languages.id =
    documents.language_id) ORDER BY count DESC 
     LIMIT 50000

It results in a very tricky JOIN, so created inside db view with joined tables, e.g. documents_language will perform relatively the same even if it's not materialized.

Anyway, I'm glad that I can achieve the same result directly from Caravel. Thank you again.

P.S. It would be great to add such an example inside this FAQ answer, because it's frequent case, I guess :)

@QiXuanWang

This comment has been minimized.

Copy link

commented Mar 8, 2017

I found this issue when I was searching for multiple tables support. In my application, I need to perform data analysis based on raw tables and then display it through superset (hopefully). My raw database used many small tables, while new table/view to be displayed could be stored in a materialized table for view purpose if needed. But I'm not sure if it's a proper use case for superset anymore.
If anyone has comment, I'd like to see how this could be solved.

@mistercrunch

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2017

Using a view, materialized or not, is how you'd make that happen in Superset. It is a proper use case.

@QiXuanWang

This comment has been minimized.

Copy link

commented Mar 9, 2017

Thanks a lot @mistercrunch
I was wondering since I didn't fully get your comment above:
"A view is a simple logical layer that abstract an arbitrary SQL queries as a virtual table. This can allow you to join and union multiple tables, and to apply some transformation using arbitrary SQL expressions. "

Not sure I fully understand this. By "view" do you mean a class in views.py or just a SQL expression?
For me, I'll make some complicated data manipulating on raw tables so a simple SQL expression should not work. I'm not quite familiar with views.py(FAB) yet so I can't say too much about this.

@mistercrunch

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2017

I meant it in the database/SQL sense of the term, as in CREATE VIEW foo AS SELECT...

@QiXuanWang

This comment has been minimized.

Copy link

commented Mar 9, 2017

Appreciate it. Now I understand.
My application will call python functions to do post-processing so I guess a materialized table(though it should be called data-source too) is necessary.

@gclsoft

This comment has been minimized.

Copy link

commented Aug 19, 2017

@alanmcruickshank Not work at all.

SELECT cfname,
       orgname FROM (SELECT cfname AS cfname,
                           (select org.fname_l2
                              from t_org_ctrlunit org
                             where fid = org.fcontrolunitid) AS orgname
                      FROM ct_crm_customerinfo
                     WHERE fcreatetime >=
                           TO_TIMESTAMP('2017-08-12T08:21:17\',
                                        'YYYY-MM-DD"T"HH24:MI:SS.ff6\')
                       AND fcreatetime <=
                           TO_TIMESTAMP('2017-08-19T08:21:17\',
                                        'YYYY-MM-DD"T"HH24:MI:SS.ff6\')) WHERE ROWNUM <= 50000

I added (select org.fname_l2 from t_org_ctrlunit org where fid = org.fcontrolunitid) in the Expression.
Supderset can't even left join, the sql is wrong. How to solve it?

@mistercrunch

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2017

@gclsoft SELECT in expressions as IN

SELECT (SELECT...) ...

isn't supported by all database engines.

Just create a view!

@alanmcruickshank

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2017

@gclsoft - what database engine are you using?

The other thing that might help you is since this original question, the release of the SQL Lab feature has made it much easier to create tables in superset which point at joined tables within your database.

@lilloraffa

This comment has been minimized.

Copy link

commented Sep 13, 2017

@alanmcruickshank Hi there, I'm trying to explore what you are saying (use SQL Lab to create tables based on sql statement, and so also table made of joins), but I've not been very successful to find a well defined process to do that. Only thing I figured out is to use SQL Lab -> run the query -> click on "visualize" ->this create "magically" a new datasource, which seems more a workaround than a proper process to create a new datasource from a query.
Would you be able to give us more details in case there actually is a more "structural" way to do the job? Thanks a lot!

@xrmx

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2017

@lilloraffa that's it, what do you have in mind for a more structured way to handle it?

@lilloraffa

This comment has been minimized.

Copy link

commented Sep 14, 2017

@xrmx well, since tables are created and managed in Sources->Tables, I would expect a functionality there to let me create tables based on sql statements. I mean, letting people create table using sql statement from a "visualize" function in SQL Lab seems more like a workaround to me.

@xrmx

This comment has been minimized.

Copy link
Contributor

commented Sep 14, 2017

@lilloraffa i think the idea is that the people will explore the sql they need as datasource as they do for the data. And they don't have access to to a sql shell :)

Your use case is that you already know the sql and have another shell to try it and you just want to copy and paste it in a new table right? That should be doable.

@lilloraffa

This comment has been minimized.

Copy link

commented Sep 14, 2017

@xrmx that is correct, in general I would like to have the flexibility to define table with sql statement so to join more table, even coming from different database.
My general use case is to provide a general purpose analytics framework (we are evaluating superset for this) to be connected to a Big Data Infrastructure, and user can choose which dataset (or join of dataset) to use in their analysis.
Thanks a lot!

@xrmx

This comment has been minimized.

Copy link
Contributor

commented Sep 14, 2017

@lilloraffa it shouldn't take much to implement, feel free to open a PR :)
cough If it's challenging enough looks more useful than a theme ;) cough

@mistercrunch

This comment has been minimized.

Copy link
Contributor

commented Sep 15, 2017

The tricky part is getting the column metadata. In the Visualize button flow in SQL Lab, we already know the columns and have an idea of the data types as we have the result set on hand when the button is clicked.

To do this in the "Add Table" flow, we'd have to run a quick query to do this, which may time out or whatever and again we're just guessing the data types based on the result set which isn't ideal.

Seems like it would be much easier to have another flow that would run a CREATE VIEW on the user's behalf (assuming the db user in the connection string has the permission to do that), and then just use the view as if it were a table (then it's easy to get the actual data types from the DB).

Though then we may need some way to allow users to ALTER VIEW as well. All of this is kind of a can of worms...

@olokedeomotayo

This comment has been minimized.

Copy link

commented Jan 23, 2018

Dear All,

I have a table i joined in SQL Lab in Superset but only one of the Table is displaying. I have used LEFT and Right Join but same thing.. Any help please.

@xrmx

This comment has been minimized.

Copy link
Contributor

commented Jan 23, 2018

@olokedeomotayo you already filed an issue, no need to repeat yourself here please.

@gclsoft

This comment has been minimized.

Copy link

commented Feb 24, 2018

Does superset support left join now? Instead of using views or only one table?

@mistercrunch

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2018

Still no support for joins at the moment and in the near future. We still tell people to create a view and use that as a source. Many more advanced dbs with good query optmizers will in fact omit the join from the plan when left-joining on a field that has a unique constraint and no fields from the remote table are referenced in the query. My point being that there may be no extra cost for unused joins in views in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.