Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LanceDB to the list of Known Users #7716

Merged
merged 2 commits into from
Oct 2, 2023
Merged

Add LanceDB to the list of Known Users #7716

merged 2 commits into from
Oct 2, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 1, 2023

I was reading about LanceDB and then realized it used DataFusion -- https://github.com/search?q=repo%3Alancedb%2Flance%20datafusion&type=code

cc @wjones127

(btw I would love to know why you chose DataFusion and how you are using it -- among other things, it might make an excellent example usecase for #6782)

docs/source/user-guide/introduction.md Outdated Show resolved Hide resolved
@wjones127
Copy link
Member

btw I would love to know why you chose DataFusion and how you are using it -- among other things, it might make an excellent example usecase for

Lance is essentially a table format (like Delta Lake). These blur the line between data format and database, so it requires database components to build, such as a expression library. In addition, one of Lance's distinguishing feature is support for secondary indexes (right now, just ANN indexes for approximate KNN search). In order to use these, we need to have query plans to handle scanning both indexed data and yet-to-be-indexed data in parallel and combine the two in a query. We use DataFusion to do this.

The two things we like about DataFusion in particular are: (1) it's easy to extend with new query nodes and (2) it's Arrow-native. For operations like scanning indices and our Take operation (get additional columns by their known row locations). DataFusion being Arrow-native has meant it's been easy to integrate with PyArrow and the larger Python data ecosystem. For example, we have many APIs where users write Python functions that operation on RecordBatches, and these can operate directly on the data without having to do any conversion. (We are very heavy users of the C Data Interface.)

Co-authored-by: Will Jones <willjones127@gmail.com>
@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2023

@wjones127 -- thank you for your comments in #7716 (comment)

Do you mind if I use this in the paper we are working on (#6782 ) as a usecase as I think it validates several of the points in the paper (Arrow compatibility and having all the expression machinery)

@eddyxu
Copy link
Member

eddyxu commented Oct 2, 2023

Do you mind if I use this in the paper we are working on (#6782 ) as a usecase as I think it validates several of the points in the paper (Arrow compatibility and having all the expression machinery)

We'd love to support your paper submission!

@alamb alamb merged commit 422e68e into main Oct 2, 2023
7 checks passed
@alamb alamb added the documentation Improvements or additions to documentation label Oct 2, 2023
@wjones127 wjones127 deleted the alamb-patch-1 branch October 2, 2023 20:47
Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
* Add LanceDB to the list of Known Users

* Update docs/source/user-guide/introduction.md

Co-authored-by: Will Jones <willjones127@gmail.com>

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants