Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hive Support #1686

Closed
bennfocus opened this issue Jul 5, 2021 · 11 comments
Closed

Add Hive Support #1686

bennfocus opened this issue Jul 5, 2021 · 11 comments

Comments

@bennfocus
Copy link
Contributor

I need to use Hive for offline-store, and Redis for online.
Since in current 0.11+ roadmap, the Hive support was not included.
I'd like to work on this, just add this issue for tracking.

@bennfocus
Copy link
Contributor Author

bennfocus commented Jul 5, 2021

FYI, Since there is no much choice for hive python client. I will use PyHive Impyla as a dependency.

@woop
Copy link
Member

woop commented Jul 5, 2021

Thanks @Baineng. You simply need to create a new OfflineStore class. Some more details here https://docs.feast.dev/feast-on-kubernetes/user-guide/extending-feast#custom-offlinestore.

I'd recommend keeping the class as an external dependency of Feast at the start (a new package). We can link to it from our docs and include it in our tests, but the repo can start out as yours. You can reference this class from the feature_store.yaml by using the class path. We will automatically pick it up using https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/helpers.py#L8

@bennfocus
Copy link
Contributor Author

Thanks for the info @woop, that's fine for me, I will go that way.

@bennfocus
Copy link
Contributor Author

Hello @woop @YikSanChan,

FYI, I have implemented part of the Hive offline store, just want have a catch up with you, and probably get some feedbacks before continue.
could you have a look when you got time ?

The repo is here: https://github.com/baineng/feast-hive

Basically it's very similar to BigQuery implementation.

Some thoughts I have when checking Feast code:

  • The DataSource abstraction in source code are separated from offline_store and online_store, but they are actually coupled (BigQuery store relies on BigQuerySource, ...), probably DataSources should stay together with their implementations.
  • And some code about DataSource are hardcoded, not easy to extend. Such as DataSource.from_proto(data_source).

Some questions:

  • Should I add protos for HiveSource and HiveOptions in my repo as well?
  • I saw you guys have some discussion about removing query in the BigQuerySource, what do you think if i remove it from HiveOfflineStore ? (Hive should have more controls by end-users, I mean creating views)
  • I saw in get_historical_features() of BigQueryOfflineStore, it uploads entity_df to BigQuery first, then do point-in-time query.
    Think I will need to do the same for Hive, but there is no efficient way to upload a df to Hive besides uploading to HDFS directly. I don't want have HDFS client as another dependency, so I will choose to use multiple rows insert. what do you guys think?

@bennfocus
Copy link
Contributor Author

bennfocus commented Jul 14, 2021

FYI, I will change python client from Impyla to Ibis for better data writing support.

@woop
Copy link
Member

woop commented Jul 14, 2021

Hello @woop @YikSanChan,

FYI, I have implemented part of the Hive offline store, just want have a catch up with you, and probably get some feedbacks before continue.
could you have a look when you got time ?

The repo is here: https://github.com/baineng/feast-hive

Basically it's very similar to BigQuery implementation.

Some thoughts I have when checking Feast code:

  • The DataSource abstraction in source code are separated from offline_store and online_store, but they are actually coupled (BigQuery store relies on BigQuerySource, ...), probably DataSources should stay together with their implementations.
  • And some code about DataSource are hardcoded, not easy to extend. Such as DataSource.from_proto(data_source).

Some questions:

  • Should I add protos for HiveSource and HiveOptions in my repo as well?
  • I saw you guys have some discussion about removing query in the BigQuerySource, what do you think if i remove it from HiveOfflineStore ? (Hive should have more controls by end-users, I mean creating views)
  • I saw in get_historical_features() of BigQueryOfflineStore, it uploads entity_df to BigQuery first, then do point-in-time query.
    Think I will need to do the same for Hive, but there is no efficient way to upload a df to Hive besides uploading to HDFS directly. I don't want have HDFS client as another dependency, so I will choose to use multiple rows insert. what do you guys think?

This is so awesome @Baineng. Thank you for working on Hive support!

The DataSource abstraction in source code are separated from offline_store and online_store, but they are actually coupled (BigQuery store relies on BigQuerySource, ...), probably DataSources should stay together with their implementations.

Yes, we agree. Needs to be cleaned up.

And some code about DataSource are hardcoded, not easy to extend. Such as DataSource.from_proto(data_source).

Yea, we realize this as well. Good catch. It needs to be generalized.

Should I add protos for HiveSource and HiveOptions in my repo as well?

I believe the answer is yes. We need to store the source/config in the registry so it needs to exist, and I dont think it should be in the main repo.

I saw you guys have some discussion about removing query in the BigQuerySource, what do you think if i remove it from HiveOfflineStore ? (Hive should have more controls by end-users, I mean creating views)

I prefer not to introduce query.

saw in get_historical_features() of BigQueryOfflineStore, it uploads entity_df to BigQuery first, then do point-in-time query.
Think I will need to do the same for Hive, but there is no efficient way to upload a df to Hive besides uploading to HDFS directly. I don't want have HDFS client as another dependency, so I will choose to use multiple rows insert. what do you guys think?

You do have to do the same, but I share your intuition. It would be nice if we didn't couple to HDFS but instead wrote directly to Hive. Also, we can allow users to provide a reference to a table where their entity df is available within Hive, so we can skip the upload step. So at least there is a way for them to use Hive.

@bennfocus
Copy link
Contributor Author

Thanks for your feedback @woop , it definitely helps me for next steps.

@bennfocus
Copy link
Contributor Author

bennfocus commented Aug 2, 2021

FYI, Since Feast internal codes related to OfflineStore and DataSource are changing, I postponed the Hive support implementation until next Feast release.

@achals
Copy link
Member

achals commented Aug 2, 2021

Thanks for the heads up @Baineng , I think we're done with most of the refactoring. If you start development off of the master branch then you should be okay. Please let us know if you encounter any bugs!

@bennfocus
Copy link
Contributor Author

Great, thanks for the update @achals, I will catch up and start soon.

@bennfocus
Copy link
Contributor Author

FYI,
I have just published the first stable version to PyPi, think it's kind of ready for use now.
Please create an issue in the repo if you have met any problem.

@woop @achals Please have a review if you got time, will appreciate for any feedback.

@adchia adchia closed this as completed Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants