Add Hive Support #1686

bennfocus · 2021-07-05T07:17:36Z

I need to use Hive for offline-store, and Redis for online.
Since in current 0.11+ roadmap, the Hive support was not included.
I'd like to work on this, just add this issue for tracking.

bennfocus · 2021-07-05T07:21:12Z

FYI, Since there is no much choice for hive python client. I will use ~~PyHive~~ Impyla as a dependency.

woop · 2021-07-05T15:32:05Z

Thanks @Baineng. You simply need to create a new OfflineStore class. Some more details here https://docs.feast.dev/feast-on-kubernetes/user-guide/extending-feast#custom-offlinestore.

I'd recommend keeping the class as an external dependency of Feast at the start (a new package). We can link to it from our docs and include it in our tests, but the repo can start out as yours. You can reference this class from the feature_store.yaml by using the class path. We will automatically pick it up using https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/helpers.py#L8

bennfocus · 2021-07-06T00:27:09Z

Thanks for the info @woop, that's fine for me, I will go that way.

bennfocus · 2021-07-13T04:20:24Z

Hello @woop @YikSanChan,

FYI, I have implemented part of the Hive offline store, just want have a catch up with you, and probably get some feedbacks before continue.
could you have a look when you got time ?

The repo is here: https://github.com/baineng/feast-hive

Basically it's very similar to BigQuery implementation.

Some thoughts I have when checking Feast code:

The DataSource abstraction in source code are separated from offline_store and online_store, but they are actually coupled (BigQuery store relies on BigQuerySource, ...), probably DataSources should stay together with their implementations.
And some code about DataSource are hardcoded, not easy to extend. Such as DataSource.from_proto(data_source).

Some questions:

Should I add protos for HiveSource and HiveOptions in my repo as well?
I saw you guys have some discussion about removing query in the BigQuerySource, what do you think if i remove it from HiveOfflineStore ? (Hive should have more controls by end-users, I mean creating views)
I saw in get_historical_features() of BigQueryOfflineStore, it uploads entity_df to BigQuery first, then do point-in-time query.
Think I will need to do the same for Hive, but there is no efficient way to upload a df to Hive besides uploading to HDFS directly. I don't want have HDFS client as another dependency, so I will choose to use multiple rows insert. what do you guys think?

bennfocus · 2021-07-14T05:42:26Z

~~FYI, I will change python client from Impyla to Ibis for better data writing support.~~

woop · 2021-07-14T18:24:51Z

Hello @woop @YikSanChan,

FYI, I have implemented part of the Hive offline store, just want have a catch up with you, and probably get some feedbacks before continue.
could you have a look when you got time ?

The repo is here: https://github.com/baineng/feast-hive

Basically it's very similar to BigQuery implementation.

Some thoughts I have when checking Feast code:

The DataSource abstraction in source code are separated from offline_store and online_store, but they are actually coupled (BigQuery store relies on BigQuerySource, ...), probably DataSources should stay together with their implementations.

And some code about DataSource are hardcoded, not easy to extend. Such as DataSource.from_proto(data_source).

Some questions:

Should I add protos for HiveSource and HiveOptions in my repo as well?

I saw you guys have some discussion about removing query in the BigQuerySource, what do you think if i remove it from HiveOfflineStore ? (Hive should have more controls by end-users, I mean creating views)

I saw in get_historical_features() of BigQueryOfflineStore, it uploads entity_df to BigQuery first, then do point-in-time query.
Think I will need to do the same for Hive, but there is no efficient way to upload a df to Hive besides uploading to HDFS directly. I don't want have HDFS client as another dependency, so I will choose to use multiple rows insert. what do you guys think?

This is so awesome @Baineng. Thank you for working on Hive support!

The DataSource abstraction in source code are separated from offline_store and online_store, but they are actually coupled (BigQuery store relies on BigQuerySource, ...), probably DataSources should stay together with their implementations.

Yes, we agree. Needs to be cleaned up.

And some code about DataSource are hardcoded, not easy to extend. Such as DataSource.from_proto(data_source).

Yea, we realize this as well. Good catch. It needs to be generalized.

Should I add protos for HiveSource and HiveOptions in my repo as well?

I believe the answer is yes. We need to store the source/config in the registry so it needs to exist, and I dont think it should be in the main repo.

I saw you guys have some discussion about removing query in the BigQuerySource, what do you think if i remove it from HiveOfflineStore ? (Hive should have more controls by end-users, I mean creating views)

I prefer not to introduce query.

saw in get_historical_features() of BigQueryOfflineStore, it uploads entity_df to BigQuery first, then do point-in-time query.
Think I will need to do the same for Hive, but there is no efficient way to upload a df to Hive besides uploading to HDFS directly. I don't want have HDFS client as another dependency, so I will choose to use multiple rows insert. what do you guys think?

You do have to do the same, but I share your intuition. It would be nice if we didn't couple to HDFS but instead wrote directly to Hive. Also, we can allow users to provide a reference to a table where their entity df is available within Hive, so we can skip the upload step. So at least there is a way for them to use Hive.

bennfocus · 2021-07-15T02:46:46Z

Thanks for your feedback @woop , it definitely helps me for next steps.

bennfocus · 2021-08-02T02:21:18Z

FYI, Since Feast internal codes related to OfflineStore and DataSource are changing, I postponed the Hive support implementation until next Feast release.

achals · 2021-08-02T17:41:39Z

Thanks for the heads up @Baineng , I think we're done with most of the refactoring. If you start development off of the master branch then you should be okay. Please let us know if you encounter any bugs!

bennfocus · 2021-08-03T03:09:47Z

Great, thanks for the update @achals, I will catch up and start soon.

bennfocus · 2021-08-30T03:54:41Z

FYI,
I have just published the first stable version to PyPi, think it's kind of ready for use now.
Please create an issue in the repo if you have met any problem.

@woop @achals Please have a review if you got time, will appreciate for any feedback.

bennfocus mentioned this issue Jul 5, 2021

Feast Roadmap for 0.11+ #1527

Closed

jeina7 mentioned this issue Aug 23, 2021

"Could not identify the source type being added." Error when listing feature views bennfocus/feast-hive#1

Closed

adchia closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hive Support #1686

Add Hive Support #1686

bennfocus commented Jul 5, 2021

bennfocus commented Jul 5, 2021 •

edited

Loading

woop commented Jul 5, 2021

bennfocus commented Jul 6, 2021

bennfocus commented Jul 13, 2021

bennfocus commented Jul 14, 2021 •

edited

Loading

woop commented Jul 14, 2021

bennfocus commented Jul 15, 2021

bennfocus commented Aug 2, 2021 •

edited

Loading

achals commented Aug 2, 2021

bennfocus commented Aug 3, 2021

bennfocus commented Aug 30, 2021

Add Hive Support #1686

Add Hive Support #1686

Comments

bennfocus commented Jul 5, 2021

bennfocus commented Jul 5, 2021 • edited Loading

woop commented Jul 5, 2021

bennfocus commented Jul 6, 2021

bennfocus commented Jul 13, 2021

bennfocus commented Jul 14, 2021 • edited Loading

woop commented Jul 14, 2021

bennfocus commented Jul 15, 2021

bennfocus commented Aug 2, 2021 • edited Loading

achals commented Aug 2, 2021

bennfocus commented Aug 3, 2021

bennfocus commented Aug 30, 2021

bennfocus commented Jul 5, 2021 •

edited

Loading

bennfocus commented Jul 14, 2021 •

edited

Loading

bennfocus commented Aug 2, 2021 •

edited

Loading