Refactor Source & Job data model and Stop Duplicate Ingestion Jobs #685

woop · 2020-05-08T06:10:48Z

What this PR does / why we need it:

1. Generalise Source Model.

The current Source model in Feast Core is Kafka specific. For all intents and purposes it is a hardcoded implementation of KafkaSource containing topics/brokers as top level fields, despite the naming being Source.

Not generalizing the data model at this point (prior to the release of 0.5) will cause further problems down the road when new sources are introduced.

This PR moves Source configuration into a config object and isolates Kafka specific logic to case statements. isDefault is retained for the time being as a top level field, since it can easily be phased out if needed.

Configuration is stored in String format:

Comparison between Source objects with .equals() no longer takes into the isDefault field.

Under this model, identical Source objects (ie .equals()) can be stored as duplicate source Objects.

Added de duplication code to JobCoordinatorService to take this into account.

2. Make Feast stop duplicate ingestion Jobs.

Currently JobCoordinatorService does not stop duplicate jobs (ie ingestion jobs that ingest from the same exact source to store pairing.).
Updates JobCoordinatorService to abort these extra ingestion Jobs when safe (ie only when JobCoordinatorService can find a running ingestion job for each Source to Store pairing).

3. Job Model Refactors

Standardize & Updated JobManager API:
- startJob() is standardised as transitioning a Job from PENDING to RUNNING.
- abortJob() is standardised as transitioning a Job from RUNNING to ABORTING
- changed abortJob() to return Job and take a Job as args to be consistent with other methods.
Refactored JobUpdateTask.call() to be easier to follow.
Refactored JobCoordinatorService.poll() into multiple methods (ie getSourceToStoreMapping(), makeJobUpdateTasks()) to make code more readable.
Updated Job to store source fields (ie type and config) as inline fields in the Job table.
- This is done to make the Job model more consistent why the ingestion Job it represents
- Modifying the Source that the Job model references does not reflect onto the underlying Ingestion Job.
- Hence source fields are copied onto Job to reflect this in the Job model.

Which issue(s) this PR fixes:

Fixes #632

Does this PR introduce a user-facing change?:

The database schema for Source has been generalized. This is a breaking change and requires a migration.
The database schema for Job has changed. The Job table stores no longer Sources by id, instead stores Source.config  and Source.type as inline fields.

Feast now stops duplicate Ingestion Jobs with the same source and store pairing.

zhilingc · 2020-05-09T10:31:02Z

core/src/main/java/feast/core/model/Source.java

Doesn't this mean that a new record is created for every source pushed to the db, even if there was a previously applied fs sharing the same source?

Yes it does. I thought about unifying these, but I think it would be unintuitive for one feature set to be able to modify the source of another feature set. Ideally a source would be configured by the administrator and the user should just select it, at which point sharing sources would make more sense in my opinion.

It's not possible for a feature set to modify the source of another feature set, if a feature set's source is updated a new entry will be written to the db, distinct from the old one (still in use by other feature sets).

Ok. Didn't realize that. So it acts pretty similarly to how it's written here with some deduplication?

Should I change anything or are you happy with this implementation as a generalization step?

No, this implementation won't work once you're able to alter feature set sources, unfortunately. It will break job creation.

zhilingc · 2020-05-10T04:35:55Z

/hold

woop · 2020-05-11T11:43:54Z

/retest

mrzzy · 2020-05-24T10:48:14Z

Currently, the point of contention for this PR is how to represent the Source's id.

As pointed out by @zhilingc, the source Id (if where to have an id) must stay constant as to ensure that when the user calls SpecService's applyFeatureSet(), it does not create a entirely new Source object, as JobCoordinatorService does not support duplicate source objects when figuring out whether to spawn its Ingestion Jobs.

In this regard we have a couple of options:

generate a deterministic id from the contents of source model

combine a the source's type with its options and use that as a String id.

No collisions but source Id can be quite large as its the concatenation of the type and entire options string.

use the source's hash code as an integer id.

Keeps the id a simple integer. Collisions might be possible. Hacky way of solving things.

Add deduplication code in JobCoordinatorService
- Update the JobCoordinatorService to deduplicate sources from the database before spawning Ingestion jobs with them.

Adds complexity to JobCoordinatorService's code. Additionally new code written would have to take into account the database might supply duplicates.

Add deduplication code to SpecService
- Update applyFeatureSet() to query for a possible matching source already stored in the DB and use that instead of creating new one.

Adds overhead to applyFeatureSet() as we need to hit the DB for a matching source for each request.

@woop @zhilingc

woop · 2020-05-24T11:10:04Z

the source Id (if where to have an id) must stay constant as to ensure that when the user calls SpecService's applyFeatureSet(), it does not create a entirely new Source object, as JobCoordinatorService does not support duplicate source objects when figuring out whether to spawn its Ingestion Jobs.

It seems like the problem here is JobCoordinatorService. Using a composed primary key for uniqueness is only one of many ways to deduplicate, and in this case its particularly dangerous because we are expecting changes to the source table in the future.

1 generate a deterministic id from the contents of source model
3 Add deduplication code to SpecService

How can you do (1) without (3)? What if a new source is being registered that already exists? You have to first have a look to see if there are existing sources with that Id and reuse it. Then when somebody changes the source you have to go and create a copy instead of just modifying it in place.

Seems like there is a lot of complexity around managing sources that shouldnt exist.

Add deduplication code in JobCoordinatorService

Sources are 1:1 to feature sets, so they should not be shared or deduplicated in the database. In theory we could decouple sources from feature sets but nobody has made that decision yet.

It seems like the current design of the database model has been influenced by an assumption of how the source id's will be used by the JobCoordinatorService. It's worth considering whether we have made the right design choices here.

Unless I am missing something, (2) is the preferred approach.

mrzzy · 2020-05-25T04:00:32Z

How can you do (1) without (3)? What if a new source is being registered that already exists?

According to this Relevant SO, Hibernate has an internal copy of the entity which it performs a diff with internal copy (dirty checking). It only hits the database with an insert query when it finds a difference in the objects fields (ie id change). If we are to guarantee that the id and non of the fields change, Hibernate will not perform an insert.

It seems like the problem here is JobCoordinatorService. Using a composed primary key for uniqueness is only one of many ways to deduplicate, and in this case its particularly dangerous because we are expecting changes to the source table in the future.

If we are able to continue isolate the Source's config in a string config string. We can make sources configurable without making major changes to the Source model.

Sources are 1:1 to feature sets, so they should not be shared or deduplicated in the database. In theory we could decouple sources from feature sets but nobody has made that decision yet.

One approach that @zhilingc has suggested on this front is move towards named sources, similar to what we currently have for stores. This would remove the problem of duplication and opens to a N:M relationship between Sources and Feature Sets. However, the ability to configure sources on a Feature Set level suffers.

Generally, I think we should move towards approach (3) with auto-incrementing/generated source ids. Since Hibernate already maintains an internal copy of the Source use for constructing the diff, I would assume that Hibernate would use this internal copy when we perform a query for an existing source and the performance hit is negligible. Apply Feature Set is also not that performance critical in my opinion, as the no. of requests by the average user pales in comparison to get online feature requests for example.

(2) is not a suitable solution in my view as its a design choice that could have future implications on our codebase, especially as direct our attention to develop the source part of Feast. All code, not just JobCoordinatorService, that use sources would have to take account the possibility of Sources being duplicates. This may confuse new developers and add to some unnecessary code bloat (ie unique Source checking cropping up over the codebase).

woop · 2020-05-25T07:16:37Z

If we are able to continue isolate the Source's config in a string config string. We can make sources configurable without making major changes to the Source model.

Agreed on this, as long as it doesn't affect the primary key.

One approach that @zhilingc has suggested on this front is move towards named sources, similar to what we currently have for stores. This would remove the problem of duplication and opens to a N:M relationship between Sources and Feature Sets. However, the ability to configure sources on a Feature Set level suffers.

Sure, but this isn't specced out. @ches and @khorshuheng have both highlighted the need to improve functionality around sources. For example having sources customizable locally (per serving deployemnt) even if they are registered centrally (Core). Or having a differences between addresses broadcasted to different consumers.

The requirements aren't clear here yet.

All code, not just JobCoordinatorService, that use sources would have to take account the possibility of Sources being duplicates.

Sources are supposed to be duplicated. An optimization that is specific to the JobCoordinatorService is that it wants to deduplicate sources in order to spin up less jobs.

Basically what you are arguing for is "pre-deduplication" of sources and managing that at create and update time. But it seems like we are creating a volatile dependency with that approach (3)

If I understand you correctly, every time you make a change to a source it could get a new Id. How are you going to maintain referential integrity of sources if a feature set to source relationship is not maintained?

mrzzy · 2020-05-26T04:37:01Z

But it seems like we are creating a volatile dependency with that approach (3)

In (3), where applyFeatureSet() detects a change in by non finding a match for a existing source in the DB, it creates entirely new Source row in the DB to track the updates. Hence sources will not become volatile under (3).

If I understand you correctly, every time you make a change to a source it could get a new Id. How are you going to maintain referential integrity of sources if a feature set to source relationship is not maintained?

I take this in the context of approach (1), which is the only approach where the source Id changes based on the contents of the Source. Where hibernates detects a change via its dirty checking in the Id, it automatically figures out that it should create a new object instead of continuing to use the existing one. Since the old source object is not touched, referential integrity is preserved for other feature sets depending on the old source.

After reading about Feature Sets being designed as designed Ingestion Concept, not a form of logical grouping however, I think I can see now why (2) makes sense, if Feature Sets themselves are tied to data sourcing and ingestion.

Moving forward with approach (2) if there are no concerns.
@zhilingc

zhilingc · 2020-05-26T09:24:11Z

@woop I'm curious as to why you say that feature sets have a 1:1 mapping with sources and that sources should be duplicated - there is nothing that suggests that in the code base, nor in the usage of feast. Sources being distinct objects that have a many-to-one mapping to feature sets is the relationship that comes to mind intuitively to me, particularly if we are eventually going to move towards named sources in the future.

woop · 2020-05-26T09:45:37Z

I'm curious as to why you say that feature sets have a 1:1 mapping with sources and that sources should be duplicated - there is nothing that suggests that in the code base, nor in the usage of feast.

You can see this by looking at the feature set specification. Each specification has one source. The configuration of the source is created by the author of that specification.

At no point did we indicate to users that the source they define is a shared resource with other feature sets, since they are authoring/creating it, and not selecting a pre-existing source.

Further illustrated by just looking at exported feature sets. They all contain the sources in line, and do not reference an externally defined source.

Sources being distinct objects that have a many-to-one mapping to feature sets is the relationship that comes to mind intuitively to me, particularly if we are eventually going to move towards named sources in the future.

I agree, but that isn't the current design. So we have to decide if we want to fix this problem by changing the Feast design of sources, or fix this implementation.

zhilingc · 2020-05-27T03:02:16Z

You can see this by looking at the feature set specification. Each specification has one source. The configuration of the source is created by the author of that specification.
At no point did we indicate to users that the source they define is a shared resource with other feature sets, since they are authoring/creating it, and not selecting a pre-existing source.
Further illustrated by just looking at exported feature sets. They all contain the sources in line, and do not reference an externally defined source.

I'm not really convinced here. Users specifying their sources in-line rather than defining them externally is more out of convenience to the user than to convey the intention that the sources are unique. I don't really see why feast can't treat sources that are the same as... well... the same.

The same goes for displaying the sources in line. It's so that users can get complete information about the feature set, it's not pushing any agenda there.

I'd argue that the internal model of entities within feast supports unique sources a lot better than duplicated ones, and the fact that named sources isn't specced out yet shouldn't immediately discount the option of implementing it as such.

woop · 2020-05-27T04:27:40Z

You can see this by looking at the feature set specification. Each specification has one source. The configuration of the source is created by the author of that specification.
At no point did we indicate to users that the source they define is a shared resource with other feature sets, since they are authoring/creating it, and not selecting a pre-existing source.
Further illustrated by just looking at exported feature sets. They all contain the sources in line, and do not reference an externally defined source.

I'm not really convinced here. Users specifying their sources in-line rather than defining them externally is more out of convenience to the user than to convey the intention that the sources are unique.

It's not trying to convey anything. A source is a part of a feature set, just like entities, just like feature, just like max_age. All attributes in the specification belong to that specification. Sources are no different.

I don't really see why feast can't treat sources that are the same as... well... the same.

It can, and in fact it functions like that today. But that is a storage only implementation detail that is now being taken as the Feast design for some reason. I see it as technical debt that should be removed. It is not a design goal to have shared sources yet, otherwise our API would indicate that, which it doesn't.

The same goes for displaying the sources in line. It's so that users can get complete information about the feature set, it's not pushing any agenda there.

Just to be clear. A key/value relationship is 1:1. No "agenda" has to be pushed. If you are arguing that the relationship is not 1:1 then it's up to you to provide rationale for that, because our API is 1:1.

I'd argue that the internal model of entities within feast supports unique sources a lot better than duplicated ones

Because it was built around a data model of unique sources.

, and the fact that named sources isn't specced out yet shouldn't immediately discount the option of implementing it as such.

Nobody has discounted that. The point I am making is that it's a big change to make, which would require a proposal and discussions. It's also different from our current design.

Furthermore, in our community calls our contributors have voiced approval for the fact that sources are a part of feature sets and tracked together. This is going to become more important when we have audit logs for example.

We see value in connection strings (BQ, brokers) or aspects of a source being externalized, but not the source itself.

woop · 2020-06-03T14:55:23Z

core/src/main/java/feast/core/service/JobCoordinatorService.java

              (source, setsForSource) -> {
+                // Sources with same type and config in different Feature Sets are different
+                // objects.
+                // Make sure that we are dealing with the same source object when spawning jobs.


This would not get the same object, but it would get objects with the same type and configuration. As long as we are clear that we wont use the object Id anywhere, then that should be fine.

mrzzy · 2020-06-12T08:39:04Z

/test test-end-to-end-batch

core/src/main/java/feast/core/model/Job.java

pyalex · 2020-06-12T11:21:39Z

core/src/main/java/feast/core/service/JobCoordinatorService.java

On this point we may generate a pair that actually doesn't have a connection. Source may be connected only to some stores, not all of them, for example. But it definitely not all stores X all sources.
Why not continue source compilation (previous line) with pair generation?

.flatMap( store -> getFeatureSetsForStore(store).stream() .map(featureSet -> Pair.of(featureSet.getSource(), store))) .distinct() .collect()

Correction, since it currently creates map, so we overwrite stores for the same Source key and thus end up with one Store per Source.
Should be list of pairs

Corrected. Thanks for pointing this out.

My first comment is still valid. It shouldn't be all sources X all stores. You need to create pairs that are actually connected.
I guess to see that's the case would be great to add test that creates
Store1 -> subscripted only to -> Source1
Store2 -> subscripted only to -> Source2
Only two pairs should be generated. Whereas current implementation will generate 4

pyalex · 2020-06-12T11:31:56Z

core/src/main/java/feast/core/service/JobCoordinatorService.java

Shouldn't be a map. It's not 1<->1 relation.
One source can be used to populate many stores
I think it's rather list of Pairs

pyalex · 2020-06-12T11:53:48Z

core/src/main/java/feast/core/job/JobUpdateTask.java

It seems that JobStatus shouldn't be used here, since you only need two values - this API may be misleading. I suggest to create new Enum

pyalex · 2020-06-12T12:08:03Z

core/src/main/java/feast/core/service/JobCoordinatorService.java

This can generate pretty big SQL query by putting long list of Ids. I'm not sure what's the length limit is, but definitely not the best practice.
I suggest to have

Set<Job> allRunningJobs = getAllRunningJobs(); Set<Job> checkedAsNeeded; for (pair : sourceToStorePairs) { checkedAsNeeded.add(..) } Set<Job> toStop = Sets.difference (allRunningJobs, checkedAsNeeded);

Updated getExtraJobs() do the diff in memory instead of as an SQL query

pyalex · 2020-06-16T07:18:07Z

core/src/main/java/feast/core/service/JobCoordinatorService.java

My first comment is still valid. It shouldn't be all sources X all stores. You need to create pairs that are actually connected.
I guess to see that's the case would be great to add test that creates
Store1 -> subscripted only to -> Source1
Store2 -> subscripted only to -> Source2
Only two pairs should be generated. Whereas current implementation will generate 4

core/src/test/java/feast/core/service/JobCoordinatorServiceTest.java

core/src/main/java/feast/core/config/FeatureStreamConfig.java

pyalex · 2020-06-18T14:47:35Z

core/src/main/java/feast/core/job/JobUpdateTask.java

probably it was meant "%s-to-%s-%s", source

Corrected to Source.type-hash(Source.config)

pyalex · 2020-06-18T15:10:05Z

core/src/main/java/feast/core/model/Source.java

if this one is supposed to be used in createJobId then it's not exactly correct. this.id won't be populated in ConsolidatedSource and more stable formula (type + config) should be used as kafka consumer group id

And I would rather have this logic in createJobId if that's the case

Moved to createJobId()

…onfig instead of id.

feast-ci-bot · 2020-06-19T04:58:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pyalex, woop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [pyalex,woop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pyalex · 2020-06-19T05:25:08Z

/lgtm

mrzzy · 2020-06-19T05:31:01Z

/unhold

woop requested review from pradithya and zhilingc as code owners May 8, 2020 06:10

feast-ci-bot added do-not-merge/work-in-progress approved needs-kind size/L labels May 8, 2020

woop added kind/techdebt compat/breaking Breaking user-facing change and removed needs-kind labels May 8, 2020

woop changed the title ~~[WIP] Generalize Source data model~~ Generalize Source data model May 8, 2020

feast-ci-bot removed the do-not-merge/work-in-progress label May 8, 2020

zhilingc reviewed May 9, 2020

View reviewed changes

mrzzy mentioned this pull request May 10, 2020

Create API compatibility and Supported Versions Policy for Feast #687

Closed

feast-ci-bot added the do-not-merge/hold label May 10, 2020

woop force-pushed the generalize-sources branch from 161a46d to 6904912 Compare May 18, 2020 02:35

mrzzy force-pushed the generalize-sources branch from 90f552d to 785208a Compare May 19, 2020 10:31

woop force-pushed the master branch from 68eb265 to 35f8f18 Compare May 24, 2020 06:47

mrzzy force-pushed the generalize-sources branch from 785208a to ce63d63 Compare May 29, 2020 09:18

woop commented Jun 3, 2020

View reviewed changes

mrzzy changed the title ~~Generalize Source data model and Stopping Dupiicate Ingestion Jobs~~ Generalize Source data model and Stopping Dupicate Ingestion Jobs Jun 12, 2020

mrzzy changed the title ~~Generalize Source data model and Stopping Dupicate Ingestion Jobs~~ Generalize Source data model and Stopping Dupiicate Ingestion Jobs Jun 12, 2020

mrzzy changed the title ~~Generalize Source data model and Stopping Dupiicate Ingestion Jobs~~ Generalize Source data model and Stopping Duplicate Ingestion Jobs Jun 12, 2020

mrzzy force-pushed the generalize-sources branch from 69befc5 to 60acbd8 Compare June 12, 2020 07:39

pyalex requested changes Jun 12, 2020

View reviewed changes

pyalex reviewed Jun 12, 2020

View reviewed changes

pyalex approved these changes Jun 16, 2020

View reviewed changes

pyalex requested changes Jun 16, 2020

View reviewed changes

core/src/test/java/feast/core/service/JobCoordinatorServiceTest.java Outdated Show resolved Hide resolved

mrzzy force-pushed the generalize-sources branch from bfc6f00 to 6894c58 Compare June 18, 2020 13:35

pyalex reviewed Jun 18, 2020

View reviewed changes

core/src/main/java/feast/core/config/FeatureStreamConfig.java Outdated Show resolved Hide resolved

pyalex reviewed Jun 18, 2020

View reviewed changes

mrzzy added 2 commits June 19, 2020 09:15

Rebase & squash on master

eca5f09

Fix java unit tests

d176811

mrzzy changed the title ~~Generalize Source data model and Stopping Duplicate Ingestion Jobs~~ Refactor Source & Job data model and Stopping Duplicate Ingestion Jobs Jun 19, 2020

mrzzy changed the title ~~Refactor Source & Job data model and Stopping Duplicate Ingestion Jobs~~ Refactor Source & Job data model and Stop Duplicate Ingestion Jobs Jun 19, 2020

mrzzy added 3 commits June 19, 2020 09:45

Rebase on JobService PR fix

af3c88d

Update JobUpdateTask's createJobId() to generate id based on source c…

509e31b

…onfig instead of id.

Fix wrong conflict resolve in FeatureStreamConfig

ac798c7

mrzzy force-pushed the generalize-sources branch from 8ec69e4 to ac798c7 Compare June 19, 2020 02:40

pyalex approved these changes Jun 19, 2020

View reviewed changes

feast-ci-bot assigned pyalex Jun 19, 2020

feast-ci-bot added the lgtm label Jun 19, 2020

feast-ci-bot removed the do-not-merge/hold label Jun 19, 2020

feast-ci-bot merged commit f425d7d into feast-dev:master Jun 19, 2020

mrzzy mentioned this pull request Jun 30, 2020

Add Labels to Source Object #835

Closed

Refactor Source & Job data model and Stop Duplicate Ingestion Jobs #685

Refactor Source & Job data model and Stop Duplicate Ingestion Jobs #685

Uh oh!

Conversation

woop commented May 8, 2020 • edited by mrzzy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Generalise Source Model.

2. Make Feast stop duplicate ingestion Jobs.

3. Job Model Refactors

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhilingc commented May 10, 2020

Uh oh!

woop commented May 11, 2020

Uh oh!

mrzzy commented May 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woop commented May 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrzzy commented May 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woop commented May 25, 2020

Uh oh!

mrzzy commented May 26, 2020

Uh oh!

zhilingc commented May 26, 2020

Uh oh!

woop commented May 26, 2020

Uh oh!

zhilingc commented May 27, 2020

Uh oh!

woop commented May 27, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrzzy commented Jun 12, 2020

Uh oh!

Uh oh!

pyalex Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pyalex Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pyalex Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pyalex Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

woop commented May 8, 2020 •

edited by mrzzy

Loading

mrzzy commented May 24, 2020 •

edited

Loading

woop commented May 24, 2020 •

edited

Loading

mrzzy commented May 25, 2020 •

edited

Loading

pyalex Jun 12, 2020 •

edited

Loading

pyalex Jun 12, 2020 •

edited

Loading

pyalex Jun 12, 2020 •

edited

Loading

pyalex Jun 12, 2020 •

edited

Loading

mrzzy Jun 19, 2020 •

edited

Loading

pyalex Jun 18, 2020 •

edited

Loading

mrzzy Jun 19, 2020 •

edited

Loading