feat: ML Model Backend Implementation #1896

RyanHolstien · 2020-09-25T20:23:37Z

Fixes #1877

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable)

Update to master

Update To Master

Merge Master

Update to LI Master

metadata-ingestion/mce-cli/bootstrap_mce.dat

feat: Port mce-cli to Java. (#1871)

docker/elasticsearch-setup/ml-model-index-config.json

gms/client/src/main/java/com/linkedin/common/client/MLModelsClient.java

gms/client/src/main/java/com/linkedin/ml/client/MLModels.java

gms/impl/src/main/java/com/linkedin/metadata/resources/ml/MLModels.java

gms/impl/src/main/java/com/linkedin/metadata/resources/ml/OwnershipResource.java

gms/impl/src/main/resources/mlModelESSearchQueryTemplate.json

docker/elasticsearch-setup/ml-model-index-config.json

gms/factories/src/main/java/com/linkedin/ml/factory/MLModelDAOFactory.java

docker/elasticsearch-setup/ml-model-index-config.json

Update to master

Update master

RyanHolstien · 2020-10-09T17:56:38Z

Build is failing due to a 502 bad gateway when reaching out to the gradle maven repo for ElasticSearch's rest client. Should pass now after the fix from a typo I made (if it can reach all the dependencies). Not sure how to force a rebuild without making more changes

jplaisted · 2020-10-09T17:59:48Z

Where was the typo? Did you already commit it? We ran the build jobs for your last few commits (see the Xs or check mars on them).

RyanHolstien · 2020-10-11T14:14:30Z

Where was the typo? Did you already commit it? We ran the build jobs for your last few commits (see the Xs or check mars on them).

Yeah the typo fix was the last commit. The previous failure was due to the restspec not having EvaluationData endpoints because I typoed the text in the Rest.li annotation for EvaluationDataResource which is now fixed. The current build failure is due to this error which does not seem to be code related, it hit a gateway error when reaching out to the Gradle repository to get a dependency:

Could not resolve all files for configuration ':metadata-utils:compileClasspath'.
Could not resolve org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.8.
Required by:
project :metadata-utils
> Could not resolve org.elasticsearch.client:elasticsearch-rest-high-level-client:5.6.8.
> Could not get resource 'https://plugins.gradle.org/m2/org/elasticsearch/client/elasticsearch-rest-high-level-client/5.6.8/elasticsearch-rest-high-level-client-5.6.8.pom'.
> Could not GET 'https://jcenter.bintray.com/org/elasticsearch/client/elasticsearch-rest-high-level-client/5.6.8/elasticsearch-rest-high-level-client-5.6.8.pom'. Received status code 502 from server: Bad Gateway

mars-lan · 2020-10-27T03:35:25Z

Sorry but just noticed that all the ML-related models are placed under com.linkedin.ml.metadata namespace (see https://github.com/linkedin/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata). @RyanHolstien do you mind if I submit a PR before this to move them to com.linkedin.ml instead?

RyanHolstien · 2020-10-27T16:45:26Z

Sorry but just noticed that all the ML-related models are placed under com.linkedin.ml.metadata namespace (see https://github.com/linkedin/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata). @RyanHolstien do you mind if I submit a PR before this to move them to com.linkedin.ml instead?

@mars-lan Not at all, go for it. I'll update as needed 😄

mars-lan · 2020-10-27T16:59:50Z

Sorry but just noticed that all the ML-related models are placed under com.linkedin.ml.metadata namespace (see https://github.com/linkedin/datahub/tree/master/metadata-models/src/main/pegasus/com/linkedin/ml/metadata). @RyanHolstien do you mind if I submit a PR before this to move them to com.linkedin.ml instead?

@mars-lan Not at all, go for it. I'll update as needed 😄

Actually nvm. Seems like the models have been referenced at multiple places, changing the namespace will quickly become a backward incompatible change. Let's live with them for now.

jywadhwani · 2020-10-28T00:01:09Z

.../com/linkedin/metadata/builders/graph/relationship/EvaluatedOnBuilderFromEvaluationData.java

+        final List<EvaluatedOn> evaluationDataList = evaluationData.getEvaluationData()
+            .stream()
+            .filter(BaseData::hasDataset)
+            .filter(baseData -> DatasetUrn.ENTITY_TYPE.equals(baseData.getDataset().getEntityType()))


why do we need this? since dataset field is of type DatasetUrn, shouldn't the entity type be the same as that of DatasetUrn?

jywadhwani · 2020-10-28T00:07:21Z

.../com/linkedin/metadata/builders/graph/relationship/EvaluatedOnBuilderFromEvaluationData.java

+    public <URN extends Urn> List<GraphBuilder.RelationshipUpdates> buildRelationships(@Nonnull URN urn, @Nonnull EvaluationData evaluationData) {
+        final List<EvaluatedOn> evaluationDataList = evaluationData.getEvaluationData()
+            .stream()
+            .filter(BaseData::hasDataset)


dataset seems to be a required field in BaseData model. Do we really need the check hasDataset?

jywadhwani · 2020-10-28T00:14:07Z

metadata-models/src/main/pegasus/com/linkedin/metadata/relationship/EvaluatedOn.pdl

+  /**
+   * How was the data preprocessed for evaluation (e.g., tokenization of sentences, cropping of images, any filtering such as dropping images without faces)?
+   */
+  preProcessing: optional string


Could you share a usecase where storing this property in the edge will be useful? I wouldn't recommend adding this field as an edge property.
cc @keremsahin1 @camelliazhang

@RyanHolstien Thanks for your contribution! Could you please help me to understand this relationship better?

Could you please provide a few examples?

How do the queries look like when this relationship is involved? Are you trying to do a filter based on "preProcessing" property? How many possible values are we talking about? In general, I won't recommend to add an arbitrary string property on an edge

I've removed it. My thought was essentially that it would be useful information to return as a result of searching for the relationship or perhaps filtering on certain types of preprocessing that had occurred similar to how OwnedBy has OwnershipType. If the issue is that it's a arbitrary string I don't currently have a list of reasonable preprocessing examples for an enum so we can leave it off and add it later if it's worthwhile.

jywadhwani · 2020-10-28T00:15:02Z

metadata-models/src/main/pegasus/com/linkedin/metadata/relationship/TrainedOn.pdl

+  /**
+   * How was the data preprocessed for evaluation (e.g., tokenization of sentences, cropping of images, any filtering such as dropping images without faces)?
+   */
+  preProcessing: optional string


same comment as before. Don't think we need to store this property in the edge.

jywadhwani · 2020-10-28T00:15:48Z

...java/com/linkedin/metadata/builders/graph/relationship/TrainedOnBuilderFromTrainingData.java

+
+import static com.linkedin.metadata.dao.internal.BaseGraphWriterDAO.RemovalOption.REMOVE_ALL_EDGES_FROM_SOURCE;
+
+public class TrainedOnBuilderFromTrainingData extends BaseRelationshipBuilder<TrainingData> {


similar comments as provided in EvaluatedOnBuilderFromEvaluationData.

jywadhwani · 2020-10-28T00:20:51Z

metadata-builders/src/main/java/com/linkedin/metadata/builders/search/MLModelIndexBuilder.java

+    private MLModelDocument getDocumentToUpdateFromAspect(MLModelUrn urn, EvaluationData evaluationData) {
+        final MLModelDocument doc = new MLModelDocument();
+
+        if (evaluationData.hasEvaluationData()) {


more of a generic comment. hasX() doesn't guarantee that getX() will not be null. I suggest we replace this with (evaluationData.getEvaluationData() != null). If you can make similar changes in other parts of the code that will be great! Related discussion #1950 (comment)

metadata-builders/src/main/java/com/linkedin/metadata/builders/search/MLModelIndexBuilder.java

…ships

Update

…o align with updates

RyanHolstien · 2021-01-27T20:49:04Z

Sorry about the delay, made the requested changes and updated the code to reflect changes to master over time. Let me know if there's anything else needed to get this merged in.

camelliazhang

@RyanHolstien Thanks for taking time addressing all the review comments!

camelliazhang · 2021-02-02T06:02:16Z

metadata-models/src/main/pegasus/com/linkedin/metadata/entity/MLModelEntity.pdl

+/**
+ * Data model for a ML Model entity
+ */
+record MLModelEntity includes BaseEntity {


This is the entity model for graph. Could you please share some context on this? Such as a list of main graph queries to support?

Our current graph use cases for ML Models include querying models that have been trained/evaluated on certain datasets. I created the entity model based on the other current entity models which are rather minimalistic with just the URN and URN components. The main value is the edge relationships between datasets.

Thanks for your information @RyanHolstien
I do see the search APIs (via ESSearchDAO) are implemented but not Neo4jQueryDAO. I assume you will have a follow up for building some APIs that leverages query DAO (for graph)?

We have a separate API & service layer diverging from Rest.li in our fork, so changes from there aren't directly transferrable. If no one else picks it up before I have time I can do the follow up mapping it over to the equivalent Rest.li based resources.

Got it, SGTM.

camelliazhang · 2021-02-02T07:23:12Z

gms/impl/src/main/resources/mlModelESSearchQueryTemplate.json

+                "platform^0.055",
+                "type^0.01"
+              ],
+              "default_operator": "OR"


It looks good to me now. My suggestion is to use one operator as default operator query string, either OR or AND, to reduce the possible confusions from users. With query_string, user can always specify and overwrite the default operator. Full syntax can be found here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax

camelliazhang

Thanks @RyanHolstien for your contributions and persistence! Great work here!

RyanHolstien · 2021-02-03T17:27:06Z

Thanks for taking the time to review @camelliazhang ! I'm not sure what the build error is about, the failing test passes locally. The error is saying that the number of arguments doesn't line up with the constructor call in my new search sanity test, but it is structured the same way as the other tests. I don't have access to run it again without pushing another commit. I can try pushing a harmless commit like adding a local variable for one of the Strings in that test class to see if it passes the build so this can be merged if that's the only way to trigger another build.

camelliazhang · 2021-02-04T00:43:09Z

The error is saying that the number of arguments doesn't line up with the constructor call in my new search sanity test,

@jplaisted Could you please take a look for the failure of new search sanity test? Thanks

jplaisted · 2021-02-04T00:59:31Z

Yeah I added a new parameter, which I acknowledges breaks code, but at the time i thought one no one was using it besides me.

It is weird to me it passes locally but not on the CI. I'd think it'd fail on both. I don't think CI tries to rebase on master before running changes?

In any case, make sure you've rebased on master, and run ./gradlew idea build. It'll probably fail locally then too, something might be cached.

Check out this: #2067. Basically there's some new sanity tests to test your config too, just pass the super constructor the search config.

arunvasudevan · 2021-02-08T18:48:30Z

@RyanHolstien It would be great if the GMS sample API calls section in the README is updated for MLModel entity - https://github.com/linkedin/datahub/blob/master/gms/README.md#sample-api-calls

Update to master

jplaisted

Nit: Thoughts on making the L lower case? MlModel?

https://google.github.io/styleguide/javaguide.html#s5.3-camel-case

We don't use google's style guide, but still a useful reference.

shirshanka · 2021-02-17T21:25:49Z

Nit: Thoughts on making the L lower case? MlModel?

https://google.github.io/styleguide/javaguide.html#s5.3-camel-case

We don't use google's style guide, but still a useful reference.

Since the ML here is a short-form for Machine Learning ... maybe MLModel is more appropriate.
Seems like Apple agrees: https://developer.apple.com/documentation/coreml/mlmodel

Looks like Jyoti's concerns are addressed now.

shirshanka · 2021-02-17T21:30:02Z

Thanks @RyanHolstien for the lift and @camelliazhang, @jywadhwani for detailed reviews!

RyanHolstien and others added 6 commits April 9, 2020 15:42

Merge pull request #1 from linkedin/master

7b724a5

Update to master

Merge pull request #2 from linkedin/master

8475ad3

Update To Master

Merge pull request #3 from linkedin/master

c18c55b

Merge Master

Merge pull request #4 from linkedin/master

dc474a7

Update to LI Master

feat(1877): add backend implementation for ML Model

c7d519c

Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl

d347b96

mars-lan reviewed Sep 25, 2020

View reviewed changes

metadata-ingestion/mce-cli/bootstrap_mce.dat Outdated Show resolved Hide resolved

RyanHolstien and others added 3 commits September 25, 2020 16:38

Merge pull request #5 from linkedin/master

81a342e

feat: Port mce-cli to Java. (#1871)

Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl

02a9970

update mce cli examples

26a2908

mars-lan assigned jywadhwani Sep 27, 2020

jywadhwani suggested changes Oct 5, 2020

View reviewed changes

docker/elasticsearch-setup/ml-model-index-config.json Outdated Show resolved Hide resolved

mars-lan reviewed Oct 6, 2020

View reviewed changes

gms/factories/src/main/java/com/linkedin/ml/factory/MLModelDAOFactory.java Outdated Show resolved Hide resolved

changes for review comments

8c897fb

jywadhwani reviewed Oct 7, 2020

View reviewed changes

docker/elasticsearch-setup/ml-model-index-config.json Outdated Show resolved Hide resolved

RyanHolstien and others added 6 commits October 7, 2020 11:03

modify owners field in index per review comments

87ab115

Merge pull request #6 from linkedin/master

938b8d2

Update to master

Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl

42dc2f5

update package of GMS factories and use BaseSearchableClient

10f5d89

fix for checkstyle

29eab26

remove unused comma analyzers and tokenizers

391304c

RyanHolstien requested a review from jywadhwani October 8, 2020 18:54

RyanHolstien and others added 3 commits October 8, 2020 13:57

Merge pull request #7 from linkedin/master

3ad1b2f

Update master

Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl

912b2a6

fix endpoint naming for evaluation data

69bbcbd

mars-lan changed the title ~~#1877 ML Model Backend Implementation~~ feat: ML Model Backend Implementation Oct 12, 2020

jywadhwani suggested changes Oct 28, 2020

View reviewed changes

jywadhwani previously requested changes Oct 28, 2020

View reviewed changes

metadata-builders/src/main/java/com/linkedin/metadata/builders/search/MLModelIndexBuilder.java Show resolved Hide resolved

RyanHolstien and others added 6 commits January 25, 2021 15:02

remove unnecessary filters and fields from index builder and relation…

075ecfa

…ships

Merge pull request #8 from linkedin/master

3f83e7f

Update

Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl

21b3c23

remove unused imports, fix formatting, and add new integration test t…

5646b85

…o align with updates

add custom analyzer to split up URN components and fix typos

8215a60

update query template to include urn components

25daf59

camelliazhang reviewed Feb 2, 2021

View reviewed changes

consolidate query template to one operator

1735ff3

camelliazhang approved these changes Feb 3, 2021

View reviewed changes

RyanHolstien and others added 5 commits February 9, 2021 16:13

Merge pull request #9 from linkedin/master

ea9434c

Update to master

Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl

86b54cb

update sanity test to master

f4e27af

Merge pull request #10 from linkedin/master

22c233b

Update to master

Merge branch 'master' into feature/DATAHUB-1877-MLModelBackendImpl

c2dbf80

jplaisted reviewed Feb 17, 2021

View reviewed changes

shirshanka merged commit ea86ade into datahub-project:master Feb 17, 2021

arunvasudevan mentioned this pull request Mar 12, 2021

feat: MLmodel Graphql Query #2166

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ML Model Backend Implementation #1896

feat: ML Model Backend Implementation #1896

RyanHolstien commented Sep 25, 2020 •

edited by mars-lan

RyanHolstien commented Oct 9, 2020

jplaisted commented Oct 9, 2020

RyanHolstien commented Oct 11, 2020 •

edited

mars-lan commented Oct 27, 2020

RyanHolstien commented Oct 27, 2020

mars-lan commented Oct 27, 2020

jywadhwani Oct 28, 2020

jywadhwani Oct 28, 2020

jywadhwani Oct 28, 2020

camelliazhang Oct 28, 2020

RyanHolstien Jan 27, 2021

jywadhwani Oct 28, 2020

jywadhwani Oct 28, 2020

jywadhwani Oct 28, 2020

RyanHolstien commented Jan 27, 2021

camelliazhang left a comment

camelliazhang Feb 2, 2021

RyanHolstien Feb 2, 2021

camelliazhang Feb 3, 2021

RyanHolstien Feb 3, 2021

camelliazhang Feb 4, 2021

camelliazhang Feb 2, 2021

camelliazhang left a comment

RyanHolstien commented Feb 3, 2021

camelliazhang commented Feb 4, 2021

jplaisted commented Feb 4, 2021

arunvasudevan commented Feb 8, 2021 •

edited

jplaisted left a comment

shirshanka commented Feb 17, 2021

shirshanka commented Feb 17, 2021


		import static com.linkedin.metadata.dao.internal.BaseGraphWriterDAO.RemovalOption.REMOVE_ALL_EDGES_FROM_SOURCE;

		public class TrainedOnBuilderFromTrainingData extends BaseRelationshipBuilder<TrainingData> {

feat: ML Model Backend Implementation #1896

feat: ML Model Backend Implementation #1896

Conversation

RyanHolstien commented Sep 25, 2020 • edited by mars-lan

Checklist

RyanHolstien commented Oct 9, 2020

jplaisted commented Oct 9, 2020

RyanHolstien commented Oct 11, 2020 • edited

mars-lan commented Oct 27, 2020

RyanHolstien commented Oct 27, 2020

mars-lan commented Oct 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RyanHolstien commented Jan 27, 2021

camelliazhang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

camelliazhang left a comment

Choose a reason for hiding this comment

RyanHolstien commented Feb 3, 2021

camelliazhang commented Feb 4, 2021

jplaisted commented Feb 4, 2021

arunvasudevan commented Feb 8, 2021 • edited

jplaisted left a comment

Choose a reason for hiding this comment

shirshanka commented Feb 17, 2021

shirshanka commented Feb 17, 2021

RyanHolstien commented Sep 25, 2020 •

edited by mars-lan

RyanHolstien commented Oct 11, 2020 •

edited

arunvasudevan commented Feb 8, 2021 •

edited