Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(rfc): Serialising GMS Updates with Preconditions #5818

Merged
merged 17 commits into from
Jan 10, 2023

Conversation

mattmatravers
Copy link
Contributor

@mattmatravers mattmatravers commented Sep 2, 2022

Checklist

@mattmatravers
Copy link
Contributor Author

mattmatravers commented Sep 9, 2022

This PR describes many of the points raised in #5635

@mattmatravers mattmatravers marked this pull request as ready for review September 9, 2022 16:29
@anshbansal anshbansal added rfc See https://github.com/linkedin/datahub/blob/master/docs/rfc.md for more details docs Issues and Improvements to docs community-contribution PR or Issue raised by member(s) of DataHub Community labels Sep 10, 2022
We wish to avoid losing data silently when two clients make updates to the same aspect. This can be quite likely in
an event-driven world driving downstream operations on Datasets. We should offer clients the chance to conditionally
update an aspect on the basis that what they recently observed is still the case and signal to the client if the
state changed before the update was possible. Essentially we need a long "compare-and-swap" operation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Thanks for the clear description up to this point!

* The client needs to be able to reference the same state to a GMS "update" endpoint if it wants to ensure the
aspect has not changed between fetching it and mutating it.
* A client could include multiple aspects in its precondition state, in case one update relies on the state of many
other aspects. We need to defend against race conditions for all the aspects given.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per-aspect tagging, this makes sense.

an aspect has not changed when making an update to it.

There are some potential spin-offs of this design which could involve writing some kind of PATCH update where a client
supplies a diff instead of a complete new state, but this is out of scope of this particular RFC.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RyanHolstien Is currently working on PATCH API support on the GMS ingest path. I'm wondering if we should account for it. I'm assuming that by default PATCH would ignore the version tag


We don't necessarily have to overhaul the versioning approach here, as all we need is a way of uniquely identifying
the aspect instance that is to be changed. We don't need to rely on numerical version ordering for this, but
something will need to be added as the general aspect level to allow for this new functionality.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true

Bad
* No guarantee of state between the read and write
* No guarantee that the plugin code will run
* Specific business logic getting tied up with GMS code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the biggest issue with this. I think most companies using DataHub simply wouldn't bother with this, and thus would not benefit from it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I re-read this document fully yesterday and I need to remove some ideas from the main section. This is one of them.


### 2: Read and Conditional Update

Read the current aspect and pass it back to the client. The client can propose an update on the basis that what they
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the client know what they read is still valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess they'll never know, but they will know the version of the aspect they just read. The clients then have the option of passing that version back to GMS during an update request.


Bad
* Slow, especially under high contention
* Client will need to handle retries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a conflict occurs, what should the client do? Just abort or log a warning or something, presumably?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client will have to manage the retry operation or just fail depending on business requirements. GMS won't have enough context to manage this without a suitable patching language that @RyanHolstien is working on.

* Fast, as retry loop will be confined to GMS

Bad
* GMS will need to understand partial (PATCH) updates
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RyanHolstien Please have a look here.

With Ryan's changes, we'll be able to support a specific set of PATCH formats, but it may not always suffice for addressing the above use case (if we don't formally support a patch of the given type. Ryan can elaborate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the path drills down far enough, JsonPatch semantics are capable of covering this scenario. Basically you can specify precise fields to modify and it will not impact the rest of the aspect, but if you target the top level of the aspect you may be overwriting things. Example:

JsonPatch for Ownership at top level:
op: ADD
path: /
value:
{ owners: { urn: urn:li:corpuser:owner1 : {...} }

vs JsonPatch at lower level:
op: ADD
path: /owners/urn:li:corpuser:owner1
value: {...}

The first one will wipe out other values in the array, but the second one will not, this is why the SDK implementation is very limited in scope since JsonPatch is almost too flexible in what it allows. It can very much accommodate this use case though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks. I'd be keen to know if JsonPatch can handle idempotent updates like "ensure this set contains these values". But I guess clients will need some form of serialisation guarantees regardless of whether it's a PATCH update or a full replacement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are two different things in my mind:

  1. Idempotency of updates -> Guarantees that if you add owner1 multiple times, it doesn't add it repeatedly. This is ensured

  2. Verification of particular values -> Guarantees that owner1 is only added if owner2 is present (or something to this effect? Not sure exactly what you mean here). This is not ensured. Maybe you mean:

add owner1 happens at same time as add owner2, owner1 AND owner2 should appear?

This should also be ensured per transactional consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. That's good news, thanks 👍 .
  2. I'm talking about a scenario where one property of an aspect gets updated using another property as a base state. If the base state is changed then that update is no longer valid, so we will need to offer serialisation for such stateful updates (whether PATCH or UPSERT).

### 4: Serialised Updates Exclusively via Kafka

Here, the client never updates the aspect HTTP but instead passes it via the MetadataChangeProposal topic in Kafka.
I don't think this is a viable solution due to the possibility of issuing out-of-date updates.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still has the overwrite / clobbering problem

* No need for business-specific logic in GMS code

Bad
* Slow, especially under high contention
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very slow! Also doesn't well support the MCE pathway, as you mentioned.

1. Previous state: The version(s) of aspects required in order for a GM update to succeed.
2. Preconditions: A wider set of assertions which must be true in order for a GM update to succeed.

We don't really cover "Preconditions" in this RFC, and in fact I argue we should never do, as this leaks business
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the term "etag" -- its used across other DBs to provide a "version tag" to a document. I'm hoping we can simply add something similar to all of our get / put pathways

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag seems to cover roughly what we need, though we'd need to think about the multiple aspect use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Aspect Versioning

Currently, [aspect versioning](/docs/advanced/aspect-versioning.md) uses 0 as the latest aspect version. We would
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just use total aspect count, instead of version. Serves the same purpose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on video chat, this might not work if we limit the retention of older versions. We need something that is deterministic and not subject to race conditions.

> If we implemented this proposal, how will existing users / developers adopt it? Is it a breaking change? Can we write
> automatic refactoring / migration tools? Can we provide a runtime adapter library for the original API it replaces?

This rollout would be done as either a new API endpoint or an optional additional parameter to an existing API, so
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be backwards compatible to simply add totalAspect count somewhere

@mattmatravers
Copy link
Contributor Author

I'm going to re-organise a lot of this document to be clear what designs are viable and which are not.

If a patch language were available, it might be possible to update collections in such a way that it's not
necessary for clients to know about the previous state of an aspect.

MongoDB offers an update mode which allows clients to [add items to a set](https://mongodb.github.io/mongo-java-driver/4.7/apidocs/mongodb-driver-core/com/mongodb/client/model/Updates.html#addToSet(java.lang.String,TItem)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge here is that we'd need to write our own operation DSL. This is what MongoDB has done!


### Serialised Updates Exclusively via Kafka

Here, the client never updates the aspect via HTTP but instead passes it via the MetadataChangeProposal topic in Kafka.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible that we do provide an optional version tag in the Kafka ingest path as well, but there is a higher likelihood of out of date updates. Also, it's obviously not possible for a client to perform a retry since they won't know what failed...

@jjoyce0510
Copy link
Collaborator

Merging this RFC to keep for history. Thanks for the awesome contrib and hoping to get this implementation in shortly!

@jjoyce0510 jjoyce0510 merged commit 54153ea into datahub-project:master Jan 10, 2023
@mattmatravers mattmatravers deleted the rfc-serial-updates branch January 10, 2023 05:59
ericyomi pushed a commit to ericyomi/datahub that referenced this pull request Jan 18, 2023
cccs-Dustin pushed a commit to CybercentreCanada/datahub that referenced this pull request Feb 1, 2023
cccs-Dustin added a commit to CybercentreCanada/datahub that referenced this pull request Feb 3, 2023
* chore(ci): update base ingestion image requirements file (datahub-project#6687)

* fix(ci): reduce warnings due to deprecated action (datahub-project#6686)

* refactor(ui): Adding caching for users, groups, and roles (datahub-project#6673)

* fix(ci): revert confluent kafka in base image (datahub-project#6690)

* fix(security): version bump to latest minor python image (datahub-project#6694)

* docs(ingest/salesforce): list required permissions (datahub-project#6610)

* feat(ingest): bigquery - option to set  on behalf project (datahub-project#6660)

* ci: stop commenting unit test results on PR (datahub-project#6700)

The results will still be surfaced under the "Test Results" action
workflow, but the results won't be commented on the PR itself.

* fix(publish): Attempting to fix publish for auth-api (datahub-project#6695)

* build(deps): bump qs from 6.5.2 to 6.5.3 in /smoke-test/tests/cypress (datahub-project#6663)

* build(deps): bump express from 4.17.1 to 4.18.2 in /datahub-web-react (datahub-project#6665)

* fix(ingest/tableau): support ssl_verify flag properly (datahub-project#6682)

* fix(config): unify the handling of boolean environment variables (datahub-project#6684)

* fix(ui): fix search on policy builder (datahub-project#6703)

* build(deps): bump qs from 6.5.2 to 6.5.3 in /datahub-web-react (datahub-project#6664)

* fix(ingest): cleanup config extra usage (datahub-project#6699)

* docs(logos): Update Great Expectations logo (datahub-project#6698)

* fix(security): play framework upgrade (datahub-project#6626)

* fix(security): play framework upgrade

* fix(ingest/sagemaker): handle missing ProcessingInputs field (datahub-project#6697)

Fixes datahub-project#6360.

* build: add retries to gradle wrapper download in ingestion docker (datahub-project#6704)

* test(quickstart): add debugging to quickstart test (datahub-project#6718)

* fix(setup): Bump setup images to alpine 3.14 with arch based on machine OS. (datahub-project#6612)

* fix(setup): Bump setup images to alpine 3.14 with arch based on machine OS.

* fix(ingest): fix bug in auto_status_aspect (datahub-project#6705)

Co-authored-by: Tamas Nemeth <treff7es@gmail.com>

* fix(security): commons-text in frontend, hadoop-commons in datahub-upgrade (datahub-project#6723)

* fix(build): rename conflicting module `auth-api` (datahub-project#6728)

* fix(build): rename conflicting module `auth-api`

* docs(aws): edit markdown link (datahub-project#6706)

* fix(ingest): mysql - fix mysql ingestion issue with non-lowercase database (datahub-project#6713)

* feat(ingest): redact configs reported in ingestion_run_summary (datahub-project#6696)

* fix(ingest): bigquery - rectify filter for BigQuery external tables (datahub-project#6691)

* feat(ingest): snowflake - add separate config for include_column_lineage in snowflake (datahub-project#6712)

* fix(ci): flakiness due to onboarding tour in add user test (datahub-project#6734)

* feat(ui): Support DataBricks Unity Catalog Source in Ui Ingestion (datahub-project#6707)

* feat(ingest/iceberg): add stateful ingestion (datahub-project#6344)

* doc(restore): document restore indices API endpoint (datahub-project#6737)

* feat(): Views Feature Milestone 1  (datahub-project#6666)

* feat(ingest): bigquery - external url support and a small profiling filter fix (datahub-project#6714)

* test(ingest): make hive/trino test more reliable (datahub-project#6741)

* Initial commit for bigquery ingestion guide (datahub-project#6587)

* Initial commit for bigquery ingestion guide

* Addressing PR review comments

* Fixing lint error

* Shorten titles

* Removing images

* update copy on overview.md

* update to setup steps with additional roles

* update configuration.md

* lowcasing overview.md filename

* lowcasing setup.md

* lowcasing configuration.md

* update reference to setup.md

* update reference to setup.md

* update reference to configuration.md

* lowcase bigquery ingestion guide filenames

* Update location of ingestion guides in sidebar

* renaming ingestion quickstart guide sidebar

* remove old files

* Update docs-website/sidebars.js

* tweak

Co-authored-by: Maggie Hays <maggiem.hays@gmail.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(ci): remove warnings due to deprecated action (datahub-project#6735)

* feat(ingest): add stateful ingestion to the ldap source (datahub-project#6127)


Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(ingest): fix serde for empty dicts in unions with null (datahub-project#6745)

The code changes in acryldata/avro_gen#16, but tests are written here.

* feat(ingest): start simplifying stateful ingestion state (datahub-project#6740)

* fix(): Add auth-api as compileOnly dependency (datahub-project#6747)

Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com>

* fix(elasticsearch): build in resilience against IO exceptions on httpclient (datahub-project#6680)

* fix(elasticsearch): build in resilience against IO exceptions on http client

* ci: fix ingestion gradle retry (datahub-project#6752)

* fix(ingest): support airflow mapped operators (datahub-project#6738)

* fix(actions): fix mistype slack/teams base url (datahub-project#6754)

* fix(smoke-test): fix stateful ingestion test regression (datahub-project#6753)

* fix(auth): Renames metadata-auth archive name to not conflict with other modules. (datahub-project#6749)

Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>

* fix(ingest/lookml): fix directory handling and a github_info resolution bug (datahub-project#6751)

* refactor(ingest): bigquery-lineage - allow tables and datasets in uppercase (datahub-project#6739)

* refactor(ux): Misc UX Improvements (tutorial copy, caching, filters) (datahub-project#6743)

* docs(): Added build failed yarn error (datahub-project#6757)

docs: add build failed yarn error message and how to deal with it 

I encountere this error and with the help of the community i could deal with it (https://datahubspace.slack.com/archives/CV2UVAPPG/p1670608619404699).

* feat(ingest): remove source config from DatahubIngestionCheckpoint (datahub-project#6722)

* fix(python-sdk): DataHubGraph get_aspect should accept empty responses (datahub-project#6760)

* fix(): Fix the datahub-web-react build (datahub-project#6764)

* docs(ingest/airflow): clarify Airflow 1.x docs for airflow plugin (datahub-project#6761)

* feat(ingest): simplify more stateful ingestion state (datahub-project#6762)

* fix(ingest): bigquery - handling custom sql errors as warning (datahub-project#6777)

* docs(docker): add section for adding community images (datahub-project#6770)

* docs(ingest): fix error in custom tags transformer example (datahub-project#6767)

* feat(ingest): add `datahub state inspect` command (datahub-project#6763)

* refactor(ui): Caching Ingestion Secrets (datahub-project#6772)

* docs(snowflake) Snowflake quick ingestion guide (datahub-project#6750)

* Optimize kafka setup (datahub-project#6778)

* fix(kafka-setup): parallelize topic creation

* feat(ingest): lookml - add unreachable views to report (datahub-project#6779)

* feat(ci): adding github security reporting to trivy scans (datahub-project#6773)

* fix(smoke-test): remove stateful ingestion config check (datahub-project#6781)

* fix(ingest): correct external url for account identifier with account name (datahub-project#6715)

* fix(tutorial): skip getting steps if there is no user (datahub-project#6786)

* fix(kafka-setup): fix return code check (datahub-project#6782)

* fix(kafka-setup): parallelize topic creation
* Remove -setup from docker compose (not services)

* fix(ui): Fixing minor issues with Ingestion forms (datahub-project#6790)

* fix(ingest): prevent NullPointerException when non-jdbc SaveIntoDataSourceCommand (datahub-project#6803)

* fix(docs): edit text to link (datahub-project#6798)

* fix(ingest/dbt): remove unsupported usage indicator (datahub-project#6805)

* refactor(ui): Miscellaneous caching improvements (datahub-project#6796)

* fix(ingest): bigquery - sharded table support improvements (datahub-project#6789)

* chore(ingest): pin black version (datahub-project#6807)

* refactor(ingest/stateful): remove most remaining state classes (datahub-project#6791)

* fix(bigquery-legacy): Fix for TypeError related failures in legacy plugin (datahub-project#6806)

Co-authored-by: John Joyce <john@acryl.io>

* Update Grafana Dashboard (datahub-project#6076)

* Add Datasource as variable in dashboard

(cherry picked from commit e75b3f7)

* Update datahub_dashboard.json

(cherry picked from commit 7015926)

* Bump docker compose version to 3.8

(cherry picked from commit ff6a97b)

* Update grafana image tag from latest to 9.1.4

(cherry picked from commit 2c88e2a)

* Update old metric name in datahub_dashboard.json

(cherry picked from commit 21b502e)

* Add panel for new metrics

(cherry picked from commit 1944527)

Co-authored-by: Peter Szalai <szalaipeti.vagyok@gmail.com>

* refactor(ingest/stateful): remove `IngestionJobStateProvider` (datahub-project#6792)

* chore(ingest): bump python package dependencies to resolve vulns (datahub-project#6384)

Co-authored-by: John Joyce <john@acryl.io>

* refactor(ingest/stateful): remove `get_last_state` method (datahub-project#6794)

* fix(ui): URL encode urns for ownership entity links (datahub-project#6814)

* fix(posts): add deletePost GraphQL endpoint (datahub-project#6813)

* fix(policies): resolve the associated domain for a domain as the domain itself (datahub-project#6812)

* feat(lineage) Adds ability to edit lineage manually from the UI (datahub-project#6816)

* fix(ui): change caching to happen post server-response when creating a UI ingestion recipe (datahub-project#6815)

* feat(ingest/stateful): remove platform_instance_id from state urn (datahub-project#6795)

* feat(ui): Adding DBT Cloud support for UI ingestion (datahub-project#6804)

* feat(kafka): expose default kafka producer mechanism (datahub-project#6381)

* Expose Kafka Sender Retry Parameters

* Implement KafkaHealthChecker

* feat(kafka): expose default kafka producer mechanism

* feat(ingest): add failure/warning counts to ingest_stats (datahub-project#6823)

* refactor(ingest): clean up pipeline init error handling (datahub-project#6817)

* fix(ingest): exclude ztsd from uber jar to prevent jni conflicts with spark (datahub-project#6787)

Co-authored-by: Tamas Nemeth <treff7es@gmail.com>

* feat(ingest/bigquery): add option to enable/disable legacy sharded table support (datahub-project#6822)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
Co-authored-by: John Joyce <john@acryl.io>

* fix(ingest): support patches in `auto_status_aspect` (datahub-project#6827)

Patches generate a raw MCP because MCPW doesn't support patches right now, so we need to handle that correctly downstream.

* fix(ci): reduce flakiness views select test (datahub-project#6821)

* refactor(ingest): clean up exception types (datahub-project#6818)

* fix(ingest): fixed snowflake oauth ingestion not using role attribute from recipe (datahub-project#6825)

* refactor(ingestion): Browse Paths Upgrade V2 Feast & Sagemaker  (datahub-project#6002)

* fix(lineage) Fix lineage viz with multiple siblings (datahub-project#6826)

* fix(pac4j-oidc): add verifier parameter (datahub-project#6835)

* fix(pac4j-oidc): add verifier parameter

* feat(ingest): extract kafka topic config properties as customProperties (datahub-project#6783)

* docs: Incorrect import statement fixed in example (datahub-project#6838)

* feat(ingestion): spark - support lineage for delta lake writes (datahub-project#6834)

* feat(ui): Support adding custom id when creating term and term group (datahub-project#6830)

* feat(ci): add cypress test ui based ingestion (datahub-project#6769)

* feat(ui): sortable domain list (datahub-project#6736)

* fix(ci): add labels based on more folders (datahub-project#6840)

* fix(ingest): kafka ingest task hand up with error bootstrap server (datahub-project#6820)

* fix(ingest): Fixing lint (datahub-project#6844)

* fix(ingestion) Inject pipeline_name into recipes at runtime (datahub-project#6833)

* feat(ingest): add db/schema properties hook to SQL common (datahub-project#6847)

* fix(oidc): fix oidc authentication loop (datahub-project#6848)

* fix(oidc): fix oidc authentication loop

* docs(confluent): add details for actions pod for confluent (datahub-project#6810)

* feat(ingestion): Business Glossary# Add domain support in  GlossaryTerm ingestion (datahub-project#6829)

* lint fix

* domain in term

* domain in term

* review comments

* add todo

Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(ingest/looker): handle missing `label` fields (datahub-project#6849)

* refactor(ui): Misc domains improvements  (datahub-project#6850)

* feat(ingest): add pydantic helper for removed fields (datahub-project#6853)

* chore(0.9.5): Bump defaults for release v0.9.5 (datahub-project#6856)

* Revert "fix(ci): remove warnings due to deprecated action (datahub-project#6735)" (datahub-project#6857)

This reverts commit 1da27ed.

* refactor(restli-mce-consumer) (datahub-project#6744)

* fix(security): commons-text in frontend

* refactor(restli): set threads based on cpu cores
feat(mce-consumers): hit local restli endpoint

* testing docker build

* Add retry configuration options for entity client

* Kafka debugging

* fix(kafka-setup): parallelize topic creation

* Adjust docker build

* Docker build updates

* WIP

* fix(lint): metadata-ingestion lint

* fix(gradle-docker): fix docker frontend dep

* fix(elastic): fix race condition between gms and mae for index creation

* Revert "fix(elastic): fix race condition between gms and mae for index creation"

This reverts commit 9629d12.

* fix(test): fix datahub frontend test for clean/test cycle

* fix(test): datahub-frontend missing assets in test

* fix(security): set protobuf lib datahub-upgrade & mce/mae-consumer

* gitingore update

* fix(docker): remove platform on docker base image, set by buildx

* refactor(kafka-producer): update kafka producer tracking/logging

* updates per PR feedback

* Add documentation around mce standalone consumer
Kafka consumer concurrency to follow thread count for restli & sql connection pool

Co-authored-by: leifker <dleifker@gmail.com>
Co-authored-by: Pedro Silva <pedro@acryl.io>

* fix(ci): reduce smoke test run time (datahub-project#6841)

* fix(security): require signed/encrypted jwt tokens (datahub-project#6565)

* fix(security): require unsigned/encrypted jwt tokens

* Add import

Co-authored-by: Pedro Silva <pedro@acryl.io>

* feat(ingest): update profiling to fetch configurable number of sample values (datahub-project#6859)

* feat(ingest/airflow): support raw dataset urns in airflow lineage (datahub-project#6854)

* feat(ingest/airflow): support dataset Urns in airflow lineage

This PR also
- resolves a reported circular import issue
- refactors the Airflow tests to reduce duplication

* fix test

* refactor(graphql): make graphqlengine easier to use (datahub-project#6865)

* fix(kafka): datahub-upgrade job (datahub-project#6864)

* feat(ingest): pass timeout config in kafka admin client api calls (datahub-project#6863)

* chore(ingest): loosen requirements file (datahub-project#6867)

* feat(ingest): upgrade pydantic version (datahub-project#6858)

This PR also removes the requirement on docker-compose v1 and makes our tests use v2 instead.

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(elasticsearch): fixes out of order runId writes (datahub-project#6845)

Co-authored-by: leifker <dleifker@gmail.com>
Co-authored-by: Pedro Silva <pedro@acryl.io>

* chore(ingest): loosen additional requirements (datahub-project#6868)

* feat(ingest): bigquery/snowflake - Store last profile date in state (datahub-project#6832)

* docs(google-analytics): Correct grammatical error in README.md (datahub-project#6870)

* feat(CI): add venv caching (datahub-project#6843)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* feat(ingest/snowflake): handle failures gracefully and raise permission failures (datahub-project#6748)

* fix(runid): always update runid, except when queued (datahub-project#6876)

* fix(ingest): conditionally include env in assertion guid (datahub-project#6811)

* chore(ci): update dependencies docs-website (datahub-project#6871)

* feat(ui) - Add a custom error message for bulk edit to add clarity (datahub-project#6775)

Co-authored-by: Kamalaseshan Radha <mkamalas@LAMU02DN212MD6R.uhc.com>

* docs(adding users): Refreshing the docs for adding new DataHub Users  (datahub-project#6879)

* test(mce-consumer): mockbeans (datahub-project#6878)

* feat(ingest): avoid embedding serialized json in metadata files (datahub-project#6742)

* refactor(gradle): move the local docker registry to common location (datahub-project#6881)

* refactor(smoke): use env variables (datahub-project#6866)

* fix(lint): pin pydantic version (datahub-project#6886)

* refactor(docs): Correctly spell elasticsearch in docs (datahub-project#6880)

* fix(ingest): okta undefined variable error (datahub-project#6882)

* fix(ci): reduce flakiness in add_users, siblings smoke test (datahub-project#6883)

* fix(ingest): trino - fall back to default table comment method for all Trino query errors (datahub-project#6873)

* test(misc):  misc test updates (datahub-project#6890)

* deprecate(ingest): bigquery - Removing bigquery-legacy source (datahub-project#6851)

Co-authored-by: John Joyce <john@acryl.io>

* chore(ingest): remove inferred args to MCPW, part 1 (datahub-project#6819)

* test(ingest/kafka-connect): make docker setup more reliable (datahub-project#6902)

* fix(ingest): profiling (bigquery) - Address biquery profiling query error due to timestamp vs data mismatch (datahub-project#6874)

* fix(cli): Make datahub quickstart work with latest docker compose in M1 (datahub-project#6891)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(cli): fix delete urn cli bug + stricter type annotations (datahub-project#6903)

* fix(ingest/airflow): reorder imports to avoid cyclical dependencies (datahub-project#6719)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* feat: remove jq requirement + tweak modeldocgen args (datahub-project#6904)

Co-authored-by: Tamas Nemeth <treff7es@gmail.com>

* chore(ingest): loosen pyspark and pydeequ deps (datahub-project#6908)

* docs(ingest/looker): fix typos + update lookml github action example (datahub-project#6910)

* fix(ingest/metabase): use card_id in dashboard to chart lineage (datahub-project#6583)

Co-authored-by: 陈城 <cheng.chen@tenclass.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(es-setup): create data stream on non-aws (datahub-project#6926)

* docs(): Adding missing Platform logos (datahub-project#6892)

* feat(ingestion): PowerBI# Improve PowerBI source ingestion (datahub-project#6549)

Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com>

* fix(kafka-setup): Fix compose context for kafka-setup (datahub-project#6923)

* feat(backend): Supporting Embeddable Previews for Dashboards, Charts, Datasets  (datahub-project#6875)

* chore(deps): bump json5 from 2.2.1 to 2.2.3 in /docs-website (datahub-project#6930)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump json5 from 1.0.1 to 1.0.2 in /datahub-web-react (datahub-project#6931)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>

* fix(ci): managed ingestion test fix (datahub-project#6946)

* feat(ingest): add `include_table_location_lineage` flag for SQL common (datahub-project#6934)

* feat(ingest): allow extracting snowflake tags (datahub-project#6500)

* chore(ingest): unpin pydantic dep (datahub-project#6909)

* chore(ingest): partially revert pyspark dep from datahub-project#6908 (datahub-project#6954)

* fix(ingest): use branch info when cloning git repos (datahub-project#6937)

* chore(ingest): remove inferred args to MCPW, part 2 (datahub-project#6905)

* fix(ingest/unity): simplify MCP generation and reporting (datahub-project#6911)

Co-authored-by: John Joyce <john@acryl.io>

* chore(ci): parallelise build and test workflow to reduce time (datahub-project#6949)

* fix(frontend): sasl.client.callback.handler.class (datahub-project#6962)

* chore(react): remove outdated cypress tests and dependency (datahub-project#6948)

* fix(ci): restrict GE to fix build issues (datahub-project#6967)

* feat(queries): [Experimental] Allow customization of # of queries in Query tab via env var (datahub-project#6964)

* feat(ingest/postgres): emit lineage for postgres views (datahub-project#6953)

* feat(ingest/vertica): support projections and lineage in vertica (datahub-project#6785)

Co-authored-by: mraman2512 <MY_mramaan2512@gmail.com>
Co-authored-by: Aman.Kumar <64635307+mraman2512@users.noreply.github.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(ingest): add missing dep for powerbi (datahub-project#6969)

* Docs fixes week of 12 22 (datahub-project#6963)

Co-authored-by: John Joyce <john@acryl.io>

* fix(ingest): unfreeze bigquery/snowflake column dataclass (datahub-project#6921)

* chore(frontend) Remove unused dependencies from package.json (datahub-project#6974)

* chore: misc fixes (datahub-project#6966)

* feat(ingest/glue): emit s3 lineage for s3a and s3n schemes (datahub-project#6788)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(kafka-setup): Make kafka-setup run with multiple threads (datahub-project#6970)

* feat(ingest): mark database_alias and env as deprecated (datahub-project#6901)

* fix(docs): Updating Tag, Glossary Term docs to point to correct GraphQL methods (datahub-project#6965)

* chore(deps): bump certifi from 2020.12.5 to 2022.12.7 in /metadata-ingestion/src/datahub/ingestion/source/feast_image (datahub-project#6979)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: John Joyce <john@acryl.io>

* fix(ingest): profiling - Fixing issue with the wrong timestamp stored in check (datahub-project#6978)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* config(quickstart): enable auto-reindex for quickstart (datahub-project#6983)

* feat(privileges) - Create a privilege to manage glossary children recursively (datahub-project#6731)

Co-authored-by: Kamalaseshan Radha <mkamalas@LAMU02DN212MD6R.uhc.com>
Co-authored-by: John Joyce <john@acryl.io>

* chore(ingest): finish removing feast-legacy (datahub-project#6985)

* feat(ingest): add  import descriptions of two or more nested messages (datahub-project#6959)

Co-authored-by: 서주현[G플레이스데이터개발] <juhyun.seo@navercorp.com>

* feat(docs) Add feature guide for Manual Lineage (datahub-project#6933)

Co-authored-by: John Joyce <john@acryl.io>

* docs(rfc): Serialising GMS Updates with Preconditions (datahub-project#5818)

* fix(ingest): kafka-connect - support newer version of debezium (datahub-project#6943)

Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>
Co-authored-by: John Joyce <john@acryl.io>

* fix(docs): build and broken snowflake docs fix (datahub-project#6997)

* fix(ingest): bigquery - views in case more than 1 datasets with views (datahub-project#6995)

Co-authored-by: Tamas Nemeth <treff7es@gmail.com>

* fix(docs): Renaming Business Glossary Doc (datahub-project#7001)

* fix(ingest/snowflake): fix type annotations + refactor get_connect_args (datahub-project#7004)

* fix(docs): Changing the platform event topic name in kafka custom topic docs (datahub-project#7007)

* fix(docs): fix name of privilege referenced in posts doc (datahub-project#7002)

* fix(SSO): Correctly redirect to originally requested URL in SSO (datahub-project#7011)

* fix(ingest): remove dead code from tests (datahub-project#7005)

Co-authored-by: John Joyce <john@acryl.io>

* feat(ingestion): Tableau # Embed links (datahub-project#6994)

* feat(auth) Update auth cookies to have same-site none for chrome extension (datahub-project#6976)

* docs(website): DPG WIP (datahub-project#6998)

Co-authored-by: Jeff Merrick <jeff@wireform.io>

* docs: resize datahub logo (datahub-project#7014)

* fix(kafka-setup): Remove reference to non-existing topic (datahub-project#7019)

* fix(ingest): powerbi # use display name field as title for powerbi report page (datahub-project#7017)

* feat(auth): Allow session ttl to be configurable by env variable (datahub-project#7022)

* fix(ui): URL Encode all Entity Profile URLs (datahub-project#7023)

* fix(ui ingest): Fix test connection when stateful ingest is enabled (datahub-project#7013)

* docs(sso) move root user warning to earlier in SSO guides (datahub-project#7028)

* fix(ingest/looker): add clarity in chart input parsing logs (datahub-project#7003)

* chore(ingest): remove duplicate data_platform.json file (datahub-project#7026)

* feat(ingestion): PowerBI # Remove corpUserInfo aspect ingestion (datahub-project#7034)

Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com>

* fix(metadata-models): remove unnecessary bin folder (datahub-project#7035)

* fix(docs): fixing typos (datahub-project#7030)

* feat(ingest): Ingest Previews for Looker Charts, Dashboards, and Explores (datahub-project#6941)

* fix(graphql):fix issue: autorender aspect could not be displayed on t… (datahub-project#6993)

Co-authored-by: yangjd33 <yangjd33@chinaunicom.cn>

* fix(config): adding quotes (datahub-project#7038)

* fix(config): adding quotes (datahub-project#7040)

* fix(ingest/bigquery): Turning some usage warning message to debug log as it caused confusion (datahub-project#7024)

* feat(ingest/vertica): Adding Vertica as source in Datahub UI (datahub-project#7010)

Co-authored-by: Vishal <vishal.k@simplify3x.com>
Co-authored-by: VISHAL KUMAR <110387730+vishalkSimplify@users.noreply.github.com>
Co-authored-by: John Joyce <john@acryl.io>

* fix(): Removed a double set for two fields (datahub-project#7037)

* fix(secret-service): fix default encrypt key (datahub-project#7074)

* Fix versioning file

* Modify SourceConfig and SourceReport

* Instantiate StaleEntityRemovalHandler in __init__ method of source

* Add entities from current run to the state object

* Emitting soft-delete workunits associated with stale entities

* Run 'black' tests

* Run 'isort' tests

* Run 'flake8' and 'mypy' tests

* Update status method of all entities

* Add draft of integration test for Iceberg source

* Add a new integration test for the stateful feature of the iceberg source

* Fixing automatic merge

* Add steps to build datahub's front-end and back-end to the pipeline

* Modified command to build front-end to be consistent with what DataHub (upstream) uses

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>
Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
Co-authored-by: orlandine <46893386+orlandine@users.noreply.github.com>
Co-authored-by: Tamas Nemeth <treff7es@gmail.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Felix Lüdin <13187726+Masterchen09@users.noreply.github.com>
Co-authored-by: Aditya Radhakrishnan <aditya.radhakrish@gmail.com>
Co-authored-by: Maggie Hays <maggiem.hays@gmail.com>
Co-authored-by: Pedro Silva <pedro@acryl.io>
Co-authored-by: jx2lee <63435794+jx2lee@users.noreply.github.com>
Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>
Co-authored-by: Jan Hicken <janhicken@users.noreply.github.com>
Co-authored-by: Dmitry Bryazgin <58312247+bda618@users.noreply.github.com>
Co-authored-by: mohdsiddique <mohdsiddiquebagwan@gmail.com>
Co-authored-by: MohdSiddique Bagwan <mohdsiddique.bagwan@gslab.com>
Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com>
Co-authored-by: cccs-seb <64980897+cccs-seb@users.noreply.github.com>
Co-authored-by: fully <ssilb4@gmail.com>
Co-authored-by: Patrick Franco Braz <patrickfbraz@poli.ufrj.br>
Co-authored-by: jakobhanna <55106217+jakobhanna@users.noreply.github.com>
Co-authored-by: Shirshanka Das <shirshanka@apache.org>
Co-authored-by: danielli-ziprecruiter <91145628+danielli-ziprecruiter@users.noreply.github.com>
Co-authored-by: Monica Senapati <89276149+senapatim@users.noreply.github.com>
Co-authored-by: Navin Sharma <103643430+NavinSharma13@users.noreply.github.com>
Co-authored-by: Peter Szalai <szalaipeti.vagyok@gmail.com>
Co-authored-by: raysaka <ray.sakanoue@gmail.com>
Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
Co-authored-by: djordje-mijatovic <97875950+djordje-mijatovic@users.noreply.github.com>
Co-authored-by: Dago Romer <dagoromer85@gmail.com>
Co-authored-by: Mirko R <118171912+mirac-cisco@users.noreply.github.com>
Co-authored-by: Teppo Naakka <teppo.naakka@gmail.com>
Co-authored-by: wangsaisai <wangsaisai@users.noreply.github.com>
Co-authored-by: leifker <dleifker@gmail.com>
Co-authored-by: cccs-eric <eric.ladouceur@cyber.gc.ca>
Co-authored-by: Meenakshi Kamalaseshan Radha <62914384+mkamalas@users.noreply.github.com>
Co-authored-by: Kamalaseshan Radha <mkamalas@LAMU02DN212MD6R.uhc.com>
Co-authored-by: Marvin Rösch <marvinroesch99@gmail.com>
Co-authored-by: Stijn De Haes <stijn.de.haes@gmail.com>
Co-authored-by: cc <50856789+ccpypy@users.noreply.github.com>
Co-authored-by: 陈城 <cheng.chen@tenclass.com>
Co-authored-by: Fredrik Sannholm <fredrik.sannholm@wolt.com>
Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com>
Co-authored-by: Lucas Roesler <roesler.lucas@gmail.com>
Co-authored-by: VISHAL KUMAR <110387730+vishalkSimplify@users.noreply.github.com>
Co-authored-by: mraman2512 <MY_mramaan2512@gmail.com>
Co-authored-by: Aman.Kumar <64635307+mraman2512@users.noreply.github.com>
Co-authored-by: Paul Logan <101486603+laulpogan@users.noreply.github.com>
Co-authored-by: seoju <wngus606@gmail.com>
Co-authored-by: 서주현[G플레이스데이터개발] <juhyun.seo@navercorp.com>
Co-authored-by: Matt Matravers <mattmatravers@hotmail.com>
Co-authored-by: 서재권(Data Platform) <90180644+jaegwonseo@users.noreply.github.com>
Co-authored-by: Thosan Girisona <thosan.girisona@gmail.com>
Co-authored-by: Jeff Merrick <jeff@wireform.io>
Co-authored-by: Yang Jiandan <succeedin2010@163.com>
Co-authored-by: yangjd33 <yangjd33@chinaunicom.cn>
Co-authored-by: Rajasekhar-Vuppala <122261609+Rajasekhar-Vuppala@users.noreply.github.com>
Co-authored-by: Vishal <vishal.k@simplify3x.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community docs Issues and Improvements to docs rfc See https://github.com/linkedin/datahub/blob/master/docs/rfc.md for more details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants