Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): Add metabase database id to platform instance mapping #8359

Merged

Conversation

k-popov
Copy link
Contributor

@k-popov k-popov commented Jul 3, 2023

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Since DataHub supports platform instances there may be several distinct instances with the same platform in the same environment. When ingesting charts from Metabase it should be possible to find the corresponding datasets (tables) in DataHub.

For example, the dataset URN searched during ingestion is:

  • without the patch: urn:li:dataPlatform:clickhouse,event.new_app,PROD
  • with the patch: urn:li:dataPlatform:clickhouse,clichouse_analytics_prod.event.new_app,PROD
    with only the latter URN being valid for dataset ingested with platform_instance specified.

The change still leaves name_to_instance_map optional and if platform_instance is not used, the mapping may be skipped.

Also the change adds a case for Metabase user not being found which is normal for deactivated users. I believe this case should not be considered as failure as people in companies quit or change their roles while charts and dashboards they created are still used by the company. I believe this behavior has already been mentioned in #5294. Unfortunately Metabase API does not support GET'ting individual user details for deactivated users, only has options to list them. Thus checking for 404 looks like a reasonable and yet simple workaround.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 3, 2023
@k-popov k-popov force-pushed the metabase_platform_instance_map branch 2 times, most recently from 6ccbfbe to 01a5717 Compare July 3, 2023 19:21
@k-popov
Copy link
Contributor Author

k-popov commented Jul 6, 2023

Another option for mapping would be using id of database in Metabase instead of its name. So if this is decided to be a better option, I will update the pull request.

@hsheth2
Copy link
Collaborator

hsheth2 commented Jul 6, 2023

@k-popov thanks for the contribution. @mayurinehate could you take a look?

@mayurinehate
Copy link
Collaborator

Hey @k-popov can you not use platform_instance_map config already available in metabase connector config ? for example -


platform_instance_map:
    clickhouse: clichouse_analytics_prod

Or do you have multiple platform instances of clickhouse accessed via same metabase instance ?

@k-popov
Copy link
Contributor Author

k-popov commented Jul 7, 2023

@mayurinehate thanks for feedback!

Yes, in my case there are multiple platform instances of clickhouse all accessed by the same metabase instance.
In my understanding that's quite a common case in larger companies: different product apps are using separate database instances but there is a single point of access to them - metabase.

I first thought of "re-using" the platform_instance_map config parameter but it would then be a breaking change which I'd prefer to avoid. That's why I introduced a new config option. Still I'm in doubt about the key for the mapping: database id is forced to be unique by Metabase while name can possibly have duplicates. On the other hand mapping with names is more "user-friendly" than mapping with numeric id. What option would you recommend?

@mayurinehate
Copy link
Collaborator

@mayurinehate thanks for feedback!

Yes, in my case there are multiple platform instances of clickhouse all accessed by the same metabase instance. In my understanding that's quite a common case in larger companies: different product apps are using separate database instances but there is a single point of access to them - metabase.

I first thought of "re-using" the platform_instance_map config parameter but it would then be a breaking change which I'd prefer to avoid. That's why I introduced a new config option. Still I'm in doubt about the key for the mapping: database id is forced to be unique by Metabase while name can possibly have duplicates. On the other hand mapping with names is more "user-friendly" than mapping with numeric id. What option would you recommend?

Got it. I am inclined towards using id than name as that would solve the problem permanently and handle corner cases as well. About user friendliness, can you confirm if this is still true - #5647 (comment) ? If yes, I'm comfortable with using database id. - and the config becomes database_id_to_platform_instance_map

Comment on lines 280 to 284
if (
hasattr(http_error, "response")
and http_error.response.status_code == 404
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (
hasattr(http_error, "response")
and http_error.response.status_code == 404
):
if (
hasattr(http_error, "response")
and http_error.response is not None
and http_error.response.status_code == 404
):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced the hasattr check with http_error.response is not None as http_error.response is always set to None in exception constructor: https://github.com/psf/requests/blob/cdbc2e271529f467b278b2760f12ee0b5d6930d3/requests/exceptions.py#L19

@k-popov k-popov changed the title feat(ingest): Add metabase name to platform instance mapping feat(ingest): Add metabase database id to platform instance mapping Jul 11, 2023
@k-popov
Copy link
Contributor Author

k-popov commented Jul 11, 2023

@mayurinehate as discussed, I've replaced key for mapping from name to id.
The key has been set to be string, not integer like it is in Metabase API as integer keys seem to not be supported by DataHub config: when I tried to make the key integer, datahub ingest has thrown AttributeError: 'int' object has no attribute 'endswith'. That's why I replaced numeric config keys with string.

@mayurinehate
Copy link
Collaborator

mayurinehate commented Jul 13, 2023

@mayurinehate as discussed, I've replaced key for mapping from name to id. The key has been set to be string, not integer like it is in Metabase API as integer keys seem to not be supported by DataHub config: when I tried to make the key integer, datahub ingest has thrown AttributeError: 'int' object has no attribute 'endswith'. That's why I replaced numeric config keys with string.

I am not familiar with such limitation. This should be supported. Do you happen to have stack trace of the error, i.e. which line in code gave the error ? I am fine with keeping these as strings. I believe, it is not mandatory to enclose the id integer within quotes in recipe yml .

@k-popov
Copy link
Contributor Author

k-popov commented Jul 13, 2023

Here is the stacktrace I see when using int as config key:

[2023-07-13 21:28:17,685] WARNING  {datahub.ingestion.run.pipeline:296} - Failed to configure reporter: datahub
Traceback (most recent call last):
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 282, in _configure_reporting
    reporter_class.create(
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/ingestion/reporting/datahub_ingestion_run_summary_provider.py", line 111, in create
    return cls(sink, reporter_config.report_recipe, ctx)
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/ingestion/reporting/datahub_ingestion_run_summary_provider.py", line 139, in __init__
    recipe=self._get_recipe_to_report(ctx),
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/ingestion/reporting/datahub_ingestion_run_summary_provider.py", line 156, in _get_recipe_to_report
    return json.dumps(redact_raw_config(ctx.pipeline_config._raw_dict))
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 60, in redact_raw_config
    return {
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 61, in <dictcomp>
    k: _redact_value(v) if _should_redact_key(k) else redact_raw_config(v)
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 60, in redact_raw_config
    return {
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 61, in <dictcomp>
    k: _redact_value(v) if _should_redact_key(k) else redact_raw_config(v)
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 60, in redact_raw_config
    return {
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 61, in <dictcomp>
    k: _redact_value(v) if _should_redact_key(k) else redact_raw_config(v)
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 60, in redact_raw_config
    return {
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 61, in <dictcomp>
    k: _redact_value(v) if _should_redact_key(k) else redact_raw_config(v)
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 35, in _should_redact_key
    return key in REDACT_KEYS or any(key.endswith(suffix) for suffix in REDACT_SUFFIXES)
  File "/home/kpopov/github/datahub/venv/lib/python3.10/site-packages/datahub/configuration/common.py", line 35, in <genexpr>
    return key in REDACT_KEYS or any(key.endswith(suffix) for suffix in REDACT_SUFFIXES)
AttributeError: 'int' object has no attribute 'endswith'

Actually ingestion continues after this stacktrace but having any exception seems bad.

# Set platform_instance if configuration provides a mapping from platform name to instance
platform_instance = (
self.config.platform_instance_map.get(platform)
if self.config.platform_instance_map
if self.config.platform_instance_map and platform_instance is None
else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if needs to be written as below for this to behave correctly for case platform_instance was chosen from datasource_id_in_metabase .

if self.config.platform_instance_map and platform_instance is None:
     platform_instance = self.config.platform_instance_map.get(platform)

Can you please refractor the entire get platform instance logic in separate method and add unit tests for that method ?

def get_platform_instance_for_datasource(datasource_id: str, platform: Optional[str])->Optional[str]:
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved platform_instance detection into method.
Attempted to add a unit-test for this function but due to connection checks being done by __init__() of MetabaseSource I need to also mock http communication. Will continue working on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayurinehate would you please check the PR once again? I've implemented the unit-test for the function and it passes normally.

I've also moved network initialization of MetabaseSource class into separate function which is called in MetabaseSource.__init__(). There are no changes in behavior but now init is logically split into "local" initialization and "network" initialization.

I see some tests are failing but these are not related to the change I made.

@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Jul 17, 2023
Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@k-popov
Copy link
Contributor Author

k-popov commented Jul 28, 2023

@mayurinehate , thank you for your response. I've removed redundant attribute definition. Related tests look fine. Would you please check the MR once again?

@mayurinehate mayurinehate removed the pending-submitter-response Issue/request has been reviewed but requires a response from the submitter label Jul 31, 2023
@k-popov k-popov force-pushed the metabase_platform_instance_map branch from 51a8fb5 to 88acb48 Compare July 31, 2023 06:25
@k-popov
Copy link
Contributor Author

k-popov commented Aug 1, 2023

@mayurinehate , sorry to bother you once again but I might be missing something.

After you approved the PR I still didn't have a button to merge it. I first thought that's because master branch has moved forward. I've rebased changes via GitHub's UI but as a result I only lost your approval. Am I supposed to have the merge button after approval? Or one of the project maintainers is accepting the PR and actually merges the changes?

@mayurinehate
Copy link
Collaborator

Hi @k-popov you don't need to do any action after approval. One of the project maintainers will merge the PR.

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I'm not a huge fan of having both platform_instance_map and database_id_to_instance_map -- in the future, does it make sense to deprecate the former in favor of the latter?

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Aug 1, 2023
@anshbansal anshbansal merged commit eec89a8 into datahub-project:master Aug 2, 2023
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants