Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): redshift - Redshift rework #6906

Merged
merged 83 commits into from
Apr 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
58ab3cf
Refactoring container creation
treff7es Dec 28, 2022
c809aa8
Adding back accidentally removed imports
treff7es Dec 28, 2022
fae7ec1
Adding back line breaks as well
treff7es Dec 28, 2022
a71f8dc
Merge branch 'master' into sql_common_refactor
treff7es Dec 28, 2022
cb219bd
Remove unneeded line
treff7es Dec 28, 2022
d3e6407
Black formatting
treff7es Dec 28, 2022
93df8fc
Isorting
treff7es Dec 28, 2022
cf074f0
Fixing return types
treff7es Dec 28, 2022
42004e5
Black formatting
treff7es Dec 28, 2022
8fcdb7a
Add option to set container name differently than the key name
treff7es Dec 28, 2022
e985a20
Fix snowflake container generation
treff7es Dec 29, 2022
9af84c9
Fixing 2 tier container generation
treff7es Dec 29, 2022
be580ee
Merge branch 'master' into sql_common_refactor
treff7es Dec 29, 2022
a626a9e
isorting
treff7es Dec 29, 2022
ade9fa2
Removing unused import
treff7es Dec 29, 2022
0da0c55
Fixing presto container generation
treff7es Dec 29, 2022
2268cad
Athena inherits from two tier db
treff7es Dec 29, 2022
480497f
Reverting Athena changes
treff7es Dec 29, 2022
e017f44
Merge branch 'master' into sql_common_refactor
treff7es Dec 29, 2022
d6c1b7d
Merge branch 'master' into sql_common_refactor
treff7es Dec 30, 2022
defaff6
Initial commit
treff7es Nov 24, 2022
1f825e4
Extracting out common methods, fixign mypy issues
treff7es Nov 25, 2022
b522948
- Adding connection test
treff7es Nov 25, 2022
0b56fbc
Redshift rework
treff7es Dec 15, 2022
7b5b420
Fixing issues after merge
treff7es Dec 30, 2022
d3cb087
Fixing accidental change
treff7es Dec 30, 2022
df55103
Flake, black, mypy fixes
treff7es Dec 30, 2022
2945d27
Adding fixes for Redshift source
treff7es Jan 9, 2023
c4106f4
Merge branch 'master' into redshift-rework
treff7es Jan 24, 2023
aecafda
Resolving merge conflicts
treff7es Jan 24, 2023
7753ffb
Fixing wrap_aspect_as_workunit
treff7es Jan 24, 2023
ceb0b37
Merge branch 'master' into redshift-rework
treff7es Jan 24, 2023
12625b3
Fixing usage test
treff7es Jan 25, 2023
442c8e0
Fixing linter error
treff7es Jan 25, 2023
9535790
Adding missing test file
treff7es Jan 25, 2023
4332e07
various fixes
treff7es Jan 26, 2023
dae7284
Adding unload lineage code as well
treff7es Jan 26, 2023
5b6881b
Merge branch 'master' into redshift-rework
treff7es Jan 26, 2023
4efc105
Adding quoting to patch path
treff7es Jan 26, 2023
22a1f6a
Merge branch 'master' into redshift-rework
treff7es Jan 31, 2023
d29dd53
PR review fixes
treff7es Jan 31, 2023
920648f
Adding support for complex types in Redshift
treff7es Feb 6, 2023
da73a74
Renaming non-binding lineage parser to generic view ddl sql parser
treff7es Feb 6, 2023
9264bfa
Merge branch 'master' into redshift-rework
treff7es Feb 6, 2023
c0d4461
Updating docs
treff7es Feb 7, 2023
99ae34f
Merge branch 'master' into redshift-rework
treff7es Feb 28, 2023
7234509
Fixing build issues
treff7es Feb 28, 2023
26fe103
Addressing some pr review comments
treff7es Feb 28, 2023
54556a0
Addressing even more pr review comments
treff7es Feb 28, 2023
a497a25
Additional pr review fixes
treff7es Feb 28, 2023
c2fa727
Additional pr review fixes
treff7es Feb 28, 2023
a44b598
Merge branch 'master' into redshift-rework
treff7es Mar 1, 2023
df05369
Addressing more pr review comment
treff7es Mar 9, 2023
b3cd9a6
Merge branch 'master' into redshift-rework
treff7es Mar 14, 2023
5b304fa
Mergin master
treff7es Mar 14, 2023
281b78d
Merge branch 'master' into redshift-rework
treff7es Mar 14, 2023
ea0a6cf
Removing redshift-beta
treff7es Mar 16, 2023
d41f776
Merge branch 'master' into redshift-rework
treff7es Mar 16, 2023
82160ed
Fixing linter error
treff7es Mar 16, 2023
3e7409a
Merge branch 'master' into redshift-rework
treff7es Mar 20, 2023
5cc4be1
Merge branch 'master' into redshift-rework
jjoyce0510 Mar 20, 2023
298f0d9
Merge branch 'master' into redshift-rework
treff7es Mar 21, 2023
bce47ac
Merge branch 'master' into redshift-rework
treff7es Mar 21, 2023
7f67047
Addressing pr review comments
treff7es Mar 21, 2023
6c9585c
fixing import order
treff7es Mar 21, 2023
5d98eb9
Merge branch 'master' into redshift-rework
treff7es Mar 21, 2023
6c3cf64
Updating dbt golden files
treff7es Mar 23, 2023
ea78f9f
Merge branch 'master' into redshift-rework
treff7es Mar 23, 2023
903e0a7
update to redshift-usage-legacy
hsheth2 Mar 23, 2023
d0e2d42
Merge branch 'master' into redshift-rework
hsheth2 Mar 23, 2023
975274b
update dbt golden files
hsheth2 Mar 23, 2023
7f9bffb
Fixing usgage-legacy
treff7es Mar 23, 2023
aa4c6d4
Recovering legacy redshift-usage
treff7es Mar 24, 2023
3968cf5
Merge branch 'master' into redshift-rework
treff7es Mar 24, 2023
af2d718
fixing lineage query
treff7es Mar 27, 2023
1eb15d1
Merge branch 'master' into redshift-rework
treff7es Mar 27, 2023
c778093
Merge branch 'master' into redshift-rework
hsheth2 Apr 7, 2023
5539139
Merge branch 'master' into redshift-rework
treff7es Apr 11, 2023
88bb936
Fixing linter error
treff7es Apr 11, 2023
25fce12
Fixing linter issues
treff7es Apr 11, 2023
bc9480f
Merge branch 'master' into redshift-rework
treff7es Apr 11, 2023
99dbd43
fixing bugbear linter issue
treff7es Apr 11, 2023
06cdcd4
Merge branch 'master' into redshift-rework
treff7es Apr 12, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions docs/how/updating-datahub.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
## Next

### Breaking Changes
- #7016 Add `add_database_name_to_urn` flag to Oracle source which ensure that Dataset urns have the DB name as a prefix to prevent collision (.e.g. {database}.{schema}.{table}). ONLY breaking if you set this flag to true, otherwise behavior remains the same.
- The Airflow plugin no longer includes the DataHub Kafka emitter by default. Use `pip install acryl-datahub-airflow-plugin[datahub-kafka]` for Kafka support.
- The Airflow lineage backend no longer includes the DataHub Kafka emitter by default. Use `pip install acryl-datahub[airflow,datahub-kafka]` for Kafka support.

- #7016 Add `add_database_name_to_urn` flag to Oracle source which ensure that Dataset urns have the DB name as a prefix to prevent collision (.e.g. {database}.{schema}.{table}). ONLY breaking if you set this flag to true, otherwise behavior remains the same.
- The Airflow plugin no longer includes the DataHub Kafka emitter by default. Use `pip install acryl-datahub-airflow-plugin[datahub-kafka]` for Kafka support.
- The Airflow lineage backend no longer includes the DataHub Kafka emitter by default. Use `pip install acryl-datahub[airflow,datahub-kafka]` for Kafka support.

### Potential Downtime

Expand All @@ -21,7 +21,8 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
### Breaking Changes

- #7103 This should only impact users who have configured explicit non-default names for DataHub's Kafka topics. The environment variables used to configure Kafka topics for DataHub used in the `kafka-setup` docker image have been updated to be in-line with other DataHub components, for more info see our docs on [Configuring Kafka in DataHub
](https://datahubproject.io/docs/how/kafka-config). They have been suffixed with `_TOPIC` where as now the correct suffix is `_TOPIC_NAME`. This change should not affect any user who is using default Kafka names.
](https://datahubproject.io/docs/how/kafka-config). They have been suffixed with `_TOPIC` where as now the correct suffix is `_TOPIC_NAME`. This change should not affect any user who is using default Kafka names.
- #6906 The Redshift source has been reworked and now also includes usage capabilities. The old Redshift source was renamed to `redshift-legacy`. The `redshift-usage` source has also been renamed to `redshift-usage-legacy` will be removed in the future.

### Potential Downtime

Expand All @@ -45,9 +46,11 @@ Helm with `--atomic`: In general, it is recommended to not use the `--atomic` se
### Potential Downtime

### Deprecations
#6851 - Sources bigquery-legacy and bigquery-usage-legacy have been removed.

- #6851 - Sources bigquery-legacy and bigquery-usage-legacy have been removed

### Other notable Changes

- If anyone faces issues with login please clear your cookies. Some security updates are part of this release. That may cause login issues until cookies are cleared.

## 0.9.4 / 0.9.5
Expand Down Expand Up @@ -151,7 +154,7 @@ Helm with `--atomic`: In general, it is recommended to not use the `--atomic` se
### Breaking Changes

- Browse Paths have been upgraded to a new format to align more closely with the intention of the feature.
Learn more about the changes, including steps on upgrading, here: https://datahubproject.io/docs/advanced/browse-paths-upgrade
Learn more about the changes, including steps on upgrading, here: <https://datahubproject.io/docs/advanced/browse-paths-upgrade>
- The dbt ingestion source's `disable_dbt_node_creation` and `load_schema` options have been removed. They were no longer necessary due to the recently added sibling entities functionality.
- The `snowflake` source now uses newer faster implementation (earlier `snowflake-beta`). Config properties `provision_role` and `check_role_grants` are not supported. Older `snowflake` and `snowflake-usage` are available as `snowflake-legacy` and `snowflake-usage-legacy` sources respectively.

Expand Down Expand Up @@ -290,4 +293,5 @@ Helm with `--atomic`: In general, it is recommended to not use the `--atomic` se
- #4644 `host_port` option of `snowflake` and `snowflake-usage` sources deprecated as the name was confusing. Use `account_id` option instead.

### Other notable Changes

- #4760 `check_role_grants` option was added in `snowflake` to disable checking roles in `snowflake` as some people were reporting long run times when checking roles.
1 change: 0 additions & 1 deletion metadata-ingestion/docs/sources/redshift/README.md
Original file line number Diff line number Diff line change
@@ -1 +0,0 @@
To get all metadata from Redshift you need to use two plugins `redshift` and `redshift-usage`. Both of them are described in this page. These will require 2 separate recipes. We understand this is not ideal and we plan to make this easier in the future.
14 changes: 0 additions & 14 deletions metadata-ingestion/docs/sources/redshift/redshift-usage_recipe.yml

This file was deleted.

16 changes: 12 additions & 4 deletions metadata-ingestion/docs/sources/redshift/redshift_recipe.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
source:
type: redshift
config:
# Coordinates
Expand All @@ -13,10 +12,19 @@ source:
options:
# driver_option: some-option

include_views: True # whether to include views, defaults to True
include_tables: True # whether to include views, defaults to True
include_table_lineage: true
include_usage_statistics: true
# The following options are only used when include_usage_statistics is true
# it appends the domain after the resdhift username which is extracted from the Redshift audit history
asikowitz marked this conversation as resolved.
Show resolved Hide resolved
# in the format username@email_domain
email_domain: mydomain.com

sink:
profiling:
enabled: true
# Only collect table level profiling information
profile_table_level_only: true

sink:
# sink configs

#------------------------------------------------------------------------------
Expand Down
13 changes: 8 additions & 5 deletions metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,8 +337,9 @@ def get_long_description():
| {"psycopg2-binary", "acryl-pyhive[hive]>=0.6.12", "pymysql>=1.0.2"},
"pulsar": {"requests"},
"redash": {"redash-toolbelt", "sql-metadata", sqllineage_lib},
"redshift": sql_common | redshift_common,
"redshift-usage": sql_common | usage_common | redshift_common,
"redshift": sql_common | redshift_common | usage_common | {"redshift-connector"},
"redshift-legacy": sql_common | redshift_common,
"redshift-usage-legacy": sql_common | usage_common | redshift_common,
"s3": {*s3_base, *data_lake_profiling},
"sagemaker": aws_common,
"salesforce": {"simple-salesforce"},
Expand Down Expand Up @@ -452,7 +453,8 @@ def get_long_description():
"presto",
"redash",
"redshift",
"redshift-usage",
"redshift-legacy",
"redshift-usage-legacy",
"s3",
"snowflake",
"tableau",
Expand Down Expand Up @@ -541,8 +543,9 @@ def get_long_description():
"oracle = datahub.ingestion.source.sql.oracle:OracleSource",
"postgres = datahub.ingestion.source.sql.postgres:PostgresSource",
"redash = datahub.ingestion.source.redash:RedashSource",
"redshift = datahub.ingestion.source.sql.redshift:RedshiftSource",
"redshift-usage = datahub.ingestion.source.usage.redshift_usage:RedshiftUsageSource",
treff7es marked this conversation as resolved.
Show resolved Hide resolved
"redshift = datahub.ingestion.source.redshift.redshift:RedshiftSource",
"redshift-legacy = datahub.ingestion.source.sql.redshift:RedshiftSource",
"redshift-usage-legacy = datahub.ingestion.source.usage.redshift_usage:RedshiftUsageSource",
"snowflake = datahub.ingestion.source.snowflake.snowflake_v2:SnowflakeV2Source",
"superset = datahub.ingestion.source.superset:SupersetSource",
"tableau = datahub.ingestion.source.tableau:TableauSource",
Expand Down
Empty file.
12 changes: 12 additions & 0 deletions metadata-ingestion/src/datahub/ingestion/source/redshift/common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from datahub.ingestion.source.redshift.config import RedshiftConfig

redshift_datetime_format = "%Y-%m-%d %H:%M:%S"


def get_db_name(config: RedshiftConfig) -> str:
db_name = config.database
db_alias = config.database_alias

db_name = db_alias or db_name
assert db_name is not None, "database name or alias must be specified"
return db_name
139 changes: 139 additions & 0 deletions metadata-ingestion/src/datahub/ingestion/source/redshift/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
from enum import Enum
from typing import Any, Dict, List, Optional

from pydantic import root_validator
from pydantic.fields import Field

from datahub.configuration import ConfigModel
from datahub.configuration.pydantic_field_deprecation import pydantic_field_deprecated
from datahub.configuration.source_common import DatasetLineageProviderConfigBase
from datahub.ingestion.source.aws.path_spec import PathSpec
from datahub.ingestion.source.sql.postgres import PostgresConfig
from datahub.ingestion.source.state.stateful_ingestion_base import (
StatefulLineageConfigMixin,
StatefulProfilingConfigMixin,
StatefulUsageConfigMixin,
)
from datahub.ingestion.source.usage.usage_common import BaseUsageConfig


# The lineage modes are documented in the Redshift source's docstring.
class LineageMode(Enum):
treff7es marked this conversation as resolved.
Show resolved Hide resolved
SQL_BASED = "sql_based"
STL_SCAN_BASED = "stl_scan_based"
MIXED = "mixed"


class S3LineageProviderConfig(ConfigModel):
"""
Any source that produces s3 lineage from/to Datasets should inherit this class.
"""

path_specs: List[PathSpec] = Field(
default=[],
description="List of PathSpec. See below the details about PathSpec",
)

strip_urls: bool = Field(
default=True,
description="Strip filename from s3 url. It only applies if path_specs are not specified.",
)


class S3DatasetLineageProviderConfigBase(ConfigModel):
"""
Any source that produces s3 lineage from/to Datasets should inherit this class.
treff7es marked this conversation as resolved.
Show resolved Hide resolved
This is needeed to group all lineage related configs under `s3_lineage_config` config property.
"""

s3_lineage_config: S3LineageProviderConfig = Field(
default=S3LineageProviderConfig(),
description="Common config for S3 lineage generation",
)


class RedshiftUsageConfig(BaseUsageConfig, StatefulUsageConfigMixin):
email_domain: Optional[str] = Field(
default=None,
description="Email domain of your organisation so users can be displayed on UI appropriately.",
)


class RedshiftConfig(
PostgresConfig,
DatasetLineageProviderConfigBase,
S3DatasetLineageProviderConfigBase,
RedshiftUsageConfig,
StatefulLineageConfigMixin,
StatefulProfilingConfigMixin,
):
database: str = Field(default="dev", description="database")

# Although Amazon Redshift is compatible with Postgres's wire format,
# we actually want to use the sqlalchemy-redshift package and dialect
# because it has better caching behavior. In particular, it queries
# the full table, column, and constraint information in a single larger
# query, and then simply pulls out the relevant information as needed.
# Because of this behavior, it uses dramatically fewer round trips for
# large Redshift warehouses. As an example, see this query for the columns:
# https://github.com/sqlalchemy-redshift/sqlalchemy-redshift/blob/60b4db04c1d26071c291aeea52f1dcb5dd8b0eb0/sqlalchemy_redshift/dialect.py#L745.
scheme = Field(
default="redshift+psycopg2",
description="",
hidden_from_schema=True,
)

_database_alias_deprecation = pydantic_field_deprecated(
"database_alias",
message="database_alias is deprecated. Use platform_instance instead.",
)

default_schema: str = Field(
default="public",
description="The default schema to use if the sql parser fails to parse the schema with `sql_based` lineage collector",
)

include_table_lineage: Optional[bool] = Field(
default=True, description="Whether table lineage should be ingested."
)
include_copy_lineage: Optional[bool] = Field(
default=True,
description="Whether lineage should be collected from copy commands",
)

include_usage_statistics: bool = Field(
default=False,
description="Generate usage statistic. email_domain config parameter needs to be set if enabled",
)

include_unload_lineage: Optional[bool] = Field(
default=True,
description="Whether lineage should be collected from unload commands",
)

capture_lineage_query_parser_failures: Optional[bool] = Field(
hide_from_schema=True,
default=False,
description="Whether to capture lineage query parser errors with dataset properties for debugging",
)

table_lineage_mode: Optional[LineageMode] = Field(
default=LineageMode.STL_SCAN_BASED,
description="Which table lineage collector mode to use. Available modes are: [stl_scan_based, sql_based, mixed]",
)
extra_client_options: Dict[str, Any] = {}

@root_validator(pre=True)
def check_email_is_set_on_usage(cls, values):
if values.get("include_usage_statistics"):
assert (
"email_domain" in values and values["email_domain"]
), "email_domain needs to be set if usage is enabled"
return values

@root_validator()
def check_database_or_database_alias_set(cls, values):
assert values.get("database") or values.get(
"database_alias"
), "either database or database_alias must be set"
return values
Loading