Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(ingest/bigquery): add inline comments + refactor in table name parsing #7609

Merged
merged 5 commits into from
Mar 24, 2023

Conversation

mayurinehate
Copy link
Collaborator

@mayurinehate mayurinehate commented Mar 16, 2023

Also added tests for widely used utility functions.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@@ -90,7 +90,7 @@ class BigQueryV2Config(
)

number_of_datasets_process_in_batch_if_profiling_enabled: int = Field(
default=80,
default=200,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been tested independently and is optimal batch value.

@@ -131,7 +131,7 @@ class BigQueryV2Config(

extract_lineage_from_catalog: bool = Field(
default=False,
description="This flag enables the data lineage extraction from Data Lineage API exposed by Google Data Catalog. NOTE: This extractor can't build views lineage. It's recommended to enable the view's DDL parsing. Read the docs to have more information about: https://cloud.google.com/data-catalog/docs/reference/data-lineage/rest",
description="This flag enables the data lineage extraction from Data Lineage API exposed by Google Data Catalog. NOTE: This extractor can't build views lineage. It's recommended to enable the view's DDL parsing. Read the docs to have more information about: https://cloud.google.com/data-catalog/docs/concepts/about-data-lineage",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated link gives more insight than original link that referenced api directly.

shortened_table_name = re.sub(
self._BIGQUERY_WILDCARD_REGEX, "", shortened_table_name
)

matches = BigQueryTableRef.SNAPSHOT_TABLE_REGEX.match(shortened_table_name)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reordering is required for below test case which failed earlier:

assert (
        BigqueryTableIdentifier(
            "project", "dataset", "table@1624046611000"
        ).get_table_name()
        == "project.dataset.table"
    )

@mayurinehate mayurinehate changed the title refractor(ingest/bigquery): added inline comments, minor reorder in t… refractor(ingest/bigquery): added inline comments, minor refractor in table name parsing Mar 16, 2023
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 16, 2023
@hsheth2 hsheth2 changed the title refractor(ingest/bigquery): added inline comments, minor refractor in table name parsing refactor(ingest/bigquery): added inline comments, minor refractor in table name parsing Mar 16, 2023
@hsheth2 hsheth2 changed the title refactor(ingest/bigquery): added inline comments, minor refractor in table name parsing refactor(ingest/bigquery): add inline comments + refactor in table name parsing Mar 16, 2023
"""
Args:
table_name:
table name (in form <table-id> or <table-prefix>_<shard>`, optionally prefixed with <project-id>.<dataset-id>)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have we changed from get_raw_table() to .table elsewhere in the code, if this one supports the project / dataset prefix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to this ->

 If table_name is fully qualified, i.e. prefixed with <project-id>.<dataset-id> then the special case of
                dataset as a sharded table, i.e. table-id itself as shard can not be detected.

get_table_and_shard("project.dataset.20230112") -> (project.dataset.20230112, None)

get_table_and_shard("20230112") -> (None, 20230112)

Alternatively - we can also fix the regex but it seemed safer to change input instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a normal practice with sharding? Or have we just seen some customers put a bunch of tables with date names in a dataset and consider them "shards". For example, https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard says:

Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD.

Can we be explicit about what cases we identify tables as sharding? It seems like right now, we support:

  • [PREFIX]_YYYYMMDD
  • [PREFIX]$YYYYMMDD nvm, seems like $ is a forbidden character
  • YYYYMMDD

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right. above two case are what we support with default sharded pattern. the second case (just YYYYMMDD as name) is more of a specific case adopted by very few AFAIK.

The complexity is that the sharded pattern can be overridden using recipe config - sharded_table_pattern . Would be good to understand if this is really used/required.

In case of non-sharded tables, returns (<table-id>, None)
In case of sharded tables, returns (<table-prefix>, shard)
"""
match = re.match(
Copy link
Collaborator Author

@mayurinehate mayurinehate Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the change from re.search to re.match (match the end of string with _). This is done to differentiate

table_2023010920230109 (as non-sharded table) from
table_20230109 (as sharded table).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we should add $ to the regex to mark the end of the string but keep it as re.search

That way this method will work for fully qualified table names too

Copy link
Collaborator Author

@mayurinehate mayurinehate Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble updating the single regex to work for both qualified table name as well as non-qualified table name and support all the cases, at the same time, also given the fact that group numbering should stay the same as this regex can be overriden using recipe config sharded_table_pattern. If this change does not look okay, I can revert this (and keep only others) and we can fix this in follow up PRs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's fine then - we can do it in a follow up PR

The reason I'd like to do it is that some folks partition at the dataset level instead of the table level, so passing the full name into this method enables us to support them in the future

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to simplify this logic as much as possible, and thus try to pass as few types of table identifiers as possible. Most of the time we pass fully qualified names or refs, so ideally I think we try to stop passing just table name most of the time? But am fine with this as Harshal is for now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we will deal with fully qualified names more often. Let me know if I should revert this change and address in follow up PR or keep it.

Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments and cleanup are helpful. I think there's more cleanup to be done around table name parsing, but this is a nice start. Main comment is around converting the existing tests to doctests

"""
Args:
table_name:
table name (in form <table-id> or <table-prefix>_<shard>`, optionally prefixed with <project-id>.<dataset-id>)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a normal practice with sharding? Or have we just seen some customers put a bunch of tables with date names in a dataset and consider them "shards". For example, https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard says:

Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD.

Can we be explicit about what cases we identify tables as sharding? It seems like right now, we support:

  • [PREFIX]_YYYYMMDD
  • [PREFIX]$YYYYMMDD nvm, seems like $ is a forbidden character
  • YYYYMMDD

@@ -84,7 +84,19 @@ class BigqueryTableIdentifier:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@treff7es do you know what the $ is doing in _BIGQUERY_DEFAULT_SHARDED_TABLE_REGEX? Seems like we'd support [PREFIX]$YYYYMMDD but $ is in the invalid_chars set

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand that $ in the middle is for matching non sharded tables. but this is really complex can be simplified.

In case of non-sharded tables, returns (<table-id>, None)
In case of sharded tables, returns (<table-prefix>, shard)
"""
match = re.match(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to simplify this logic as much as possible, and thus try to pass as few types of table identifiers as possible. Most of the time we pass fully qualified names or refs, so ideally I think we try to stop passing just table name most of the time? But am fine with this as Harshal is for now

shortened_table_name = self.table
# if table name ends in _* or * then we strip it as that represents a query on a sharded table
# if table name ends in _* or * or _yyyy* or _yyyymm* then we strip it as that represents a query on a sharded table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we support any series of numbers between the _ and * in the regex, so you could technically do something like _yy* or _yyyym* if you wanted. Not sure the best way to reflect this in a comment though

Comment on lines +411 to +429
@pytest.mark.parametrize(
"table_name, expected_table_prefix, expected_shard",
[
# Cases with Fully qualified name as input
("project.dataset.table", "project.dataset.table", None),
("project.dataset.table_20231215", "project.dataset.table", "20231215"),
("project.dataset.table_2023", "project.dataset.table_2023", None),
# incorrectly handled special case where dataset itself is a sharded table if full name is specified
("project.dataset.20231215", "project.dataset.20231215", None),
# Cases with Just the table name as input
("table", "table", None),
("table20231215", "table20231215", None),
("table_20231215", "table", "20231215"),
("table_1624046611000_name", "table_1624046611000_name", None),
("table_1624046611000", "table_1624046611000", None),
# Special case where dataset itself is a sharded table
("20231215", None, "20231215"),
],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you guys think about making these doctests instead? Seems like a standard use of them, as the tests are simple and help document the function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but won't that require some framework level update to execute them ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm didn't realize we didn't have any. Definitely a separate change then, I'll try looking into it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep we don't have doctests yet, but we can enable them by adding --doctest-modules in pytest to the setup.cfg file

Definitely agree that they'd help in these cases

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asikowitz I'm good to merge this if you are

@asikowitz asikowitz merged commit 301c861 into datahub-project:master Mar 24, 2023
gmcgoldrick-r7 pushed a commit to rapid7/datahub that referenced this pull request Mar 27, 2023
yoonhyejin pushed a commit that referenced this pull request Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants