-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(ingest/bigquery): add inline comments + refactor in table name parsing #7609
refactor(ingest/bigquery): add inline comments + refactor in table name parsing #7609
Conversation
…able name parsing
@@ -90,7 +90,7 @@ class BigQueryV2Config( | |||
) | |||
|
|||
number_of_datasets_process_in_batch_if_profiling_enabled: int = Field( | |||
default=80, | |||
default=200, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been tested independently and is optimal batch value.
@@ -131,7 +131,7 @@ class BigQueryV2Config( | |||
|
|||
extract_lineage_from_catalog: bool = Field( | |||
default=False, | |||
description="This flag enables the data lineage extraction from Data Lineage API exposed by Google Data Catalog. NOTE: This extractor can't build views lineage. It's recommended to enable the view's DDL parsing. Read the docs to have more information about: https://cloud.google.com/data-catalog/docs/reference/data-lineage/rest", | |||
description="This flag enables the data lineage extraction from Data Lineage API exposed by Google Data Catalog. NOTE: This extractor can't build views lineage. It's recommended to enable the view's DDL parsing. Read the docs to have more information about: https://cloud.google.com/data-catalog/docs/concepts/about-data-lineage", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated link gives more insight than original link that referenced api directly.
shortened_table_name = re.sub( | ||
self._BIGQUERY_WILDCARD_REGEX, "", shortened_table_name | ||
) | ||
|
||
matches = BigQueryTableRef.SNAPSHOT_TABLE_REGEX.match(shortened_table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reordering is required for below test case which failed earlier:
assert (
BigqueryTableIdentifier(
"project", "dataset", "table@1624046611000"
).get_table_name()
== "project.dataset.table"
)
""" | ||
Args: | ||
table_name: | ||
table name (in form <table-id> or <table-prefix>_<shard>`, optionally prefixed with <project-id>.<dataset-id>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have we changed from get_raw_table() to .table elsewhere in the code, if this one supports the project / dataset prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to this ->
If table_name is fully qualified, i.e. prefixed with <project-id>.<dataset-id> then the special case of
dataset as a sharded table, i.e. table-id itself as shard can not be detected.
get_table_and_shard("project.dataset.20230112") -> (project.dataset.20230112, None)
get_table_and_shard("20230112") -> (None, 20230112)
Alternatively - we can also fix the regex but it seemed safer to change input instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a normal practice with sharding? Or have we just seen some customers put a bunch of tables with date names in a dataset and consider them "shards". For example, https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard says:
Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD.
Can we be explicit about what cases we identify tables as sharding? It seems like right now, we support:
[PREFIX]_YYYYMMDD
nvm, seems like[PREFIX]$YYYYMMDD
$
is a forbidden characterYYYYMMDD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's right. above two case are what we support with default sharded pattern. the second case (just YYYYMMDD as name) is more of a specific case adopted by very few AFAIK.
The complexity is that the sharded pattern can be overridden using recipe config - sharded_table_pattern
. Would be good to understand if this is really used/required.
39f02cf
to
795aa27
Compare
In case of non-sharded tables, returns (<table-id>, None) | ||
In case of sharded tables, returns (<table-prefix>, shard) | ||
""" | ||
match = re.match( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the change from re.search to re.match (match the end of string with _). This is done to differentiate
table_2023010920230109 (as non-sharded table) from
table_20230109 (as sharded table).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we should add $
to the regex to mark the end of the string but keep it as re.search
That way this method will work for fully qualified table names too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having trouble updating the single regex to work for both qualified table name as well as non-qualified table name and support all the cases, at the same time, also given the fact that group numbering should stay the same as this regex can be overriden using recipe config sharded_table_pattern
. If this change does not look okay, I can revert this (and keep only others) and we can fix this in follow up PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that's fine then - we can do it in a follow up PR
The reason I'd like to do it is that some folks partition at the dataset level instead of the table level, so passing the full name into this method enables us to support them in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to simplify this logic as much as possible, and thus try to pass as few types of table identifiers as possible. Most of the time we pass fully qualified names or refs, so ideally I think we try to stop passing just table name most of the time? But am fine with this as Harshal is for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we will deal with fully qualified names more often. Let me know if I should revert this change and address in follow up PR or keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments and cleanup are helpful. I think there's more cleanup to be done around table name parsing, but this is a nice start. Main comment is around converting the existing tests to doctests
""" | ||
Args: | ||
table_name: | ||
table name (in form <table-id> or <table-prefix>_<shard>`, optionally prefixed with <project-id>.<dataset-id>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a normal practice with sharding? Or have we just seen some customers put a bunch of tables with date names in a dataset and consider them "shards". For example, https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard says:
Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD.
Can we be explicit about what cases we identify tables as sharding? It seems like right now, we support:
[PREFIX]_YYYYMMDD
nvm, seems like[PREFIX]$YYYYMMDD
$
is a forbidden characterYYYYMMDD
@@ -84,7 +84,19 @@ class BigqueryTableIdentifier: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@treff7es do you know what the $
is doing in _BIGQUERY_DEFAULT_SHARDED_TABLE_REGEX
? Seems like we'd support [PREFIX]$YYYYMMDD
but $
is in the invalid_chars
set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand that $ in the middle is for matching non sharded tables. but this is really complex can be simplified.
In case of non-sharded tables, returns (<table-id>, None) | ||
In case of sharded tables, returns (<table-prefix>, shard) | ||
""" | ||
match = re.match( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to simplify this logic as much as possible, and thus try to pass as few types of table identifiers as possible. Most of the time we pass fully qualified names or refs, so ideally I think we try to stop passing just table name most of the time? But am fine with this as Harshal is for now
shortened_table_name = self.table | ||
# if table name ends in _* or * then we strip it as that represents a query on a sharded table | ||
# if table name ends in _* or * or _yyyy* or _yyyymm* then we strip it as that represents a query on a sharded table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we support any series of numbers between the _
and *
in the regex, so you could technically do something like _yy*
or _yyyym*
if you wanted. Not sure the best way to reflect this in a comment though
@pytest.mark.parametrize( | ||
"table_name, expected_table_prefix, expected_shard", | ||
[ | ||
# Cases with Fully qualified name as input | ||
("project.dataset.table", "project.dataset.table", None), | ||
("project.dataset.table_20231215", "project.dataset.table", "20231215"), | ||
("project.dataset.table_2023", "project.dataset.table_2023", None), | ||
# incorrectly handled special case where dataset itself is a sharded table if full name is specified | ||
("project.dataset.20231215", "project.dataset.20231215", None), | ||
# Cases with Just the table name as input | ||
("table", "table", None), | ||
("table20231215", "table20231215", None), | ||
("table_20231215", "table", "20231215"), | ||
("table_1624046611000_name", "table_1624046611000_name", None), | ||
("table_1624046611000", "table_1624046611000", None), | ||
# Special case where dataset itself is a sharded table | ||
("20231215", None, "20231215"), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you guys think about making these doctests instead? Seems like a standard use of them, as the tests are simple and help document the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but won't that require some framework level update to execute them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm didn't realize we didn't have any. Definitely a separate change then, I'll try looking into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep we don't have doctests yet, but we can enable them by adding --doctest-modules
in pytest to the setup.cfg file
Definitely agree that they'd help in these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asikowitz I'm good to merge this if you are
Also added tests for widely used utility functions.
Checklist