refactor(ingest/bigquery): add inline comments + refactor in table name parsing #7609

mayurinehate · 2023-03-16T12:52:37Z

Also added tests for widely used utility functions.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…able name parsing

mayurinehate · 2023-03-16T12:53:39Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

@@ -90,7 +90,7 @@ class BigQueryV2Config(
    )

    number_of_datasets_process_in_batch_if_profiling_enabled: int = Field(
-        default=80,
+        default=200,


This has been tested independently and is optimal batch value.

mayurinehate · 2023-03-16T12:54:01Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

@@ -131,7 +131,7 @@ class BigQueryV2Config(

    extract_lineage_from_catalog: bool = Field(
        default=False,
-        description="This flag enables the data lineage extraction from Data Lineage API exposed by Google Data Catalog. NOTE: This extractor can't build views lineage. It's recommended to enable the view's DDL parsing. Read the docs to have more information about: https://cloud.google.com/data-catalog/docs/reference/data-lineage/rest",
+        description="This flag enables the data lineage extraction from Data Lineage API exposed by Google Data Catalog. NOTE: This extractor can't build views lineage. It's recommended to enable the view's DDL parsing. Read the docs to have more information about: https://cloud.google.com/data-catalog/docs/concepts/about-data-lineage",


Updated link gives more insight than original link that referenced api directly.

mayurinehate · 2023-03-16T12:56:05Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py

        shortened_table_name = re.sub(
            self._BIGQUERY_WILDCARD_REGEX, "", shortened_table_name
        )

+        matches = BigQueryTableRef.SNAPSHOT_TABLE_REGEX.match(shortened_table_name)


This reordering is required for below test case which failed earlier:

assert ( BigqueryTableIdentifier( "project", "dataset", "table@1624046611000" ).get_table_name() == "project.dataset.table" )

metadata-ingestion/tests/unit/test_bigquery_source.py

hsheth2 · 2023-03-17T19:05:07Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py

+        """
+        Args:
+            table_name:
+                table name (in form <table-id> or <table-prefix>_<shard>`, optionally prefixed with <project-id>.<dataset-id>)


why have we changed from get_raw_table() to .table elsewhere in the code, if this one supports the project / dataset prefix?

Due to this ->

If table_name is fully qualified, i.e. prefixed with <project-id>.<dataset-id> then the special case of dataset as a sharded table, i.e. table-id itself as shard can not be detected.

get_table_and_shard("project.dataset.20230112") -> (project.dataset.20230112, None)

get_table_and_shard("20230112") -> (None, 20230112)

Alternatively - we can also fix the regex but it seemed safer to change input instead.

Is this a normal practice with sharding? Or have we just seen some customers put a bunch of tables with date names in a dataset and consider them "shards". For example, https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard says:

Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD.

Can we be explicit about what cases we identify tables as sharding? It seems like right now, we support:

[PREFIX]_YYYYMMDD

~~[PREFIX]$YYYYMMDD~~ nvm, seems like $ is a forbidden character

YYYYMMDD

that's right. above two case are what we support with default sharded pattern. the second case (just YYYYMMDD as name) is more of a specific case adopted by very few AFAIK.

The complexity is that the sharded pattern can be overridden using recipe config - sharded_table_pattern . Would be good to understand if this is really used/required.

mayurinehate · 2023-03-20T11:11:40Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py

+                In case of non-sharded tables, returns (<table-id>, None)
+                In case of sharded tables, returns (<table-prefix>, shard)
+        """
+        match = re.match(


Note the change from re.search to re.match (match the end of string with _). This is done to differentiate

table_2023010920230109 (as non-sharded table) from
table_20230109 (as sharded table).

In that case, we should add $ to the regex to mark the end of the string but keep it as re.search

That way this method will work for fully qualified table names too

I am having trouble updating the single regex to work for both qualified table name as well as non-qualified table name and support all the cases, at the same time, also given the fact that group numbering should stay the same as this regex can be overriden using recipe config sharded_table_pattern. If this change does not look okay, I can revert this (and keep only others) and we can fix this in follow up PRs.

Ok that's fine then - we can do it in a follow up PR

The reason I'd like to do it is that some folks partition at the dataset level instead of the table level, so passing the full name into this method enables us to support them in the future

It'd be nice to simplify this logic as much as possible, and thus try to pass as few types of table identifiers as possible. Most of the time we pass fully qualified names or refs, so ideally I think we try to stop passing just table name most of the time? But am fine with this as Harshal is for now

I agree that we will deal with fully qualified names more often. Let me know if I should revert this change and address in follow up PR or keep it.

asikowitz

The comments and cleanup are helpful. I think there's more cleanup to be done around table name parsing, but this is a nice start. Main comment is around converting the existing tests to doctests

metadata-ingestion/tests/unit/test_bigquery_source.py

asikowitz · 2023-03-22T16:01:54Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py

+        """
+        Args:
+            table_name:
+                table name (in form <table-id> or <table-prefix>_<shard>`, optionally prefixed with <project-id>.<dataset-id>)


Is this a normal practice with sharding? Or have we just seen some customers put a bunch of tables with date names in a dataset and consider them "shards". For example, https://cloud.google.com/bigquery/docs/partitioned-tables#dt_partition_shard says:

Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD.

Can we be explicit about what cases we identify tables as sharding? It seems like right now, we support:

[PREFIX]_YYYYMMDD

~~[PREFIX]$YYYYMMDD~~ nvm, seems like $ is a forbidden character

YYYYMMDD

asikowitz · 2023-03-22T16:24:53Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py

@@ -84,7 +84,19 @@ class BigqueryTableIdentifier:



@treff7es do you know what the $ is doing in _BIGQUERY_DEFAULT_SHARDED_TABLE_REGEX? Seems like we'd support [PREFIX]$YYYYMMDD but $ is in the invalid_chars set

From what I understand that $ in the middle is for matching non sharded tables. but this is really complex can be simplified.

asikowitz · 2023-03-22T16:27:18Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py

+                In case of non-sharded tables, returns (<table-id>, None)
+                In case of sharded tables, returns (<table-prefix>, shard)
+        """
+        match = re.match(


It'd be nice to simplify this logic as much as possible, and thus try to pass as few types of table identifiers as possible. Most of the time we pass fully qualified names or refs, so ideally I think we try to stop passing just table name most of the time? But am fine with this as Harshal is for now

asikowitz · 2023-03-22T16:31:14Z

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py

        shortened_table_name = self.table
-        # if table name ends in _* or * then we strip it as that represents a query on a sharded table
+        # if table name ends in _* or * or _yyyy* or _yyyymm* then we strip it as that represents a query on a sharded table


Seems like we support any series of numbers between the _ and * in the regex, so you could technically do something like _yy* or _yyyym* if you wanted. Not sure the best way to reflect this in a comment though

asikowitz · 2023-03-22T16:34:19Z

metadata-ingestion/tests/unit/test_bigquery_source.py

+@pytest.mark.parametrize(
+    "table_name, expected_table_prefix, expected_shard",
+    [
+        # Cases with Fully qualified name as input
+        ("project.dataset.table", "project.dataset.table", None),
+        ("project.dataset.table_20231215", "project.dataset.table", "20231215"),
+        ("project.dataset.table_2023", "project.dataset.table_2023", None),
+        # incorrectly handled special case where dataset itself is a sharded table if full name is specified
+        ("project.dataset.20231215", "project.dataset.20231215", None),
+        # Cases with Just the table name as input
+        ("table", "table", None),
+        ("table20231215", "table20231215", None),
+        ("table_20231215", "table", "20231215"),
+        ("table_1624046611000_name", "table_1624046611000_name", None),
+        ("table_1624046611000", "table_1624046611000", None),
+        # Special case where dataset itself is a sharded table
+        ("20231215", None, "20231215"),
+    ],
+)


What do you guys think about making these doctests instead? Seems like a standard use of them, as the tests are simple and help document the function

I like the idea, but won't that require some framework level update to execute them ?

Hmm didn't realize we didn't have any. Definitely a separate change then, I'll try looking into it

Yep we don't have doctests yet, but we can enable them by adding --doctest-modules in pytest to the setup.cfg file

Definitely agree that they'd help in these cases

hsheth2

@asikowitz I'm good to merge this if you are

…me parsing (datahub-project#7609)

…me parsing (#7609)

refractor(ingest/bigquery): added inline comments, minor reorder in t…

73003da

…able name parsing

mayurinehate commented Mar 16, 2023

View reviewed changes

mayurinehate changed the title ~~refractor(ingest/bigquery): added inline comments, minor reorder in t…~~ refractor(ingest/bigquery): added inline comments, minor refractor in table name parsing Mar 16, 2023

mayurinehate requested review from treff7es, hsheth2 and asikowitz March 16, 2023 12:57

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 16, 2023

hsheth2 changed the title ~~refractor(ingest/bigquery): added inline comments, minor refractor in table name parsing~~ refactor(ingest/bigquery): added inline comments, minor refractor in table name parsing Mar 16, 2023

hsheth2 changed the title ~~refactor(ingest/bigquery): added inline comments, minor refractor in table name parsing~~ refactor(ingest/bigquery): add inline comments + refactor in table name parsing Mar 16, 2023

mayurinehate added 2 commits March 17, 2023 14:09

update shard regex matching, update tests

3a40d5c

patch correctly

e259285

hsheth2 reviewed Mar 17, 2023

View reviewed changes

add comments, reorganize test cases

795aa27

mayurinehate force-pushed the add_bigquery_comments branch from 39f02cf to 795aa27 Compare March 20, 2023 11:05

mayurinehate commented Mar 20, 2023

View reviewed changes

asikowitz approved these changes Mar 22, 2023

View reviewed changes

Merge branch 'master' into add_bigquery_comments

a291840

vercel bot deployed to Preview March 23, 2023 07:29 View deployment

hsheth2 approved these changes Mar 24, 2023

View reviewed changes

asikowitz merged commit 301c861 into datahub-project:master Mar 24, 2023

gmcgoldrick-r7 pushed a commit to rapid7/datahub that referenced this pull request Mar 27, 2023

refactor(ingest/bigquery): add inline comments + refactor in table na…

d3b2830

…me parsing (datahub-project#7609)

yoonhyejin pushed a commit that referenced this pull request Apr 3, 2023

refactor(ingest/bigquery): add inline comments + refactor in table na…

d035d48

…me parsing (#7609)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(ingest/bigquery): add inline comments + refactor in table name parsing #7609

refactor(ingest/bigquery): add inline comments + refactor in table name parsing #7609

mayurinehate commented Mar 16, 2023 •

edited

Loading

mayurinehate Mar 16, 2023

mayurinehate Mar 16, 2023

mayurinehate Mar 16, 2023

hsheth2 Mar 17, 2023

mayurinehate Mar 20, 2023

asikowitz Mar 22, 2023

mayurinehate Mar 23, 2023

mayurinehate Mar 20, 2023 •

edited

Loading

hsheth2 Mar 20, 2023

mayurinehate Mar 21, 2023 •

edited

Loading

hsheth2 Mar 21, 2023

asikowitz Mar 22, 2023

mayurinehate Mar 23, 2023

asikowitz left a comment

asikowitz Mar 22, 2023

asikowitz Mar 22, 2023

mayurinehate Mar 23, 2023

asikowitz Mar 22, 2023

asikowitz Mar 22, 2023

asikowitz Mar 22, 2023

mayurinehate Mar 23, 2023

asikowitz Mar 23, 2023

hsheth2 Mar 23, 2023

hsheth2 left a comment

refactor(ingest/bigquery): add inline comments + refactor in table name parsing #7609

refactor(ingest/bigquery): add inline comments + refactor in table name parsing #7609

Conversation

mayurinehate commented Mar 16, 2023 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayurinehate Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asikowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

mayurinehate commented Mar 16, 2023 •

edited

Loading

mayurinehate Mar 20, 2023 •

edited

Loading

mayurinehate Mar 21, 2023 •

edited

Loading