Using information schema #364

dbeatty10 · 2022-10-26T19:59:45Z

resolves #113

branched from #238

Problem

As described here, the current implementation uses the project.dataset.__TABLES__ metadata table which requires elevated permissions versus the information schema tables. Service accounts provisioned for dbt usage frequently do not have access to this table. This affects dbt docs generate.

Proposed solution

Use project.dataset.INFORMATION_SCHEMA.TABLES instead.

Trade-offs and outstanding questions

The problem is that it doesn't contain row_count or size_bytes, so it is not a perfect drop-in replacement. This could be possibly overcome by using the INFORMATION_SCHEMA.TABLE_STORAGE view to get the number of rows + the size in logical bytes, but that has its own complications (see below).

Is it okay to go from reporting the row count and size in bytes and then degrade to 0 for both?
- Is there a way to make this work without an environment variable? e.g., is there a way for the adapter to know the relevant region(s)?
Are there any region-specific effects that would could cause a table/view/external to not be reflected?
Is this method slower, faster, or the same?

Alternatives

Could we try the original method using project.dataset.__TABLES__ and fall-back to the new method in case of failure?
Could we query the list of unique location values within INFORMATION_SCHEMA.SCHEMATA and then iterate through them to generate all the unique region-REGION.INFORMATION_SCHEMA.TABLE_STORAGE queries? Then we'd have the row counts and sizes in bytes?

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
~~This PR includes tests, or~~ new tests are not required/relevant for this PR
~~I have opened an issue to add/update docs, or~~ docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

Fleid · 2023-02-01T22:49:41Z

Hey @dbeatty10 is this ready_for_review? Just checking :)

Fleid · 2023-02-01T22:50:14Z

Or do you want me to be the judge of that :D

dbeatty10 · 2023-02-02T16:07:19Z

@Fleid I think that @colin-rogers-dbt might have taken a look at this after me, but I'm not sure if he was able to get a breakthrough or ran into the same walls.

Trade-offs

If I recall correctly, there are some trade-offs if this PR is merged as-is.

Here are the key considerations :

Pro: dbt docs generate works without elevated permissions
Con: the row count (row_count) will be degraded to a constant 0
Con: the size in bytes (size_bytes) will be degraded to a constant 0

The degradations above would affect the backwards-compatibility for folks that do have elevated permissions today, so dbt docs generate is already working fine for them right now.

Potential paths forward

It sounds like General Mills has a way to get the row_count and size_bytes, but it requires knowing the relevant {{ region }}.INFORMATION_SCHEMA.TABLE_STORAGE to query from.

If we can somehow infer the relevant region(s) to template within the bigquery__get_catalog query, then we can overcome both of the cons listed above.

Two different ideas we could try:

Could we try the original/current method using project.dataset.__TABLES__ and fall-back to the new method in case of failure?
Could we query the list of unique location values within INFORMATION_SCHEMA.SCHEMATA and then iterate through them to generate all the unique region-REGION.INFORMATION_SCHEMA.TABLE_STORAGE queries? Then we'd have the row counts and sizes in bytes?

Fleid · 2023-02-06T23:25:29Z

Hey @dbeatty10, it looks like you're using the better regex pattern to identify shards, like discussed here.
I'm guessing that's on purpose? Would this PR solves #260?

dbeatty10 · 2023-02-07T15:32:06Z

Hey @dbeatty10, it looks like you're using the better regex pattern to identify shards, like discussed here. I'm guessing that's on purpose? Would this PR solves #260?

That's awesome @Fleid ! Credit goes to @hassan-mention-me for the regex (and basically the rest of the implementation seen in this PR). I branched from his implementation in #238, and if this PR is merged @hassan-mention-me is listed in the changelog entry and commits are preserved as well.

Giving #260 a re-read, it does look like the regex in this PR would solve it. However, I'd prefer to see explicit test cases added to confirm that the regex is working properly.

Here are some test cases listed here that should not be considered shards:

STD_MOBILITY_INDEXED_20220519163648
foo20220808
foo_bar20220808

Fleid · 2023-02-27T22:20:56Z

Hey @dbeatty10, do you think we can make that one not a draft, and flag it ready_for_review?

dbeatty10 · 2023-02-27T22:34:49Z

@Fleid there's two things that make me uncomfortable with marking this as being "ready for review", both of which I would consider a breaking change:

the row count (row_count) will be degraded to a constant of 0
the size in bytes (size_bytes) will be degraded to a constant of 0

My opinion: we should ensure that it's non-breaking first.

Here's two ideas to make it non-breaking, neither of which I have tried:

Try project.dataset.__TABLES__ and fall-back to INFORMATION_SCHEMA.TABLES
- Try accessing the original/current method using project.dataset.__TABLES__ and fall-back to the new method in case of failure
Use INFORMATION_SCHEMA.SCHEMATA
- Query the list of unique location values within INFORMATION_SCHEMA.SCHEMATA and then iterate through them to generate all the unique region-REGION.INFORMATION_SCHEMA.TABLE_STORAGE queries.

Advantage of option 1:

Anyone that has sufficient permissions and had non-zero row counts (and sizes in bytes) before would still have non-zero values after
People without sufficient permissions now would have 0 for row counts (and sizes in bytes), but at least they would have everything else.

Advantage of option 2:

We'd have the row counts and sizes in bytes 100% of the time for all users.

Fleid · 2023-02-27T22:55:06Z

I hear you.

I'm thinking that ready_for_review can also be about "we've tried as much as we can, but for whatever reason couldn't push it over the finish line, so let's get the team in to do it".

Do you want another stab at this, or are you good with passing the baton?

dbeatty10 · 2023-02-28T00:24:00Z

I'm marking this as ready_for_review to indicate that:

we've tried as much as we can, but for whatever reason couldn't push it over the finish line, so let's get the team in to do it

I'm passing the baton to whoever reviews this! Here's the best TLDR for you to read about proposed things to resolve: #364 (comment)

Fleid · 2023-03-07T00:09:35Z

Now tracking at #585

drnielsen · 2023-06-16T18:33:44Z

If the the biggest concern is losing the row count and size, wouldn't you be able to solve that by using the INFORMATION_SCHEMA.PARTITIONS table?

Something like the below:

SELECT

  table_catalog as table_database,
  table_schema as table_schema,
  table_name as original_table_name,
  concat(table_catalog, '.', table_schema, '.', table_name) as relation_id,
  row_count as row_count,
  size_bytes as size_bytes,
  case when table_type = 'EXTERNAL' then 'external' else 'table' end as table_type,
  regexp_contains(table_name, '^.+[0-9]{8}$') and table_type = 'BASE TABLE' as is_date_shard,
  regexp_extract(table_name, '^(.+)[0-9]{8}$') as shard_base_name,
  regexp_extract(table_name, '^.+([0-9]{8})$') as shard_name

FROM (
  SELECT
    table_catalog,
    table_schema,
    table_name,
    table_type,
    sum(total_rows) row_count,
    sum(total_logical_bytes) size_bytes
  FROM dataset_name.INFORMATION_SCHEMA.TABLES
  LEFT OUTER JOIN dataset_name.INFORMATION_SCHEMA.PARTITIONS
  USING (table_catalog, table_schema, table_name)
  GROUP BY table_catalog, table_schema, table_name, table_type
)

sambloom92 · 2023-08-24T17:56:28Z

I'm new to this issue but my naive impression is that this looks like a good suggestion, are there any issues with doing it this way?

If the the biggest concern is losing the row count and size, wouldn't you be able to solve that by using the INFORMATION_SCHEMA.PARTITIONS table?

Something like the below:

SELECT

  table_catalog as table_database,
  table_schema as table_schema,
  table_name as original_table_name,
  concat(table_catalog, '.', table_schema, '.', table_name) as relation_id,
  row_count as row_count,
  size_bytes as size_bytes,
  case when table_type = 'EXTERNAL' then 'external' else 'table' end as table_type,
  regexp_contains(table_name, '^.+[0-9]{8}$') and table_type = 'BASE TABLE' as is_date_shard,
  regexp_extract(table_name, '^(.+)[0-9]{8}$') as shard_base_name,
  regexp_extract(table_name, '^.+([0-9]{8})$') as shard_name

FROM (
  SELECT
    table_catalog,
    table_schema,
    table_name,
    table_type,
    sum(total_rows) row_count,
    sum(total_logical_bytes) size_bytes
  FROM dataset_name.INFORMATION_SCHEMA.TABLES
  LEFT OUTER JOIN dataset_name.INFORMATION_SCHEMA.PARTITIONS
  USING (table_catalog, table_schema, table_name)
  GROUP BY table_catalog, table_schema, table_name, table_type
)

dbeatty10 · 2024-06-06T22:22:10Z

Closing in favor of #1213

hassan-mention-me and others added 5 commits July 21, 2022 16:19

using information_shema to get catalog data to limit perm need

7ebe7c3

removing __TABLES__ as not is use anymore

dab497c

running changie

cc23074

Merge branch 'main' into using-information-schema

f45a2b0

Merge branch 'main' into dbeatty/using-information-schema

a5eb367

cla-bot bot added the cla:yes label Oct 26, 2022

dbeatty10 and others added 7 commits October 26, 2022 14:02

Restore original files

d53f5c9

Update changie entry

4964ed9

Update changie entry

0442d90

Standardize casing

5ec285d

Update changelog entry [skip ci]

43ec7a6

Merge branch 'main' into dbeatty/using-information-schema

40d90e6

Merge branch 'main' into dbeatty/using-information-schema

a14015e

colin-rogers-dbt mentioned this pull request Nov 4, 2022

Using information schema #238

Closed

4 tasks

nathaniel-may mentioned this pull request Dec 5, 2022

[CT-1329] Issue tracking review of PR #364 (formerly #238) #346

Closed

Fleid added the pr_tracked label Jan 30, 2023

dbeatty10 marked this pull request as ready for review February 28, 2023 00:23

dbeatty10 requested a review from a team as a code owner February 28, 2023 00:23

dbeatty10 requested a review from McKnight-42 February 28, 2023 00:23

dbeatty10 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Feb 28, 2023

Fleid mentioned this pull request Mar 7, 2023

[ADAP-357] [PR Review]Using information schema #364 #585

Closed

dbeatty10 mentioned this pull request May 16, 2023

[ADAP-555] [Spike] Capture the number of partitions pruned when running an incremental model #728

Closed

3 tasks

dbeatty10 mentioned this pull request Aug 24, 2023

[ADAP-847] [Bug] references to deprecated __TABLES__ metadata table #897

Closed

2 tasks

mikealfare removed the pr_tracked label Feb 7, 2024

mikealfare mentioned this pull request May 1, 2024

Remove usage of __TABLES__ #1213

Open

dbeatty10 closed this Jun 6, 2024

mikealfare deleted the dbeatty/using-information-schema branch July 17, 2024 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using information schema #364

Using information schema #364

dbeatty10 commented Oct 26, 2022 •

edited

Loading

Fleid commented Feb 1, 2023

Fleid commented Feb 1, 2023

dbeatty10 commented Feb 2, 2023

Fleid commented Feb 6, 2023

dbeatty10 commented Feb 7, 2023

Fleid commented Feb 27, 2023

dbeatty10 commented Feb 27, 2023

Fleid commented Feb 27, 2023 •

edited

Loading

dbeatty10 commented Feb 28, 2023

Fleid commented Mar 7, 2023

drnielsen commented Jun 16, 2023

sambloom92 commented Aug 24, 2023

dbeatty10 commented Jun 6, 2024

Using information schema #364

Using information schema #364

Conversation

dbeatty10 commented Oct 26, 2022 • edited Loading

Problem

Proposed solution

Trade-offs and outstanding questions

Alternatives

Checklist

Fleid commented Feb 1, 2023

Fleid commented Feb 1, 2023

dbeatty10 commented Feb 2, 2023

Trade-offs

Potential paths forward

Fleid commented Feb 6, 2023

dbeatty10 commented Feb 7, 2023

Fleid commented Feb 27, 2023

dbeatty10 commented Feb 27, 2023

Fleid commented Feb 27, 2023 • edited Loading

dbeatty10 commented Feb 28, 2023

Fleid commented Mar 7, 2023

drnielsen commented Jun 16, 2023

sambloom92 commented Aug 24, 2023

dbeatty10 commented Jun 6, 2024

dbeatty10 commented Oct 26, 2022 •

edited

Loading

Fleid commented Feb 27, 2023 •

edited

Loading