Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using information schema #364

Closed
wants to merge 12 commits into from
Closed

Conversation

dbeatty10
Copy link
Contributor

@dbeatty10 dbeatty10 commented Oct 26, 2022

resolves #113

branched from #238

Problem

As described here, the current implementation uses the project.dataset.__TABLES__ metadata table which requires elevated permissions versus the information schema tables. Service accounts provisioned for dbt usage frequently do not have access to this table. This affects dbt docs generate.

Proposed solution

Use project.dataset.INFORMATION_SCHEMA.TABLES instead.

Trade-offs and outstanding questions

The problem is that it doesn't contain row_count or size_bytes, so it is not a perfect drop-in replacement. This could be possibly overcome by using the INFORMATION_SCHEMA.TABLE_STORAGE view to get the number of rows + the size in logical bytes, but that has its own complications (see below).

  1. Is it okay to go from reporting the row count and size in bytes and then degrade to 0 for both?
    • Is there a way to make this work without an environment variable? e.g., is there a way for the adapter to know the relevant region(s)?
  2. Are there any region-specific effects that would could cause a table/view/external to not be reflected?
  3. Is this method slower, faster, or the same?

Alternatives

  • Could we try the original method using project.dataset.__TABLES__ and fall-back to the new method in case of failure?
  • Could we query the list of unique location values within INFORMATION_SCHEMA.SCHEMATA and then iterate through them to generate all the unique region-REGION.INFORMATION_SCHEMA.TABLE_STORAGE queries? Then we'd have the row counts and sizes in bytes?

Checklist

@cla-bot cla-bot bot added the cla:yes label Oct 26, 2022
@Fleid
Copy link
Contributor

Fleid commented Feb 1, 2023

Hey @dbeatty10 is this ready_for_review? Just checking :)

@Fleid
Copy link
Contributor

Fleid commented Feb 1, 2023

Or do you want me to be the judge of that :D

@dbeatty10
Copy link
Contributor Author

@Fleid I think that @colin-rogers-dbt might have taken a look at this after me, but I'm not sure if he was able to get a breakthrough or ran into the same walls.

Trade-offs

If I recall correctly, there are some trade-offs if this PR is merged as-is.

Here are the key considerations :

  • Pro: dbt docs generate works without elevated permissions
  • Con: the row count (row_count) will be degraded to a constant 0
  • Con: the size in bytes (size_bytes) will be degraded to a constant 0

The degradations above would affect the backwards-compatibility for folks that do have elevated permissions today, so dbt docs generate is already working fine for them right now.

Potential paths forward

It sounds like General Mills has a way to get the row_count and size_bytes, but it requires knowing the relevant {{ region }}.INFORMATION_SCHEMA.TABLE_STORAGE to query from.

If we can somehow infer the relevant region(s) to template within the bigquery__get_catalog query, then we can overcome both of the cons listed above.

Two different ideas we could try:

  1. Could we try the original/current method using project.dataset.__TABLES__ and fall-back to the new method in case of failure?
  2. Could we query the list of unique location values within INFORMATION_SCHEMA.SCHEMATA and then iterate through them to generate all the unique region-REGION.INFORMATION_SCHEMA.TABLE_STORAGE queries? Then we'd have the row counts and sizes in bytes?

@Fleid
Copy link
Contributor

Fleid commented Feb 6, 2023

Hey @dbeatty10, it looks like you're using the better regex pattern to identify shards, like discussed here.
I'm guessing that's on purpose? Would this PR solves #260?

@dbeatty10
Copy link
Contributor Author

Hey @dbeatty10, it looks like you're using the better regex pattern to identify shards, like discussed here. I'm guessing that's on purpose? Would this PR solves #260?

That's awesome @Fleid ! Credit goes to @hassan-mention-me for the regex (and basically the rest of the implementation seen in this PR). I branched from his implementation in #238, and if this PR is merged @hassan-mention-me is listed in the changelog entry and commits are preserved as well.

Giving #260 a re-read, it does look like the regex in this PR would solve it. However, I'd prefer to see explicit test cases added to confirm that the regex is working properly.

Here are some test cases listed here that should not be considered shards:

  • STD_MOBILITY_INDEXED_20220519163648
  • foo20220808
  • foo_bar20220808

@Fleid
Copy link
Contributor

Fleid commented Feb 27, 2023

Hey @dbeatty10, do you think we can make that one not a draft, and flag it ready_for_review?

@dbeatty10
Copy link
Contributor Author

@Fleid there's two things that make me uncomfortable with marking this as being "ready for review", both of which I would consider a breaking change:

  • the row count (row_count) will be degraded to a constant of 0
  • the size in bytes (size_bytes) will be degraded to a constant of 0

My opinion: we should ensure that it's non-breaking first.

Here's two ideas to make it non-breaking, neither of which I have tried:

  1. Try project.dataset.__TABLES__ and fall-back to INFORMATION_SCHEMA.TABLES
    • Try accessing the original/current method using project.dataset.__TABLES__ and fall-back to the new method in case of failure
  2. Use INFORMATION_SCHEMA.SCHEMATA

Advantage of option 1:

  • Anyone that has sufficient permissions and had non-zero row counts (and sizes in bytes) before would still have non-zero values after
  • People without sufficient permissions now would have 0 for row counts (and sizes in bytes), but at least they would have everything else.

Advantage of option 2:

  • We'd have the row counts and sizes in bytes 100% of the time for all users.

@Fleid
Copy link
Contributor

Fleid commented Feb 27, 2023

I hear you.

I'm thinking that ready_for_review can also be about "we've tried as much as we can, but for whatever reason couldn't push it over the finish line, so let's get the team in to do it".

Do you want another stab at this, or are you good with passing the baton?

@dbeatty10 dbeatty10 marked this pull request as ready for review February 28, 2023 00:23
@dbeatty10 dbeatty10 requested a review from a team as a code owner February 28, 2023 00:23
@dbeatty10
Copy link
Contributor Author

I'm marking this as ready_for_review to indicate that:

we've tried as much as we can, but for whatever reason couldn't push it over the finish line, so let's get the team in to do it

I'm passing the baton to whoever reviews this! Here's the best TLDR for you to read about proposed things to resolve: #364 (comment)

@dbeatty10 dbeatty10 added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Feb 28, 2023
@Fleid
Copy link
Contributor

Fleid commented Mar 7, 2023

Now tracking at #585

@drnielsen
Copy link

If the the biggest concern is losing the row count and size, wouldn't you be able to solve that by using the INFORMATION_SCHEMA.PARTITIONS table?

Something like the below:

SELECT

  table_catalog as table_database,
  table_schema as table_schema,
  table_name as original_table_name,
  concat(table_catalog, '.', table_schema, '.', table_name) as relation_id,
  row_count as row_count,
  size_bytes as size_bytes,
  case when table_type = 'EXTERNAL' then 'external' else 'table' end as table_type,
  regexp_contains(table_name, '^.+[0-9]{8}$') and table_type = 'BASE TABLE' as is_date_shard,
  regexp_extract(table_name, '^(.+)[0-9]{8}$') as shard_base_name,
  regexp_extract(table_name, '^.+([0-9]{8})$') as shard_name

FROM (
  SELECT
    table_catalog,
    table_schema,
    table_name,
    table_type,
    sum(total_rows) row_count,
    sum(total_logical_bytes) size_bytes
  FROM dataset_name.INFORMATION_SCHEMA.TABLES
  LEFT OUTER JOIN dataset_name.INFORMATION_SCHEMA.PARTITIONS
  USING (table_catalog, table_schema, table_name)
  GROUP BY table_catalog, table_schema, table_name, table_type
)

@sambloom92
Copy link

I'm new to this issue but my naive impression is that this looks like a good suggestion, are there any issues with doing it this way?

If the the biggest concern is losing the row count and size, wouldn't you be able to solve that by using the INFORMATION_SCHEMA.PARTITIONS table?

Something like the below:

SELECT

  table_catalog as table_database,
  table_schema as table_schema,
  table_name as original_table_name,
  concat(table_catalog, '.', table_schema, '.', table_name) as relation_id,
  row_count as row_count,
  size_bytes as size_bytes,
  case when table_type = 'EXTERNAL' then 'external' else 'table' end as table_type,
  regexp_contains(table_name, '^.+[0-9]{8}$') and table_type = 'BASE TABLE' as is_date_shard,
  regexp_extract(table_name, '^(.+)[0-9]{8}$') as shard_base_name,
  regexp_extract(table_name, '^.+([0-9]{8})$') as shard_name

FROM (
  SELECT
    table_catalog,
    table_schema,
    table_name,
    table_type,
    sum(total_rows) row_count,
    sum(total_logical_bytes) size_bytes
  FROM dataset_name.INFORMATION_SCHEMA.TABLES
  LEFT OUTER JOIN dataset_name.INFORMATION_SCHEMA.PARTITIONS
  USING (table_catalog, table_schema, table_name)
  GROUP BY table_catalog, table_schema, table_name, table_type
)

@dbeatty10
Copy link
Contributor Author

Closing in favor of #1213

@dbeatty10 dbeatty10 closed this Jun 6, 2024
@mikealfare mikealfare deleted the dbeatty/using-information-schema branch July 17, 2024 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-162] Upgrade from the __tables__ construct to the information_schema.tables construct
8 participants