Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Issue #3: Add Support for Identifying Identical Table Names Across Different Data Sets #4

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

KazuSh1geru
Copy link

@KazuSh1geru KazuSh1geru commented Apr 24, 2024

What is this PR ?

This PR solve #3

Overview

  • Expanded example.js to accept schema.
  • Expanded JavaScript in includes/assertions to accept schema.

Information

Context Method: ref

ref method use individual arguments for the "schema", and "name" values.

References another action, adding it as a dependency to this action, returning valid SQL to be used in a from expression.
This function can be called with a Resolvable object, for example: ${ref({ name: "name", schema: "schema", database: "database" })}

This function can also be called using individual arguments for the "database", "schema", and "name" values. When only two values are provided, the default database is used and the values are interpreted as "schema" and "name". When only one value is provided, the default database and schema are used, with the provided value interpreted as "name". ${ref("database", "schema", "name")} ${ref("schema", "name")} ${ref("name")}

document

@KazuSh1geru KazuSh1geru changed the title Feature/expansion schema Fix Issue #3: Add Support for Identifying Identical Table Names Across Different Data Sets Apr 24, 2024
},
"second_table": {
"id_in_accepted_values": "id IN (1, 2, 3)"
// Format: "schema": { "table": { "conditionName": "conditionQuery", ... }, ... }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected Format is here.

Comment on lines +66 to +67
"childSchema": "dataform",
"childTable": "second_table",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add childSchema

.tags("assert-data-completeness")
.query(ctx => `SELECT COUNT(*) AS total_rows,
SUM(CASE WHEN ${columnName} IS NULL THEN 1 ELSE 0 END) AS null_count
FROM ${ctx.ref(tableName)}
FROM ${ctx.ref(schemaName, tableName)}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference schemaName.tableName

Comment on lines +47 to +52
for (let schemaName in dataCompletenessConditions) {
const tableNames = dataCompletenessConditions[schemaName];
for (let tableName in tableNames) {
const columnConditions = tableNames[tableName];
createDataCompletenessAssertion(globalParams, schemaName, tableName, columnConditions);
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix nest structure.

.tags("assert-data-freshness")
.query(ctx => `
WITH
freshness AS (
SELECT
DATE_DIFF(CURRENT_DATE(), MAX(${dateColumn}), ${timeUnit}) AS delay
FROM
${ctx.ref(tableName)}
${ctx.ref(schemaName, tableName)}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference schemaName.tableName

.database(globalParams.database)
.schema(globalParams.schema)
.description(`Check referential integrity for ${childTable}.${childKey} referencing ${parentTable}.${parentKey}`)
.tags("assert-referential-integrity")
.query(ctx => `
SELECT pt.${parentKey}
FROM ${ctx.ref(parentTable)} AS pt
LEFT JOIN ${ctx.ref(childTable)} AS t ON t.${childKey} = pt.${parentKey}
FROM ${ctx.ref(parentSchema, parentTable)} AS pt
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference parentSchema.parentTable

FROM ${ctx.ref(parentTable)} AS pt
LEFT JOIN ${ctx.ref(childTable)} AS t ON t.${childKey} = pt.${parentKey}
FROM ${ctx.ref(parentSchema, parentTable)} AS pt
LEFT JOIN ${ctx.ref(childSchema, childTable)} AS t ON t.${childKey} = pt.${parentKey}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference childSchema.childTable

for (let tableName in dataCompletenessConditions) {
const columnConditions = dataCompletenessConditions[tableName];
createDataCompletenessAssertion(globalParams, tableName, columnConditions);
for (let schemaName in dataCompletenessConditions) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • get schemaName
  • get tableNames

Copy link
Author

@KazuSh1geru KazuSh1geru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hrialan
Please review this PR!

@hrialan
Copy link
Collaborator

hrialan commented Apr 24, 2024

Thanks a lot for you support @KazuSh1geru !

This is indeed a super important feature that need to be implemented.

What do you think about making the schema argument optional? You can have projects with many datasets, and specify them one by one can be very long and we loose usage simplicity.

I am thinking about something like that:

const commonAssertions = require("dataform-assertions");

const commonAssertionsResult = commonAssertions({
  globalAssertionsParams: {
    "schema": "dataform_assertions_" + dataform.projectConfig.vars.env,
    "location": "asia-northeast1",
    "tags": ["assertions"],
  },
  uniqueKeyConditions: {
    "schema1.filter": ["Id"],
    "schema2.filter": ["Id"],
    "otherTable": ["Id"]
  }
});

@KazuSh1geru
Copy link
Author

KazuSh1geru commented Apr 25, 2024

Thank you for your comment!!

I believe my method of writing the schema in isolation is better from 2 perspectives.

As the first, the above method of specifying tables is not sufficient.
Because the way to specify in the ref function needs to be schema, table. ref("schema1.filter") will cause an error, so the proposed notation needs to be converted to ref("schema1", "filter") and to be correct as the ref argument.
(Parsing of strings is not recommended, as it creates complexity.)

Second, it is against the Dry Principle.
Whenever there is a table other than filter in schema1, e.g. preprocess, postprocess, etc., schema1 must be added.
If this happens in other schemas, readability will be reduced.

const commonAssertionsResult = commonAssertions({
  globalAssertionsParams: {
    "schema": "dataform_assertions_" + dataform.projectConfig.vars.env,
    "location": "asia-northeast1",
    "tags": ["assertions"],
  },
  uniqueKeyConditions: {
    "schema1.filter": ["Id"],
    "schema1.preprocess": ["Id"],
    "schema1.postprocess": ["Id"],
    "schema2.filter": ["Id"],
    "schema2.postprocess": ["Id"],
    "schema3.filter": ["Id"],
    "schema3.preprocess": ["Id"],
    "schema3.postprocess": ["Id"],
    "schema3.transform": ["Id"],
    "otherTable": ["Id"]
  }
});

isolation schema pattern

const commonAssertionsResult = commonAssertions({
  globalAssertionsParams: {
    "schema": "dataform_assertions_" + dataform.projectConfig.vars.env,
    "location": "asia-northeast1",
    "tags": ["assertions"],
  },
  uniqueKeyConditions: {
    "schema1": {
      "filter": ["Id"],
      "preprocess": ["Id"],
      "postprocess": ["Id"]
    },
    "schema2": {
      "filter": ["Id"],
      "postprocess": ["Id"]
    },
    "schema3": {
      "filter": ["Id"],
      "preprocess": ["Id"],
      "postprocess": ["Id"],
      "transform": ["Id"]
    },
    "otherTable": ["Id"]
  }
});

@KazuSh1geru
Copy link
Author

KazuSh1geru commented Apr 25, 2024

@hrialan
The idea of being able to cut assertions in this package into arbitrary javascript is very nice!
I trust that this PR will be considered and better extended!
Best regards

@hrialan
Copy link
Collaborator

hrialan commented Apr 25, 2024

ok, I agree for the isolation pattern, It seems to be more readable.

Before merging, here are two points to improve:

  1. [Not blocking] Add the possibility to optionally set the schema.

Can be implemented in a future version if it is not so complicated.

As the example here for "otherTable" :

const commonAssertionsResult = commonAssertions({
  globalAssertionsParams: {
    "schema": "dataform_assertions_" + dataform.projectConfig.vars.env,
    "location": "asia-northeast1",
    "tags": ["assertions"],
  },
  uniqueKeyConditions: {
    "schema3": {
      "filter": ["Id"],
      "preprocess": ["Id"],
      "postprocess": ["Id"],
      "transform": ["Id"]
    },
    "otherTable": ["Id"] // THIS is not working
  }
});
  1. [Blocking] How can we specify the environment in a schema name ?

In many projects, envs are included in the dataset name. Do you have an idea on how we can do this with your solution ?

Example:

const commonAssertionsResult = commonAssertions({
  globalAssertionsParams: {
    "schema": "dataform_assertions_" + dataform.projectConfig.vars.env,
    "location": "asia-northeast1",
    "tags": ["assertions"],
  },
  uniqueKeyConditions: {
    "schema3" + dataform.projectConfig.vars.env : { // THIS is not working
      "filter": ["Id"],
      "preprocess": ["Id"],
      "postprocess": ["Id"],
      "transform": ["Id"]
    },
    "otherTable": ["Id"]
  }
});

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants