Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid alias usage by manipulating the AST #219

Merged
merged 4 commits into from
Apr 9, 2024

Conversation

bishwajit-db
Copy link
Contributor

@bishwajit-db bishwajit-db commented Apr 7, 2024

Handle invalid alias usage by manipulating the AST. Replace the erroneous alias with the original expression.

Fixes #179, #195, #167

Copy link
Contributor

@sundarshankar89 sundarshankar89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sundarshankar89 sundarshankar89 added this pull request to the merge queue Apr 9, 2024
Merged via the queue into main with commit 1026286 Apr 9, 2024
6 checks passed
@sundarshankar89 sundarshankar89 deleted the feature/lca-handling branch April 9, 2024 09:19
ravit-db pushed a commit that referenced this pull request Apr 18, 2024
Handle invalid alias usage by manipulating the AST. Replace the
erroneous alias with the original expression.

Fixes #179, #195, #167
nfx added a commit that referenced this pull request May 8, 2024
* Added `Configure Secrets` support to `databricks labs remorph configure-secrets` cli command ([#254](#254)). The `Configure Secrets` feature has been implemented in the `databricks labs remorph` CLI command, specifically for the new `configure-secrets` command. This addition allows users to establish Scope and Secrets within their Databricks Workspace, enhancing security and control over resource access. The implementation includes a new `recon_config_utils.py` file in the `databricks/labs/remorph/helpers` directory, which contains classes and methods for managing Databricks Workspace secrets. Furthermore, the `ReconConfigPrompts` helper class has been updated to handle prompts for selecting sources, entering secret scope names, and handling overwrites. The CLI command has also been updated with a new `configure_secrets` function and corresponding tests to ensure correct functionality.
* Added handling for invalid alias usage by manipulating the AST ([#219](#219)). The recent commit addresses the issue of invalid alias usage in SQL queries by manipulating the Abstract Syntax Tree (AST). It introduces a new method, `unalias_lca_in_select`, which unaliases Lateral Column Aliases (LCA) in the SELECT clause of a query. The AliasInfo class is added to manage aliases more effectively, with attributes for the name, expression, and a flag indicating if the alias name is the same as a column. Additionally, the execute.py file is modified to check for unsupported LCA using the `lca_utils.check_for_unsupported_lca` method, improving the system's robustness when handling invalid aliases. Test cases are also added in the new file, test_lca_utils.py, to validate the behavior of the `check_for_unsupported_lca` function, ensuring that SQL queries are correctly formatted for Snowflake dialect and avoiding errors due to invalid alias usage.
* Added support for `databricks labs remorph generate-lineage` CLI command ([#238](#238)). A new CLI command, `databricks labs remorph generate-lineage`, has been added to generate lineage for input SQL files, taking the source dialect, input, and output directories as arguments. The command uses existing logic to generate a directed acyclic graph (DAG) and then creates a DOT file in the output directory using the DAG. The new command is supported by new functions `_generate_dot_file_contents`, `lineage_generator`, and methods in the `RootTableIdentifier` and `DAG` classes. The command has been manually tested and includes unit tests, with plans for adding integration tests in the future. The commit also includes a new method `temp_dirs_for_lineage` and updates to the `configure_secrets_databricks` method to handle a new source type "databricks". The command handles invalid input and raises appropriate exceptions.
* Custom oracle tokenizer ([#316](#316)). In this release, the remorph library has been updated to enhance its handling of Oracle databases. A custom Oracle tokenizer has been developed to map the `LONG` datatype to text (string) in the tokenizer, allowing for more precise parsing and manipulation of `LONG` columns in Oracle databases. The Oracle dialect in the configuration file has also been updated to utilize the new custom Oracle tokenizer. Additionally, the Oracle class from the snow module has been imported and integrated into the Oracle dialect. These improvements will enable the remorph library to manage Oracle databases more efficiently, with a particular focus on improving the handling of the `LONG` datatype. The commit also includes updates to test files in the functional/oracle/test_long_datatype directory, which ensure the proper conversion of the `LONG` datatype to text. Furthermore, a new test file has been added to the tests/unit/snow directory, which checks for compatibility with Oracle's long data type. These changes enhance the library's compatibility with Oracle databases, ensuring accurate handling and manipulation of the `LONG` datatype in Oracle SQL and Databricks SQL.
* Removed strict source dialect checks ([#284](#284)). In the latest release, the `transpile` and `generate_lineage` functions in `cli.py` have undergone changes to allow for greater flexibility in source dialect selection. Previously, only `snowflake` or `tsql` dialects were supported, but now any source dialect supported by SQLGLOT can be used, controlled by the `SQLGLOT_DIALECTS` dictionary. Providing an unsupported source dialect will result in a validation error. Additionally, the input and output folder paths for the `generate_lineage` function are now validated against the file system to ensure their existence and validity. In the `install.py` file of the `databricks/labs/remorph` package, the source dialect selection has been updated to use `SQLGLOT_DIALECTS.keys()`, replacing the previous hardcoded list. This change allows for more flexibility in selecting the source dialect. Furthermore, recent updates to various test functions in the `test_install.py` file suggest that the source selection process has been modified, possibly indicating the addition of new sources or a change in source identification. These modifications provide greater flexibility in testing and potentially in the actual application.
* Set Catalog, Schema from default Config ([#312](#312)). A new feature has been added to our open-source library that allows users to specify the `catalog` and `schema` configuration options as part of the `transpile` command-line interface (CLI). If these options are not provided, the `transpile` function in the `cli.py` file will now set them to the values specified in `default_config`. This ensures that a default catalog and schema are used if they are not explicitly set by the user. The `labs.yml` file has been updated to reflect these changes, with the addition of the `catalog-name` and `schema-name` options to the `commands` object. The `default` property of the `validation` object has also been updated to `true`, indicating that the validation step will be skipped by default. These changes provide increased flexibility and ease-of-use for users of the `transpile` functionality.
* Support for Null safe equality join for databricks generator ([#280](#280)). In this release, we have implemented support for a null-safe equality join in the Databricks generator, addressing issue [#280](#280). This feature introduces the use of the " <=> " operator in the generated SQL code instead of the `is not distinct from` syntax to ensure accurate comparisons when NULL values are present in the columns being joined. The Generator class has been updated with a new method, NullSafeEQ, which takes in an expression and returns the binary version of the expression using the " <=> " operator. The preprocess method in the Generator class has also been modified to include this new functionality. It is important to note that this change may require users to update their existing code to align with the new syntax in the Databricks environment. With this enhancement, the Databricks generator is now capable of performing null-safe equality joins, resulting in consistent results regardless of the presence of NULL values in the join conditions.
@nfx nfx mentioned this pull request May 8, 2024
nfx added a commit that referenced this pull request May 8, 2024
* Added `Configure Secrets` support to `databricks labs remorph
configure-secrets` cli command
([#254](#254)). The
`Configure Secrets` feature has been implemented in the `databricks labs
remorph` CLI command, specifically for the new `configure-secrets`
command. This addition allows users to establish Scope and Secrets
within their Databricks Workspace, enhancing security and control over
resource access. The implementation includes a new
`recon_config_utils.py` file in the `databricks/labs/remorph/helpers`
directory, which contains classes and methods for managing Databricks
Workspace secrets. Furthermore, the `ReconConfigPrompts` helper class
has been updated to handle prompts for selecting sources, entering
secret scope names, and handling overwrites. The CLI command has also
been updated with a new `configure_secrets` function and corresponding
tests to ensure correct functionality.
* Added handling for invalid alias usage by manipulating the AST
([#219](#219)). The
recent commit addresses the issue of invalid alias usage in SQL queries
by manipulating the Abstract Syntax Tree (AST). It introduces a new
method, `unalias_lca_in_select`, which unaliases Lateral Column Aliases
(LCA) in the SELECT clause of a query. The AliasInfo class is added to
manage aliases more effectively, with attributes for the name,
expression, and a flag indicating if the alias name is the same as a
column. Additionally, the execute.py file is modified to check for
unsupported LCA using the `lca_utils.check_for_unsupported_lca` method,
improving the system's robustness when handling invalid aliases. Test
cases are also added in the new file, test_lca_utils.py, to validate the
behavior of the `check_for_unsupported_lca` function, ensuring that SQL
queries are correctly formatted for Snowflake dialect and avoiding
errors due to invalid alias usage.
* Added support for `databricks labs remorph generate-lineage` CLI
command ([#238](#238)).
A new CLI command, `databricks labs remorph generate-lineage`, has been
added to generate lineage for input SQL files, taking the source
dialect, input, and output directories as arguments. The command uses
existing logic to generate a directed acyclic graph (DAG) and then
creates a DOT file in the output directory using the DAG. The new
command is supported by new functions `_generate_dot_file_contents`,
`lineage_generator`, and methods in the `RootTableIdentifier` and `DAG`
classes. The command has been manually tested and includes unit tests,
with plans for adding integration tests in the future. The commit also
includes a new method `temp_dirs_for_lineage` and updates to the
`configure_secrets_databricks` method to handle a new source type
"databricks". The command handles invalid input and raises appropriate
exceptions.
* Custom oracle tokenizer
([#316](#316)). In this
release, the remorph library has been updated to enhance its handling of
Oracle databases. A custom Oracle tokenizer has been developed to map
the `LONG` datatype to text (string) in the tokenizer, allowing for more
precise parsing and manipulation of `LONG` columns in Oracle databases.
The Oracle dialect in the configuration file has also been updated to
utilize the new custom Oracle tokenizer. Additionally, the Oracle class
from the snow module has been imported and integrated into the Oracle
dialect. These improvements will enable the remorph library to manage
Oracle databases more efficiently, with a particular focus on improving
the handling of the `LONG` datatype. The commit also includes updates to
test files in the functional/oracle/test_long_datatype directory, which
ensure the proper conversion of the `LONG` datatype to text.
Furthermore, a new test file has been added to the tests/unit/snow
directory, which checks for compatibility with Oracle's long data type.
These changes enhance the library's compatibility with Oracle databases,
ensuring accurate handling and manipulation of the `LONG` datatype in
Oracle SQL and Databricks SQL.
* Removed strict source dialect checks
([#284](#284)). In the
latest release, the `transpile` and `generate_lineage` functions in
`cli.py` have undergone changes to allow for greater flexibility in
source dialect selection. Previously, only `snowflake` or `tsql`
dialects were supported, but now any source dialect supported by SQLGLOT
can be used, controlled by the `SQLGLOT_DIALECTS` dictionary. Providing
an unsupported source dialect will result in a validation error.
Additionally, the input and output folder paths for the
`generate_lineage` function are now validated against the file system to
ensure their existence and validity. In the `install.py` file of the
`databricks/labs/remorph` package, the source dialect selection has been
updated to use `SQLGLOT_DIALECTS.keys()`, replacing the previous
hardcoded list. This change allows for more flexibility in selecting the
source dialect. Furthermore, recent updates to various test functions in
the `test_install.py` file suggest that the source selection process has
been modified, possibly indicating the addition of new sources or a
change in source identification. These modifications provide greater
flexibility in testing and potentially in the actual application.
* Set Catalog, Schema from default Config
([#312](#312)). A new
feature has been added to our open-source library that allows users to
specify the `catalog` and `schema` configuration options as part of the
`transpile` command-line interface (CLI). If these options are not
provided, the `transpile` function in the `cli.py` file will now set
them to the values specified in `default_config`. This ensures that a
default catalog and schema are used if they are not explicitly set by
the user. The `labs.yml` file has been updated to reflect these changes,
with the addition of the `catalog-name` and `schema-name` options to the
`commands` object. The `default` property of the `validation` object has
also been updated to `true`, indicating that the validation step will be
skipped by default. These changes provide increased flexibility and
ease-of-use for users of the `transpile` functionality.
* Support for Null safe equality join for databricks generator
([#280](#280)). In this
release, we have implemented support for a null-safe equality join in
the Databricks generator, addressing issue
[#280](#280). This
feature introduces the use of the " <=> " operator in the generated SQL
code instead of the `is not distinct from` syntax to ensure accurate
comparisons when NULL values are present in the columns being joined.
The Generator class has been updated with a new method, NullSafeEQ,
which takes in an expression and returns the binary version of the
expression using the " <=> " operator. The preprocess method in the
Generator class has also been modified to include this new
functionality. It is important to note that this change may require
users to update their existing code to align with the new syntax in the
Databricks environment. With this enhancement, the Databricks generator
is now capable of performing null-safe equality joins, resulting in
consistent results regardless of the presence of NULL values in the join
conditions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[LCA] Alias scoping issue
2 participants