-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix open lineage namespace for Sqlite as per OL team request #1142
Conversation
Codecov ReportBase: 91.94% // Head: 93.82% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1142 +/- ##
==========================================
+ Coverage 91.94% 93.82% +1.87%
==========================================
Files 12 65 +53
Lines 472 3045 +2573
Branches 48 341 +293
==========================================
+ Hits 434 2857 +2423
- Misses 28 129 +101
- Partials 10 59 +49
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks, @sunank200 for fixing it.
@@ -18,6 +18,24 @@ class DatabaseSubclass(BaseDatabase): | |||
pass | |||
|
|||
|
|||
def test_openlineage_database_dataset_namespace(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering does this test add some value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just for code coverage. It was failing for coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good point from @pankajastro, the test should check that the method returns what you expect.
Test coverage reports tell you where you're missing tests, coverage is not a goal in itself.
In this case, the test should check that the method returns a namespace that conforms with the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @pankajastro and @julienledem here, this test doesn't add any value.
It is just for code coverage. It was failing for coverage
If we just need to do this for code coverage, we are doing it wrong :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We already have a test which covers the scenarios: https://github.com/astronomer/astro-sdk/blob/main/python-sdk/tests/sql/test_table.py#L221
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sunank200 for addressing the request.
Sorry for not giving my review before this got merged.
Please look at the few comments bellow.
I think you still need to tweak it before we can consider it done.
Best, Julien
@@ -18,6 +18,24 @@ class DatabaseSubclass(BaseDatabase): | |||
pass | |||
|
|||
|
|||
def test_openlineage_database_dataset_namespace(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good point from @pankajastro, the test should check that the method returns what you expect.
Test coverage reports tell you where you're missing tests, coverage is not a goal in itself.
In this case, the test should check that the method returns a namespace that conforms with the spec.
""" | ||
db = DatabaseSubclass(conn_id="fake_conn_id") | ||
with pytest.raises(NotImplementedError): | ||
db.openlineage_dataset_name(table=BaseTable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, here you want to mock the method being called to check that openlineage_dataset_name returns the expected value.
with pytest.raises(NotImplementedError): | ||
db.openlineage_dataset_name(table=BaseTable) | ||
|
||
|
||
def test_subclass_missing_not_implemented_methods_raise_exception(): | ||
db = DatabaseSubclass(conn_id="fake_conn_id") | ||
with pytest.raises(NotImplementedError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
"/tmp/sqlite.db", | ||
Connection(conn_id="test_conn", conn_type="sqlite", host="tmp/sqlite.db"), | ||
"tmp/sqlite.db.test_tb", | ||
"sqlite://tmp/sqlite.db", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments here:
- The namespace should not repeat the dataset name ("tmp/sqlite.db.test_tb").
- The tricky aspect of those local datasets that are private to the machine they are running on is we want the dataset ns+name to be unique. Otherwise if you run this on two different machines the lineage graph will assume that you are writing and reading the same dataset, which is not true. You need to make the namespace more unique by (for example) adding the machine name/IP in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the PR for the change #1178. Please review
# Description ## What is the current behavior? <!-- Please describe the current behavior that you are modifying. --> Fix the comments made on #1142 <!-- Issues are required for both bug fixes and features. Reference it using one of the following: closes: #ISSUE related: #ISSUE --> closes: #1179 ## What is the new behavior? <!-- Please describe the behavior or changes that are being added by this PR. --> - Fix the namespace for sqlite as per comment on #1142 > Two comments here: > > - The namespace should not repeat the dataset name ("tmp/sqlite.db.test_tb"). >- The tricky aspect of those local datasets that are private to the machine they are running on is we want the dataset ns+name to be unique. Otherwise if you run this on two different machines the lineage graph will assume that you are writing and reading the same dataset, which is not true. You need to make the namespace more unique by (for example) adding the machine name/IP in it. - Fix the test for this ## Does this introduce a breaking change? No ### Checklist - [x] Created tests which fail without the change (if possible) - [x] Extended the README / documentation, if necessary Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
# Description ## What is the current behavior? <!-- Please describe the current behavior that you are modifying. --> The sqlite namespace is missing in [OpenLineage naming convention](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md). So we created the namespace by default. OL team has requested to change this namespace. The Namespaces extracted from AstroSDK do not conform to the [OpenLineage naming convention](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md) for sqlite. For example, /tmp/sqlite_default.db instead of e.g. sqllite:// <!-- Issues are required for both bug fixes and features. Reference it using one of the following: closes: #ISSUE related: #ISSUE --> closes: #1141 ## What is the new behavior? <!-- Please describe the behavior or changes that are being added by this PR. --> - Fix the namespace for sqlite - Fix the test for this ## Does this introduce a breaking change? No ### Checklist - [x] Created tests which fail without the change (if possible) - [x] Extended the README / documentation, if necessary
# Description ## What is the current behavior? <!-- Please describe the current behavior that you are modifying. --> Fix the comments made on #1142 <!-- Issues are required for both bug fixes and features. Reference it using one of the following: closes: #ISSUE related: #ISSUE --> closes: #1179 ## What is the new behavior? <!-- Please describe the behavior or changes that are being added by this PR. --> - Fix the namespace for sqlite as per comment on #1142 > Two comments here: > > - The namespace should not repeat the dataset name ("tmp/sqlite.db.test_tb"). >- The tricky aspect of those local datasets that are private to the machine they are running on is we want the dataset ns+name to be unique. Otherwise if you run this on two different machines the lineage graph will assume that you are writing and reading the same dataset, which is not true. You need to make the namespace more unique by (for example) adding the machine name/IP in it. - Fix the test for this ## Does this introduce a breaking change? No ### Checklist - [x] Created tests which fail without the change (if possible) - [x] Extended the README / documentation, if necessary Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Hey, there have been new conventions introduced: Naming.md#local-file-system A word of explanation: both IP and port are mandatory. It means that if port is unknown (e. g. if file exists in Airflow's local file system only) you should default to some value. In one of OL extractors it is defaulted to |
@JDarDagran created the issue for this: #1281 |
Describe the bug - There have been new conventions introduced: [Naming.md#local-file-system](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#local-file-system) I believe sqlite counts as local file, therefore you could adjust naming convention in astro as well. A word of explanation: both IP and port are mandatory. It means that if port is unknown (e. g. if file exists in Airflow's local file system only) you should default to some value. In one of OL extractors it is defaulted to paramiko.config.SSH_PORT, in most cases it's equal to 22. Reference: #1142 (comment) - Add openlineage_dataset_uri() method to create URI as per each database. closes #1281 Co-authored-by: Ankit Chaurasia <8670962+sunank200@users.noreply.github.com>
Describe the bug - There have been new conventions introduced: [Naming.md#local-file-system](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#local-file-system) I believe sqlite counts as local file, therefore you could adjust naming convention in astro as well. A word of explanation: both IP and port are mandatory. It means that if port is unknown (e. g. if file exists in Airflow's local file system only) you should default to some value. In one of OL extractors it is defaulted to paramiko.config.SSH_PORT, in most cases it's equal to 22. Reference: #1142 (comment) - Add openlineage_dataset_uri() method to create URI as per each database. closes #1281 Co-authored-by: Ankit Chaurasia <8670962+sunank200@users.noreply.github.com> (cherry picked from commit b725430)
Description
What is the current behavior?
The sqlite namespace is missing in OpenLineage naming convention. So we created the namespace by default. OL team has requested to change this namespace.
The Namespaces extracted from AstroSDK do not conform to the OpenLineage naming convention for sqlite. For example, /tmp/sqlite_default.db instead of e.g. sqllite://
closes: #1141
What is the new behavior?
Does this introduce a breaking change?
No
Checklist