Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix open lineage namespace for Sqlite as per OL team request #1142

Merged
merged 3 commits into from
Oct 27, 2022

Conversation

sunank200
Copy link
Contributor

@sunank200 sunank200 commented Oct 27, 2022

Description

What is the current behavior?

The sqlite namespace is missing in OpenLineage naming convention. So we created the namespace by default. OL team has requested to change this namespace.
The Namespaces extracted from AstroSDK do not conform to the OpenLineage naming convention for sqlite. For example, /tmp/sqlite_default.db instead of e.g. sqllite://

closes: #1141

What is the new behavior?

  • Fix the namespace for sqlite
  • Fix the test for this

Does this introduce a breaking change?

No

Checklist

  • Created tests which fail without the change (if possible)
  • Extended the README / documentation, if necessary

@sunank200 sunank200 changed the title Fix open lineage namespace as per https://github.com/OpenLineage/Open… Fix open lineage namespace for Sqlite Oct 27, 2022
@codecov
Copy link

codecov bot commented Oct 27, 2022

Codecov Report

Base: 91.94% // Head: 93.82% // Increases project coverage by +1.87% 🎉

Coverage data is based on head (7e326f9) compared to base (496382c).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1142      +/-   ##
==========================================
+ Coverage   91.94%   93.82%   +1.87%     
==========================================
  Files          12       65      +53     
  Lines         472     3045    +2573     
  Branches       48      341     +293     
==========================================
+ Hits          434     2857    +2423     
- Misses         28      129     +101     
- Partials       10       59      +49     
Impacted Files Coverage Δ
python-sdk/src/astro/databases/sqlite.py 96.42% <100.00%> (ø)
sql-cli/sql_cli/__main__.py 100.00% <0.00%> (ø)
sql-cli/sql_cli/exceptions.py 100.00% <0.00%> (ø)
python-sdk/src/astro/sql/operators/load_file.py 96.47% <0.00%> (ø)
python-sdk/src/astro/databases/base.py 97.00% <0.00%> (ø)
python-sdk/src/astro/files/locations/http.py 100.00% <0.00%> (ø)
python-sdk/src/astro/databases/postgres.py 96.29% <0.00%> (ø)
python-sdk/src/astro/sql/operators/raw_sql.py 89.65% <0.00%> (ø)
python-sdk/src/astro/files/locations/__init__.py 100.00% <0.00%> (ø)
python-sdk/src/astro/files/locations/amazon/s3.py 97.56% <0.00%> (ø)
... and 47 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sunank200 sunank200 changed the title Fix open lineage namespace for Sqlite Fix open lineage namespace for Sqlite as per OL team request Oct 27, 2022
Copy link
Contributor

@pankajastro pankajastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @sunank200 for fixing it.

@@ -18,6 +18,24 @@ class DatabaseSubclass(BaseDatabase):
pass


def test_openlineage_database_dataset_namespace():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering does this test add some value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just for code coverage. It was failing for coverage

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good point from @pankajastro, the test should check that the method returns what you expect.
Test coverage reports tell you where you're missing tests, coverage is not a goal in itself.
In this case, the test should check that the method returns a namespace that conforms with the spec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @pankajastro and @julienledem here, this test doesn't add any value.

It is just for code coverage. It was failing for coverage

If we just need to do this for code coverage, we are doing it wrong :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunank200 sunank200 self-assigned this Oct 27, 2022
@sunank200 sunank200 merged commit c2773d5 into main Oct 27, 2022
@sunank200 sunank200 deleted the rename-database-namespace branch October 27, 2022 06:54
Copy link

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sunank200 for addressing the request.
Sorry for not giving my review before this got merged.
Please look at the few comments bellow.
I think you still need to tweak it before we can consider it done.
Best, Julien

@@ -18,6 +18,24 @@ class DatabaseSubclass(BaseDatabase):
pass


def test_openlineage_database_dataset_namespace():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good point from @pankajastro, the test should check that the method returns what you expect.
Test coverage reports tell you where you're missing tests, coverage is not a goal in itself.
In this case, the test should check that the method returns a namespace that conforms with the spec.

"""
db = DatabaseSubclass(conn_id="fake_conn_id")
with pytest.raises(NotImplementedError):
db.openlineage_dataset_name(table=BaseTable)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, here you want to mock the method being called to check that openlineage_dataset_name returns the expected value.

with pytest.raises(NotImplementedError):
db.openlineage_dataset_name(table=BaseTable)


def test_subclass_missing_not_implemented_methods_raise_exception():
db = DatabaseSubclass(conn_id="fake_conn_id")
with pytest.raises(NotImplementedError):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

"/tmp/sqlite.db",
Connection(conn_id="test_conn", conn_type="sqlite", host="tmp/sqlite.db"),
"tmp/sqlite.db.test_tb",
"sqlite://tmp/sqlite.db",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments here:

  • The namespace should not repeat the dataset name ("tmp/sqlite.db.test_tb").
  • The tricky aspect of those local datasets that are private to the machine they are running on is we want the dataset ns+name to be unique. Otherwise if you run this on two different machines the lineage graph will assume that you are writing and reading the same dataset, which is not true. You need to make the namespace more unique by (for example) adding the machine name/IP in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the PR for the change #1178. Please review

phanikumv pushed a commit that referenced this pull request Nov 4, 2022
# Description
## What is the current behavior?
<!-- Please describe the current behavior that you are modifying. -->
Fix the comments made on
#1142

<!--
Issues are required for both bug fixes and features.
Reference it using one of the following:

closes: #ISSUE
related: #ISSUE
-->
closes: #1179 

## What is the new behavior?
<!-- Please describe the behavior or changes that are being added by
this PR. -->

- Fix the namespace for sqlite  as per comment on #1142 

> Two comments here:
> 
> - The namespace should not repeat the dataset name
("tmp/sqlite.db.test_tb").
>- The tricky aspect of those local datasets that are private to the
machine they are running on is we want the dataset ns+name to be unique.
Otherwise if you run this on two different machines the lineage graph
will assume that you are writing and reading the same dataset, which is
not true. You need to make the namespace more unique by (for example)
adding the machine name/IP in it.
- Fix the test for this


## Does this introduce a breaking change?
No

### Checklist
- [x] Created tests which fail without the change (if possible)
- [x] Extended the README / documentation, if necessary

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
utkarsharma2 pushed a commit that referenced this pull request Nov 4, 2022
# Description
## What is the current behavior?
<!-- Please describe the current behavior that you are modifying. -->
The sqlite namespace is missing in [OpenLineage naming
convention](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md).
So we created the namespace by default. OL team has requested to change
this namespace.
The Namespaces extracted from AstroSDK do not conform to the
[OpenLineage naming
convention](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md)
for sqlite. For example, /tmp/sqlite_default.db instead of e.g.
sqllite://


<!--
Issues are required for both bug fixes and features.
Reference it using one of the following:

closes: #ISSUE
related: #ISSUE
-->
closes: #1141

## What is the new behavior?
<!-- Please describe the behavior or changes that are being added by
this PR. -->

- Fix the namespace for sqlite
- Fix the test for this


## Does this introduce a breaking change?
No

### Checklist
- [x] Created tests which fail without the change (if possible)
- [x] Extended the README / documentation, if necessary
utkarsharma2 pushed a commit that referenced this pull request Nov 4, 2022
# Description
## What is the current behavior?
<!-- Please describe the current behavior that you are modifying. -->
Fix the comments made on
#1142

<!--
Issues are required for both bug fixes and features.
Reference it using one of the following:

closes: #ISSUE
related: #ISSUE
-->
closes: #1179 

## What is the new behavior?
<!-- Please describe the behavior or changes that are being added by
this PR. -->

- Fix the namespace for sqlite  as per comment on #1142 

> Two comments here:
> 
> - The namespace should not repeat the dataset name
("tmp/sqlite.db.test_tb").
>- The tricky aspect of those local datasets that are private to the
machine they are running on is we want the dataset ns+name to be unique.
Otherwise if you run this on two different machines the lineage graph
will assume that you are writing and reading the same dataset, which is
not true. You need to make the namespace more unique by (for example)
adding the machine name/IP in it.
- Fix the test for this


## Does this introduce a breaking change?
No

### Checklist
- [x] Created tests which fail without the change (if possible)
- [x] Extended the README / documentation, if necessary

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@JDarDagran
Copy link

Hey, there have been new conventions introduced: Naming.md#local-file-system
I believe sqlite counts as local file, therefore you could adjust naming convention in astro as well.

A word of explanation: both IP and port are mandatory. It means that if port is unknown (e. g. if file exists in Airflow's local file system only) you should default to some value. In one of OL extractors it is defaulted to paramiko.config.SSH_PORT, in most cases it's equal to 22.

@sunank200
Copy link
Contributor Author

Hey, there have been new conventions introduced: Naming.md#local-file-system I believe sqlite counts as local file, therefore you could adjust naming convention in astro as well.

A word of explanation: both IP and port are mandatory. It means that if port is unknown (e. g. if file exists in Airflow's local file system only) you should default to some value. In one of OL extractors it is defaulted to paramiko.config.SSH_PORT, in most cases it's equal to 22.

@JDarDagran created the issue for this: #1281

sunank200 added a commit that referenced this pull request Nov 29, 2022
Describe the bug
- There have been new conventions introduced:
[Naming.md#local-file-system](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#local-file-system)
I believe sqlite counts as local file, therefore you could adjust naming
convention in astro as well.

A word of explanation: both IP and port are mandatory. It means that if
port is unknown (e. g. if file exists in Airflow's local file system
only) you should default to some value. In one of OL extractors it is
defaulted to paramiko.config.SSH_PORT, in most cases it's equal to 22.

Reference:
#1142 (comment)
- Add openlineage_dataset_uri() method to create URI as per each
database.

closes #1281

Co-authored-by: Ankit Chaurasia <8670962+sunank200@users.noreply.github.com>
sunank200 pushed a commit that referenced this pull request Dec 1, 2022
Describe the bug
- There have been new conventions introduced:
[Naming.md#local-file-system](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md#local-file-system)
I believe sqlite counts as local file, therefore you could adjust naming
convention in astro as well.

A word of explanation: both IP and port are mandatory. It means that if
port is unknown (e. g. if file exists in Airflow's local file system
only) you should default to some value. In one of OL extractors it is
defaulted to paramiko.config.SSH_PORT, in most cases it's equal to 22.

Reference:
#1142 (comment)
- Add openlineage_dataset_uri() method to create URI as per each
database.

closes #1281

Co-authored-by: Ankit Chaurasia <8670962+sunank200@users.noreply.github.com>
(cherry picked from commit b725430)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix namespace for OL for sqlite
5 participants