Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed reading from zip package to default to text. #13984

Merged
merged 3 commits into from Feb 1, 2021

Conversation

levahim
Copy link
Contributor

@levahim levahim commented Jan 30, 2021

This is a resubmission of PR #13962. The original PR did not include corresponding unit test adjustments and therefore broke the build. This PR fixes the problem.

Original PR #13962 description for reference:

The open_maybe_zipped function returns different file-like objects depending on whether it's called for a plain file or for a file in a zip archive. The problem is that by default io.open (used for plain files) returns file in text mode and subsequent read on it returns strings. ZipFile on the other hand by default returns a binary file and subsequent read on it returns bytes.
The returned value for open_maybe_zipped should be the same regardless whether it's a zip or a plain file--it should be in text mode. Returning binaries for zip packages causes problems. For example, when saving DAG code is turned on, the DagCode model tries to save DAG source code in the metadata database. SQLAlchemy throws an error for DAGs that come from a zip package, because tries to save binary value in a string column.

@potiuk
Copy link
Member

potiuk commented Jan 30, 2021

I think the solution is not complete, as it does not properly include Python encoding. And it is currently (potentially) wrong not only for "zipped" case but also for the "non-zipped" case. Maybe there is a chance to fix it for both cases. It would likely require to change the interface slightly of the open_maybe_zippped function.

In Python 3 default encoding is utf-8, and I guess it covers vast majority of cases, but there might be different encodings specified as defined by PEP 263: https://www.python.org/dev/peps/pep-0263/ . They are rarely used in Python 3 but still, there are cases when it can be useful. Moreover, different python files can be encoded with different encoding and we seem to use always the same encoding (default) as defined by locale.getpreferredencoding(False) (see https://docs.python.org/3/library/io.html#io.TextIOWrapper).

However, this function is only used to read python sources I believe, and there is a way in Python 3 to detect the encoding for Python source files. It is there in the standard library:

There are those two functions that can be used (added in Python 3.2):

They both read BOM of a file (if present) or follow PEP 263 (if not) to detect the file encoding. I think it would not be too complex to use those to reliably detect encoding of python files.

@levahim
Copy link
Contributor Author

levahim commented Jan 30, 2021

@potiuk, I'd say it's a separate issue. The scope of this fix is just to make sure that when the DAG source code is read from a zipped package, it is returned in exactly the same way as for a non-zipped file. In both cases it uses the Python's default encoding and with this fix the outcome of calling open_maybe_zipped is identical for both cases.
I'd suggest a separate issue is created and tracked for what you are describing. In the meantime, this fix unblocks all the folks who want to use zipped packages and have store_dag_code option enabled.

@potiuk
Copy link
Member

potiuk commented Jan 30, 2021

@potiuk, I'd say it's a separate issue. The scope of this fix is just to make sure that when the DAG source code is read from a zipped package, it is returned in exactly the same way as for a non-zipped file. In both cases it uses the Python's default encoding and with this fix the outcome of calling open_maybe_zipped is identical for both cases.
I'd suggest a separate issue is created and tracked for what you are describing. In the meantime, this fix unblocks all the folks who want to use zipped packages and have store_dag_code option enabled.

Happy to hear what others think, but since testing is about the same piece of code and potentially resulting in errors in the same place - I think it might be worth fixing in the same PR.

@levahim
Copy link
Contributor Author

levahim commented Jan 30, 2021

It appears that the CI Build failed for a reason unrelated to the PR. Seems to be a temporary issue with the failed test itself. Is it possible to re-run the failed check?

@potiuk
Copy link
Member

potiuk commented Jan 30, 2021

Unfortunately not (GA has no such possibility). It is indeed traansient error, But it is not blocking for merge. I would love to hear what others think (@kaxil, @ashb whether detection of the encoding of python file is worth a separate issue/PR) before merging.

@ashb
Copy link
Member

ashb commented Jan 30, 2021

Speaking of the tests: we are testing the implementation which is fragile.

If possible we should instead test that when reading from it we get strings, not bytes (and mock as little as possible - we already have actual zip files in the tests for for this purpose)

@ashb
Copy link
Member

ashb commented Jan 30, 2021

As for encoding: so long as we document it, I'm perfectly happy only supporting utf8 source files - it's unlikely any modern system will use anything else, so isn't worth us supporting it.

If someone wants it they can open a PR in future.

I'm not against it, but also agree that it should be a separate PR to this one.

@@ -586,13 +586,16 @@ def test_open_maybe_zipped_normal_file_with_zip_in_name(self):

@mock.patch("zipfile.is_zipfile")
@mock.patch("zipfile.ZipFile")
def test_open_maybe_zipped_archive(self, mocked_zip_file, mocked_is_zipfile):
@mock.patch("io.TextIOWrapper")
def test_open_maybe_zipped_archive(self, mocked_text_io_wrapper, mocked_zip_file, mocked_is_zipfile):
Copy link
Member

@mik-laj mik-laj Jan 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these test cases to test/utils/test_file.py? During the review, I checked that the file was missing and hence I merged the change that didn't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to move TestCorrectMaybeZipped to the new tests/utils/test_file.py module as well?

Copy link
Member

@mik-laj mik-laj Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Please move it also. YOu can moce these classes in this PR, but you'd better do it as a separate PR.

@levahim
Copy link
Contributor Author

levahim commented Jan 31, 2021

Speaking of the tests: we are testing the implementation which is fragile.

If possible we should instead test that when reading from it we get strings, not bytes (and mock as little as possible - we already have actual zip files in the tests for for this purpose)

That I agree with. In this PR I tried to match the way things are done in the existing code and fix the specific issue at focus with minimal changes. If the elders agree, I'll certainly be happy to change the test implementation. @ashb , I'm new to the codebase, can you recommend exact files (a zip and a non-zip Python source code-like content) to use that you say already exist in the tests?

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Feb 1, 2021
@github-actions
Copy link

github-actions bot commented Feb 1, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk potiuk merged commit 0bbc2cb into apache:master Feb 1, 2021
@kaxil kaxil modified the milestones: Airflow 2.0.1, Airflow 2.1 Feb 1, 2021
@ashb
Copy link
Member

ashb commented Feb 4, 2021

That I agree with. In this PR I tried to match the way things are done in the existing code and fix the specific issue at focus with minimal changes. If the elders agree, I'll certainly be happy to change the test implementation. @ashb , I'm new to the codebase, can you recommend exact files (a zip and a non-zip Python source code-like content) to use that you say already exist in the tests?

@levahim tests/dags/test_zip.zip was the zip file I was thinking of, and anything else in that dir for a .py folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants