Fixed reading from zip package to default to text. #13984

levahim · 2021-01-30T13:50:34Z

This is a resubmission of PR #13962. The original PR did not include corresponding unit test adjustments and therefore broke the build. This PR fixes the problem.

Original PR #13962 description for reference:

The open_maybe_zipped function returns different file-like objects depending on whether it's called for a plain file or for a file in a zip archive. The problem is that by default io.open (used for plain files) returns file in text mode and subsequent read on it returns strings. ZipFile on the other hand by default returns a binary file and subsequent read on it returns bytes.
The returned value for open_maybe_zipped should be the same regardless whether it's a zip or a plain file--it should be in text mode. Returning binaries for zip packages causes problems. For example, when saving DAG code is turned on, the DagCode model tries to save DAG source code in the metadata database. SQLAlchemy throws an error for DAGs that come from a zip package, because tries to save binary value in a string column.

potiuk · 2021-01-30T15:32:38Z

I think the solution is not complete, as it does not properly include Python encoding. And it is currently (potentially) wrong not only for "zipped" case but also for the "non-zipped" case. Maybe there is a chance to fix it for both cases. It would likely require to change the interface slightly of the open_maybe_zippped function.

In Python 3 default encoding is utf-8, and I guess it covers vast majority of cases, but there might be different encodings specified as defined by PEP 263: https://www.python.org/dev/peps/pep-0263/ . They are rarely used in Python 3 but still, there are cases when it can be useful. Moreover, different python files can be encoded with different encoding and we seem to use always the same encoding (default) as defined by locale.getpreferredencoding(False) (see https://docs.python.org/3/library/io.html#io.TextIOWrapper).

However, this function is only used to read python sources I believe, and there is a way in Python 3 to detect the encoding for Python source files. It is there in the standard library:

There are those two functions that can be used (added in Python 3.2):

They both read BOM of a file (if present) or follow PEP 263 (if not) to detect the file encoding. I think it would not be too complex to use those to reliably detect encoding of python files.

levahim · 2021-01-30T16:59:39Z

@potiuk, I'd say it's a separate issue. The scope of this fix is just to make sure that when the DAG source code is read from a zipped package, it is returned in exactly the same way as for a non-zipped file. In both cases it uses the Python's default encoding and with this fix the outcome of calling open_maybe_zipped is identical for both cases.
I'd suggest a separate issue is created and tracked for what you are describing. In the meantime, this fix unblocks all the folks who want to use zipped packages and have store_dag_code option enabled.

potiuk · 2021-01-30T17:05:31Z

@potiuk, I'd say it's a separate issue. The scope of this fix is just to make sure that when the DAG source code is read from a zipped package, it is returned in exactly the same way as for a non-zipped file. In both cases it uses the Python's default encoding and with this fix the outcome of calling open_maybe_zipped is identical for both cases.
I'd suggest a separate issue is created and tracked for what you are describing. In the meantime, this fix unblocks all the folks who want to use zipped packages and have store_dag_code option enabled.

Happy to hear what others think, but since testing is about the same piece of code and potentially resulting in errors in the same place - I think it might be worth fixing in the same PR.

levahim · 2021-01-30T17:31:49Z

It appears that the CI Build failed for a reason unrelated to the PR. Seems to be a temporary issue with the failed test itself. Is it possible to re-run the failed check?

potiuk · 2021-01-30T17:38:05Z

Unfortunately not (GA has no such possibility). It is indeed traansient error, But it is not blocking for merge. I would love to hear what others think (@kaxil, @ashb whether detection of the encoding of python file is worth a separate issue/PR) before merging.

ashb · 2021-01-30T18:07:05Z

Speaking of the tests: we are testing the implementation which is fragile.

If possible we should instead test that when reading from it we get strings, not bytes (and mock as little as possible - we already have actual zip files in the tests for for this purpose)

ashb · 2021-01-30T18:09:39Z

As for encoding: so long as we document it, I'm perfectly happy only supporting utf8 source files - it's unlikely any modern system will use anything else, so isn't worth us supporting it.

If someone wants it they can open a PR in future.

I'm not against it, but also agree that it should be a separate PR to this one.

mik-laj · 2021-01-30T19:23:01Z

tests/utils/test_dag_processing.py

@@ -586,13 +586,16 @@ def test_open_maybe_zipped_normal_file_with_zip_in_name(self):

    @mock.patch("zipfile.is_zipfile")
    @mock.patch("zipfile.ZipFile")
-    def test_open_maybe_zipped_archive(self, mocked_zip_file, mocked_is_zipfile):
+    @mock.patch("io.TextIOWrapper")
+    def test_open_maybe_zipped_archive(self, mocked_text_io_wrapper, mocked_zip_file, mocked_is_zipfile):


Can you move these test cases to test/utils/test_file.py? During the review, I checked that the file was missing and hence I merged the change that didn't work.

Would you like to move TestCorrectMaybeZipped to the new tests/utils/test_file.py module as well?

Yes. Please move it also. YOu can moce these classes in this PR, but you'd better do it as a separate PR.

levahim · 2021-01-31T14:11:53Z

Speaking of the tests: we are testing the implementation which is fragile.

If possible we should instead test that when reading from it we get strings, not bytes (and mock as little as possible - we already have actual zip files in the tests for for this purpose)

That I agree with. In this PR I tried to match the way things are done in the existing code and fix the specific issue at focus with minimal changes. If the elders agree, I'll certainly be happy to change the test implementation. @ashb , I'm new to the codebase, can you recommend exact files (a zip and a non-zip Python source code-like content) to use that you say already exist in the tests?

github-actions · 2021-02-01T01:18:54Z

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

ashb · 2021-02-04T11:12:53Z

That I agree with. In this PR I tried to match the way things are done in the existing code and fix the specific issue at focus with minimal changes. If the elders agree, I'll certainly be happy to change the test implementation. @ashb , I'm new to the codebase, can you recommend exact files (a zip and a non-zip Python source code-like content) to use that you say already exist in the tests?

@levahim tests/dags/test_zip.zip was the zip file I was thinking of, and anything else in that dir for a .py folder.

levahim added 3 commits January 28, 2021 17:34

Fixed reading from zip package to default to text.

aa32fed

Fixed open_maybe_zip unit test to account for io.TextIOWrapper.

cd49317

Merge remote-tracking branch 'upstream/master'

6e65fb8

levahim mentioned this pull request Jan 30, 2021

Fixed reading from zip package to default to text. #13962

Merged

mik-laj reviewed Jan 30, 2021

View reviewed changes

mik-laj approved these changes Feb 1, 2021

View reviewed changes

github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Feb 1, 2021

potiuk merged commit 0bbc2cb into apache:master Feb 1, 2021

kaxil modified the milestones: Airflow 2.0.1, Airflow 2.1 Feb 1, 2021

levahim mentioned this pull request Feb 6, 2021

Improved unit tests for open_maybe_zipped function. #14114

Merged

kaxil mentioned this pull request Jun 28, 2021

Web UI doesn't render packaged dag code properly #10554

Closed

kaxil mentioned this pull request Jul 6, 2021

STORE_DAG_CODE not displaying correct code for zipped dag files #14412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed reading from zip package to default to text. #13984

Fixed reading from zip package to default to text. #13984

levahim commented Jan 30, 2021

potiuk commented Jan 30, 2021 •

edited

levahim commented Jan 30, 2021

potiuk commented Jan 30, 2021 •

edited

levahim commented Jan 30, 2021

potiuk commented Jan 30, 2021

ashb commented Jan 30, 2021

ashb commented Jan 30, 2021 •

edited

mik-laj Jan 30, 2021 •

edited

levahim Jan 31, 2021

mik-laj Feb 1, 2021 •

edited

levahim commented Jan 31, 2021

github-actions bot commented Feb 1, 2021

ashb commented Feb 4, 2021

Fixed reading from zip package to default to text. #13984

Fixed reading from zip package to default to text. #13984

Conversation

levahim commented Jan 30, 2021

potiuk commented Jan 30, 2021 • edited

levahim commented Jan 30, 2021

potiuk commented Jan 30, 2021 • edited

levahim commented Jan 30, 2021

potiuk commented Jan 30, 2021

ashb commented Jan 30, 2021

ashb commented Jan 30, 2021 • edited

mik-laj Jan 30, 2021 • edited

Choose a reason for hiding this comment

levahim Jan 31, 2021

Choose a reason for hiding this comment

mik-laj Feb 1, 2021 • edited

Choose a reason for hiding this comment

levahim commented Jan 31, 2021

github-actions bot commented Feb 1, 2021

ashb commented Feb 4, 2021

potiuk commented Jan 30, 2021 •

edited

potiuk commented Jan 30, 2021 •

edited

ashb commented Jan 30, 2021 •

edited

mik-laj Jan 30, 2021 •

edited

mik-laj Feb 1, 2021 •

edited