New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed reading from zip package to default to text. #13984
Conversation
I think the solution is not complete, as it does not properly include Python encoding. And it is currently (potentially) wrong not only for "zipped" case but also for the "non-zipped" case. Maybe there is a chance to fix it for both cases. It would likely require to change the interface slightly of the open_maybe_zippped function. In Python 3 default encoding is utf-8, and I guess it covers vast majority of cases, but there might be different encodings specified as defined by PEP 263: https://www.python.org/dev/peps/pep-0263/ . They are rarely used in Python 3 but still, there are cases when it can be useful. Moreover, different python files can be encoded with different encoding and we seem to use always the same encoding (default) as defined by However, this function is only used to read python sources I believe, and there is a way in Python 3 to detect the encoding for Python source files. It is there in the standard library: There are those two functions that can be used (added in Python 3.2):
They both read BOM of a file (if present) or follow PEP 263 (if not) to detect the file encoding. I think it would not be too complex to use those to reliably detect encoding of python files. |
@potiuk, I'd say it's a separate issue. The scope of this fix is just to make sure that when the DAG source code is read from a zipped package, it is returned in exactly the same way as for a non-zipped file. In both cases it uses the Python's default encoding and with this fix the outcome of calling |
Happy to hear what others think, but since testing is about the same piece of code and potentially resulting in errors in the same place - I think it might be worth fixing in the same PR. |
It appears that the CI Build failed for a reason unrelated to the PR. Seems to be a temporary issue with the failed test itself. Is it possible to re-run the failed check? |
Speaking of the tests: we are testing the implementation which is fragile. If possible we should instead test that when reading from it we get strings, not bytes (and mock as little as possible - we already have actual zip files in the tests for for this purpose) |
As for encoding: so long as we document it, I'm perfectly happy only supporting utf8 source files - it's unlikely any modern system will use anything else, so isn't worth us supporting it. If someone wants it they can open a PR in future. I'm not against it, but also agree that it should be a separate PR to this one. |
@@ -586,13 +586,16 @@ def test_open_maybe_zipped_normal_file_with_zip_in_name(self): | |||
|
|||
@mock.patch("zipfile.is_zipfile") | |||
@mock.patch("zipfile.ZipFile") | |||
def test_open_maybe_zipped_archive(self, mocked_zip_file, mocked_is_zipfile): | |||
@mock.patch("io.TextIOWrapper") | |||
def test_open_maybe_zipped_archive(self, mocked_text_io_wrapper, mocked_zip_file, mocked_is_zipfile): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these test cases to test/utils/test_file.py
? During the review, I checked that the file was missing and hence I merged the change that didn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to move TestCorrectMaybeZipped
to the new tests/utils/test_file.py
module as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Please move it also. YOu can moce these classes in this PR, but you'd better do it as a separate PR.
That I agree with. In this PR I tried to match the way things are done in the existing code and fix the specific issue at focus with minimal changes. If the elders agree, I'll certainly be happy to change the test implementation. @ashb , I'm new to the codebase, can you recommend exact files (a zip and a non-zip Python source code-like content) to use that you say already exist in the tests? |
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
@levahim |
This is a resubmission of PR #13962. The original PR did not include corresponding unit test adjustments and therefore broke the build. This PR fixes the problem.
Original PR #13962 description for reference:
The
open_maybe_zipped
function returns different file-like objects depending on whether it's called for a plain file or for a file in a zip archive. The problem is that by defaultio.open
(used for plain files) returns file in text mode and subsequentread
on it returns strings.ZipFile
on the other hand by default returns a binary file and subsequentread
on it returns bytes.The returned value for
open_maybe_zipped
should be the same regardless whether it's a zip or a plain file--it should be in text mode. Returning binaries for zip packages causes problems. For example, when saving DAG code is turned on, theDagCode
model tries to save DAG source code in the metadata database. SQLAlchemy throws an error for DAGs that come from a zip package, because tries to save binary value in a string column.