Adding `preserve_file_name` param to `S3Hook.download_file` method #26886

alexkruc · 2022-10-05T15:04:06Z

This PR is adding the ability for the download_file method to keep the original file name of the file as it is on S3.

Currently, the method generates a hardcoded temp file, as shown here:

airflow/airflow/providers/amazon/aws/hooks/s3.py

Lines 905 to 906 in d600cbd

    
           with NamedTemporaryFile(dir=local_path, prefix='airflow_tmp_', delete=False) as local_tmp_file: 
        
               s3_obj.download_fileobj(

As the original ticket states, sometimes users need to keep the file name as it is on S3. To accomplish this, I've added the parameter preserve_file_name, which if set to True, will rename the temporary file name to the name of the S3 object that we downloaded.

The default behavior is set to False - meaning to keep and use the generated file, for keeping the same API and not making a breaking change.

closes: #23514
related: #23654 (closed, but tries to implement the same with a different way)

kaxil

cc @utkarsharma2 @pankajastro

o-nikolas

Thanks for the contribution! I left a wording nit-pick and a suggestion for a refactor. Feel free to push back on the latter ;)

airflow/providers/amazon/aws/hooks/s3.py

o-nikolas · 2022-10-05T22:01:27Z

airflow/providers/amazon/aws/hooks/s3.py

            )

-        return local_tmp_file.name
+        if preserve_file_name:


It feels to me like the conditional should happen earlier, before the temporary file name is created, it's a bit strange to change the name afterwards when we could just download it under the correct name (whether temp or not) in the first place.

This would require a greater refactoring of this code though.

This is a good point, I've refactored the code accordingly, and now we are not renaming but writing to the proper file.
I'll appreciate another review if possible on the new implementation :) 🙏

uranusjr · 2022-10-06T02:34:08Z

airflow/providers/amazon/aws/hooks/s3.py

+            filename_in_s3 = s3_obj.key.split('/')[-1]
+            local_folder_name = local_tmp_file.name.rsplit('/', 1)[0]
+            local_file_name = f"{local_folder_name}/{filename_in_s3}"
+
+            self.log.info("Renaming file from %s to %s", local_tmp_file.name, local_file_name)
+            rename(local_tmp_file.name, local_file_name)
+
+            return local_file_name


Suggested change

filename_in_s3 = s3_obj.key.split('/')[-1]

local_folder_name = local_tmp_file.name.rsplit('/', 1)[0]

local_file_name = f"{local_folder_name}/{filename_in_s3}"

self.log.info("Renaming file from %s to %s", local_tmp_file.name, local_file_name)

rename(local_tmp_file.name, local_file_name)

return local_file_name

filename_in_s3 = s3_obj.key.rsplit('/', 1)[-1]

local_tmp_file_name = Path(local_tmp_file.name)

local_file_name = local_tmp_file_name.with_name(filename_in_s3)

self.log.info("Renaming file from %s to %s", local_tmp_file_name, local_file_name)

local_tmp_file_name.rename(local_file_name)

return str(local_file_name)

Don’t do string manipulation on the path, use structured calls instead.

Thank you for your comment, but I've refactored the code according to @o-nikolas's suggestion, so now this is not relevant :(

@uranusjr - Can you please review the new implementation? 🙏 It's a bit simpler, and I made sure that I'm using structured calls for paths.

uranusjr · 2022-10-06T10:02:32Z

airflow/providers/amazon/aws/hooks/s3.py


-        with NamedTemporaryFile(dir=local_path, prefix='airflow_tmp_', delete=False) as local_tmp_file:
+        if preserve_file_name:
+            local_dir = local_path if local_path else gettempdir()


I’m a bit worried that using the temp dir directly with a predictive file name may cause a vulnarability. I don’t have concrete examples, but the combination is sort of a red flag.

WDYM?
I've tested it locally in Airflow docker and in Breeze, and the gettempdir() method retunes /tmp on all Linux envs..
When the dir parameter is not provided to the NamedTemporaryFile, it also called directly:
https://github.com/python/cpython/blob/0d68879104dfb392d31e52e25dcb0661801a0249/Lib/tempfile.py#L126

I do not quite understand why it may cause a vulnerability.

Do you think it's better to stay with the older implementation of renaming the file after it's already been created? I think that this way is a bit cleaner, but I'm also ok with also "reverting" to the old flow..

Also, in all flows, the user can still have a file kept in S3 with a name that can cause vulnerability and later saved in the temp directory that will be generated using the same function (if we don't provide a local_path)..

This is not different, it means that we can't implement this feature at all?

To follow up on what @uranusjr 's thought:

Say we are using LocalExecutor or CeleryExecutor, so two users' jobs can be executed on the same host.

Here you are having filename_in_s3 = s3_obj.key.rsplit('/', 1)[-1]. So if user A is having file .../A/data.json and user B is having .../B/data.json, there may be conflict, right?

But just a vague thinking and very likely I missed something. Please feel free to point out.

The risk can be greatly reduced if the file is put in a subdirectory instead of directly inside the temp directory root (so the full path the file is downloaded to remains unpredictable), but that may lead to additional cleanup issues since directories are more finicky than files. I’d be happy tif it works though.

I do also like option 1, it solves many of the complexities you pointed out (or at least bubbles them up to the user) and also allows the user to create a path that is predictable, so this is probably my preference. But they should be able to provide a full sub path within tmp so that they can organize files with similar names to their preference.

Although, option 2 would be perfectly serviceable as well.

Eventually, I did 2 things:

Added a check if the file exists before re-writing it, failing the task if it already exists to bubble the issue to the user.

Added another parameter, use_autogenerated_subdir, that is True by default, which creates a new sub-directory. The user can disable it to control the target file location, but it's on by default.

@o-nikolas @uranusjr @XD-DENG Will appreciate your review of the latest additions to this flow 🙏

Hey @alexkruc, thanks for sticking with this! The method stub is a little complicated now, but I think it's a decent middle ground given all the constraints that came up in the discussions here 👍

@o-nikolas Thanks!
It seems like this PR is beginning to be a bit stale.. Do you think we should do anything else? Or is this ok to approve and merge this?

Hey @alexkruc,

I think this is good to merge, but unfortunately I'm not a committer. CCing @eladkal

Taragolis · 2022-10-06T10:52:02Z

IMHO, I thought that this method create more problem rather than solve (also have issue with SSE Encryption #25835), and seem like it main purpose for use in code PythonOperator (and so on), only one operator use this method - S3ToMySqlOperator.

Also it have the same name as boto3.client("s3").download_file but provide different input arguments, and all other community operators use this boto3 method by S3Hook(...).get_conn().download_file(...) or same method from high-level S3 Resource

eladkal · 2022-10-06T10:59:13Z

only one operator use this method

This is hook capability which means it can be used by many operators. Check the user issue the use case was to use custom operator with it.

if we are taking the deprecation path we need to review other functions in the hook (I guess same question can be raised for other functions as well)

Taragolis · 2022-10-06T11:16:23Z

This is hook capability which means it can be used by many operators. Check the user issue the use case was to use custom operator with it.

Most of the user issue would fix if they avoid to use this method and just switch to method provided by boto3 and do whatever they wanted.

Because in this case S3Hook would use only for authentication and creation client. All other stuff they can read in boto3 documentation, rather than also try to understand why it won't work in this method and why it work different rather than download_file

Again it just my IMHO, but I want to highlight that might this is a reason why other community provided operators in general try to avoid use this method.

And another IMHO, better just warn user (UserWarning, docstring, provider documentation) there is another method to achieve everything they wanted in their custom components rather than struggling with this method.

o-nikolas · 2022-10-06T15:02:26Z

This is hook capability which means it can be used by many operators. Check the user issue the use case was to use custom operator with it.

Most of the user issue would fix if they avoid to use this method and just switch to method provided by boto3 and do whatever they wanted.

Because in this case S3Hook would use only for authentication and creation client. All other stuff they can read in boto3 documentation, rather than also try to understand why it won't work in this method and why it work different rather than download_file

Yeah, this is a constant balance in Airflow provider packages, the question of "when is a hook/operator adding enough value over just using the underlying service sdk". In this case the user is getting some value from the file/location creation so it's a hard call.

Taragolis · 2022-10-06T16:14:30Z

I do not have any objections about this PR. My messages more like discussion and ideas not related to current PR.

My main objection about method S3Hook.download_file if it would be initially something like

    @provide_bucket_name
    @unify_bucket_name_and_key
    def download_file(
        self,
        key: str,
        bucket_name: str | None = None, 
        filename: str | None = None, 
        extra_args=None, 
        config=None
    ) -> None:
        self.conn.download_file(
            bucket_name, key, filename, ExtraArgs=extra_args, Callback=None, Config=config
        )

Then we do not even have this discussion and modification not required because it just would be simple thin wrapper around one boto3 s3 client method which resolved different combination of key and bucket name.

But we have what we have and we cannot revert time back. So it might a lot of user code exists which use this method, how users use current method and what they additionally wanted. I could just predict:

User want to save to location which mount to persistent storage (NFS Share) and also want to keep prefix not only filename
User want to check is this file already exists in File System and raise an error
User want to add Object Version to the filename
User want to delete file in the end of the execute() method of his operator / callable function

My current idea - just inform (at least in docstring) that might exists better way which could help to solve their cases rather than try to use this method.

Also for better user experience we could create some additional static method in hook which could be use for split s3 URL to the parts, it might help them to do some specific stuff without write some additional code.

Sample of helper

from typing import NamedTuple
from urllib.parse import urlparse


class S3UrlParts(NamedTuple):
    bucket: str
    key: str | None
    prefix: str | None
    parts: list[str]
    filename: str | None
    basename: str | None
    suffix: str | None
    suffixes: list[str]


def parse_url(url: str):
    if "://" not in url:
        raise ValueError(f"Invalid S3 URL: {url!r}")
    url_parts = urlparse(url)
    if url_parts.scheme != "s3":
        # Yeah, I know that about 4 ways to define S3 URL and not limited by s3://bucket/prefix/key.json
        raise ValueError(f"Invalid S3 URL (incorrect schema): {url!r}")
    elif not url_parts.netloc:
        raise ValueError(f"Invalid S3 URL (no bucket name): {url!r}")

    bucket = url_parts.netloc
    key = url_parts.path.lstrip("/")
    key_parts = key.rsplit("/", 1)
    if len(key_parts) == 2:
        prefix, filename = key_parts
    else:
        prefix, filename = None, key_parts[0]

    parts = []
    basename = None
    suffix = None
    suffixes = []
    if prefix:
        parts.extend(prefix.split("/"))
    if filename:
        parts.append(filename)

        file_parts = filename.split(".", 1)
        if len(file_parts) == 2:
            basename, suffix = file_parts
            suffixes = suffix.split(".")
        else:
            basename, suffix = file_parts[0], None

    return S3UrlParts(
        bucket,
        key=key,
        prefix=prefix or None,
        parts=parts,
        filename=filename or None,
        basename=basename,
        suffix=suffix,
        suffixes=suffixes
    )

Sample of "user" code

s3_urls = [
    "s3://awesome-bucket/this/is/path/to/awesome-data.parquet.snappy",
    "s3://awesome-bucket",
    "s3://awesome-bucket/",
    "s3://awesome-bucket/prefix/",
    "s3://awesome-bucket/key.json",
    "s3://awesome-bucket/key-with-no-extension",
    "s3://awesome-bucket/key-with-dot-in-the-end.",
]

for s3_url in s3_urls:
    print(f"S3 URL: {s3_url}")
    print(f"Parsed URL: {parse_url(s3_url)}")
    print()

Sample Output

S3 URL: s3://awesome-bucket/this/is/path/to/awesome-data.parquet.snappy
S3UrlParts(bucket='awesome-bucket', key='this/is/path/to/awesome-data.parquet.snappy', prefix='this/is/path/to', parts=['this', 'is', 'path', 'to', 'awesome-data.parquet.snappy'], filename='awesome-data.parquet.snappy', basename='awesome-data', suffix='parquet.snappy', suffixes=['parquet', 'snappy'])

S3 URL: s3://awesome-bucket
S3UrlParts(bucket='awesome-bucket', key='', prefix=None, parts=[], filename=None, basename=None, suffix=None, suffixes=[])

S3 URL: s3://awesome-bucket/
S3UrlParts(bucket='awesome-bucket', key='', prefix=None, parts=[], filename=None, basename=None, suffix=None, suffixes=[])

S3 URL: s3://awesome-bucket/prefix/
S3UrlParts(bucket='awesome-bucket', key='prefix/', prefix='prefix', parts=['prefix'], filename=None, basename=None, suffix=None, suffixes=[])

S3 URL: s3://awesome-bucket/key.json
S3UrlParts(bucket='awesome-bucket', key='key.json', prefix=None, parts=['key.json'], filename='key.json', basename='key', suffix='json', suffixes=['json'])

S3 URL: s3://awesome-bucket/key-with-no-extension
S3UrlParts(bucket='awesome-bucket', key='key-with-no-extension', prefix=None, parts=['key-with-no-extension'], filename='key-with-no-extension', basename='key-with-no-extension', suffix=None, suffixes=[])

S3 URL: s3://awesome-bucket/key-with-dot-in-the-end.
S3UrlParts(bucket='awesome-bucket', key='key-with-dot-in-the-end.', prefix=None, parts=['key-with-dot-in-the-end.'], filename='key-with-dot-in-the-end.', basename='key-with-dot-in-the-end', suffix='', suffixes=[''])

…wing message

…kruc/airflow into s3hook_download_file_keep_file_name

alexkruc · 2022-10-11T09:09:45Z

airflow/providers/amazon/aws/hooks/s3.py

        :return: the file name.
        :rtype: str
        """
+        self.log.info(


@Taragolis I've added a log message here to show that this function shadows boto's method, hope that's fine :)

eladkal

@uranusjr @XD-DENG any further comments?

airflow/providers/amazon/aws/hooks/s3.py

eladkal · 2022-10-24T08:39:30Z

If no further concerns I'll merge this PR so it will be available in next provider wave.
@alexkruc can you please rebase and resolve conflicts?

potiuk · 2022-10-24T14:38:44Z

Unfortunately it needs rebase/conflict resolution after the strring normalization @alexkruc

# Conflicts: # airflow/providers/amazon/aws/hooks/s3.py # tests/providers/amazon/aws/hooks/test_s3.py

alexkruc · 2022-10-26T09:57:27Z

Hey @eladkal and @potiuk - I've rebased, and after the fix in #27258 (thanks for this! it drove me crazy yesterday 😄) the CI tests are passing.
So can we merge it? 🙏

alexkruc and others added 3 commits October 5, 2022 17:54

Adding preserve_file_name param to S3Hook.download_file method

3249b6f

Merge branch 'main' into s3hook_download_file_keep_file_name

f523999

small fix to logline and extract test folder to var

841b713

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Oct 5, 2022

eladkal mentioned this pull request Oct 5, 2022

Update download_file method in s3_hook to accept files and directories #23654

Closed

alexkruc and others added 2 commits October 5, 2022 19:16

fix docstring indentation

4428fad

Merge branch 'main' into s3hook_download_file_keep_file_name

ff8b5b4

kaxil approved these changes Oct 5, 2022

View reviewed changes

o-nikolas reviewed Oct 5, 2022

View reviewed changes

uranusjr reviewed Oct 6, 2022

View reviewed changes

alexkruc and others added 2 commits October 6, 2022 12:44

Refactoring based on PR - creating the file before S3 write

bea904e

Merge branch 'main' into s3hook_download_file_keep_file_name

1b824ce

uranusjr reviewed Oct 6, 2022

View reviewed changes

alexkruc added 2 commits October 11, 2022 12:02

PR fixes - adding random subdir param, check if file exists and shado…

0ddb262

…wing message

Merge branch 's3hook_download_file_keep_file_name' of github.com:alex…

64e4e2b

…kruc/airflow into s3hook_download_file_keep_file_name

alexkruc requested a review from eladkal as a code owner October 11, 2022 09:03

alexkruc commented Oct 11, 2022

View reviewed changes

alexkruc and others added 2 commits October 11, 2022 12:10

small fix to shadowing message

3cc4c6a

Merge branch 'main' into s3hook_download_file_keep_file_name

c78dbbd

o-nikolas approved these changes Oct 13, 2022

View reviewed changes

eladkal approved these changes Oct 16, 2022

View reviewed changes

eladkal requested review from XD-DENG and uranusjr October 16, 2022 07:44

uranusjr reviewed Oct 17, 2022

View reviewed changes

airflow/providers/amazon/aws/hooks/s3.py Outdated Show resolved Hide resolved

alexkruc requested review from uranusjr and removed request for XD-DENG October 18, 2022 11:16

alexkruc and others added 4 commits October 25, 2022 12:12

Merge branch 'main' into s3hook_download_file_keep_file_name

e62ecea

# Conflicts: # airflow/providers/amazon/aws/hooks/s3.py # tests/providers/amazon/aws/hooks/test_s3.py

merge from master & small updates

9300ade

Merge branch 'main' into s3hook_download_file_keep_file_name

67daea3

Merge branch 'main' into s3hook_download_file_keep_file_name

f525905

eladkal approved these changes Oct 26, 2022

View reviewed changes

eladkal merged commit 777b57f into apache:main Oct 26, 2022

potiuk mentioned this pull request Nov 15, 2022

Status of testing Providers that were prepared on November 15, 2022 #27674

Closed

Bartket mentioned this pull request Feb 6, 2024

S3 Hook provides unnecessary log message when using s3hook class instance method: download_file #37204

Closed

2 tasks

	with NamedTemporaryFile(dir=local_path, prefix='airflow_tmp_', delete=False) as local_tmp_file:
	s3_obj.download_fileobj(

Adding preserve_file_name param to S3Hook.download_file method #26886

Adding preserve_file_name param to S3Hook.download_file method #26886

Uh oh!

Conversation

alexkruc commented Oct 5, 2022

Uh oh!

kaxil left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

o-nikolas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexkruc Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Taragolis commented Oct 6, 2022

Uh oh!

eladkal commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Taragolis commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

o-nikolas commented Oct 6, 2022

Uh oh!

Taragolis commented Oct 6, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eladkal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eladkal commented Oct 24, 2022

Uh oh!

potiuk commented Oct 24, 2022

Uh oh!

alexkruc commented Oct 26, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Adding `preserve_file_name` param to `S3Hook.download_file` method #26886

Adding `preserve_file_name` param to `S3Hook.download_file` method #26886

kaxil left a comment •

edited

Loading

alexkruc Oct 6, 2022 •

edited

Loading

eladkal commented Oct 6, 2022 •

edited

Loading

Taragolis commented Oct 6, 2022 •

edited

Loading