Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/keep tgz download #7886

Merged
merged 43 commits into from Oct 27, 2020

Conversation

memsharded
Copy link
Member

@memsharded memsharded commented Oct 15, 2020

Changelog: Feature: Download and keep the conan_package.tgz in the cache, so they are not affected by different Operating Systems compression and de-compression and uploading is way more efficient.
Docs: Omit

The same for other .tgz files will follow in separate PR in: #7938

#tags: slow
#revisions: 1

@memsharded memsharded marked this pull request as draft Oct 15, 2020
@memsharded memsharded requested a review from czoido Oct 26, 2020
@memsharded memsharded added this to the 1.31 milestone Oct 26, 2020
@memsharded memsharded marked this pull request as ready for review Oct 26, 2020
@memsharded memsharded requested a review from jgsogo Oct 26, 2020
czoido
czoido approved these changes Oct 26, 2020
conans/client/cmd/copy.py Outdated Show resolved Hide resolved
conans/client/migrations.py Outdated Show resolved Hide resolved
mkdir(package_folder) # Just in case it doesn't exist, because uncompress did nothing
for file_name, file_path in zipped_files.items(): # copy CONANINFO and CONANMANIFEST
os.rename(file_path, os.path.join(package_folder, file_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we moving these files now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they are being downloaded to the "download" folder, and these files need to be in the final package folder. There were 2 options:

  • Download to the "package folder", then move the .tgz to other place (the download folder)
  • Download to the "download folder", leave the .tgz there and move the other files to the package folder.

I have decided the second one, to don't move the "heavy" .tgz file.

package_checksums = calc_files_checksum(zipped_files)
download_pkg_folder = layout.download_package(pref)
# Download files to the pkg_tgz folder, not to the final one
zipped_files = self._call_remote(remote, "get_package", pref, download_pkg_folder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere around these lines, we need to set/unset the dirty flag to the downloaded file, it can fail during the download. I think the get_package call is only assigning the dirty flag to the package folder.
I see we are no longer removing the folder in the except BaseException

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the thing, as long as the whole package_folder is protected with a "dirty" flag, everything inside is dirty. There was not dirty flag for this download before, I would say it is not necessary. Probably the other dirty flags over individual .tgz files could also be removed, but I didn't want to change more things in this PR.

The removal of the folder also doesn't make sense, and apparently no test is covering it. If we have the "dirty" mechanism in place to protect interrupted processes, then it shouldn't be necessary to try to remove the folder (in some cases, it could even be annoying, as it won't let debug)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tried it, but this is my concern:

  • Download starts
    • package_folder marked as dirty
  • Download fails
    • package_folder mark is still there
    • We have a broken file in <layout>/download_folder/pid/xxxxxx.tgz (not dirty)
  • Upload the package: lines in the uploader are checking the dirty for the tgz file, not the package_folder, will it upload a broken file?
         download_pkg_folder = layout.download_package(pref)
         package_tgz = os.path.join(download_pkg_folder, PACKAGE_TGZ_NAME)
         if is_dirty(package_tgz):
             self._output.warn("%s: Removing %s, marked as dirty" % (str(pref), PACKAGE_TGZ_NAME))
             os.remove(package_tgz)
             clean_dirty(package_tgz)
        ... 

Without running the test I'm not sure if this is something that can happen or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is doing both:

    def _compress_package_files(self, layout, pref, integrity_check):
        t1 = time.time()
        if layout.package_is_dirty(pref):
            raise ConanException("Package %s is corrupted, aborting upload.\n"
                                 "Remove it with 'conan remove %s -p=%s'"
                                 % (pref, pref.ref, pref.id))

        download_pkg_folder = layout.download_package(pref)
        package_tgz = os.path.join(download_pkg_folder, PACKAGE_TGZ_NAME)
        if is_dirty(package_tgz):
            self._output.warn("%s: Removing %s, marked as dirty" % (str(pref), PACKAGE_TGZ_NAME))
            os.remove(package_tgz)
            clean_dirty(package_tgz)

First, it will raise if the package is dirty (as it doesn't know how to recover from that). If only the .tgz is dirty, it can try to remove it and re-compress.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in test_upload_dirty() covers the case where the package is marked as dirty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we are covered with that raise when the package_folder is dirty. 👍

Copy link
Member Author

@memsharded memsharded left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the PR:

  • Removed the non related parts of the migration
  • Fixed the copy (only affected package-ids)

Please check the comments

mkdir(package_folder) # Just in case it doesn't exist, because uncompress did nothing
for file_name, file_path in zipped_files.items(): # copy CONANINFO and CONANMANIFEST
os.rename(file_path, os.path.join(package_folder, file_name))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they are being downloaded to the "download" folder, and these files need to be in the final package folder. There were 2 options:

  • Download to the "package folder", then move the .tgz to other place (the download folder)
  • Download to the "download folder", leave the .tgz there and move the other files to the package folder.

I have decided the second one, to don't move the "heavy" .tgz file.

package_checksums = calc_files_checksum(zipped_files)
download_pkg_folder = layout.download_package(pref)
# Download files to the pkg_tgz folder, not to the final one
zipped_files = self._call_remote(remote, "get_package", pref, download_pkg_folder)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the thing, as long as the whole package_folder is protected with a "dirty" flag, everything inside is dirty. There was not dirty flag for this download before, I would say it is not necessary. Probably the other dirty flags over individual .tgz files could also be removed, but I didn't want to change more things in this PR.

The removal of the folder also doesn't make sense, and apparently no test is covering it. If we have the "dirty" mechanism in place to protect interrupted processes, then it shouldn't be necessary to try to remove the folder (in some cases, it could even be annoying, as it won't let debug)

jgsogo
jgsogo approved these changes Oct 27, 2020
@memsharded memsharded merged commit 20e758b into conan-io:develop Oct 27, 2020
2 checks passed
@memsharded memsharded deleted the feature/keep_tgz_download branch Oct 27, 2020
@mietzen
Copy link

mietzen commented Nov 25, 2020

@memsharded Is it possible to turn this feature of? Some of our Devs don't want to store redundant data on their drives.

@memsharded
Copy link
Member Author

Hi @ngerke

At the moment it is not possible. Partly, because this is a bit more than a feature, but it is also fixing some bugs. Read the rationale in the blog post: https://blog.conan.io/2020/11/05/New-conan-release-1-31.html.

Could you please give an order of magnitude of your developer caches sizes? Trying to understand a bit the pain and size of the issue.

In any case, the solution is likely to be based more on the ongoing work with the "LRU", and new features in order to keep the cache size limited. Have a look at: #7980

@mietzen
Copy link

mietzen commented Nov 25, 2020

Our main product has a pkg size of ~ 2 GB (worst case) the SDK pkg ~ 300 MB. Also we have a other pkgs (e.g. Qt) with ~ 300MB if you store multiple versions of the pkgs, cache size of 20-30 GB are not unusual (Without tar.gz pkgs).
I will have a look at "LRU", is there any documentation about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants