Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conserve disk space when dealing with raster files #692

Merged
merged 3 commits into from Feb 26, 2019

Conversation

Projects
None yet
1 participant
@lewfish
Copy link
Contributor

lewfish commented Feb 22, 2019

Overview

Currently, we download all raster files in one go. This leads to running out of disk space on large datasets. This PR avoids this by downloading an individual raster (on activation), using it, and then deleting it (on deactivation). With this change, you shouldn't run out of disk space as long as the biggest scene fits onto disk. However, if multiple jobs are running on a multi-core instance, then you could still run out of disk space. We should keep that in mind when deciding how much disk space to allocate.

Notes

An unfortunate consequence of this PR is that we now have to download each raster twice: once when we read metadata and a test chip in the RasterioRasterSource constructor, and then again when we actually use it for chipping, prediction, etc. To get around this, I tried to read the metadata and test chip directly off S3 in the constructor using GDAL/Rasterio's ability to do this, but there were problems documented in #691 I'm hoping this won't present a big performance issues since downloads from S3 are fast on EC2.

Testing Instructions

  • I tested the Vegas workflow with local and remote data.
  • In progress: I tested the analyze command remotely using the IDB dataset duplicated 4 times which has 60GB+ of imagery on a 50GB disk. (25 mins)
  • In progress: I compared times for running analyze with the IDB dataset with this branch (10mins) and develop (8mins).

Closes #689

@lewfish lewfish added the review label Feb 22, 2019

@codecov

This comment has been minimized.

Copy link

codecov bot commented Feb 22, 2019

Codecov Report

Merging #692 into develop will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #692      +/-   ##
===========================================
+ Coverage     71.5%   71.53%   +0.03%     
===========================================
  Files          171      171              
  Lines         8260     8269       +9     
===========================================
+ Hits          5906     5915       +9     
  Misses        2354     2354
Impacted Files Coverage Δ
rastervision/data/raster_source/image_source.py 100% <100%> (ø) ⬆️
rastervision/command/command.py 83.33% <100%> (+2.08%) ⬆️
rastervision/data/raster_source/rasterio_source.py 93.84% <100%> (+0.74%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16d7f5b...f637696. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Feb 22, 2019

Codecov Report

Merging #692 into develop will increase coverage by 0.03%.
The diff coverage is 97.64%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #692      +/-   ##
===========================================
+ Coverage     71.5%   71.53%   +0.03%     
===========================================
  Files          171      171              
  Lines         8260     8269       +9     
===========================================
+ Hits          5906     5915       +9     
  Misses        2354     2354
Impacted Files Coverage Δ
rastervision/command/command.py 83.33% <100%> (+2.08%) ⬆️
rastervision/data/raster_source/rasterio_source.py 93.84% <100%> (+0.74%) ⬆️
rastervision/core/raster_stats.py 100% <100%> (ø) ⬆️
rastervision/data/raster_source/image_source.py 100% <100%> (ø) ⬆️
rastervision/analyzer/stats_analyzer.py 100% <100%> (ø) ⬆️
rastervision/analyzer/stats_analyzer_config.py 79.36% <90.47%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16d7f5b...05554ac. Read the comment docs.

lewfish added some commits Feb 22, 2019

Retain reference to tmp_dir object
Otherwise the tmp_dir that is returned will not exist when this method returns

@lewfish lewfish force-pushed the lf/save-space2 branch from 05554ac to eb0a1fc Feb 22, 2019

@lewfish lewfish merged commit ddac0a0 into develop Feb 26, 2019

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@lewfish lewfish deleted the lf/save-space2 branch Feb 26, 2019

@lewfish lewfish removed the review label Feb 26, 2019

@lewfish lewfish changed the title WIP: Conserve disk space when dealing with raster files Conserve disk space when dealing with raster files Mar 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.