Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize StatsAnalyzer 馃弫 #690

Merged
merged 5 commits into from Feb 21, 2019

Conversation

Projects
None yet
2 participants
@lewfish
Copy link
Contributor

lewfish commented Feb 19, 2019

Overview

This PR speeds up the StatsAnalyzer (which computes means and standard deviations of each channel in the dataset) by only getting each chip once rather than 2*nb_channels times, switching away from npstreams, and adding an option for approximating the stats using a random subset of chips. It also adds unit tests for the StatAnalyzer. Example:

        analyzer = rv.AnalyzerConfig.builder(rv.STATS_ANALYZER) \
                                    .with_sample_prob(0.01) \
                                    .build()

        experiment = rv.ExperimentConfig.builder() \
                                        .with_id('potsdam-seg') \
                                        .with_analyzer(analyzer) \

Notes

I thought about making some sample_prob the default, but decided against it because I don't know a good value yet and the value could depend on the average size of a scene. On the other hand, it seems like the fast option should be the default. But even with sliding window mode (ie. no sample_prob) it still runs 6x faster.

Another issue is that using the sample_prob option will result in greater approximation error if the scenes contain a large amount of NODATA.

Testing

The unit tests exercise the new functionality. I also benchmarked it on Potsdam with two scenes and it resulted in a huge speedup (> 30x) with only a small difference in the stats that were computed.

Using develop:
time 2:07

{"stds": [35.870648282240516, 35.32982639561334, 36.702463479669596, 37.38938087670711], "means": [85.95280349150948, 92.19038055254074, 85.52430486264436, 96.83653263423753]}

Using this branch with no sample_prob:
time 0:21

{"stds": [35.870648282240516, 35.32982639561334, 36.702463479669596, 37.38938087670711], "means": [85.95280349150948, 92.19038055254074, 85.52430486264436, 96.83653263423753]}

Using this branch with sample_prob=0.01:
time 0:03

{"stds": [36.17572923393826, 35.857899881444986, 37.11680588294381, 36.18744804702934],
"means": [89.91624927520753, 96.44924354553223, 87.8176760673523, 111.6387071609497]}

Closes #648

lewfish added some commits Feb 19, 2019

Speed up analyze command
* Stop using npstreams and update mean and var manually
* Only get each chip once instead of 2 * nb_channels
* Add option to randomly sample a subset of chips

@lewfish lewfish added the review label Feb 19, 2019

@lewfish lewfish changed the title WIP: Speed up analyze WIP: Optimize StatsAnalyzer Feb 19, 2019

@lewfish lewfish changed the title WIP: Optimize StatsAnalyzer Optimize StatsAnalyzer 馃弫 Feb 19, 2019

@codecov

This comment has been minimized.

Copy link

codecov bot commented Feb 19, 2019

Codecov Report

Merging #690 into develop will increase coverage by 0.23%.
The diff coverage is 97.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop    #690      +/-   ##
==========================================
+ Coverage    71.27%   71.5%   +0.23%     
==========================================
  Files          171     171              
  Lines         8211    8260      +49     
==========================================
+ Hits          5852    5906      +54     
+ Misses        2359    2354       -5
Impacted Files Coverage 螖
rastervision/analyzer/stats_analyzer.py 100% <100%> (+44.44%) 猬嗭笍
rastervision/core/raster_stats.py 100% <100%> (酶) 猬嗭笍
rastervision/analyzer/stats_analyzer_config.py 79.36% <90.47%> (+8.53%) 猬嗭笍

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update f62810a...396e8e3. Read the comment docs.

1 similar comment
@codecov

This comment has been minimized.

Copy link

codecov bot commented Feb 19, 2019

Codecov Report

Merging #690 into develop will increase coverage by 0.23%.
The diff coverage is 97.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop    #690      +/-   ##
==========================================
+ Coverage    71.27%   71.5%   +0.23%     
==========================================
  Files          171     171              
  Lines         8211    8260      +49     
==========================================
+ Hits          5852    5906      +54     
+ Misses        2359    2354       -5
Impacted Files Coverage 螖
rastervision/analyzer/stats_analyzer.py 100% <100%> (+44.44%) 猬嗭笍
rastervision/core/raster_stats.py 100% <100%> (酶) 猬嗭笍
rastervision/analyzer/stats_analyzer_config.py 79.36% <90.47%> (+8.53%) 猬嗭笍

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update f62810a...396e8e3. Read the comment docs.

@lewfish lewfish merged commit 16d7f5b into develop Feb 21, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@lewfish lewfish deleted the lf/fast-analyze branch Feb 21, 2019

@lewfish lewfish removed the review label Feb 21, 2019

@ljvmiranda921

This comment has been minimized.

Copy link

ljvmiranda921 commented Feb 22, 2019

Thank you for this, this has been really helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.