Issues/54: on the multifile problem in spark-fits #55

JulienPeloton · 2018-10-18T12:45:49Z

What has changed?

This PR brings two major improvements:

Solve the multifile problem that was seen in On the multifile problem in spark-fits #54
Reduce a potential large overhead when reading many files, due to schema checks.

How this has been tested?

Unit test suite passes + additional integration tests performed. Seem all good, though I need to keep an eye on this

Is there anything left?

Yes for 20,000+ input files, the job explodes by sending many errors in the same times (probably related to each other):

java.lang.AssertionError:
        HDU number 0 does not exist!

java.lang.ArithmeticException: / by zero

java.util.NoSuchElementException: key not found: BITPIX

org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1151505977-134.158.75.222-1469858775214:blk_1075726089_1986406

.... (and then loop over those four)

I need to investigate a bit more.

…s referencing to each other

…paths as a list of inputs

…s will avoid large overhead.

JulienPeloton · 2018-10-18T12:47:42Z

For the record, the multifile problem was solved with minimal change: instead of looping over files and unioning RDD, I now give to Spark the full list of files and he fills the partition alone... Just magic.

codecov-io · 2018-10-18T12:48:58Z

Codecov Report

Merging #55 into master will decrease coverage by 0.21%.
The diff coverage is 90.9%.

@@            Coverage Diff            @@
##           master     #55      +/-   ##
=========================================
- Coverage   89.52%   89.3%   -0.22%     
=========================================
  Files           9       9              
  Lines         487     477      -10     
  Branches       87      88       +1     
=========================================
- Hits          436     426      -10     
  Misses         51      51

Impacted Files	Coverage Δ
...strolabsoftware/sparkfits/FitsSourceRelation.scala	`97.36% <90.9%> (-0.31%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dff168a...6789c37. Read the comment docs.

JulienPeloton · 2018-10-18T13:14:23Z

Probably the last error arises because of ulimit -n being too small (1024). Need to investigate.

JulienPeloton · 2018-10-19T11:21:56Z

OK - the problem with >> 10,000 seems deeper than expected.
I will merge this PR since it solves a large part of the problem and open a separate one to this remaining issue.

JulienPeloton added 6 commits October 17, 2018 20:53

Call sparkContext.union instead of rdd.union to avoid nested UnionRDD…

f72853f

…s referencing to each other

Union directly at a higher level

98df932

Capitalize on the fact that newAPIHadoopFile accepts comma separated …

5efa50f

…paths as a list of inputs

Reorganise FitsRelation, and simplify its structure

9453f53

Update code documentation with recent changes

84b4eb5

Check the schema of the 10 first files only in case of multifile. Thi…

6789c37

…s will avoid large overhead.

JulienPeloton added bug Something isn't working enhancement New feature or request IO labels Oct 18, 2018

JulienPeloton self-assigned this Oct 18, 2018

JulienPeloton mentioned this pull request Oct 18, 2018

On the multifile problem in spark-fits #54

Open

JulienPeloton merged commit 2726caa into master Oct 19, 2018

JulienPeloton mentioned this pull request Oct 19, 2018

Add FITS header check as a user option #56

Open

JulienPeloton deleted the dag-scheduler-event-loop-Fix branch January 7, 2019 10:24

JulienPeloton mentioned this pull request May 14, 2019

.option('columns', 'col1,col2,col3,col4') does not preserve order #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues/54: on the multifile problem in spark-fits #55

Issues/54: on the multifile problem in spark-fits #55

JulienPeloton commented Oct 18, 2018

JulienPeloton commented Oct 18, 2018

codecov-io commented Oct 18, 2018 •

edited

Loading

JulienPeloton commented Oct 18, 2018

JulienPeloton commented Oct 19, 2018

Issues/54: on the multifile problem in spark-fits #55

Issues/54: on the multifile problem in spark-fits #55

Conversation

JulienPeloton commented Oct 18, 2018

What has changed?

How this has been tested?

Is there anything left?

JulienPeloton commented Oct 18, 2018

codecov-io commented Oct 18, 2018 • edited Loading

Codecov Report

JulienPeloton commented Oct 18, 2018

JulienPeloton commented Oct 19, 2018

codecov-io commented Oct 18, 2018 •

edited

Loading