-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues/54: on the multifile problem in spark-fits #55
Conversation
…s referencing to each other
…paths as a list of inputs
…s will avoid large overhead.
For the record, the multifile problem was solved with minimal change: instead of looping over files and unioning RDD, I now give to Spark the full list of files and he fills the partition alone... Just magic. |
Codecov Report
@@ Coverage Diff @@
## master #55 +/- ##
=========================================
- Coverage 89.52% 89.3% -0.22%
=========================================
Files 9 9
Lines 487 477 -10
Branches 87 88 +1
=========================================
- Hits 436 426 -10
Misses 51 51
Continue to review full report at Codecov.
|
Probably the last error arises because of |
OK - the problem with >> 10,000 seems deeper than expected. |
What has changed?
This PR brings two major improvements:
How this has been tested?
Unit test suite passes + additional integration tests performed. Seem all good, though I need to keep an eye on this
Is there anything left?
Yes for 20,000+ input files, the job explodes by sending many errors in the same times (probably related to each other):
I need to investigate a bit more.