-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added ETL #120
added ETL #120
Conversation
Codecov Report
@@ Coverage Diff @@
## master #120 +/- ##
==========================================
- Coverage 85.80% 85.79% -0.02%
==========================================
Files 60 60
Lines 5559 5560 +1
==========================================
Hits 4770 4770
- Misses 789 790 +1
Continue to review full report at Codecov.
|
I am struggling to understand what |
@Giulio2002
Can you explain what it does ? My gut feeling is you're mixing a heap and a stack or a queue. |
ETL uses external merge sorting in order to put all the entries in sorted order by key. The file providers class just helps reading each entry. what it's doing is keeping in the heap (which i now replaced with a priority queue) at least one entry from each file at all times, and gradually put them in the database as they are sorted. The |
The load method is the equivalent of the LoadFunc of the golang implementation. It's just a component of the original implementation. In ETL, this function is called just before an entry is put in the db and can modify each entry RIGHT before they are put in the DB, or just do something additional. |
Thank you so much for explanation but, besides forgiving me being so dumb, after your last commits I'm even more puzzled (if possible) :D Let me recap and please correct me where I'm wrong:
Am I right so far ? Now come my problems in understanding
The first problem I see is you Now you create a prority queue populating it from one read obtained by each data_provider expecting the queue to have the smallest
to this
Even in this case however we do not know if data_provider has returned an empty item due to EOF reached (so effectively no data to process) or cause some other error has occurred. But besides this you will notice that
but At this point my understanding is you want to keep processing the smallest possible
Then in bottom loop you pick the smallest key by copy and immediately remove from the queue. There is no need. Simply defer the removal when you do not longer need the reference. Other caveats :
|
…if no data to return
CLang format
No description provided.