-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement full outer join #91
Conversation
Pull Request Test Coverage Report for Build 327
💛 - Coveralls |
1 similar comment
Pull Request Test Coverage Report for Build 327
💛 - Coveralls |
Pull Request Test Coverage Report for Build 334
💛 - Coveralls |
dataflows/processors/join.py
Outdated
db = KVFile() | ||
|
||
# Joining mode | ||
if mode is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if going with mode we should properly deprecate full
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see notes below)
dataflows/processors/join.py
Outdated
|
||
deduplication = target_key is None | ||
fields = fix_fields(fields) | ||
source_key = KeyCalc(source_key) | ||
target_key = KeyCalc(target_key) if target_key is not None else target_key | ||
db_keys = collections.OrderedDict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally you should use a KVFile here for memory scalability
dataflows/processors/join.py
Outdated
continue | ||
extra = dict( | ||
(k, row.get(k)) | ||
for k in fields.keys() | ||
) | ||
row.update(extra) | ||
yield row | ||
if mode == 'full-outer': | ||
for key in db_keys: | ||
extra = db.get(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid code duplication...
dataflows/processors/join.py
Outdated
@@ -324,16 +350,16 @@ def func(package: PackageWrapper): | |||
return func | |||
|
|||
|
|||
def join(source_name, source_key, target_name, target_key, fields={}, full=True, source_delete=True): | |||
return join_aux(source_name, source_key, source_delete, target_name, target_key, fields, full) | |||
def join(source_name, source_key, target_name, target_key, fields={}, full=True, mode=None, source_delete=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full should default to None and mode to 'inner'
if full is not None - show deprecation warning and update mode accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment applies to below functions as well
@akariv
|
This looks good @roll, thanks! |
Hi @akariv,
It's a first attempt to implement full outer join mode as asked here BCODMO/frictionless-usecases#12
If adding this functionality makes sense there are a few questions:
mode
parameter (inner/half-outer/full-outer
) because thefull
flag can't cover all the options and slightly contradict to the SQL terminologydb_keys
cache. Not sure will it be good enough using memory for DPPPlease take a look