Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement full outer join #91

Merged
merged 9 commits into from
Jul 2, 2019
Merged

Implement full outer join #91

merged 9 commits into from
Jul 2, 2019

Conversation

roll
Copy link
Contributor

@roll roll commented May 30, 2019

Hi @akariv,

It's a first attempt to implement full outer join mode as asked here BCODMO/frictionless-usecases#12

If adding this functionality makes sense there are a few questions:

  • user interface: for now, added additional overriding mode parameter (inner/half-outer/full-outer) because the full flag can't cover all the options and slightly contradict to the SQL terminology
  • effectiveness of the db_keys cache. Not sure will it be good enough using memory for DPP
  • I don't see it's possible to join without aggregation as SQL does by default. I'm curious whether it makes sense to add it in the future
  • other flaws in the implementation?

Please take a look

@coveralls
Copy link

Pull Request Test Coverage Report for Build 327

  • 22 of 23 (95.65%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 84.443%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dataflows/processors/join.py 22 23 95.65%
Totals Coverage Status
Change from base Build 321: 0.2%
Covered Lines: 1585
Relevant Lines: 1877

💛 - Coveralls

1 similar comment
@coveralls
Copy link

Pull Request Test Coverage Report for Build 327

  • 22 of 23 (95.65%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 84.443%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dataflows/processors/join.py 22 23 95.65%
Totals Coverage Status
Change from base Build 321: 0.2%
Covered Lines: 1585
Relevant Lines: 1877

💛 - Coveralls

@coveralls
Copy link

coveralls commented May 30, 2019

Pull Request Test Coverage Report for Build 334

  • 29 of 30 (96.67%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 84.485%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dataflows/processors/join.py 29 30 96.67%
Totals Coverage Status
Change from base Build 321: 0.2%
Covered Lines: 1590
Relevant Lines: 1882

💛 - Coveralls

db = KVFile()

# Joining mode
if mode is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if going with mode we should properly deprecate full

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see notes below)


deduplication = target_key is None
fields = fix_fields(fields)
source_key = KeyCalc(source_key)
target_key = KeyCalc(target_key) if target_key is not None else target_key
db_keys = collections.OrderedDict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally you should use a KVFile here for memory scalability

continue
extra = dict(
(k, row.get(k))
for k in fields.keys()
)
row.update(extra)
yield row
if mode == 'full-outer':
for key in db_keys:
extra = db.get(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid code duplication...

@@ -324,16 +350,16 @@ def func(package: PackageWrapper):
return func


def join(source_name, source_key, target_name, target_key, fields={}, full=True, source_delete=True):
return join_aux(source_name, source_key, source_delete, target_name, target_key, fields, full)
def join(source_name, source_key, target_name, target_key, fields={}, full=True, mode=None, source_delete=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full should default to None and mode to 'inner'
if full is not None - show deprecation warning and update mode accordingly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment applies to below functions as well

@roll
Copy link
Contributor Author

roll commented Jun 26, 2019

@akariv
Thanks for the review. I've made changes and updated the readme. Please take a look.

  • I've set mode=half-outer by default because it seems the only non breaking way to make this update
  • Other question is whether we should go with half-outer or left-outer or something else

@roll roll changed the title [WIP] Implement full outer join Implement full outer join Jun 26, 2019
@akariv
Copy link
Member

akariv commented Jul 2, 2019

This looks good @roll, thanks!

@akariv akariv merged commit 95ea664 into datahq:master Jul 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants