Skip to content

[BEAM-2990] support MAP in SQL schema#5079

Merged
mingmxu merged 5 commits intoapache:masterfrom
mingmxu:BEAM-2990
Apr 16, 2018
Merged

[BEAM-2990] support MAP in SQL schema#5079
mingmxu merged 5 commits intoapache:masterfrom
mingmxu:BEAM-2990

Conversation

@mingmxu
Copy link

@mingmxu mingmxu commented Apr 10, 2018

Add type MAP.


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a JIRA issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes.
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue.
  • Write a pull request description that is detailed enough to understand:
    • What the pull request does
    • Why it does it
    • How it does it
    • Why this approach
  • Each commit in the pull request should have a meaningful subject line and body.
  • Run mvn clean verify to make sure basic checks pass. A more thorough check will be performed on your pull request automatically.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

@mingmxu mingmxu requested review from reuvenlax and xumingming April 10, 2018 05:12
@mingmxu
Copy link
Author

mingmxu commented Apr 10, 2018

R: + @akedin

Can you guys take a look and let's try to finish it before 2.5 cutoff? Thanks!

@mingmxu
Copy link
Author

mingmxu commented Apr 10, 2018

retest this please

Copy link
Contributor

@akedin akedin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall, few questions:

// For MAP type, returns the type of the key element.
@Nullable public abstract FieldType getComponentKeyType();
// For MAP type, returns the type of the value element.
@Nullable public abstract FieldType getComponentValueType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me it looks like both getComponentValueType() and getComponentType() mean the same thing, i.e. they both describe the type of the value in the container. Keep just one of them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have different fields for MAP and ARRAY, if it doesn't cause significant performance issue.

@@ -222,6 +224,9 @@ public boolean isDateType() {
public boolean isContainerType() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is map a container type as well? A composite type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while, my thought is MAP-> MAP, ARRAY -> CONTAINER, ROW -> COMPOSITE, to make it clear for the backend types.

@mingmxu
Copy link
Author

mingmxu commented Apr 10, 2018

the failure seems unrelated, any tips to handle it?

@mingmxu
Copy link
Author

mingmxu commented Apr 11, 2018

retest this please

@mingmxu
Copy link
Author

mingmxu commented Apr 11, 2018

run java precommit

put(1, "value1");
put(2, "value2");
put(3, "value3");
put(4, "value4");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a test where the map value is itself a complex or nested type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update to support primitive/array/map/row as value type in Map

Map<Object, Object> valueMap = (Map<Object, Object>) value;
Map<Object, Object> verifiedMap = Maps.newHashMapWithExpectedSize(valueMap.size());
for (Entry<Object, Object> kv : valueMap.entrySet()) {
verifiedMap.put(verifyPrimitiveType(kv.getKey(), componentKeyType.getTypeName(), fieldName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually incorrect right now, since as coded the key type might not be a primitive type. However as I mentioned in Schema.java, I think it would be better to change the key type to be a TypeName, in which case this logic will be correct.

return this;
}

public <T1, T2> Builder addMap(Map<T1, T2> data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this overload needed? addArray was needed because otherwise passing an array into addValues tended to unroll the array, but I don't think Java will do that for maps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not necessary, will remove.

public static final Set<TypeName> STRING_TYPES = ImmutableSet.of(STRING);
public static final Set<TypeName> DATE_TYPES = ImmutableSet.of(DATETIME);
public static final Set<TypeName> CONTAINER_TYPES = ImmutableSet.of(ARRAY);
public static final Set<TypeName> MAP_TYPES = ImmutableSet.of(MAP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider this part of CONTAINER_TYPES?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would separate here, CONTAINER should be ARRAY/SET. List<KV<>> could be a CONTAINER_TYPE, Map<> is not.

// For MAP type, returns the type of the key element.
@Nullable public abstract FieldType getComponentKeyType();
// For MAP type, returns the type of the value element.
@Nullable public abstract FieldType getComponentValueType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XuMingmin wrote:
I would prefer to have different fields for MAP and ARRAY, if it doesn't cause significant performance issue.

two questions/comments

  1. I think the key type be a TypeName instead of a FieldType? Making it a FieldType makes it legal for the key to be a complex value - e.g. they key could be an array type - which doesn't sound very meaningful to me.

  2. Have you considered introducing a new key-value type here (pair of TypeName, FieldType)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will change to key as primitive, and value can be primitive/array/map/row

@@ -222,6 +224,9 @@ public boolean isDateType() {
public boolean isContainerType() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XuMingmin wrote:
while, my thought is MAP-> MAP, ARRAY -> CONTAINER, ROW -> COMPOSITE, to make it clear for the backend types.

I think it's actually just another type of container. i.e. one way of imagining .a map is it's just a container of key-value pairs.

@mingmxu
Copy link
Author

mingmxu commented Apr 12, 2018

retest this please

@mingmxu
Copy link
Author

mingmxu commented Apr 13, 2018

any comments on the change? Would like to close this PR asap as my repository is broken after #4964

@mingmxu
Copy link
Author

mingmxu commented Apr 13, 2018

run java precommit

@mingmxu
Copy link
Author

mingmxu commented Apr 13, 2018

retest this please

@mingmxu mingmxu force-pushed the BEAM-2990 branch 2 times, most recently from 5949dbe to 78f845e Compare April 13, 2018 07:32
Copy link
Contributor

@reuvenlax reuvenlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java:438:

        } else if (TypeName.ROW.equals(componentType.getTypeName())) {

missing Map case.

I think we need to factor this switch out into a helper function used in all these cases, so we don't have to keep updating all of them.

public static final Set<TypeName> COMPOSITE_TYPES = ImmutableSet.of(ROW);

public boolean isPrimitiveType() {
return isNumericType() || isStringType() || isDateType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct (e.g. it excludes boolean). better off making this exclusive (return !isContainterType() && !isCompositeType()).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, will change

public static final Set<TypeName> STRING_TYPES = ImmutableSet.of(STRING);
public static final Set<TypeName> DATE_TYPES = ImmutableSet.of(DATETIME);
public static final Set<TypeName> CONTAINER_TYPES = ImmutableSet.of(ARRAY);
public static final Set<TypeName> MAP_TYPES = ImmutableSet.of(MAP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XuMingmin wrote:
would separate here, CONTAINER should be ARRAY/SET. List<KV<>> could be a CONTAINER_TYPE, Map<> is not.

I'm still confused about this. In most systems, Map is considered a container (e.g. in Java Map is a container type)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java, container extends Collection and map extends Map, they're very different IMO. If we merge them together I don't see any benefit as this is a backend function and developers are using either TypeName.ARRAY.type().withComponentType() or TypeName.MAP.type().withMapType.

To make it clear, I would prefer to use the term Collection instead of Component or Contianer. Any comments?

return this;
}

public <T1, T2> Builder addMap(Map<T1, T2> data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XuMingmin wrote:
it's not necessary, will remove.

Acknowledged.

@mingmxu
Copy link
Author

mingmxu commented Apr 13, 2018

run java precommit

abstract static class Builder {
abstract Builder setTypeName(TypeName typeName);
abstract Builder setComponentType(@Nullable FieldType componentType);
abstract Builder setCollectionType(@Nullable FieldType collectionType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick here: collectionType reads weird to me, as it seems like it's the type of collection (e.g. array, list, tree, etc.) instead of the type of the component elements of the collection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, what's your prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn\t mind componentType, but if you want to include collection in there then maybe CollectionElementType or CollectionValueType?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me change it to CollectionElementType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, LGTM from me once that change is made.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, please have a look when you've time.

@reuvenlax
Copy link
Contributor

lgtm

@mingmxu
Copy link
Author

mingmxu commented Apr 16, 2018

Appreciate @reuvenlax , squash and merging

@mingmxu mingmxu merged commit 3b3f944 into apache:master Apr 16, 2018
@mingmxu mingmxu deleted the BEAM-2990 branch April 16, 2018 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants