ARROW-386: [Java] Respect case of struct / map field names #261

Open
wants to merge 6 commits into
from

Projects

None yet

4 participants

@alphalfalfa

Changes include:

  • Remove all toLowerCase() calls on field names in MapWriters.java template file, so that the writers can respect case of the field names.
  • Use lower-case keys for internalMap in UnionVector instead of camel-case (e.g. bigInt -> bigint). p.s. I don't know what is the original purpose of using camel case here. It did not conflict because all field names are converted to lower cases in the past.
  • Add a simple test case of MapWriter with mixed-case field names.
Jingyuan Wang Arrow-386: [Java] Respect case of struct / map field names
51da2a1
@wesm
Member
wesm commented Dec 30, 2016

I expect this will also show up if added to the integration tests. @julienledem @jacques-n I'm not sure the timeline for using the Arrow jars in Drill, but I presume case-sensitivity is something you'd be able to handle without too much trouble on the application side?

@wesm
Member
wesm commented Jan 2, 2017

@alphalfalfa can you change the PR title to start with ARROW-386:? The merge tool is case sensitive

@jacques-n
Contributor

I think we need to support both behaviors. We have a bunch of code that supports case insensitivity that is built on this code. -1 until we come up with a better solution. Maybe @julienledem will have some ideas but he is out this week

@wesm
Member
wesm commented Jan 2, 2017

@jacques-n this is blocking use of Arrow on some internal use cases where we have case sensitive field names. We are also working on Arrow Java<->C++ in Spark -- I'm not sure if Spark SQL metadata is case sensitive or not.

@jacques-n
Contributor

I'm not worried about the internal name change (bigInt > bigint).

The key is the change to mapwriters. Can you add an option in MapWriters when you construct one to whether they are case sensitive or not? Keep the default behavior as it was but allow a fully case preserving/sensitive alternative?

@alphalfalfa

@jacques-n , without the name change of bigInt -> bigint, the test case of promotableWriter inside TestComplexWriter.java would fail. The problematic code block is the following:

public NullableBigIntVector getBigIntVector() {
    if (bigIntVector == null) {
      int vectorCount = internalMap.size();
      bigIntVector = internalMap.addOrGet("bigInt", MinorType.BIGINT, NullableBigIntVector.class);
      if (internalMap.size() > vectorCount) {
        bigIntVector.allocateNew();
        if (callBack != null) {
          callBack.doWork();
        }
      }
    }
    return bigIntVector;
  }

After BigIntWriter got promoted into UnionWriter, the internalMap of UnionVector already has an entry of "bigint". Somehow, an additional entry named "bigInt" is created. The subsequent operations are then messed up.


For the default behavior, is that the best solution? Probably it is only me, but I never expect a data writer would silently lowering-case of field names provided by users. Of course, if there are there are already products based on the case insensitivity, it make sense to keep it as default.

@alphalfalfa alphalfalfa changed the title from Arrow-386: [Java] Respect case of struct / map field names to ARROW-386: [Java] Respect case of struct / map field names Jan 10, 2017
@alphalfalfa

@wesm title changed, is this good?

@jacques-n
Contributor

@alphalfalfa: I was saying that the bigint change seems correct.

For defaults: we've built a bunch of code on top of this already so changing the default would be quite challenging.

Jingyuan Wang Add option to MapWriters to configure the case sensitivity (defaulted…
… as case-insensitive)
cba60d1
@alphalfalfa

@jacques-n I've added the option and set default to be case-insensitive

@jacques-n

we need to figure out a way to do this that it isn't in the the map(), list(), etc call path. Maybe subclassing?

can you explain a bit more? what do you mean by map(), list() etc call path?

@@ -102,7 +116,7 @@ public Field getField() {
@Override
public MapWriter map(String name) {
- FieldWriter writer = fields.get(name.toLowerCase());
+ FieldWriter writer = fields.get(handleCase(name));
@wesm
wesm Jan 11, 2017 Member

@alphalfalfa he's referring to the changes in this function and below. I think Jacques is proposing having a CaseSensitiveMapWriter

@alphalfalfa
alphalfalfa Jan 11, 2017

@wesm @jacques-n I am a bit confused here as i don't have the full picture of how these writer classes are used. What is the benefit of subclassing into CaseSensitiveMapWriter?

@julienledem

Here are my comments on this.

+ return this.caseSensitive? input : input.toLowerCase();
+ }
+
+ public boolean getCaseSensitivity() {
@julienledem
julienledem Jan 12, 2017 Member

isCasSensitive()

@@ -90,6 +90,7 @@
void clear();
void copyReader(FieldReader reader);
MapWriter rootAsMap();
+ MapWriter rootAsMap(Boolean caseSensitive);
@julienledem
julienledem Jan 12, 2017 Member

I think ComplexWriter itself is case sensitive or not. And this method is not needed.

@julienledem
julienledem Jan 12, 2017 Member

On a side note this should be boolean not Boolean (but it does not apply since I'm suggesting to remove it)

@alphalfalfa
alphalfalfa Jan 12, 2017

I would prefer to configure the case sensitivity directly on ComplexWriter as well. But I have only found MapWriters lowering-case of field names which confused me a bit and make me thinking the case sensitivity probably only applies to the MapWriters.

@@ -139,17 +139,22 @@ public MapWriter directMap(){
}
@Override
- public MapWriter rootAsMap() {
+ public MapWriter rootAsMap(Boolean caseSensitive) {
@julienledem
julienledem Jan 12, 2017 Member

I'd suggest that we add instead an optional caseSensitive parameter to the Constructor of ComplexWriterImpl

@@ -159,6 +164,11 @@ public MapWriter rootAsMap() {
return mapRoot;
}
+ @Override
+ public MapWriter rootAsMap() {
+ return rootAsMap(null);
@julienledem
julienledem Jan 12, 2017 Member

s/null/false?

@alphalfalfa
alphalfalfa Jan 12, 2017

there are more than two binary situations here, using Boolean and null is probably not the best solution. What would you suggest?

  1. init call, caseSensitive = null -> case insensitive
  2. init call, caseSensitive = true/false -> case sensitive/insensitive
  3. non-init call, caseSensitive = null -> already initialized, doing nothing
  4. non-init call, caseSensitive = true/false -> if not the same with initialized sensitivity, IllegalArgumentException is thrown, otherwise, doing nothing
@alphalfalfa
alphalfalfa Jan 12, 2017

nvm, i guess it doesn't matter if case sensitivity is configured when ComplexWriter is constructed.

Jingyuan Wang Configure case sensitivity when constructing ComplexWriterImpl
d269e21
@julienledem
Member

thanks for the update @alphalfalfa.
this looks good to me.
@jacques-n ?

@wesm
Member
wesm commented Jan 16, 2017

LGTM also

@jacques-n
Contributor

The modifications don't address my concerns. Specifically, we want to avoid evaluating the boolean in map() every single time it is called. A common pattern when parsing json strings is calling this repeatedly. Let's just have two version of the implementation. In most cases, only one would get loaded so specialization could occur.

@julienledem
Member
julienledem commented Jan 16, 2017 edited

thanks for clarifying @jacques-n.
@alphalfalfa could make the following changes to resolve this?

  • make handleCase(final String input) protected and just return input.toLowerCase() in MapWriters
  • create a subclass of NullableMapWriter called CaseInsensitiveNullableMapWriter that overrides handleCase
  • modify ComplexWriterImpl to implement the correct NullableMapWriter in rootAsMap()/INIT (@jacques-n: please confirm but if should be fine to add an if in rootAsMap())
    Then we should be good to go
@julienledem
Member

@alphalfalfa Do you need support for nested maps? It does not look like you current PR passes down the caseSensitive attribute to nested maps.

@jacques-n
Contributor

If condition in rootAsMap seems fine.

Agreed that the whole tree should probably be case sensitive or not (not just a part of it).

@alphalfalfa

sure, I can make the change.

@julienledem I probably don't need support for nested maps. Do you think I should pass down the attribute? If so, can you point me to the right portion of code?

@julienledem
Member

@alphalfalfa I think for correctness it is better to pass down the caseSensitivity attribute to nested maps. That means passing down that attribute to ComplexWriters returned by your writer:

  • in ComplexWriterImpl, pass the attribute to the writer returned by rootAsList()
  • in MapWriters, pass the attribute to the writers returned by list() and map()
    Possibly you can create a NullableMapWriterFactory to pass down to the writers which will be called to create new NullableMapWriter instances.
Jingyuan Wang added some commits Jan 17, 2017
Jingyuan Wang Separate MapWriters with CaseSensitiveMapWriters 2fe7bcf
Jingyuan Wang Pass caseSensitive Attribute down to nested MapWriters
7b28bfc
@alphalfalfa

PR updated. Hope I cover all the possible cases.

@julienledem

LGTM
+1
@jacques-n ?

@@ -145,15 +153,16 @@ public void clear() {
@Override
public ListWriter list(String name) {
- FieldWriter writer = fields.get(name.toLowerCase());
+ String finalName = handleCase(name);
+ FieldWriter writer = fields.get(handleCase(finalName));
@julienledem
julienledem Jan 19, 2017 Member

You forgot to remove the call to handleCase here

@@ -159,15 +153,16 @@ public void clear() {
@Override
public ListWriter list(String name) {
- FieldWriter writer = fields.get(handleCase(name));
+ String finalName = handleCase(name);
+ FieldWriter writer = fields.get(handleCase(finalName));
@julienledem
julienledem Jan 19, 2017 Member

you forgot to remove the call to handleCase here

Jingyuan Wang Remove unnecessary handleCase() call
cd08145
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment