Skip to content

Avro schema case sensitivity testing #14500

@hudi-bot

Description

@hudi-bot

As a fallout of [PR 956|https://github.com//pull/956] we would like to understand how Avro behaves with case sensitive column names.

Couple of action items:

  • Test with different field names just differing in case.
  • AbstractRealtimeRecordReader is one of the classes where we are converting Avro Schema field names to lower case, to be able to verify them against column names from Hive. We can consider removing the lowercase conversion there if we verify it does not break anything.

 

JIRA info


Comments

23/May/20 22:07;shivnarayan;[~guoyihua]: this ticket is also related to case sensitivity. If you plan to take the other ticket, this should be on similar lines. ;;;


19/Oct/20 13:42;309637554;i do not think this should fix. because hive meta column is case insensitive. if do not lowercase  will not match the hive meta schema with avro schema. just like :  hive_metastoreConstants.META_TABLE_COLUMNS will be case insensitive. 

Map<String, Field> schemaFieldsMap = HoodieRealtimeRecordReaderUtils.getNameToFieldMap(writerSchema);
hiveSchema = constructHiveOrderedSchema(writerSchema, schemaFieldsMap);

// Get all column names of hive table
String hiveColumnString = jobConf.get(hive_metastoreConstants.META_TABLE_COLUMNS);
LOG.info("Hive Columns : " + hiveColumnString);
String[] hiveColumns = hiveColumnString.split(",");
LOG.info("Hive Columns : " + hiveColumnString);
List hiveSchemaFields = new ArrayList<>();

for (String columnName : hiveColumns) {
Field field = schemaFieldsMap.get(columnName.toLowerCase());

if (field != null) {
hiveSchemaFields.add(new Schema.Field(field.name(), field.schema(), field.doc(), field.defaultVal()));
} else {
// Hive has some extra virtual columns like BLOCK__OFFSET__INSIDE__FILE which do not exist in table schema.
// They will get skipped as they won't be found in the original schema.
LOG.debug("Skipping Hive Column => " + columnName);
}
};;;


19/Oct/20 13:45;309637554;[~uditme]    , [~vinoth]   what do you think about this  :D**;;;


19/Oct/20 23:58;vinoth;[~309637554] this task is about exploring all possibilities and making a call.  IIUC you are making the case for retaining the lower casing. I think what you point out is why we lower cased this. 

I can't decide for myself until we paint the full picture. :) ;;;

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions