IMPALA-7368: Add initial support for DATE type

DATE values describe a particular year/month/day in the form yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a time of day component. The range of values supported for the DATE type is 0000-01-01 to 9999-12-31. This initial DATE type support covers TEXT and HBASE fileformats only. 'DateValue' is used as the internal type to represent DATE values. The changes are as follows: - Support for DATE literal syntax. - Explicit casting between DATE and other types (note that invalid casts will fail with an error just like invalid DECIMAL_V2 casts, while failed casts to other types do no lead to warning or error): - from STRING to DATE. The string value must be formatted as yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory, the time component is optional. If the time component is present, it will be truncated silently. - from DATE to STRING. The resulting string value is formatted as yyyy-MM-dd. - from TIMESTAMP to DATE. The source timestamp's time of day component is ignored. - from DATE to TIMESTAMP. The target timestamp's time of day component is set to 00:00:00. - Implicit casting between DATE and other types: - from STRING to DATE if the source string value is used in a context where a DATE value is expected. - from DATE to TIMESTAMP if the source date value is used in a context where a TIMESTAMP value is expected. - Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP implicit conversions are now all possible, the existing function overload resolution logic is not adequate anymore. For example, it resolves the if(false, '2011-01-01', DATE '1499-02-02') function call to the if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded function, instead of the if(BOOLEAN, DATE, DATE) version. This is clearly wrong, so the function overload resolution logic had to be changed to resolve function calls to the best-fit overloaded function definition if there are multiple applicable candidates. An overloaded function definition is an applicable candidate for a function call if each actual parameter in the function call either matches the corresponding formal parameter's type (without casting) or is implicitly castable to that type. When looking for the best-fit applicable candidate, a parameter match score (i.e. the number of actual parameters in the function call that match their corresponding formal parameter's type without casting) is calculated and the applicable candidate with the highest parameter match score is chosen. There's one more issue that the new resolution logic has to address: if two applicable candidates have the same parameter match score and the only difference between the two is that the first one requires a STRING -> TIMESTAMP implicit cast for some of its parameters while the second one requires a STRING -> DATE implicit cast for the same parameters then the first candidate has to be chosen not to break backward compatibility. E.g: year('2019-02-15') function call must resolve to year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not implemented yet, so this is not an issue at the moment but it will be in the future. When the resolution algorithm considers overloaded function definitions, first it orders them lexicographically by the types in their parameter lists. To ensure the backward compatible behavior Primitivetype.DATE enum value has to come after PrimitiveType.TIMESTAMP. - Codegen infrastructure changes for expression evaluation. - 'IS [NOT] NULL' and '[NOT] IN' predicates. - Common comparison operators (including the 'BETWEEN' operator). - Infrastructure changes for built-in functions. - Some built-in functions: conditional, aggregate, analytical and math functions. - C++ UDF/UDA support. - Support partitioning and grouping by DATE. - Beeswax, HiveServer2 support. These items are tightly coupled and it makes sense to implement them in one change-set. Testing: - A new partitioned TEXT table 'functional.date_tbl' (and the corresponding HBASE table 'functional_hbase.date_tbl') was introduced for DATE-related tests. - BE and FE tests were extended to cover DATE type. - E2E tests: - since DATE type is supported for TEXT and HBASE fileformats only, most DATE tests were implemented separately in tests/query_test/test_date_queries.py. Note, that this change-set is not a complete DATE type implementation, but it lays the foundation for future work: - Add date support to the random query generator. - Implement a complete set of built-in functions. - Add Parquet support. - Add Kudu support. - Optionally support Avro and ORC. For further details, see IMPALA-6169. Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f Reviewed-on: http://gerrit.cloudera.org:8080/12481 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
apache · Apr 23, 2019 · b5805de · b5805de
1 parent 67f77d4
commit b5805de
Show file tree

Hide file tree

Showing 163 changed files with 4,241 additions and 463 deletions.
diff --git a/be/src/codegen/codegen-anyval.cc b/be/src/codegen/codegen-anyval.cc
@@ -34,6 +34,7 @@ const char* CodegenAnyVal::LLVM_DOUBLEVAL_NAME    = "struct.impala_udf::DoubleVa
 const char* CodegenAnyVal::LLVM_STRINGVAL_NAME    = "struct.impala_udf::StringVal";
 const char* CodegenAnyVal::LLVM_TIMESTAMPVAL_NAME = "struct.impala_udf::TimestampVal";
 const char* CodegenAnyVal::LLVM_DECIMALVAL_NAME   = "struct.impala_udf::DecimalVal";
+const char* CodegenAnyVal::LLVM_DATEVAL_NAME      = "struct.impala_udf::DateVal";
 
 llvm::Type* CodegenAnyVal::GetLoweredType(LlvmCodeGen* cg, const ColumnType& type) {
   switch(type.type) {
@@ -63,6 +64,8 @@ llvm::Type* CodegenAnyVal::GetLoweredType(LlvmCodeGen* cg, const ColumnType& typ
     case TYPE_DECIMAL: // %"struct.impala_udf::DecimalVal" (isn't lowered)
                        // = { {i8}, [15 x i8], {i128} }
       return cg->GetNamedType(LLVM_DECIMALVAL_NAME);
+    case TYPE_DATE: // i64
+      return cg->i64_type();
     default:
       DCHECK(false) << "Unsupported type: " << type;
       return NULL;
@@ -112,6 +115,9 @@ llvm::Type* CodegenAnyVal::GetUnloweredType(LlvmCodeGen* cg, const ColumnType& t
     case TYPE_DECIMAL:
       result = cg->GetNamedType(LLVM_DECIMALVAL_NAME);
       break;
+    case TYPE_DATE:
+      result = cg->GetNamedType(LLVM_DATEVAL_NAME);
+      break;
     default:
       DCHECK(false) << "Unsupported type: " << type;
       return NULL;
@@ -215,6 +221,7 @@ llvm::Value* CodegenAnyVal::GetIsNull(const char* name) const {
     case TYPE_TINYINT:
     case TYPE_SMALLINT:
     case TYPE_INT:
+    case TYPE_DATE:
     case TYPE_FLOAT:
       // Lowered type is an integer. Get the first byte.
       return builder_->CreateTrunc(value_, codegen_->bool_type(), name);
@@ -265,6 +272,7 @@ void CodegenAnyVal::SetIsNull(llvm::Value* is_null) {
     case TYPE_TINYINT:
     case TYPE_SMALLINT:
     case TYPE_INT:
+    case TYPE_DATE:
     case TYPE_FLOAT: {
       // Lowered type is an integer. Set the first byte to 'is_null'.
       value_ = builder_->CreateAnd(value_, -0x100LL, "masked");
@@ -293,7 +301,8 @@ llvm::Value* CodegenAnyVal::GetVal(const char* name) {
     case TYPE_BOOLEAN:
     case TYPE_TINYINT:
     case TYPE_SMALLINT:
-    case TYPE_INT: {
+    case TYPE_INT:
+    case TYPE_DATE: {
       // Lowered type is an integer. Get the high bytes.
       int num_bits = type_.GetByteSize() * 8;
       llvm::Value* val = GetHighBits(num_bits, value_, name);
@@ -339,7 +348,8 @@ void CodegenAnyVal::SetVal(llvm::Value* val) {
     case TYPE_BOOLEAN:
     case TYPE_TINYINT:
     case TYPE_SMALLINT:
-    case TYPE_INT: {
+    case TYPE_INT:
+    case TYPE_DATE: {
       // Lowered type is an integer. Set the high bytes to 'val'.
       int num_bits = type_.GetByteSize() * 8;
       value_ = SetHighBits(num_bits, val, value_, name_);
@@ -385,7 +395,7 @@ void CodegenAnyVal::SetVal(int16_t val) {
 }
 
 void CodegenAnyVal::SetVal(int32_t val) {
-  DCHECK(type_.type == TYPE_INT || type_.type == TYPE_DECIMAL);
+  DCHECK(type_.type == TYPE_INT || type_.type == TYPE_DECIMAL || type_.type == TYPE_DATE);
   SetVal(builder_->getInt32(val));
 }
 
@@ -560,6 +570,7 @@ void CodegenAnyVal::LoadFromNativePtr(llvm::Value* raw_val_ptr) {
     case TYPE_FLOAT:
     case TYPE_DOUBLE:
     case TYPE_DECIMAL:
+    case TYPE_DATE:
       SetVal(builder_->CreateLoad(raw_val_ptr, "raw_val"));
       break;
     default:
@@ -617,6 +628,7 @@ void CodegenAnyVal::StoreToNativePtr(llvm::Value* raw_val_ptr, llvm::Value* pool
     case TYPE_FLOAT:
     case TYPE_DOUBLE:
     case TYPE_DECIMAL:
+    case TYPE_DATE:
       // The representations of the types match - just store the value.
       builder_->CreateStore(GetVal(), raw_val_ptr);
       break;
@@ -698,6 +710,7 @@ llvm::Value* CodegenAnyVal::Eq(CodegenAnyVal* other) {
     case TYPE_INT:
     case TYPE_BIGINT:
     case TYPE_DECIMAL:
+    case TYPE_DATE:
       return builder_->CreateICmpEQ(GetVal(), other->GetVal(), "eq");
     case TYPE_FLOAT:
     case TYPE_DOUBLE:
@@ -740,6 +753,7 @@ llvm::Value* CodegenAnyVal::EqToNativePtr(llvm::Value* native_ptr,
     case TYPE_INT:
     case TYPE_BIGINT:
     case TYPE_DECIMAL:
+    case TYPE_DATE:
       return builder_->CreateICmpEQ(GetVal(), val, "cmp_raw");
     case TYPE_FLOAT:
     case TYPE_DOUBLE:{

diff --git a/be/src/codegen/codegen-anyval.h b/be/src/codegen/codegen-anyval.h
@@ -53,6 +53,7 @@ namespace impala {
 /// TYPE_TIMESTAMP/TimestampVal: { i64, i64 }
 /// TYPE_DECIMAL/DecimalVal (isn't lowered):
 /// %"struct.impala_udf::DecimalVal" { {i8}, [15 x i8], {i128} }
+/// TYPE_DATE/DateVal: i64
 //
 /// TODO:
 /// - unit tests
@@ -68,6 +69,7 @@ class CodegenAnyVal {
   static const char* LLVM_STRINGVAL_NAME;
   static const char* LLVM_TIMESTAMPVAL_NAME;
   static const char* LLVM_DECIMALVAL_NAME;
+  static const char* LLVM_DATEVAL_NAME;
 
   /// Creates a call to 'fn', which should return a (lowered) *Val, and returns the result.
   /// This abstracts over the x64 calling convention, in particular for functions returning

diff --git a/be/src/codegen/gen_ir_descriptions.py b/be/src/codegen/gen_ir_descriptions.py
@@ -57,6 +57,8 @@
    "_ZN6impala18AggregateFunctions9AvgUpdateIN10impala_udf9BigIntValEEEvPNS2_15FunctionContextERKT_PNS2_9StringValE"],
   ["AVG_UPDATE_DOUBLE",
    "_ZN6impala18AggregateFunctions9AvgUpdateIN10impala_udf9DoubleValEEEvPNS2_15FunctionContextERKT_PNS2_9StringValE"],
+  ["AVG_UPDATE_DATE",
+   "_ZN6impala18AggregateFunctions9AvgUpdateIN10impala_udf7DateValEEEvPNS2_15FunctionContextERKT_PNS2_9StringValE"],
   ["AVG_UPDATE_TIMESTAMP",
    "_ZN6impala18AggregateFunctions18TimestampAvgUpdateEPN10impala_udf15FunctionContextERKNS1_12TimestampValEPNS1_9StringValE"],
   ["AVG_UPDATE_DECIMAL",
@@ -97,6 +99,8 @@
    "_ZNK6impala7SlotRef16GetCollectionValEPNS_19ScalarExprEvaluatorEPKNS_8TupleRowE"],
   ["SCALAR_EXPR_NULL_LITERAL_GET_COLLECTION_VAL",
    "_ZNK6impala11NullLiteral16GetCollectionValEPNS_19ScalarExprEvaluatorEPKNS_8TupleRowE"],
+  ["SCALAR_EXPR_GET_DATE_VAL",
+   "_ZN6impala10ScalarExpr10GetDateValEPS0_PNS_19ScalarExprEvaluatorEPKNS_8TupleRowE"],
   ["HASH_CRC", "IrCrcHash"],
   ["HASH_MURMUR", "IrMurmurHash"],
   ["PHJ_PROCESS_BUILD_BATCH",
@@ -147,6 +151,8 @@
    "_ZN6impala18AggregateFunctions9HllUpdateIN10impala_udf12TimestampValEEEvPNS2_15FunctionContextERKT_PNS2_9StringValE"],
   ["HLL_UPDATE_DECIMAL",
    "_ZN6impala18AggregateFunctions9HllUpdateIN10impala_udf10DecimalValEEEvPNS2_15FunctionContextERKT_PNS2_9StringValE"],
+  ["HLL_UPDATE_DATE",
+   "_ZN6impala18AggregateFunctions9HllUpdateIN10impala_udf7DateValEEEvPNS2_15FunctionContextERKT_PNS2_9StringValE"],
   ["HLL_MERGE",
    "_ZN6impala18AggregateFunctions8HllMergeEPN10impala_udf15FunctionContextERKNS1_9StringValEPS4_"],
   ["DECODE_AVRO_DATA",
@@ -192,6 +198,7 @@
   ["STRING_TO_DECIMAL4", "IrStringToDecimal4"],
   ["STRING_TO_DECIMAL8", "IrStringToDecimal8"],
   ["STRING_TO_DECIMAL16", "IrStringToDecimal16"],
+  ["STRING_TO_DATE", "IrStringToDate"],
   ["IS_NULL_STRING", "IrIsNullString"],
   ["GENERIC_IS_NULL_STRING", "IrGenericIsNullString"],
   ["RAW_VALUE_COMPARE",

diff --git a/be/src/codegen/llvm-codegen.cc b/be/src/codegen/llvm-codegen.cc
@@ -544,6 +544,8 @@ llvm::Type* LlvmCodeGen::GetSlotType(const ColumnType& type) {
       return timestamp_value_type_;
     case TYPE_DECIMAL:
       return llvm::Type::getIntNTy(context(), type.GetByteSize() * 8);
+    case TYPE_DATE:
+      return i32_type();
     default:
       DCHECK(false) << "Invalid type: " << type;
       return NULL;
@@ -1369,6 +1371,7 @@ void LlvmCodeGen::CodegenMinMax(LlvmBuilder* builder, const ColumnType& type,
     case TYPE_TINYINT:
     case TYPE_SMALLINT:
     case TYPE_INT:
+    case TYPE_DATE:
     case TYPE_BIGINT:
     case TYPE_DECIMAL:
       if (min) {

diff --git a/be/src/exec/aggregator.cc b/be/src/exec/aggregator.cc
@@ -339,7 +339,8 @@ Status Aggregator::CodegenUpdateSlot(LlvmCodeGen* codegen, int agg_fn_idx,
   const ColumnType& dst_type = agg_fn->intermediate_type();
   bool dst_is_int_or_float_or_bool = dst_type.IsIntegerType()
       || dst_type.IsFloatingPointType() || dst_type.IsBooleanType();
-  bool dst_is_numeric_or_bool = dst_is_int_or_float_or_bool || dst_type.IsDecimalType();
+  bool dst_is_numeric_or_bool = dst_is_int_or_float_or_bool || dst_type.IsDecimalType()
+      || dst_type.IsDateType();
 
   llvm::BasicBlock* ret_block = llvm::BasicBlock::Create(codegen->context(), "ret", *fn);
 

diff --git a/be/src/exec/data-source-scan-node.cc b/be/src/exec/data-source-scan-node.cc
@@ -25,6 +25,7 @@
 #include "exec/read-write-util.h"
 #include "exprs/scalar-expr.h"
 #include "gen-cpp/parquet_types.h"
+#include "runtime/date-value.h"
 #include "runtime/mem-pool.h"
 #include "runtime/mem-tracker.h"
 #include "runtime/row-batch.h"
@@ -306,6 +307,12 @@ Status DataSourceScanNode::MaterializeNextRow(const Timezone& local_tz,
             val.size(), slot));
         break;
       }
+      case TYPE_DATE:
+        if (val_idx >= col.int_vals.size()) {
+          return Status(Substitute(ERROR_INVALID_COL_DATA, "DATE"));
+        }
+        *reinterpret_cast<DateValue*>(slot) = DateValue(col.int_vals[val_idx]);
+        break;
       default:
         DCHECK(false);
     }

diff --git a/be/src/exec/hash-table.cc b/be/src/exec/hash-table.cc
@@ -611,6 +611,7 @@ static void CodegenAssignNullValue(LlvmCodeGen* codegen, LlvmBuilder* builder,
       case TYPE_INT:
       case TYPE_BIGINT:
       case TYPE_DECIMAL:
+      case TYPE_DATE:
         null_value = codegen->GetIntConstant(byte_size, fnv_seed, fnv_seed);
         break;
       case TYPE_FLOAT: {

diff --git a/be/src/exec/hdfs-scanner-ir.cc b/be/src/exec/hdfs-scanner-ir.cc
@@ -153,6 +153,11 @@ void IrStringToTimestamp(TimestampValue* out, const char* s, int len,
   *out = StringParser::StringToTimestamp(s, len, result);
 }
 
+extern "C"
+DateValue IrStringToDate(const char* s, int len, ParseResult* result) {
+  return StringParser::StringToDate(s, len, result);
+}
+
 extern "C"
 Decimal4Value IrStringToDecimal4(const char* s, int len, int type_precision,
     int type_scale, ParseResult* result)  {

diff --git a/be/src/exec/hdfs-table-sink.cc b/be/src/exec/hdfs-table-sink.cc
@@ -491,11 +491,25 @@ Status HdfsTableSink::InitOutputPartition(RuntimeState* state,
           new HdfsTextTableWriter(
               this, state, output_partition, &partition_descriptor, table_desc_));
       break;
-    case THdfsFileFormat::PARQUET:
-      output_partition->writer.reset(
-          new HdfsParquetTableWriter(
-              this, state, output_partition, &partition_descriptor, table_desc_));
-      break;
+    case THdfsFileFormat::PARQUET: {
+        // Writing DATE columns to PARQUET is not supported yet.
+        int num_clustering_cols = table_desc_->num_clustering_cols();
+        int num_cols = table_desc_->num_cols() - num_clustering_cols;
+        for (int i = 0; i < num_cols; ++i) {
+          const ColumnType& type = output_expr_evals_[i]->root().type();
+          if (type.type == TYPE_DATE) {
+            ColumnDescriptor col_desc = table_desc_->col_descs()[num_clustering_cols + i];
+            stringstream error_msg;
+            error_msg << "Cannot write DATE column '" << col_desc.name()
+                      << "' to a PARQUET table.";
+            return Status(error_msg.str());
+          }
+        }
+        output_partition->writer.reset(
+            new HdfsParquetTableWriter(
+                this, state, output_partition, &partition_descriptor, table_desc_));
+        break;
+      }
     default:
       stringstream error_msg;
       map<int, const char*>::const_iterator i =

diff --git a/be/src/exec/text-converter.cc b/be/src/exec/text-converter.cc
@@ -230,6 +230,9 @@ Status TextConverter::CodegenWriteSlot(LlvmCodeGen* codegen,
       case TYPE_TIMESTAMP:
         parse_fn_enum = IRFunction::STRING_TO_TIMESTAMP;
         break;
+      case TYPE_DATE:
+        parse_fn_enum = IRFunction::STRING_TO_DATE;
+        break;
       case TYPE_DECIMAL:
         switch (slot_desc->slot_size()) {
           case 4:

diff --git a/be/src/exec/text-converter.inline.h b/be/src/exec/text-converter.inline.h
@@ -28,6 +28,7 @@
 #include "runtime/tuple.h"
 #include "util/string-parser.h"
 #include "runtime/string-value.h"
+#include "runtime/date-value.h"
 #include "runtime/timestamp-value.h"
 #include "runtime/mem-pool.h"
 #include "runtime/string-value.inline.h"
@@ -140,6 +141,11 @@ inline bool TextConverter::WriteSlot(const SlotDescriptor* slot_desc, Tuple* tup
       }
       break;
     }
+    case TYPE_DATE: {
+      *reinterpret_cast<DateValue*>(slot) =
+          StringParser::StringToDate(data, len, &parse_result);
+      break;
+    }
     case TYPE_DECIMAL: {
       switch (slot_desc->slot_size()) {
         case 4:

diff --git a/be/src/exprs/agg-fn-evaluator.cc b/be/src/exprs/agg-fn-evaluator.cc
@@ -27,6 +27,7 @@
 #include "exprs/scalar-expr-evaluator.h"
 #include "exprs/scalar-fn-call.h"
 #include "gutil/strings/substitute.h"
+#include "runtime/date-value.h"
 #include "runtime/lib-cache.h"
 #include "runtime/raw-value.h"
 #include "runtime/runtime-state.h"
@@ -260,6 +261,10 @@ void AggFnEvaluator::SetDstSlot(const AnyVal* src, const SlotDescriptor& dst_slo
         default:
           break;
       }
+    case TYPE_DATE:
+      *reinterpret_cast<DateValue*>(slot) =
+          DateValue::FromDateVal(*reinterpret_cast<const DateVal*>(src));
+      return;
     default:
       DCHECK(false) << "NYI: " << dst_slot_desc.type();
   }
@@ -489,6 +494,13 @@ void AggFnEvaluator::SerializeOrFinalize(Tuple* src,
       }
       break;
     }
+    case TYPE_DATE: {
+      typedef DateVal(*Fn)(FunctionContext*, AnyVal*);
+      DateVal v = reinterpret_cast<Fn>(fn)(
+          agg_fn_ctx_.get(), staging_intermediate_val_);
+      SetDstSlot(&v, dst_slot_desc, dst);
+      break;
+    }
     default:
       DCHECK(false) << "NYI";
   }