DataTable V3 implementation and measure data table serialization cost on server #6710

mqliang · 2021-03-23T01:25:14Z

Description

This PR:

Add a positional data section to the tail of data table, bump up data table version to V3
Data in the positional data section is supposed to be key/value pairs, and data are supposed to be positional(value of a given key is locatable even after serialization), so use String[] to store keys and use enum to store keys.
Currently we only have one KV pair (response_serialization_cost) in positional data section. But if we add more KV pairs, we can add some utility function such as getOffsetForValueOfGivenKey() to locate the value of given key.
measure data table serialization cost on server and put the cost in the positional data section.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

Yes (Please label as backward-incompat, and complete the section below on Release Notes)

Does this PR fix a zero-downtime upgrade introduced earlier?

Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

Does this PR otherwise need attention when creating release notes? Things to consider:

New configuration options
Deprecation of configurations
Signature changes to public methods/interfaces
New plugins added or old plugins removed

Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

If you have tagged this as either backward-incompat or release-notes,
you MUST add text here that you would like to see appear in release notes of the
next release.

If you have a series of commits adding or enabling a feature, then
add this section only in final commit that marks the feature completed.
Refer to earlier release notes to see examples of text

Documentation

If you have introduced a new feature or configuration, please add it to the documentation as well.
See https://docs.pinot.apache.org/developers/developers-and-contributors/update-document

codecov-io · 2021-03-23T02:09:56Z

Codecov Report

Merging #6710 (636ec0e) into master (8dbb70b) will decrease coverage by 8.00%.
The diff coverage is 83.86%.

@@            Coverage Diff             @@
##           master    #6710      +/-   ##
==========================================
- Coverage   73.83%   65.82%   -8.01%     
==========================================
  Files        1396     1405       +9     
  Lines       67765    68161     +396     
  Branches     9807     9853      +46     
==========================================
- Hits        50035    44870    -5165     
- Misses      14485    20100    +5615     
+ Partials     3245     3191      -54

Flag	Coverage Δ
integration	`?`
unittests	`65.82% <83.86%> (-0.18%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...org/apache/pinot/common/utils/CommonConstants.java	`21.15% <ø> (-13.47%)`	⬇️
...e/pinot/core/common/datatable/DataTableImplV2.java	`0.00% <0.00%> (-89.46%)`	⬇️
...core/operator/blocks/IntermediateResultsBlock.java	`76.21% <0.00%> (-5.41%)`	⬇️
...core/query/executor/ServerQueryExecutorV1Impl.java	`46.19% <0.00%> (-33.70%)`	⬇️
...e/pinot/core/transport/InstanceRequestHandler.java	`55.88% <0.00%> (-22.06%)`	⬇️
...pinot/server/starter/helix/HelixServerStarter.java	`0.00% <0.00%> (-51.99%)`	⬇️
...e/pinot/core/query/reduce/BrokerReduceService.java	`68.54% <33.33%> (-25.81%)`	⬇️
...che/pinot/core/query/scheduler/QueryScheduler.java	`68.96% <66.66%> (-13.09%)`	⬇️
...pinot/core/common/datatable/DataTableImplBase.java	`76.53% <76.53%> (ø)`
.../pinot/core/common/datatable/DataTableBuilder.java	`86.72% <85.71%> (-0.32%)`	⬇️
... and 366 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8dbb70b...636ec0e. Read the comment docs.

siddharthteotia · 2021-03-23T02:20:19Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV2V3.java

 import org.apache.pinot.spi.utils.ByteArray;
 import org.apache.pinot.spi.utils.BytesUtils;


-public class DataTableImplV2 implements DataTable {
-  private static final int VERSION = 2;
+public class DataTableImplV2V3 implements DataTable {


(nit) suggest not including the version name in class name. It should just be DataTableImpl. Tomorrow if we bump up the version to 4, then the name will be DataTableImplV2V3V4 which is undesirable

I name it as DataTableImplV2V3 since V2 and V3 share a lot of common logic. If V2 and V3 has major changes, as you suggest:

Since we are anyway bumping up the version, how about we move the existing metadata of key-value pairs to the end of file to keep consistency in the format. So, all the metadata stuff (aka key-value pairs) + new positional stuff can be a file footer.

If we do that, I vote for put V2 logic into DataTableImplV2 and V3 logic into DataTableImplV3, and extract common logic (e.g. serialize/de-serialize metada/dictionaryMap into DataTableUtils.java)

move the existing metadata of key-value pairs to the end of file

Actually I considered that. I also considered to make metadata as a String[] instead of Map<String, String> and make all meta data keys as enum value. Also make "serialization_cpu_times_ns" as part of metadata. In other words, "serialization_cpu_times_ns" is part of mate data and footer section only contains meta data. In this way:

all meta data is positional, we can replace values in metadata even after data table is serialized. (bytes of Map<String, String> is not positional because when loop over a hashmap, the order of items is not deterministic, but loop over of an array, the order is deterministic)

meta data previously is Map<String, String>, where we need to write keys(type string) to byte buffer. When replaces as String[], we don't write the enum constant itself. Just the value (length+bytes) corresponding to the ordinal/position of the constant. So less data is transfered between server/broker.

But if we change in this way, as I previously stated, I vote to keep the current DataTableImplV2.java as it is, and create a DataTableImplV3.java to put all V3 logic (with extracting common into DataTableUtils.java ). Otherwise, puting all V2/V3 logic in same file will make the code hard to read.

Let's discuss the approach again by moving the metadata to the end of the payload. I think we both are inclined towards doing that since all the metadata (existing + new) will be together in the footer.

Coming to naming, my initial suggestion of not including version was indeed because they share the logic. So tomorrow if we move to v4 and still share a lot of common logic, we can continue to retain the name DataTableImpl and not DataTableImplv2v3v4 as everything will be in the same file as long as it is readable.

I agree that moving the metadata is a change which will make some code unreadable if we try to keep everything in the same file. So yes, if we go down this path, I agree we should create a new class.

siddharthteotia · 2021-03-23T02:22:49Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV2V3.java

+public class DataTableImplV2V3 implements DataTable {
+  public static final int VERSION_2 = 2;
+  public static final int VERSION_3 = 3;
+  public static final int DEFAULT_VERSION = VERSION_3;


Change this to CURRENT_VERSION ?

siddharthteotia · 2021-03-23T02:23:32Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV2V3.java

@@ -61,12 +65,15 @@
  private final byte[] _variableSizeDataBytes;
  private final ByteBuffer _variableSizeData;
  private final Map<String, String> _metadata;
+  // Only V3 has _positionalData
+  private final String[] _positionalData;


I would suggest calling this footer and add some comments on the structure of footer. Please give some example as well.

Also update the javadocs in class DataTableBuilder because that's where the structure of the file is listed

siddharthteotia · 2021-03-23T02:56:04Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV2V3.java


  /**
   * Construct data table with results. (Server side)
   */
-  public DataTableImplV2(int numRows, DataSchema dataSchema, Map<String, Map<Integer, String>> dictionaryMap,
-      byte[] fixedSizeDataBytes, byte[] variableSizeDataBytes) {
+  public DataTableImplV2V3(int version, int numRows, DataSchema dataSchema,


Are we passing version number to the constructor so that we can do backward compatibility tests between V2 and V3 ? Other than tests, I don't see why server should decide a version. It should always write the data table with CURRENT_VERSION

siddharthteotia · 2021-03-23T03:07:49Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV2V3.java

+
+    dataOutputStream.writeInt(_positionalData.length);
+    for (String entry : _positionalData) {
+      byte[] bytes = StringUtil.encodeUtf8(entry);


Some comments on the format here would be useful. We don't write the enum constant itself. Just the value (length+bytes) corresponding to the ordinal/position of the constant. Correct ?

yes, your understanding is correct.

siddharthteotia · 2021-03-23T03:16:14Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV2V3.java

@@ -344,6 +395,20 @@ public void addException(ProcessingException processingException) {
    return byteArrayOutputStream.toByteArray();
  }

+  private byte[] serializePositionalData()


This is actually not doing the serialization to the main output stream opened by the caller toByte().
This function like the other serialization functions first writes to a temporary output stream and then converts to byte array which is returned to the caller and written to the main stream. I think the reason for doing that is upfront we don't know the length of byte[] array to allocate.

However, for this footer we can probably do different and it might be faster

Write a loop to go over each entry and keep a running sum of size

At the end of loop, allocate byte array of that size

Start another loop and go over each entry again and fill out the pre-allocated byte array.

Return the filled byte array

This will prevent the unnecessary creation of streams at lined 400,401 and then writing to them followed by converting to byte array. We can directly write to byte array. I think this can be faster.
For the other Serialization functions which follow this approach, we can fix them later outside this PR if need be

I will write a benchmark to compare these two serialization approach. If the proposed approach is better, will send a PR to address it. Create a issue to track this: #6714

Let's not worry about spending time on that optimization. It can always be done later is not a must have for this change.

siddharthteotia · 2021-03-23T03:28:04Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV2V3.java

+    if (version == VERSION_3 && byteBuffer.hasRemaining()) {
+      int positionalDataStart = variableSizeDataStart + variableSizeDataLength;
+      int positionalDataLength = byteBuffer.remaining();
+      byteBuffer.position(positionalDataStart);


Since we are using byteBuffer.remaining() to compute the length of positional data, it implies we are treating it as a footer of specific format (name-value pairs as defined in the enum) even though we are not calling it. So, technically no other structure can come after this as we will fail to distinguish between the length of positional data + whatever comes after it. I don't think we should limit that flexibility. Even if we call this footer, let's please write the length of footer as well before line 348

done. Length of footer are written into header now

siddharthteotia · 2021-03-23T03:38:59Z

With the addition of new data structure in this PR, there are essentially two places in DataTable where the key-value / name-value style structure is located.

First is the existing DataTable metadata which is also a series of key-value pairs where key is string and value is some statistic/metric. This is towards the beginning of the byte stream
Second is the structure introduced in this PR which is written as a footer.

Since we are anyway bumping up the version, how about we move the existing metadata of key-value pairs to the end of file to keep consistency in the format. So, all the metadata stuff (aka key-value pairs) + new positional stuff can be a file footer.

siddharthteotia · 2021-03-23T03:53:52Z

With this PR, we should resolve a couple of TODOs introduced in PR #6680

Expose the serialization time through an API at the DataTable level and log it in QueryScheduler. You need to serialize before the logging line. Currently it is after.
Revisit this. The execution cpu time is not yet serialized as part of metadata. May be we can just remove line 258.

mcvsubbu · 2021-03-23T16:45:45Z

Any reason we are restricting the trailer (or footer) to have only key-value pairs? We don't need to place that restriction as long as the length is also encoded up front. It can be any serialized object, right?

mqliang · 2021-03-23T18:13:55Z

@mcvsubbu

Any reason we are restricting the trailer (or footer) to have only key-value pairs? We don't need to place that restriction as long as the length is also encoded up front. It can be any serialized object, right?

You are right, it can be any serialized object, but restricting to only contains KV pairs has following benefit:

Any object can be add as a KV pair, just: (key, serialized_object). So it's easy to add new section to footer in future.
For all KV pairs in footer, put their keys in enum, so when serialize footer, the order of KV pairs is deterministic. This make all KV pairs is positional/locatable. So we are able to replace value of a given key in footer even after serialized.
If we want to add a new object into data table. If we are OK to put it as a KV pair into footer, we don't need to bum up version Here is the pseudocode of serialize/de-serialize footer:

enum footerkeys {
	k0,
	k1,
	k2,
}

String footerkeysToStr = new String[]{
	"k0",
	"k1",
	"k2",
}

function serializeFooter() {
 	byte[] bytes;
 	for (key in footerkeys) {
 	    String data = encode_to_str(value_of_key(key));
 	    bytes = append(bytes, len(data));
 	    bytes = append(bytes, data.toBytes());
 	}
}

function String[] deSerializeFooter(byte[] bytes) {
	String[] values = new String[len(footerkeys)];
	for (int i = 0; i < len(footerkeys); i++) {
	   int data_len = bytes.nextInt();
	   values[i] = bytes.nextBytesofLens(data_len);
	}
}

// If values_i is a complex object instead of a string, we can deserialize it even further：
    String[] footerKVpairs = deSerializeFooter(bytes);
	Object_i = deserialize(footerKVpairs[i].toBytes());

So, if we want to add new object to footer, add it as KV pair, and as long as we add the key as the last one of the enum, old broker will just ignore the extra one, it's back-compatable).

If we make footer not only contains KV pairs, but also other arbitrary serializable objects:

+------------------------------------+
|     
|    serializable object 1
|
+------------------------------------
|
|    serializable object 2
|
+------------------------------------
|
|    KV pairs
|
+------------------------------------

It's not extensible: If we wanner add a serializable_object_3 in between of serializable_object_2 and KV_pairs, we need to bump up version (If we bump version, we can also add in to the middle of data table, not necessarily in footer).

That's the reason I prefer footer only contains KV pairs: If we want to add a new simple section into data table, and don't want bump up version, add it as KV pair to footer. If we want add new very complex section or re-arrange current sections, add it into the middle of data table, and bump up version.

siddharthteotia · 2021-03-23T20:02:05Z

With the addition of new data structure in this PR, there are essentially two places in DataTable where the key-value / name-value style structure is located.

First is the existing DataTable metadata which is also a series of key-value pairs where key is string and value is some statistic/metric. This is towards the beginning of the byte stream

Second is the structure introduced in this PR which is written as a footer.

Since we are anyway bumping up the version, how about we move the existing metadata of key-value pairs to the end of file to keep consistency in the format. So, all the metadata stuff (aka key-value pairs) + new positional stuff can be a file footer.

KV pair might be misleading here. Within a KV pair, the value part is indeed a arbitrary serialized object. The KV concept in the footer is just to give it some structure. So, we can keep growing the footer by adding a key to the enum and then the corresponding serialized bytes into the payload

mcvsubbu · 2021-03-23T20:52:15Z

@siddharthteotia , @mqliang and I met, and agreed on the following (I have added some extras, so take a look)

We will move the metadata to the trailer, retain the other elements in the same order.
We will encode the trailer as
= (int, int, blob)+
The first int is the enum ordinal, second int is the length of the blob, the third part is utf8 encoding of a string, or int/long as dictated by the enum. If int/long, then we will encode in network byte order (big-endian). Alternative is to convert it to a string.

siddharthteotia · 2021-03-23T21:27:09Z

@siddharthteotia , @mqliang and I met, and agreed on the following (I have added some extras, so take a look)

We will move the metadata to the trailer, retain the other elements in the same order.

We will encode the trailer as

= (int, int, blob)+

The first int is the enum ordinal, second int is the length of the blob, the third part is utf8 encoding of a string, or int/long as dictated by the enum. If int/long, then we will encode in network byte order (big-endian). Alternative is to convert it to a string.

I think (int, int, bytes/blob in utf-8) is preferable as opposed to converting to string

mcvsubbu · 2021-03-23T21:58:58Z

@siddharthteotia , @mqliang and I met, and agreed on the following (I have added some extras, so take a look)

We will move the metadata to the trailer, retain the other elements in the same order.

We will encode the trailer as

= (int, int, blob)+

The first int is the enum ordinal, second int is the length of the blob, the third part is utf8 encoding of a string, or int/long as dictated by the enum. If int/long, then we will encode in network byte order (big-endian). Alternative is to convert it to a string.

I think (int, int, bytes/blob in utf-8) is preferable as opposed to converting to string

Correct, but if the "blob" is an int or long value, then utf8 will mean long ->string->utf8 right? Alternatively, toBigEndian(longValue)

mqliang · 2021-03-24T17:41:40Z

@mcvsubbu Just found a defect of using enum value as key and encode trailer as (int, int, bytes/blob in utf-8) :

We are able to add new key into the enum, without bumping up version
We are able to not include a key into trailer, without bumping up version
However, we are unable to remove a key from the enum (if the key is no long used in a future version)

Namely, say we now have three keys:

// old version:
enum {
    key1,
    key2,
    key3,
}

Now if we remove key2 from the enum since it's no longer been used.

// new version
enum {
    key1,
    key3,
}

Then, when new broker receive bytes from old server, it will interpret value of k2 as value of k3.

So a better solution is using string as key and encode trailer as (int of key length, bytes of key in utf-8, int of value length, bytes of value in utf-8). Which is exactly how we encode metadata in V2.

However, if we do it in his way, it's equivalent to just moving metadata section to the end of datatable, which does not make too much sense to bump up a version just for rearranging sections in datatable.

Let's take a step back to what we wanner solve:

we wanner add serialization_cost to datatable, but serialization_cost is not available before serialization.
we wanner keep back-comp

To add serialization_cost to datatable after serialization, basically we have two options:

append it to the end of bytes.
put a temporary value of serialization_cost when serialization, after serialization is done, replace it as the actual value.

So, here is another approach:

don't add a trailer section
put serialization_cost into metedata
we serialize metedata, in V2 we encode it as (int of key length, bytes of key in utf-8, int of value length, bytes of value in utf-8). Encoding in this way makes value replacement after serialization impossible, since String.valueOf("1000").length() != String.valueOf("100000").length().
In V3, keep all existing logic. However, if the value is long, we should encode it as (int of key length, bytes of key in utf-8, toBigEndian(longValue)). And the the function of serializaMetadata(), we can have a variable to record the start offset of serialization_cost.

bytes[] bytes;
int serialization_cost_value_start_offset;

offset = 0;
for (String key: metadata.keySet()) {
      keybytes[] = to-utf8(key);
      bytes.append(keybytes.length())
      bytes.append(keybytes)

      offset += 4;
      offset += keybytes.length

      if (key.equals("erialization_cost")) {
            serialization_cost_value_start_offset = offset;
            valuebytes = toBigEndian(value);
            bytes.append(valuebytes)
            offset += 8;
      } else {
            valuebytes = to-utf8(value);
            bytes.append(valuebytes.length())
            bytes.append(valuebytes)
            offset += 4
            offset += keybytes.length
      }
}

So after serialization, we are able to replace the value of serialization_cost (toBigEndian(longValue) is always 8 bytes, which makes replacement possible):

offset = metadataStartOffset+serialization_cost_value_start_offset
bytes[offset:offset+8] = toBigEndian(actualValue)

Jackie-Jiang

High level question: why do we need this new field? We should be able to use the metadata field for this

mqliang · 2021-03-24T19:58:42Z

@Jackie-Jiang

High level question: why do we need this new field? We should be able to use the metadata field for this

We wanner measure CPU time to serialize datatable (AKA: serialization_cost)on each server, and send it back to broker. Here is the dilemma: we will only know the CPU time after the serialization is completed, however if the serialization is already completed, how can make serialization_cost as part of the payload (it's a chicken-and-egg problem)?

To add serialization_cost to serialized bytes of datatable, basically we have two options (we don't want serialize two times):

append it to the end of bytes.
put a temporary value of serialization_cost when serialization, after serialization is done, replace it as the actual value.

No matter which options we adopt, we need bump up the version.

mcvsubbu · 2021-03-24T20:16:00Z

@siddharthteotia , @mqliang and I met, and agreed on the following (I have added some extras, so take a look)

We will move the metadata to the trailer, retain the other elements in the same order.

We will encode the trailer as

= (int, int, blob)+

The first int is the enum ordinal, second int is the length of the blob, the third part is utf8 encoding of a string, or int/long as dictated by the enum. If int/long, then we will encode in network byte order (big-endian). Alternative is to convert it to a string.

Not sure which option @siddharthteotia agrees with, but the alternatives are something like:
7, 8, "12609856" (8 byte string for a number)
vs
7, 4, 12609856 (4-byte integer for a number)

Maybe we can decide based on what looks easier in code.

mcvsubbu · 2021-03-24T20:21:46Z

@mcvsubbu Just found a defect of using enum value as key and encode trailer as (int, int, bytes/blob in utf-8) :

We are able to add new key into the enum, without bumping up version

We are able to not include a key into trailer, without bumping up version

However, we are unable to remove a key from the enum (if the key is no long used in a future version)

Namely, say we now have three keys:
// old version:
enum {
    key1,
    key2,
    key3,
}
Now if we remove key2 from the enum since it's no longer been used.
// new version
enum {
    key1,
    key3,
}
Then, when new broker receive bytes from old server, it will interpret value of k2 as value of k3.

So a better solution is using string as key and encode trailer as (int of key length, bytes of key in utf-8, int of value length, bytes of value in utf-8). Which is exactly how we encode metadata in V2.

However, if we do it in his way, it's equivalent to just moving metadata section to the end of datatable, which does not make too much sense to bump up a version just for rearranging sections in datatable.

Let's take a step back to what we wanner solve:

we wanner add serialization_cost to datatable, but serialization_cost is not available before serialization.

we wanner keep back-comp

To add serialization_cost to datatable after serialization, basically we have two options:

append it to the end of bytes.

put a temporary value of serialization_cost when serialization, after serialization is done, replace it as the actual value.

So, here is another approach:

don't add a trailer section

put serialization_cost into metedata

we serialize metedata, in V2 we encode it as (int of key length, bytes of key in utf-8, int of value length, bytes of value in utf-8). Encoding in this way makes value replacement after serialization impossible, since String.valueOf("1000").length() != String.valueOf("100000").length().

In V3, keep all existing logic. However, if the value is long, we should encode it as (int of key length, bytes of key in utf-8, toBigEndian(longValue)). And the the function of serializaMetadata(), we can have a variable to record the start offset of serialization_cost.
bytes[] bytes;
int serialization_cost_value_start_offset;

offset = 0;
for (String key: metadata.keySet()) {
      keybytes[] = to-utf8(key);
      bytes.append(keybytes.length())
      bytes.append(keybytes)

      offset += 4;
      offset += keybytes.length

      if (key.equals("erialization_cost")) {
            serialization_cost_value_start_offset = offset;
            valuebytes = toBigEndian(value);
            bytes.append(valuebytes)
            offset += 8;
      } else {
            valuebytes = to-utf8(value);
            bytes.append(valuebytes.length())
            bytes.append(valuebytes)
            offset += 4
            offset += keybytes.length
      }
}
So after serialization, we are able to replace the value of serialization_cost (toBigEndian(longValue) is always 8 bytes, which makes replacement possible):
offset = metadataStartOffset+serialization_cost_value_start_offset
bytes[offset:offset+8] = toBigEndian(actualValue)

Removing enums will break the protocol and is not allowed. We need to state in the comments clearly.
We should use a trailer instead of hacking the length. This will be applicable for streaming use cases as well

mqliang · 2021-03-24T22:45:45Z

@mcvsubbu Ready for another round of review.

commit of "implement datatable V3":

Add DataTableImplV3, compared with V2:
- V3 has a trailer section, at the end of datatable
- V3 don't have metadata sections, all KV pairs are put into trailer section
- V3 has an exceptions section in the middle of datatable. V2 use meta data to store exceptions (use
  "Exception"+errCode as key). In V3, all key are enum value, which must be defined statically, we can not use
  "Exception"+errCode to create new keys, so use a dedicate section to store exceptions
Although metadata section has been removed in V3, there are many existing code use dataTable.getMetadata().get("key")/dataTable.getMetadata().set("key", "value") to set/get metadata KV pairs, to provide the same interface with V2, V3 also implement the getMetadata() method. When serialize, move all metadata into trailer section; when deserialize, move all metadata KV pair trailer section to matedata map.
When serialize the trailer section, for each KV pairs:
- if value is int/long, encode it as: [keyOrdinal, bigEndianRepresentationOfValue]
- if value is string, encode it as: [keyOrdinal, valueLength, Utf8EncodedValue]

To make review easier, will @you at where V3 is different with V2.

commit of "add responseSerializationCpuTimeNs measurement":

put a temporary value of serialization_cost when serialization, after serialization is done, replace it as the actual value.

mqliang · 2021-03-24T22:49:24Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+   * TODO(@mqliang): revise this if we decide to get/set metadata by
+   *  datable.getTailerData(key)/datable.setTailer(key, value).
+   */
+  private final Map<String, String> _metadata;


All metadata KV pairs are stored in trailer in V3, however, to provide the same interface with V2, V3 also implement the Map<String, String> getMedadata() method. We need to copy KV paird between _metadata and _trailer during serializaion/deserialization.

mqliang · 2021-03-24T22:50:59Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+    // Write trailer data (START|SIZE).
+    dataOutputStream.writeInt(dataOffset);
+    // Put all meta data into trailer.
+    _trailer = putAllMetaDataIntoTrailer();


@mcvsubbu Before serialize _trailer, we need copy all KV pairs in metadata in to trailer.

mqliang · 2021-03-24T22:52:42Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+  /**
+   * Construct data table from V2 byte array. (broker side)
+   */
+  public DataTableImplV3(ByteBuffer byteBuffer, boolean isV2)


@mcvsubbu This function is used to deserialize a V2 bytes into V3 datatable object

mqliang · 2021-03-24T22:53:40Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+     * Metadata is actually a part of _trailer in V3 when serialize DataTable into bytes. When deserialize,
+     * we extract metadata from _trailer into this _metadata map to provide the same interface with V2.
+     * */
+    _metadata = extractMetadataFormTrailer();


@mcvsubbu After de-serialize _trailer, we need copy all metadata KV pairs in _trailer into _metadata.

mqliang · 2021-03-24T22:54:27Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+     * V2 stores exceptions as a bunch of KV pairs in metadata, all exceptions has key of "Exception"+errCode.
+     * To interpret V2 bytes as V3 object, extract exceptions from metadata.
+     */
+    _exceptions = extractExceptionsFormV2Metadata();


@mcvsubbu V2 stores exceptions as a bunch of KV pairs in metadata, all exceptions has key of "Exception"+errCode. To interpret V2 bytes as V3 object, extract exceptions from metadata and put them into _exceptions

mqliang · 2021-03-24T22:55:09Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+   * - if value is int/long, encode it as: [keyOrdinal, bigEndianRepresentationOfValue]
+   * - if value is string, encode it as: [keyOrdinal, valueLength, Utf8EncodedValue]
+   */
+  private byte[] serializeTrailer()


@mcvsubbu This is the code to serialize trailer.

mqliang · 2021-03-24T22:55:22Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+    return byteArrayOutputStream.toByteArray();
+  }
+
+  private Map<TrailerKeys, String> deserializeTrailer(byte[] bytes)


@mcvsubbu This is the code to de-serialize trailer.

Jackie-Jiang · 2021-03-24T23:18:08Z

Since we are adding a new data table version, please use this opportunity to address the TODOs within the DataTableBuilder.
For the TrailerKeys enum, let's put an id for each key instead of using the ordinal of the enum. This way it is much easier to manage as long as we don't reuse the ids. Also suggest renaming it to MetadataKeys

Jackie-Jiang · 2021-03-24T23:26:39Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+  /**
+   * Construct data table from V2 byte array. (broker side)
+   */
+  public DataTableImplV3(ByteBuffer byteBuffer, boolean isV2)


It might be cleaner if you add a data table v2 to v3 converter instead of constructing v3 directly from v2 buffer

siddharthteotia · 2021-03-24T23:59:43Z

@siddharthteotia , @mqliang and I met, and agreed on the following (I have added some extras, so take a look)

We will move the metadata to the trailer, retain the other elements in the same order.

We will encode the trailer as

= (int, int, blob)+

The first int is the enum ordinal, second int is the length of the blob, the third part is utf8 encoding of a string, or int/long as dictated by the enum. If int/long, then we will encode in network byte order (big-endian). Alternative is to convert it to a string.

Not sure which option @siddharthteotia agrees with, but the alternatives are something like:
7, 8, "12609856" (8 byte string for a number)
vs
7, 4, 12609856 (4-byte integer for a number)

Maybe we can decide based on what looks easier in code.

@mcvsubbu I agree with the big endian approach in case when the value/blob part itself is fixed with int or long

siddharthteotia · 2021-03-25T00:00:43Z

@mcvsubbu Just found a defect of using enum value as key and encode trailer as (int, int, bytes/blob in utf-8) :

We are able to add new key into the enum, without bumping up version

We are able to not include a key into trailer, without bumping up version

However, we are unable to remove a key from the enum (if the key is no long used in a future version)

Namely, say we now have three keys:
// old version:
enum {
    key1,
    key2,
    key3,
}
Now if we remove key2 from the enum since it's no longer been used.
// new version
enum {
    key1,
    key3,
}
Then, when new broker receive bytes from old server, it will interpret value of k2 as value of k3.

So a better solution is using string as key and encode trailer as (int of key length, bytes of key in utf-8, int of value length, bytes of value in utf-8). Which is exactly how we encode metadata in V2.

However, if we do it in his way, it's equivalent to just moving metadata section to the end of datatable, which does not make too much sense to bump up a version just for rearranging sections in datatable.

Let's take a step back to what we wanner solve:

we wanner add serialization_cost to datatable, but serialization_cost is not available before serialization.

we wanner keep back-comp

To add serialization_cost to datatable after serialization, basically we have two options:

append it to the end of bytes.

put a temporary value of serialization_cost when serialization, after serialization is done, replace it as the actual value.

So, here is another approach:

don't add a trailer section

put serialization_cost into metedata

we serialize metedata, in V2 we encode it as (int of key length, bytes of key in utf-8, int of value length, bytes of value in utf-8). Encoding in this way makes value replacement after serialization impossible, since String.valueOf("1000").length() != String.valueOf("100000").length().

In V3, keep all existing logic. However, if the value is long, we should encode it as (int of key length, bytes of key in utf-8, toBigEndian(longValue)). And the the function of serializaMetadata(), we can have a variable to record the start offset of serialization_cost.
bytes[] bytes;
int serialization_cost_value_start_offset;

offset = 0;
for (String key: metadata.keySet()) {
      keybytes[] = to-utf8(key);
      bytes.append(keybytes.length())
      bytes.append(keybytes)

      offset += 4;
      offset += keybytes.length

      if (key.equals("erialization_cost")) {
            serialization_cost_value_start_offset = offset;
            valuebytes = toBigEndian(value);
            bytes.append(valuebytes)
            offset += 8;
      } else {
            valuebytes = to-utf8(value);
            bytes.append(valuebytes.length())
            bytes.append(valuebytes)
            offset += 4
            offset += keybytes.length
      }
}
So after serialization, we are able to replace the value of serialization_cost (toBigEndian(longValue) is always 8 bytes, which makes replacement possible):
offset = metadataStartOffset+serialization_cost_value_start_offset
bytes[offset:offset+8] = toBigEndian(actualValue)

@mqliang @mcvsubbu I don't think we should worry about or even allow removal of enums. It is complicating the design plus it's something that is typically not allowed

siddharthteotia · 2021-03-25T00:06:10Z

With the addition of new data structure in this PR, there are essentially two places in DataTable where the key-value / name-value style structure is located.

First is the existing DataTable metadata which is also a series of key-value pairs where key is string and value is some statistic/metric. This is towards the beginning of the byte stream

Second is the structure introduced in this PR which is written as a footer.

Since we are anyway bumping up the version, how about we move the existing metadata of key-value pairs to the end of file to keep consistency in the format. So, all the metadata stuff (aka key-value pairs) + new positional stuff can be a file footer.

With this PR, we should resolve a couple of TODOs introduced in PR #6680

Expose the serialization time through an API at the DataTable level and log it in QueryScheduler. You need to serialize before the logging line. Currently it is after.

Revisit this. The execution cpu time is not yet serialized as part of metadata. May be we can just remove line 258.

@mqliang , please make sure to address these TODOs

mcvsubbu · 2021-03-30T16:03:29Z

I have labelled it as backward-incompat and release-notes. Please add appropriate checkin comments mentioning that this change will be backward incompat if servers are upgraded first, so brokers must be upgraded before servers.
Also mention that the compatibility of the protocols will not be retained beyond 0.8.0 (or the next version that is released), create an issue so that we remove all V2 protocol code after 0.8.0 is released.

mcvsubbu

Add a test for metadata section serialize/deserialize. Be sure to examine the actual bytes in the test and not just call deserialize code. Thanks.

mcvsubbu · 2021-03-30T16:28:33Z

pinot-common/src/main/java/org/apache/pinot/common/utils/CommonConstants.java

@@ -321,6 +321,9 @@
    public static final String CONFIG_OF_ENABLE_THREAD_CPU_TIME_MEASUREMENT =
        "pinot.server.instance.enableThreadCpuTimeMeasurement";
    public static final boolean DEFAULT_ENABLE_THREAD_CPU_TIME_MEASUREMENT = false;
+
+    public static final String CONFIG_OF_CURRENT_DATA_TABLE_VERSION = "pinot.server.instance.currentDataTableVersion";


We can retain this config forever, to be used for upgrading the protocol.

Note that by default protocol version is the latest (3). The config will be used to downgrade the protocol to 2 without having to rollback the server deployment if in case there are any issues.

mcvsubbu · 2021-03-30T16:34:02Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableBuilder.java

    _dataSchema = dataSchema;
    _columnOffsets = new int[dataSchema.size()];
    _rowSizeInBytes = DataTableUtils.computeColumnOffsets(dataSchema, _columnOffsets);
  }

+  public static void setCurrentDataTableVersion(int version) {
+    _version = version;


Throw exception if it is not one of the supported versions

mcvsubbu · 2021-03-30T16:36:54Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+ * Datatable V3 implementation.
+ * The layout of serialized V3 datatable looks like:
+ * 	+-----------------------------------------------+
+ * 	| 13 bytes of header:                           |


Suggested change

* | 13 bytes of header: |

* | 13 integers of header: |

mcvsubbu · 2021-03-30T16:38:20Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+public class DataTableImplV3 extends DataTableImplBase {
+  private static final int HEADER_SIZE = Integer.BYTES * 13;
+  // _exceptions stores exceptions as a map of errorCode->errorMessage
+  private final Map<Integer, String> _exceptions;


Suggested change

private final Map<Integer, String> _exceptions;

private final Map<Integer, String> _errCodeToExceptionMap;

mcvsubbu · 2021-03-30T16:39:38Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+
+    // Read metadata.
+    int metadataLength = byteBuffer.getInt();
+    byte[] trailerBytes = new byte[metadataLength];


Suggested change

byte[] trailerBytes = new byte[metadataLength];

byte[] metadataBytes = new byte[metadataLength];

Let us keep the naming consistent.

mcvsubbu · 2021-03-30T16:39:52Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+    }
+
+    // Read metadata.
+    int metadataLength = byteBuffer.getInt();


Add the case where metadataLength is 0

mcvsubbu · 2021-03-30T16:45:57Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+  /**
+   * Serialize metadata section to bytes.
+   * Format of the bytes looks like:
+   * [numEntries, bytesOfKV2, bytesOfKV2, bytesOfKV3]


this is wrong description. The format is:

length of metadata section

actual metadata
Metadata can be one of two types -- fixed (Int/long) or var length.
A fixed length metadata is coded as: (enumOrdinal , metadata value )
Var length metadata is coded as: (enumOrdinal, metadata length, metadata value)
All integer values (including ordinal, etc.) are encoded in BigEndian format

Oh, actually the length of metadata section is written outside of this function, it's write by the caller. So the description of [numEntries, bytesOfKV2, bytesOfKV2, bytesOfKV3] here is correct. Has add comments at caller to highlight the length writing logic.

siddharthteotia · 2021-03-30T19:30:50Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplBase.java

+  protected ByteBuffer _variableSizeData;
+  protected Map<String, String> _metadata;
+
+  public DataTableImplBase(int numRows, DataSchema dataSchema, Map<String, Map<Integer, String>> dictionaryMap,


Please add javadoc

siddharthteotia · 2021-03-30T19:30:54Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableBuilder.java

@@ -91,11 +94,16 @@
  private ByteBuffer _currentRowDataByteBuffer;

  public DataTableBuilder(DataSchema dataSchema) {
+    _version = VERSION_3;


This is not really needed since you already define it at line# 82

siddharthteotia · 2021-03-30T19:37:50Z

pinot-core/src/main/java/org/apache/pinot/core/query/executor/ServerQueryExecutorV1Impl.java

@@ -138,7 +138,7 @@ public DataTable processQuery(ServerQueryRequest queryRequest, ExecutorService e
      String errorMessage = String
          .format("Query scheduling took %dms (longer than query timeout of %dms)", querySchedulingTimeMs,
              queryTimeoutMs);
-      DataTable dataTable = new DataTableImplV2();
+      DataTable dataTable = new DataTableImplV3();


This seems incorrect since if the protocol config is set to V2, we should not be constructing V3 data table.

I think all these places are constructing empty data table on the server right?

I think we should replace these with DataTableUtils.buildEmptyDataTable() to properly build an empty data table. Secondly, since DataTableUtils internally uses DataTableBuilder which is aware of the version so it will build an empty table based on V2 or V3

Let's discuss this to see what we need to do here. Might want to cleanup the existing code first to always build empty data table in the same manner. We have 2 options

Add a static method to DataTableBuilder -- something like DataTableBuilder.getDefaultTable() , this internally has the version so it will either return new DataTableImplV2() or new DataTableImplV3()

Clean the existing code by always using DataTableUtils.buildEmptyDataTable in these situations

For option 2, I am not sure why the existing code (not this PR) is having mixed semantics for constructing empty data table. Several places are directly calling the constructor which sets everything to null whereas in one unique place we are calling DataTableUtils.buildEmptyDataTable(queryContext) to return an empty data table with properly initialized schema

Discussed this offline with @mqliang. For now we decided to go with option 1. Add a TODO there to follow-up with a PR which unifies the the way of constructing empty data table in the same manner everywhere

siddharthteotia · 2021-03-30T19:40:30Z

pinot-core/src/main/java/org/apache/pinot/core/query/scheduler/QueryScheduler.java

@@ -161,13 +163,15 @@ public void stop() {
          queryRequest.getBrokerId(), e);
      // For not handled exceptions
      serverMetrics.addMeteredGlobalValue(ServerMeter.UNCAUGHT_EXCEPTIONS, 1);
-      dataTable = new DataTableImplV2();
+      dataTable = new DataTableImplV3();


I think we should use DataTableUtils.buildEmptyDataTable()

Please address this as per approach discussed in #6710

siddharthteotia · 2021-03-30T19:47:51Z

pinot-core/src/test/java/org/apache/pinot/core/common/datatable/DataTableSerDeTest.java

+  }
+
+  @Test
+  public void testV2V3Compatibility()


let's also add test cases for

v3 data table sent by server is empty

v3 data table sent by server has metadata length as 0

v3 data table sent by server has metadata length as 0

That's impossible, since the toBytes() in V3 will always add a threadCpuTimeNs KV pair to metadata, so for V3, metadata at least contains 1 KV pair. We can add a test: empty datatable (numRow = 0); datatable whoes metadata only contains threadCpuTimeNs KV; datatable whoes metadata has multiple KV pairs, etc

@mqliang the idea here is to make sure that the receiver handles things as much as possible even if sender does something weird (say, someone introduces a bug, or somehow the next rev of the protocol does something funky). See Robustmness principle: https://en.wikipedia.org/wiki/Robustness_principle

siddharthteotia · 2021-03-30T19:52:54Z

pinot-core/src/test/java/org/apache/pinot/core/common/datatable/DataTableSerDeTest.java

+    DataTable dataTableV2 = dataTableBuilderV2.build(); // create a V2 data table
+    // Deserialize data table bytes as V3
+    DataTable newDataTable = DataTableFactory.getDataTable(dataTableV2.toBytes());
+    Assert.assertEquals(newDataTable.getDataSchema(), dataSchema, ERROR_MESSAGE);


Not sure I follow this test

server is constructing a v2 data table and serializing it

broker will use DataTableFactory to get the data table. How can broker get it as v3 when the version # will indicate 2 and DataTableFactory will accordingly create DataTableImplV2 ?

Oh, that's from previous implementation, where we have a convert to convert V2 to V3. Will change the comments here.

Done. I have updated the comments.

siddharthteotia · 2021-03-30T20:44:44Z

pinot-core/src/main/java/org/apache/pinot/core/query/scheduler/QueryScheduler.java

@@ -315,7 +313,7 @@ private boolean forceLog(long schedulerWaitMs, long numDocsScanned) {
   */
  protected ListenableFuture<byte[]> immediateErrorResponse(ServerQueryRequest queryRequest,
      ProcessingException error) {
-    DataTable result = new DataTableImplV2();
+    DataTable result = new DataTableImplV3();


Please fix this as per https://github.com/apache/incubator-pinot/pull/6710/files#r604379681

Jackie-Jiang

Suggest separating the builder and reader for v2 and v3 because we will need to remove v2 implementations in the next release

Jackie-Jiang · 2021-03-30T22:08:41Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+   *  - Always add new keys to the end.
+   *  Otherwise, backward compatibility will be broken.
+   */
+  enum MetadataKeys {


I still suggest associating an id with each key instead of using ordinal of the enum. The convention here should be always increasing the id when adding new keys.
Id is more flexible than ordinal for the following reasons:

Ordinal works as always putting the index key as the id. If by any chance people accidentally change the order of the keys, it will break

With id, we can remove keys in a backward-compatible way in two releases if necessary. With ordinal, we have to keep a place holder so that the ordinal for other keys don't change

@mqliang @siddharthteotia @mcvsubbu Thoughts?

@Jackie-Jiang , I prefer enums. We can add a unit test that asserts (A < B< C ...), to catch any re-orders.
If we have to manually insert a value, then duplicate values are possible (by mistake) and that can also cause problems.

How about let's use enum at this moment. We can discuss more, if we decide to associate an id with each key later on, as long as we associate the first key with 0, second with 1, third key with 3...The bytes send on wire will not change. We can address it in a separate PR, it's just some code level change, will not change any payloads.

Yes, we can argue both ways here but my preference would be enum with implicit ordinal as opposed to id based. I agree the latter gives more flexibility to the user but I don't think we need it. So a simple enum with ordinal as id along with clear javadoc highlighting the rules for updating the enum is preferable imo.

Jackie-Jiang · 2021-03-30T22:09:21Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+    THREAD_CPU_TIME_NS("threadCpuTimeNs"),
+    ;


(nit)

Suggested change

THREAD_CPU_TIME_NS("threadCpuTimeNs"),

;

THREAD_CPU_TIME_NS("threadCpuTimeNs");

Jackie-Jiang · 2021-03-30T22:10:31Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+   *  - Always add new keys to the end.
+   *  Otherwise, backward compatibility will be broken.
+   */
+  enum MetadataKeys {


Suggested change

enum MetadataKeys {

enum MetadataKey {

Jackie-Jiang · 2021-03-30T22:13:02Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+    }
+
+    // getByOrdinal returns an optional enum key for a given ordinal
+    public static Optional<MetadataKeys> getByOrdinal(int ordinal) {


You don't need Optional here, but either:

Throw exception for invalid id (suggest this way)

Return null for invalid id

Jackie-Jiang · 2021-03-30T22:13:10Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+    }
+
+    // getByName returns an optional enum key for a given name.
+    public static Optional<MetadataKeys> getByName(String name) {


Jackie-Jiang · 2021-03-30T22:14:29Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+      return _name;
+    }
+
+    static {


Put this block following the map definition for better readability

Oh, the code block putting here was conduct by IntellJ reformat. I'd suggest keep as it is, since assume later someone change this file and run IntellJ reformatting before commit, it will be moved to here anyway.

Jackie-Jiang · 2021-03-30T22:16:03Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableBuilder.java

@@ -77,6 +77,9 @@
 // TODO:   3. Given a data schema, write all values one by one instead of using rowId and colId to position (save time).
 // TODO:   4. Store bytes as variable size data instead of String
 public class DataTableBuilder {


Suggest making 2 builders, one for v2 and one for v3. You can extract the common logic into a base class, or just duplicate code because we will deprecate v2 in the next release once v3 is well tested

We plan to remove all V2 logic after the next release. So, we can keep re-factors and beautifications to a minimum. Please do only what is necessary because all of V2 logic will disappear and someone looking at the code will wonder why we have a base class

+1 for keeping the current logic. Another drawback of having two builder is: all caller need to decide call v2 builder or v3 builder based on instance config, which is ugly.

Jackie-Jiang · 2021-03-30T22:19:23Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableBuilder.java

@@ -96,6 +99,17 @@ public DataTableBuilder(DataSchema dataSchema) {
    _rowSizeInBytes = DataTableUtils.computeColumnOffsets(dataSchema, _columnOffsets);


This won't be correct because we want to fix the float value size (should be 4 but use 8 bytes in v2)

As discussed offline, we wanner this change focus on metadata change, I will send a separate PR to bump up version to V4, which is dedicated to address all TODOs in DataTableBuilder, including:

fix the float value size issue

Store bytes as variable size data instead of String

Use one map of "String->Int" for all columns, instead a one map for one column.

Jackie-Jiang · 2021-03-30T22:19:43Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableBuilder.java

@@ -77,6 +77,9 @@
 // TODO:   3. Given a data schema, write all values one by one instead of using rowId and colId to position (save time).
 // TODO:   4. Store bytes as variable size data instead of String
 public class DataTableBuilder {
+  public static final int VERSION_2 = 2;
+  public static final int VERSION_3 = 3;
+  private static int _version = VERSION_3;


This should not be hardcoded but from the config

We have a setCurrentDataTableVersion() static function to set versions, which is called in HelixServerStarter

Jackie-Jiang · 2021-03-30T22:20:37Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplBase.java

+/**
+ * Base implementation of the DataTable interface.
+ */
+public abstract class DataTableImplBase implements DataTable {


Rename to BaseDataTable

amrishlal · 2021-04-01T01:48:18Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+   *  Otherwise, backward compatibility will be broken.
+   */
+  enum MetadataKey {
+    UNKNOWN("unknown"),


Is UNKNOWN really needed? Can we get rid of it?

I have found it useful to have one enum reserved that is never used. It is never sent by the sender, but the receiver, if needed, can set it to this value if it encounters a value that it does not know about. In that case, the special case handling is restricted to the layer that first scans the enums, and the other layers above don't need to worry about default cases.

amrishlal · 2021-04-01T01:49:00Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+    private static final Set<MetadataKey> _intValueMetadataKey = ImmutableSet
+        .of(MetadataKey.NUM_SEGMENTS_QUERIED, MetadataKey.NUM_SEGMENTS_PROCESSED, MetadataKey.NUM_SEGMENTS_MATCHED,
+            MetadataKey.NUM_RESIZES, MetadataKey.NUM_CONSUMING_SEGMENTS_PROCESSED, MetadataKey.NUM_RESIZES);
+    // _longValueMetadataKey contains all metadata keys which has value of long type.


Why do we need _intValueMetadataKey and _longValueMetadataKey? Instead of maintaining two static maps to decide which parameter is long and which is int, can we add a member variable _type for each of the enum options? This will also allow for replacing isIntValueMetadataKey() and isLongValueMetadataKey() functions with getType()?

Looks good, but wondering if we can use ColumnDataType (which is widely used already) instead of defining a new enum which more or less means the same thing? I think the ordinal position of values in ColumnDataType is already fixed (from serialization, deserialization point of view), but for safety we can add a comment their saying not to change the ordinal position.

That's a good idea, I see Jackie has some relate work to unify the usage of CloummDataType: #6728, he mentation that we will consider merging DataType and ColumnDataType in the future. So let's address it separately.

amrishlal · 2021-04-01T01:49:30Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/BaseDataTable.java

+    super();
+    _numRows = 0;
+    _numColumns = 0;
+    _dataSchema = null;
+    _columnOffsets = null;
+    _rowSizeInBytes = 0;
+    _dictionaryMap = null;
+    _fixedSizeDataBytes = null;
+    _fixedSizeData = null;
+    _variableSizeDataBytes = null;
+    _variableSizeData = null;


This block of code including the call to super() is redundant as Java will automatically do this.

amrishlal · 2021-04-01T01:49:56Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV2.java

-    _variableSizeDataBytes = null;
-    _variableSizeData = null;
-    _metadata = new HashMap<>();
+    super();


This constructor can be removed. super() is redundant.

Done. super() is redundant, but default constructor is needed here since we have a DataTableImplV2(ByteBuffer byteBuffer) no-default constructor.

amrishlal · 2021-04-01T01:50:12Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableImplV3.java

+   * Construct empty data table. (Server side)
+   */
+  public DataTableImplV3() {
+    super();


call to super() is redundant.

Jackie-Jiang

LGTM otherwise

Jackie-Jiang · 2021-04-02T00:00:11Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+      this._name = name;
+      this._valueType = valueType;


(nit)

Suggested change

this._name = name;

this._valueType = valueType;

_name = name;

_valueType = valueType;

Jackie-Jiang · 2021-04-02T00:01:01Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+    }
+
+    // getByOrdinal returns an optional enum key for a given ordinal or null if the key does not exist.
+    public static MetadataKey getByOrdinal(int ordinal) {


Suggested change

public static MetadataKey getByOrdinal(int ordinal) {

@Nullable

public static MetadataKey getByOrdinal(int ordinal) {

Jackie-Jiang · 2021-04-02T00:01:32Z

pinot-common/src/main/java/org/apache/pinot/common/utils/DataTable.java

+    public static MetadataKey getByName(String name) {
+      return _nameToEnumKeyMap.getOrDefault(name, null);
+    }


Suggested change

public static MetadataKey getByName(String name) {

return _nameToEnumKeyMap.getOrDefault(name, null);

}

@Nullable

public static MetadataKey getByName(String name) {

return _nameToEnumKeyMap.get(name);

}

@mqliang , can you please address this?

Jackie-Jiang · 2021-04-02T00:02:31Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/BaseDataTable.java

+import org.apache.pinot.spi.utils.ByteArray;
+import org.apache.pinot.spi.utils.BytesUtils;
+
+import static org.apache.pinot.core.common.datatable.DataTableUtils.decodeString;


(Code style) Avoid using static import. Same for other non-test files

@mqliang , can you please fix static imports? They are in a quite a few places

Will do it in a follow-up PR.

fixed in #6738

Jackie-Jiang · 2021-04-02T00:03:40Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/BaseDataTable.java

+    }
+  }
+
+  public Map<String, String> getMetadata() {


Put override annotation over these classes that implements the interface

@mqliang , can you please address this?

fixed in #6738

Jackie-Jiang · 2021-04-02T00:12:28Z

pinot-core/src/main/java/org/apache/pinot/core/common/datatable/BaseDataTable.java

+  /**
+   * Helper method to serialize dictionary map.
+   */
+  protected byte[] serializeDictionaryMap(Map<String, Map<Integer, String>> dictionaryMap)


No need to have the argument. It always serializes the _dictionaryMap

… on server DataTable V3 move metadata section to the end of bytes when serialization and use enum values (instead of String in V2) as key. This change will be backward incompat if servers are upgraded first, so brokers must be upgraded before servers. The compatibility of the protocols will not be retained beyond 0.8.0 (or the next version that is released)

siddharthteotia · 2021-04-02T06:24:40Z

pinot-core/src/test/java/org/apache/pinot/core/common/datatable/DataTableSerDeTest.java

+      dataTableV3.getMetadata().put(key, EXPECTED_METADATA.get(key));
+    }
+    newDataTable = DataTableFactory.getDataTable(dataTableV3.toBytes()); // Broker deserialize data table bytes as V2
+    Assert.assertEquals(newDataTable.getDataSchema(), dataSchema, ERROR_MESSAGE);


@mqliang , can you please fix the typo? this should be V3

fixed in #6738

siddharthteotia reviewed Mar 23, 2021

View reviewed changes

mqliang mentioned this pull request Mar 23, 2021

Benchmark data table serialization logic and pre-allocate byte[] array if need be #6714

Open

Jackie-Jiang reviewed Mar 24, 2021

View reviewed changes

mqliang changed the title ~~Add a positional data section to data table and measure data table serialization cost on server~~ Add a trailer section to data table and measure data table serialization cost on server Mar 24, 2021

mqliang commented Mar 24, 2021

View reviewed changes

Jackie-Jiang reviewed Mar 24, 2021

View reviewed changes

mcvsubbu added the release-notes Referenced by PRs that need attention when compiling the next release notes label Mar 30, 2021

mcvsubbu reviewed Mar 30, 2021

View reviewed changes

siddharthteotia reviewed Mar 30, 2021

View reviewed changes

Jackie-Jiang reviewed Mar 30, 2021

View reviewed changes

mqliang closed this Mar 30, 2021

mqliang reopened this Mar 30, 2021

mqliang closed this Mar 31, 2021

mqliang reopened this Mar 31, 2021

mqliang mentioned this pull request Mar 31, 2021

Remove DataTable V2 code after 0.8.0 release #6732

Open

amrishlal reviewed Apr 1, 2021

View reviewed changes

siddharthteotia approved these changes Apr 1, 2021

View reviewed changes

Jackie-Jiang approved these changes Apr 2, 2021

View reviewed changes

mqliang closed this Apr 2, 2021

mqliang reopened this Apr 2, 2021

mqliang closed this Apr 2, 2021

mqliang reopened this Apr 2, 2021

siddharthteotia merged commit fb7ceb0 into apache:master Apr 2, 2021

siddharthteotia reviewed Apr 2, 2021

View reviewed changes

	private final Map<Integer, String> _exceptions;
	private final Map<Integer, String> _errCodeToExceptionMap;

	byte[] trailerBytes = new byte[metadataLength];
	byte[] metadataBytes = new byte[metadataLength];

	THREAD_CPU_TIME_NS("threadCpuTimeNs"),
	;
	THREAD_CPU_TIME_NS("threadCpuTimeNs");

		@@ -96,6 +99,17 @@ public DataTableBuilder(DataSchema dataSchema) {
		_rowSizeInBytes = DataTableUtils.computeColumnOffsets(dataSchema, _columnOffsets);

	public static MetadataKey getByOrdinal(int ordinal) {
	@Nullable
	public static MetadataKey getByOrdinal(int ordinal) {

DataTable V3 implementation and measure data table serialization cost on server #6710

DataTable V3 implementation and measure data table serialization cost on server #6710

Conversation

mqliang commented Mar 23, 2021 • edited by chenboat

Description

Upgrade Notes

Release Notes

Documentation

codecov-io commented Mar 23, 2021 • edited

Codecov Report

Choose a reason for hiding this comment

mqliang Mar 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia Mar 23, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia commented Mar 23, 2021 • edited

siddharthteotia commented Mar 23, 2021 • edited

mcvsubbu commented Mar 23, 2021

mqliang commented Mar 23, 2021

siddharthteotia commented Mar 23, 2021

mcvsubbu commented Mar 23, 2021

siddharthteotia commented Mar 23, 2021

mcvsubbu commented Mar 23, 2021

mqliang commented Mar 24, 2021 • edited

Jackie-Jiang left a comment

Choose a reason for hiding this comment

mqliang commented Mar 24, 2021

mcvsubbu commented Mar 24, 2021

mcvsubbu commented Mar 24, 2021

mqliang commented Mar 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang commented Mar 24, 2021

Choose a reason for hiding this comment

siddharthteotia commented Mar 24, 2021

siddharthteotia commented Mar 25, 2021

siddharthteotia commented Mar 25, 2021

mcvsubbu commented Mar 30, 2021

mcvsubbu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siddharthteotia Mar 30, 2021 • edited

Choose a reason for hiding this comment

siddharthteotia Mar 30, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mqliang commented Mar 23, 2021 •

edited by chenboat

codecov-io commented Mar 23, 2021 •

edited

mqliang Mar 23, 2021 •

edited

siddharthteotia Mar 23, 2021 •

edited

siddharthteotia commented Mar 23, 2021 •

edited

siddharthteotia commented Mar 23, 2021 •

edited

mqliang commented Mar 24, 2021 •

edited

siddharthteotia Mar 30, 2021 •

edited

siddharthteotia Mar 30, 2021 •

edited

siddharthteotia Mar 30, 2021 •

edited

mqliang Mar 31, 2021 •

edited