-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize InputRowSerde #2047
optimize InputRowSerde #2047
Conversation
int metricSize = WritableUtils.readVInt(in); | ||
for(int i = 0; i < metricSize; i++) { | ||
String metric = readString(in); | ||
String type = readString(in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type can be encoded as a number instead of string, you can probably use ValueType.XXX.ordinal() as encoding.
io.druid.segment.column.ValueType is an enum defined with the types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just change and not write the type.
looks good overall, faster is better |
@binlijin for the seek of knowledge, why there is a performance gain ? |
@b-slim The performance gain from:(1)As less as new Text(String) which will encode String to UTF-8 (2)Write dimension name only once. |
@binlijin @himanshug does this mean that the InputRowSerde introduced in #1472 caused performance degradation? How does this new ser/de compare to just passing Text between mapper and reducer (ignoring the combiner for a moment)? |
@xvrl this is used in batch ingestion only . batch ingestion runtime is mostly dependent upon actual indexing and trips of data to/from hdfs, serding time here plays little role . We did not notice any significant differences in our batch ingestion times post that patch. Also, this was really added so that we could support batch delta ingestion (and other arbitrary InputFormats which would produce non-Text data), combiner optimization was an added benefit. |
if(aggs[i].getName().equals(metric)) { | ||
return aggs[i].getTypeName(); | ||
} | ||
for(AggregatorFactory agg : aggs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a log.warn(..) here so that we know if in a real use case this fall backs to for loop.
also it will be nice to have a UT that verifies if ordering of AggregatorFactory[] in serialization and deserialization stayed same then the for loop is not run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sign, I don't know how to add a UT to verify it...
👍 |
I test it with some data(1million record,100 dimesions, 20 metrics):
serTime 68122 ms (InputRowSerde.toBytes)
deTime 76761 ms (InputRowSerde.fromBytes)
serDe size 2721126345 byte
With the patch:
serTime 34231 ms (InputRowSerde.toBytes)
deTime 27868 ms (InputRowSerde.fromBytes)
serDe size 2048581349 byte