Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize InputRowSerde #2047

Merged
merged 1 commit into from
Dec 9, 2015
Merged

optimize InputRowSerde #2047

merged 1 commit into from
Dec 9, 2015

Conversation

binlijin
Copy link
Contributor

@binlijin binlijin commented Dec 5, 2015

I test it with some data(1million record,100 dimesions, 20 metrics):
serTime 68122 ms (InputRowSerde.toBytes)
deTime 76761 ms (InputRowSerde.fromBytes)
serDe size 2721126345 byte
With the patch:
serTime 34231 ms (InputRowSerde.toBytes)
deTime 27868 ms (InputRowSerde.fromBytes)
serDe size 2048581349 byte

int metricSize = WritableUtils.readVInt(in);
for(int i = 0; i < metricSize; i++) {
String metric = readString(in);
String type = readString(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type can be encoded as a number instead of string, you can probably use ValueType.XXX.ordinal() as encoding.
io.druid.segment.column.ValueType is an enum defined with the types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just change and not write the type.

@himanshug
Copy link
Contributor

looks good overall, faster is better

@b-slim
Copy link
Contributor

b-slim commented Dec 5, 2015

@binlijin for the seek of knowledge, why there is a performance gain ?

@binlijin
Copy link
Contributor Author

binlijin commented Dec 7, 2015

@b-slim The performance gain from:(1)As less as new Text(String) which will encode String to UTF-8 (2)Write dimension name only once.

@binlijin binlijin closed this Dec 8, 2015
@binlijin binlijin reopened this Dec 8, 2015
@xvrl
Copy link
Member

xvrl commented Dec 8, 2015

@binlijin @himanshug does this mean that the InputRowSerde introduced in #1472 caused performance degradation? How does this new ser/de compare to just passing Text between mapper and reducer (ignoring the combiner for a moment)?

@himanshug
Copy link
Contributor

@xvrl this is used in batch ingestion only . batch ingestion runtime is mostly dependent upon actual indexing and trips of data to/from hdfs, serding time here plays little role . We did not notice any significant differences in our batch ingestion times post that patch. Also, this was really added so that we could support batch delta ingestion (and other arbitrary InputFormats which would produce non-Text data), combiner optimization was an added benefit.

if(aggs[i].getName().equals(metric)) {
return aggs[i].getTypeName();
}
for(AggregatorFactory agg : aggs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a log.warn(..) here so that we know if in a real use case this fall backs to for loop.
also it will be nice to have a UT that verifies if ordering of AggregatorFactory[] in serialization and deserialization stayed same then the for loop is not run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sign, I don't know how to add a UT to verify it...

@binlijin binlijin closed this Dec 9, 2015
@binlijin binlijin reopened this Dec 9, 2015
@fjy
Copy link
Contributor

fjy commented Dec 9, 2015

👍

himanshug added a commit that referenced this pull request Dec 9, 2015
@himanshug himanshug merged commit f29c25b into apache:master Dec 9, 2015
@fjy fjy modified the milestone: 0.9.0 Feb 4, 2016
@fjy fjy mentioned this pull request Feb 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants