TAJO-1112: Implement histogram interface and a candidate histogram by mvhlong · Pull Request #200 · apache/tajo

mvhlong · 2014-10-14T01:32:35Z

Hi everyone,
This patch contains:

a histogram interface with utility functions, including selectivity estimation
2 candidate histograms: equi-width and equi-depth
some unit tests for integrity and accuracy of the histograms

In the accuracy tests, given a 100k data set and a 10k random sample (10% of the data set), the estimation accuracy is about 80% - 95%, for both random data of uniform and Gaussian distributions. Histogram construction time (just consider the first construction time, without cache effect) is about 15 ms.

Please review and advice me if anything should be improved. Sincerely!

jihoonson · 2014-10-14T01:58:16Z

Great work!
I'll review!

jihoonson · 2014-10-14T10:23:35Z

...-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/statistics/Histogram.java

Would you add an explanation?

jihoonson · 2014-10-14T11:49:06Z

This patch looks truly great!
In my opinion, the proposed histogram interface looks nice.
In addition, this patch contains plenty tests and comments.
I've left some minor comments!

mvhlong · 2014-10-15T04:41:43Z

Thank @jihoonson,
I added the "@VisibleForTesting" annotation and the explanation for "lastAnalyzed". I also removed the use of "round()" after "floor()" (it became redundant after a bug fix). However, I do not use "yarn.util.SystemClock.getTime()" instead of "System.currentTimeMillis()" because the source code of SystemClock.getTime() just contains a single call to System.currentTimeMillis(). The use of former function adds unnecessary package dependency while System.currentTimeMillis() is also used in many other places in Tajo source code. "isReady" is intentionally omitted somewhere because it is not related to the real content of a histogram. Finally, I try not to use NumericDatum since it's not necessary but increases processing time. As I understand, a Tajo member is trying to avoid the use of Datum in a Jira issue.

jihoonson · 2014-10-15T07:14:08Z

Thanks for your comment!
On using the system time function, you are right. Any unnecessary packaging dependencies must be avoided.

On using Datum, I have a different opinion. I think that we need to extend the histogram later to support not only columns of the numeric type, but also columns of the text and more complex types.
However, I think the 'double' type is not sufficient to contain every type of data.
Even though you consider only the numeric types in this issue, we can easily extend the histogram interface later if it is designed for the more general type, Datum.
What do you think about it?

If any other guys have other opinions about this, please feel free to suggest!

hyunsik · 2014-10-15T07:21:48Z

In several days, I was very busy due to business trip schedules. I'll give comments about the patch tomorrow.

In addition, I'd like to discuss the roadmap and future direction about statistic information. I'll start the discussion in TAJO-1091 soon.

mvhlong · 2014-10-15T08:28:57Z

I'm glad to discuss with you, guys.

@jihoonson I am a little confusing. Your advice is to use NumericDatum (not Datum) instead of Double. In Tajo, NumericDatum represents INTx and FLOATx. I think that Double can cover a broader range of numeric values, not only int and float, but also boolean, bit, datetime, and char. Because the data is big, the sample data is big, too (if you take too small samples, the accuracy will be too low). Meanwhile, a table may contain hundreds of columns, each of which needs to construct a separate histogram. Hence, histogram construction time can be very long and we should alleviate this burden by using simple data type, such as Double instead of NumericDatum (or Datum). Data type conversion from Datum to Double can be done by a utility function.

equi-width and equi-depth are simple histograms, thus obtain not-so-great estimation accuracy. When you want to improve the accuracy of selectivity estimation by implementing more complex histograms in the future (for example, multidimensional histograms - to catch the dependencies of data between different columns), you will see that histogram construction time is a real problem. So, it'd be better to keep it simple.

For a fast extension, TEXT can be approximately mapped to a numerical value, too. For more complex types (including TEXT if you wish), in my opinion, Tajo needs a special treatment.

jihoonson · 2014-10-15T09:01:37Z

@mvhlong sorry for confusing.
I first intended that using NumericDatum will be useful for the later histogram extension.
However, you are right. Double can contain a broad range of numeric types, and TEXT can be mapped to a numeric value. I missed it.
In addition, the histogram is stored in only the catalog, so we don't need to use Datum.
Thanks for your explanation!

jihoonson · 2014-10-16T04:08:18Z

If Hyunsik and the other guys don't have any objections, I'll commit this patch.
IMO, it will be better to proceed the histogram-related work in a new branch separated with the master branch.

hyunsik · 2014-10-16T04:34:57Z

I'm still reviewing this patch. I think that we need some discussion before committing it.

jihoonson · 2014-10-16T05:43:38Z

Welcome any discussion.
I'll wait for your review.

hyunsik · 2014-10-16T07:26:28Z

Your patch is a very nice starting point for histogram in Apache Tajo. The patch looks good to me. Additionally, I leave some comments.

As you mentioned, precision is less important in sampling approaches. In many cases, using double may be a right solution. But, when it comes to value range representation, Double still has limitation.

in many cases, Text value can be easily longer than the representation range of DOUBLE or LONG. I already fixed many bugs of sort bugs caused by long text value (up to 256 bytes). For ETL workloads, it is usual. Also, can Double represent an entire Long range? We need to check it. In my opinion, we need to use four value types: Long, Double, Text, and Byte [].

Also, I'll start other discussions in TAJO-1091 (https://issues.apache.org/jira/browse/TAJO-1091).

mvhlong · 2014-10-17T01:24:59Z

I'd like to explain again about the sampling and precision.
It is best to create a histogram from the full column data. However, because it will take too long time, people often take a sample of the data and create a histogram based on this sample. When the sampling ratio is low, the histogram based on this sample will have low selectivity estimation accuracy. In contrast, when the sampling ratio is high, the histogram will have high estimation accuracy (samplingRatio = 100% means that we use the full column data). So, we will want to build a histogram based on a big sample (i.e., high sampling ratio). But, this may lead to a significant increase in histogram construction time. Consequently, we will want to reduce the time by using simple data structures instead of complex ones.

I checked the classes Long and Double. Double cannot represent the entire value range of Long, thus a histogram built on Double cannot replace another one built for Long. This is my mistake in making a careless assumption.

After reading your comments, I have thought a lot more about the supporting of other data types in histograms. Now, I think that I should use Datum since it is the only way to make a unified and clean implementation, although it takes longer processing time. In HistogramBucketProto, min and max will be changed from "double" to "bytes". ( @jihoonson please note that I changed my opinion about the use of Datum )

I will update the patch soon.

hyunsik · 2014-10-17T05:21:55Z

Hi @mvhlong,

Why don't you make a separate histogram implementation for each type? We may need only three implementations for Long, Double, and byte []. They will cover all value types. I think that this approach would be good in practice.

I have another question. Your proposed histogram implementation takes a list of Double values. Does this approach takes all values at a time? Otherwise, can we build a histogram in an incremental way?

mvhlong · 2014-10-17T06:27:32Z

Hi @hyunsik and @jihoonson,
I have just made an important update to my patch: change from using Double to Datum.
In the source code, I throw Exceptions for non-numeric type because I do not want to mix too many changes at once and we need to update several important functions in Datum before other types can be supported. However, with this change, the histogram module can be extended to support other data types easily in the future.

With the use of Datum, in contrast to my assumption, the histogram construction time is not slower in the tests. This is good for us.

With a single histogram idea (for example, equi-width), I prefer to make only 1 implementation for all data types rather than to make 3 different versions for Long, Double, and byte[]. This makes the source code clean and easy for maintenance. Anyway, with the latest update, this problem has been solved.

Currently, the histogram implementation takes a list of Datum values. You are right that it takes all values at a time. Building a histogram in an completely incremental way is difficult because many histograms require the sort of all data points. So, I think that we should build the histograms in a partially incremental way. More specifically, we first build many histograms with many different samples, then merge them. A function to merge the histogram will be implemented later.

jihoonson · 2014-10-17T06:41:07Z

@mvhlong and @hyunsik, thanks for your great discussion.

On using Datum, I agree with @mvhlong. IMO, a simple way is good for the first implementation. We can improve it if any performance problems are found.

Notebook scheduler

mvhlong added 2 commits October 14, 2014 09:57

internal histogram support

91b2cb7

fix minor bug

f40ec9d

fix minor bug

2f04c40

jihoonson reviewed Oct 14, 2014
View reviewed changes

update based on Jihoonson's comments

183b971

change from using Double to Datum for the ease of future extensions

8dac208

babokim pushed a commit to babokim/tajo that referenced this pull request Dec 11, 2014

Merge pull request apache#200 from NFLabs/schedule

bd7fdb8

Notebook scheduler

asfgit closed this in 90cb945 Nov 2, 2015

Conversation

mvhlong commented Oct 14, 2014

Uh oh!

jihoonson commented Oct 14, 2014

Uh oh!

jihoonson Oct 14, 2014

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Oct 14, 2014

Uh oh!

mvhlong commented Oct 15, 2014

Uh oh!

jihoonson commented Oct 15, 2014

Uh oh!

hyunsik commented Oct 15, 2014

Uh oh!

mvhlong commented Oct 15, 2014

Uh oh!

jihoonson commented Oct 15, 2014

Uh oh!

jihoonson commented Oct 16, 2014

Uh oh!

hyunsik commented Oct 16, 2014

Uh oh!

jihoonson commented Oct 16, 2014

Uh oh!

hyunsik commented Oct 16, 2014

Uh oh!

mvhlong commented Oct 17, 2014

Uh oh!

hyunsik commented Oct 17, 2014

Uh oh!

mvhlong commented Oct 17, 2014

Uh oh!

jihoonson commented Oct 17, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants