TAJO-1112: Implement histogram interface and a candidate histogram#200
TAJO-1112: Implement histogram interface and a candidate histogram#200mvhlong wants to merge 5 commits intoapache:masterfrom
Conversation
|
Great work! |
There was a problem hiding this comment.
Would you add an explanation?
|
This patch looks truly great! |
|
Thank @jihoonson, |
|
Thanks for your comment! On using Datum, I have a different opinion. I think that we need to extend the histogram later to support not only columns of the numeric type, but also columns of the text and more complex types. If any other guys have other opinions about this, please feel free to suggest! |
|
In several days, I was very busy due to business trip schedules. I'll give comments about the patch tomorrow. In addition, I'd like to discuss the roadmap and future direction about statistic information. I'll start the discussion in TAJO-1091 soon. |
|
I'm glad to discuss with you, guys. @jihoonson I am a little confusing. Your advice is to use NumericDatum (not Datum) instead of Double. In Tajo, NumericDatum represents INTx and FLOATx. I think that Double can cover a broader range of numeric values, not only int and float, but also boolean, bit, datetime, and char. Because the data is big, the sample data is big, too (if you take too small samples, the accuracy will be too low). Meanwhile, a table may contain hundreds of columns, each of which needs to construct a separate histogram. Hence, histogram construction time can be very long and we should alleviate this burden by using simple data type, such as Double instead of NumericDatum (or Datum). Data type conversion from Datum to Double can be done by a utility function. equi-width and equi-depth are simple histograms, thus obtain not-so-great estimation accuracy. When you want to improve the accuracy of selectivity estimation by implementing more complex histograms in the future (for example, multidimensional histograms - to catch the dependencies of data between different columns), you will see that histogram construction time is a real problem. So, it'd be better to keep it simple. For a fast extension, TEXT can be approximately mapped to a numerical value, too. For more complex types (including TEXT if you wish), in my opinion, Tajo needs a special treatment. |
|
@mvhlong sorry for confusing. |
|
If Hyunsik and the other guys don't have any objections, I'll commit this patch. |
|
I'm still reviewing this patch. I think that we need some discussion before committing it. |
|
Welcome any discussion. |
|
Your patch is a very nice starting point for histogram in Apache Tajo. The patch looks good to me. Additionally, I leave some comments. As you mentioned, precision is less important in sampling approaches. In many cases, using double may be a right solution. But, when it comes to value range representation, Double still has limitation. in many cases, Text value can be easily longer than the representation range of DOUBLE or LONG. I already fixed many bugs of sort bugs caused by long text value (up to 256 bytes). For ETL workloads, it is usual. Also, can Double represent an entire Long range? We need to check it. In my opinion, we need to use four value types: Long, Double, Text, and Byte []. Also, I'll start other discussions in TAJO-1091 (https://issues.apache.org/jira/browse/TAJO-1091). |
|
I'd like to explain again about the sampling and precision. I checked the classes Long and Double. Double cannot represent the entire value range of Long, thus a histogram built on Double cannot replace another one built for Long. This is my mistake in making a careless assumption. After reading your comments, I have thought a lot more about the supporting of other data types in histograms. Now, I think that I should use Datum since it is the only way to make a unified and clean implementation, although it takes longer processing time. In HistogramBucketProto, min and max will be changed from "double" to "bytes". ( @jihoonson please note that I changed my opinion about the use of Datum ) I will update the patch soon. |
|
Hi @mvhlong, Why don't you make a separate histogram implementation for each type? We may need only three implementations for Long, Double, and byte []. They will cover all value types. I think that this approach would be good in practice. I have another question. Your proposed histogram implementation takes a list of Double values. Does this approach takes all values at a time? Otherwise, can we build a histogram in an incremental way? |
|
Hi @hyunsik and @jihoonson, With the use of Datum, in contrast to my assumption, the histogram construction time is not slower in the tests. This is good for us. With a single histogram idea (for example, equi-width), I prefer to make only 1 implementation for all data types rather than to make 3 different versions for Long, Double, and byte[]. This makes the source code clean and easy for maintenance. Anyway, with the latest update, this problem has been solved. Currently, the histogram implementation takes a list of Datum values. You are right that it takes all values at a time. Building a histogram in an completely incremental way is difficult because many histograms require the sort of all data points. So, I think that we should build the histograms in a partially incremental way. More specifically, we first build many histograms with many different samples, then merge them. A function to merge the histogram will be implemented later. |
Notebook scheduler
Hi everyone,
This patch contains:
In the accuracy tests, given a 100k data set and a 10k random sample (10% of the data set), the estimation accuracy is about 80% - 95%, for both random data of uniform and Gaussian distributions. Histogram construction time (just consider the first construction time, without cache effect) is about 15 ms.
Please review and advice me if anything should be improved. Sincerely!