Sampling

Tom Effland edited this page Feb 8, 2017 · 16 revisions

Readings:

Questions: These three papers are examples are example of different approaches to sampling.

  • A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?
  • If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

Bonus question

  • Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

Create a new section with a subheading ## YOURUNI YOURNAME and write the question answers in your section.

ew2493 Eugene Wu

Example answers here

bgw2119 Brennan Wallace

A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

From the perspective of PFunk-H there is an issue that the sampling could result in the perception of the data being different because the sample (randomly) does not represent the characteristics of the source data to the extent where a investigator's perception of the nature of the data would be affected.

Similarly, from the perspective of the Sample + Seek paper the issue might be thought of as having a sample distribution that is so different from the actual distribution of the data that the analysis is compromised.

These two views are essentially the same the first being the perceptual effect of the second.

If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

The two problems identified before are still true. The two samples may not be representative enough of the total data to produce a join with a similar distribution as a corresponding join of the entire data set.

Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

Ideal would be simple easy to calculate distributions of the entire dataset with few outliers (or when outliers are not important). Also having many data points would be good but, if there are few then the entire data set could be used anyways.

Worst case would be when there are outliers and outliers are important as they could be missed even when checking the distribution. Also if the distribution is hard to determine (possibly due to the size of the data or because of an unusual distribution, perhaps non-continuos) that would be difficult for these tools/approaches to deal with.

gf2308 Gal Levy-Fix

Q1. A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

The problem of this approach from the perspective of PFunk-H is that it ignores human perceptual visual accuracy. If the difference between the true data distribution and the approximated one obtained from sampling is perceptually discernible then the resulting queries will be misleading. The paper focuses on human perceptual limitations for visualization which has been studied and been assigned a functional form. It may be the case that for some datasets the error from the simple approach described above will be below the discernibility thresholds, and for other datasets it would not be.

From the perspective of Sample + Seek uniform sampling does not take in to account user-specified error bounds. This means that such sampling methods may result in unbounded errors, which may turn out to be unacceptable to the user. They propose a precision matric called distribution precision, which can better provide error bound guarantees.

Q2. If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

If the independent samples S1 and S2 are different enough from the true data T1 and T2 then the join will also be significantly different from the join of the original data. The threshold for the error rate of this approximation may be set using knowledge of human perception per PFUNK-H or by user specified error bounds per Sample+Seek.

Bonus question. Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

In the best scenario the desired visualization and data type allow for large errors to be undetected by human perception and fall below the user specified error bound which will allow a large reduction in the needed sample size. Thus resulting in large performance gains. In the worst-case scenario the visual and data types are very “unforgiving”. In which, case very little errors are perceptually detected and requiring a sample size very close to the true data size to be below the specified error bounds.

fp2358 Fei Peng

  • A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

From the perspective of PFUNK-H, this simple approach fails to give a clear measurement of the difference between the approximated answer and the exact one. So this method won't perform well when the trade-off between accuracy and efficiency must be taken into consideration. Besides, uniform sampling is an offline method and may output arbitrary results for some complicated queries because it ignores query semantics.

According to Sample+Seek, this approach ignores the values of entities in the database and so the data of the approximated sample database may have a different distribution from the original data. It will lead to inaccurate query results. For example, uniform sampling cannot work well for queries with aggregation SUM.

  • If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

Since S1 and S2 are sampled independently, this process does not consider the correlation between T1 and T2. Therefore, S1 join S2 may have a different distribution compared with T1 join T2.

For example, T1 and T2 are joined by T1.a = x and T2.b =x. T1 has many records of which the attribute a is equal to x. However, T2 only has one record of which the attribute b is equal to x. However, the sample approach ignores this information and outputs S2 that has no record with attribute b = x. Thus, S1 join S2 is empty whereas T2 join T2 has many records.

  • Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

Most sampling methods are designed for queries with aggregation. The results of these queries are not influenced by any single record in the database. In that case, sampling can works well and answer queries rapidly.

However, if any single record must be considered to give accurate result, then sampling is not applicable. For example, normal sampling methods that consider overall distribution of dataset cannot give out correct result for query with MAX/MIN operators because outliers may be removed in the process of sampling.

gr2547 Gabriel Ryan

  • A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

This would only give a single level of precision based on the extent of the sampling. In the case of PFUNK-H it may be impossible to meet the precision requirement given by the perceptual function. Uniform sampling also suffers from the problem described in Sample Seek of data dependent error, where some rows may be more significant than others.

  • If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

From the perspective of PFunk, this has the same issue in that does not take into account a particular precision guarantee. Even if the two tables are independently sampled to a given precision, that does guarantee that queries on the join will have the same precision as the independent samples. This is similar to the issue described in Sample Seek where the precision of individual groups does not guarantee precision over all groups.

  • Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

I imagine sampling is best employed when the degree of required precision is known, and is be low enough to make sampling an effective optimization, and of course when data size is large enough to justify the speedup from sampling. Interactive visualizations that are of low resolution or only shown for a short period of time, for example, can likely tolerate low precisions. Another use case is providing answers to analysts that only need to understand relative trends and relationships.

yz3060 Ye Zhang

  • A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

From the perspective of PFunk-H, a uniform random sample does not guarantee the perceptual error. So the visualization of the random sample may result in quite a discernible difference from that of the real data.

From the perspective of Sample + Seek, a uniform random sample is not guaranteed within a certain threshold of the real distribution. It can produce unbounded errors. In addition, it does not take users’ expectation of error guarantee into consideration.

  • If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

If we draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2, the joint sample does not have the same distribution as the real joint data because T1 and T2 are joint on the constraint T1.a = T2.b.

  • Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

In the best scenario, the size of data is large and precision is required by user, sampling is an useful optimization.

dn2311 Drashko Nakikj

A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

There are at least five problem categories with this approach: a. representativeness of the sample (is it representing the real DB well enough), b. appropriateness of the sample for the query (will this sample be enough to answer all possible queries), c. size of the sample (is the size of the sample the optimal size to be kept in memory/disc), d. accessibility (is this the most efficient way to store the sample to secure fast response) and e. rendering the answer (will the sample be enough to secure visualization that will precisely deliver the answer for the human eye). Both papers offer different approaches (online and offline) to overcome these barriers by offering smart solutions for estimating the best possible sample for precise visual rendering of the query answer (based on perceptual functions) and efficient storage and calculation of precomputed samples (sample+seek framework (large and small queries): based on measure-biased sample, data bubble, measure-augmented inverted index and low-frequency group index).

If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

Although there are smart ways on how to do the sampling so it secures the best possible representation of the dataset given a set of constraints, it is still omitting some data. If we don’t consider the possible dependencies between the samples (to try to improve the samples to be joined), than we are risking to omit rows where the actual join is taking place. In worst case scenario, the join based on the independent samples might be an empty set, although in reality (all data available) that is not the case.

Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

In ideal scenario it would be nice to know the distribution of the requests load issued to the DB, query types issued to the DB and the distribution of the data in the DB in order to decide which approach to take: offline, online or a combination; in order to construct optimal samples (rendering, representativeness and size) and make them easily accessible (indexing). In addition, to know the relationship between the query type and the perceptual function with rich context and/or clear definition of what the user specified error bound is. In a worst case scenario: a. the error of approximate answers is hard to define; b. it is hard to estimate the distribution of the data in the dataset, c. the assumption that the perceptual function is univariate and is monotonic does not stand anymore i.e. modeling the perceptual function becomes more complex; d. user specified error bound might be subjective and bias the results in the answer, e. highly dynamic DB where it’s hard to estimate the “size” of the query (large vs. small) in order to do precomputation of samples, and f. the nature of the data is such that indexes become very big. Sampling generally works well when it comes to aggregation and high tolerance of error on the user’s side which is estimable/definable upfront with good understanding of the data distribution in the database, whereas it is harder to perform well if extreme values are essential for the reliability of the query answer and error tolerance is very strict or ambiguous to define and it is hard to specify the distribution of the data in the DB. Joining tables employs another challenge to the sampling process.

az2407 Alireza Zareian

A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

If the data has a smooth distribution with few outliers, this might be a good idea. But if the data contains important outliers that significantly affect the result of aggregation, they might be all missed by uniform sampling, and the resulting aggregation might be infinitely inaccurate.

If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

If we sample the joined table directly, we might have some approximation error. But if we sample each table first, each sampling will introduce an approximation error, and joining will have a synergic effect on the two errors. Similar to the case that if you measure two numbers each with 10% error, and multiply them, the result will have more than 20% error.

Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

The best case for sampling is if the data is very large and accurate query execution is not possible, or if the data distribution is known without cost, or if tolerable error has a large margin. The worst case is when accurate query can be easily computed, so there is no need to approximate, or there are many outliers that our distribution does not model, or high accuracy is desired.

sh3266 Daniel Hong

  • A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

As discussed in Sample + Seek, and previously in imMens, uniform random sampling has the disadvantage of having high probability of omitting extreme points with statistical significance. From a querying perspective, Approximate query processing must guarantee both interactive speed and precision. Random sampling concerns precision due to one of two problems proposed by Sample + Seek; depending on which rows are included in the sample, guaranteeing user-given error bound will be difficult. When the sample distribution greatly differs from the actual, large error will be produced. PFunk-H proposes an online sampling algorithm to determine the threshold at which sampling should stop while the difference between the estimate and actual answers are perceptually indiscernible. But the perceptual model and optimizing the minimum resolution may not meet the user-specified error bound using random sampling.

  • If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

CI-based approximation systems that draw random samples with certain confidence intervals do not consider errors resulting from join queries. CI for each group does not represent the CI for all groups when aggregate operations are performed across groups. When data dependent join queries are performed, the samples may contain some of the needed rows, or not contain the rows at all. In both cases, the user-given error threshold cannot be guaranteed.

  • Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

The worst case is for a large portion of the data to have extreme values that are not ideal for uniform random sampling. In such cases, queries with highly selective predicates, the sample distribution would be significantly different from the actual. On the other hand, having easily predictable data with known distributional assumptions would be the best case, so that even without a large sample size, query can be accurately processed.

lw2666 Luren Wang

  • A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

PFunk-H: Unfortunately, the sample would not be guaranteed to be perceived in the same way as the data. As mentioned in PFunk-H, larger sample sizes are needed to make the approximation more and more similar to the underlying data. However, large sample sizes also tend to make the system slower, thereby inhibiting interactivity. Thus, PFunk-H solves this problem in the context of human perceptual limitations. That is, the goal is to make the approximation as efficient as possible without being perceptually discernible.

Sample + Seek: According to this paper, using sampling techniques such as AQP cannot guarantee error bounds. Additionally, arbitrarily large errors may be produced in queries such as SUM. The solution in this paper was a sampling scheme called measure-biased sampling in order to have precision guaranteed within user-specified error bound.

  • If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

The fundamental issue is the fact that sampling distribution from S1 and S2 is not the same as resulting sampling distribution in the join. In the paper "On Random Sampling over Joins" by Chaudhuri et al., it is stated that "the join of random samples from R1 and R2 does not give a random sample of their join". In fact, the join has a biased sample where the probability of a tuple depends on the number of tuples in R2 that is joined.

  • Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

Sampling is useful because of it is scalable. However, sampling also produces errors. The two above techniques deal with mitigating these errors in different ways. Specifically, PFunk-H mitigates these errors in the context of human perceptual limitations. This is useful for exploratory analysis where the analysts are not concerned with error bounds and just desire a visually acceptable representation of the underlying data. On the other hand, some situations have lower tolerances and call for a specific error bound. Thus, Sample + Seek's measure-biased sampling is useful in these instances.

te2245 Tom Effland

Q1. A simple approach would be to take a uniform random sample of the database, store that, and use the sample exclusively to answer database queries. What is the fundamental problem of this approach?

One fundamental problem with this approach is that outliers in the data won't be included in the sample with high probability. Additionally, from the perspective of the readings, this sample will likely be overkill if too large or unboundedly inaccurate if too small.

Q2. If we want to approximate the result of a join query (e.g., SELECT ... FROM T1 JOIN T2 ON T1.a = T2.b), we might draw independent samples S1 from T1 and S2 from T2 and perform a join on S1 and S2. What's the fundamental problem of this approach?

The issue with this approach is a join of two samples is not equivalent to a sample of a join. This is because the two tables are correlated via the join attribute. By sampling these tables first, we are ignoring this correlation and thus are biasing the sample from the join.

Bonus question. Sampling is a useful operation in data visualization, but it's not clear when and how it can be best employed. What do you imagine are the ideal or worst scenarios (e.g., the visualization, resource availability, type of data) for the sampling approaches in the above papers?

I imagine that the ideal scenarios for sampling are when we want to visualize a single static table with no outliers (or where outliers are unimportant for the visualization.) The worst case would be when outliers are critical to the visualization task or our visualization must be computed over the result of one or more joins of sampled tables.