End to End Visualization Systems

Tom Effland edited this page Feb 1, 2017 · 18 revisions

Readings:

Questions: Each of the three systems employ constraints over their interaction expressiveness in order to achieve performance.

  • In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

Polaris and imMens are related to SQL.

  • What is SQL's relationship with data visualization? What types of visualizations/interactions can it express? When is it not enough?

Create a new section with a subheading ## YOURUNI YOURNAME and write the question answers in your section.

ew2493 Eugene Wu

Example answers here

bgw2119

Polaris

In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

  1. The explicitly identified constraint I found was that when creating "mappings of ordinal fields to orientation" is that the orientation for each mapping needs to differ by at least 30 degrees. This creates a limit of no more than six categories for the mappings (automatically created). Presumably this would decrease computational demands and thus increase the speed of the process.

  2. Depending on what is meant by performance another item would be of note as well. The paper mentions when creating marks for visualizations a minimum (the constraint) size is chose such that the other properties of the mark besides size can still be identified. This is more of a natural constraint in that the system doesn't explicitly identify it (based on my reading). It enables better performance if performance is the accurate examination of the data visualizations.

imMens

In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

  1. Limiting the number of dimensions to 4 is one such constraint. It improves performance as without it the visualization may run out of memory resulting in extremely poor speeds and a large reduction in the number of items able to be visualized.

  2. Rectangular tessellations are used for binning instead of alternatives like hexagons because of the faster processing times of the rectangular approach (speed in creating data visualizations being the performance metric).

  3. As another a stretch of the word performance to include accurate analysis of the plots another constraint is only using color to indicate density.

What is SQL's relationship with data visualization? What types of visualizations/interactions can it express? When is it not enough?

First, SQL is the way to access the database which holds the actual data. It also is a way to express relational algebra in order to get the data back in an arrangement that aids in visualization. The rate at which data can be attained through SQL queries is a latency cost that should be considered and thus is an important technical consideration. It can help aid in many visualizations/interactions (I'm examining the case of using the results of an SQL query immediately for the visualization and ignoring post-processing as that opens up virtually endless possibilities as long as the latency cost is paid due the computational overhead) where the data that is stored is what is directly needed for the visualization and it is especially good when using multiple tables that have a specific relationship that can be modeled as a join. Furthermore, SQL does well with filtering and grouping data based on a shared value. SQL is not as good at tasks that require the data to be aggregated in complex ways, processed (programming languages are used to make up for this deficiency), or as stated in the Polaris paper, partitioned.

fp2358 Fei Peng

  • In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

Polaris

In the introduction, authors say that "all intermediate specifications that can be created in the visual language are valid and can be interpreted to create visualizations". So the constraint of data exploration in Polaris stems from the design or grammar of its visual language used to create visual specification. Performance in Polaris means expressiveness.

  1. Table Algebra. Polaris supports two types of data - ordinal and quantitative - and provides three operators - cross, nest and concatenation. Datatypes and operators limit the possible visual specification that users can created. For example, data selection that involves condition cannot be expressed: select the profit in January when the number of year is even and the profit in February when the number of year is odd. This constraint makes the system usable and robust because developers do not need to worry about validating visual specifications from users and users can always get response given whatever specification.

  2. Graph Type. Although users can specify axes in graph, they must choose any graph type from what Polaris supports like rectangle, circle, glyph, etc. This is a trade-off between flexibility and succinctness. Suppose Polaris provides more flexible graph configuration for users, then the system will be really complicated and hard to use.

imMens

imMens is designed for scenario with large dataset. So its performance refers to low latency during real-time display and small information lose after data reduction.

  1. Size of dimensions are limited to at most four in order to support real-time brushing & linking. The bin count is also adjusted according to the size of sub-cubes. This constraint limits the dataset that user can used in imMens but reduces time and space required for computation.

  2. Type of visualization. imMens supports 1D and 2D visualization of binned plots for four types of data: numeric, ordinal, temporal and geographic. Most of them involve color encoding without actual measure value display because values after data reduction are not representative. This is the inevitable result of data reduction that is required to transform the original big dataset to smaller data tile suitable for real-time data interaction.

  • What is SQL's relationship with data visualization? What types of visualizations/interactions can it express? When is it not enough?

Both Polaris and imMens are high-level tools for visual interaction based on actual databases. The first step of data visualization, selecting a subset of data that users need for visualization, is conducted by SQL queries. However, the motivation of these visualization tools is that SQL is hard to use if the analysts do not know SQL grammar or the computation is too complicated to be expressed in simple SQL script.

SQL cannot provide any visualization except for displaying simple spreadsheets. But SQL supports interactions that involves complex computation such as aggregation, filter with conditions, comparison, etc. I think SQL can do any possible computation but the problem is that some task may lead to long, complicated and nested SQL script. So SQL is not suitable for graph display or users without database background. Besides, some data analysis may involve regression, machine learning or other advanced techniques. In that case, SQL is not enough.

gf2308 Gal Levy-Fix

Polaris provides graphical representations of tables generated from the data using defined algebraic formalism. The system is thus constrained by the data transformations possible by the defined algebraic formalism. This constraint makes the system straightforward to use for those familiar with relational algebra. The limited space of operations also allows the system to be optimized to quickly perform the data manipulation and visualizations. Performance can be defined by expressiveness, ease of use, speed, and accuracy of the system.

imMens constrains the representation of the data to binned aggregation of the data. Image tiles for these bins are pre-computed, limiting the data dimensions to 3- and 4- dimensions at a time. Plots are limited to 2-dimnetions of the multi-dimensional data. These constraints are put in place so that the system could perform interactive visualization with very large datasets. Performance seems to be largely defined by speed.

Data visualizations can be created from SQL queries. These visualizations are largely limited to the data types amiable for SQL and the data transformations possible in SQL. These interactions may be too limited when it comes to complex data types like image data or visualization based on some more advanced modeling.

gr2547 Gabriel Ryan

Polaris

Polaris constrains explorations to tables via a 'relational table algebra'. Valid operations are cross (enumerate all combinations), nest (display one value nested inside another, quarters and months for example), and concatenate (display both next to each other). This allows the user to view a table as a pivot table, where it is easy to 'pivot' between viewing different dimensions of the table. Performance in this case is defined as the users ability to easily compare data across several dimensions to answer questions effectively (see the coffe CFO example).

imMens

imMems constrains explorations to brushing and linking, which allows it to use data cubes and data tiles to optimize queries. Results are visualized using binned aggregation, which is effective for visualizing large volumes of data while still showing outliers. Performance is defined as maintain interactive speeds while visualizing a large data set.

SQL does not seem to allow efficient calculation of more sophisticated pivot tables with aggregates like median and mode.

dn2311 Drashko Nakikj

Each of the three systems employ constraints over their interaction expressiveness in order to achieve performance. In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

Polaris

The Polaris interface is built upon a formalism for constructing graphs and building data transformations. The state of the interface - a set of interactions with it at time t0, represents a visual specification of the analysis task and is automatically compiled into data and graphical transformations that are presented as tables that consist of layers and panes where each pane might be different graphic. The primary interaction that defines the dimensions (partition the table in rows and columns) and measurements (axes within the panes) of the analysis is drag-and-drop of fields from the DB schema onto shelves in the display. This sequence of interactions determines the structure of the table and types of graphs in each table pan. However, this approach is not exhausting all possible "visual questions" that can be defined through the interaction and "answered" in a single table view. Brushing (selecting data points) and linking (data display as a response to the brushing) is an interaction that comes to help. All these interface interactions are eventually mapped to standard SQL queries and that is precisely the driver behind the design of the system: interactions that produce visual specifications that rely on relational algebra (and table algebra) and can be transformed into SQL queries. So, the interaction expressiveness is limited by the capabilities of the visual elements, that are the output from the interactions, to be transformed into SQL queries. The performance in Polaris was determined by the capability to iteratively display multi-dimensional data where it will be easy and straightforward to detect and compare patterns and trends by issuing "visual queries" that can be transformed into standard SQL statements. The size of the data to be queried and presented and how that may affect interaction, and the response time (latency) were not of primary focus.

imMens

Unlike Polaris, imMens is a system that is deeply focused on the quantity of the data to be queried and visualized. It is primarily focused on perceptual and interactive scalability, or in other words, how does the scale of the data affect the perception of the visualized data (representation density) and constrain the interactions (zooming, panning, brushing&linking, response time, rendering) and what are the possibilities to overcome some of the imposed barriers. One big premise in the imMens design is that perception and interaction should be limited by the chosen resolution of the visualized data and not the number of records in the dataset. To achieve this goal imMens takes one of the possible approaches for data reduction called binned aggregation and applies it to numeric, ordinal, temporal and geographical variables. This reduction then dictates how interactions: panning, zooming, and brushing and linking will play out. The main challenge to address is the one of exponential growth of data as more dimensions are required for the analysis. To address interaction scalability, imMens takes two major approaches: a. enable scalable interaction - precomputation of multivariate data tiles (decomposing a data cube into 3 and 4 dimensional projections for flexible on-demand data management) and b. enable interactive visualization - parallel data processing and rendering (fast response time through dense indexing schema and WebGL for parallel processing on the GPU). In essence, the interaction expressiveness in imMens is primarily constrained by the aggregated binning data reduction approach, which calls for frequent panning and zooming for finer data resolution and partially de-aggregated data for brushing and linking. To achieve fast and seamless flow in the interactive data exploration inMems has an arsenal of approaches depending on the analytics task and size of the data: 1. data cube queries; 2. decomposing the full cube into 3 or 4 dimensional sub-cubes projections. and 3. breaking the sub-cubes into data-tiles (dense vs. sparse) and 4. parallel query processing (enabled and supported by the previous data decomposition). This imMens approach demands more frequent interactions, but provides more detailed insights in the data. Compared to Profiler, imMens outperforms that system in the ability to update frames during brushing and linking and the performance is invariant to the original data set size (Profiler can't visualize data with more than 10M records).

Profiler

This system for understanding data quality contributes novel methods for integrated statistical and visual analysis (automation coupled with human insight), and provides automatic view suggestions (important for explaining data anomalies: anomalies become more apparent when put in context and in right visualization type) and scalable visual summaries (discovery and triage causes and consequences of anomalous data). On a high level, the interaction with the system is driven by the statistical analysis that produces possible anomalies in the data (the Type Registry together with Detector module) represented in an anomaly browser. For a selected anomaly, data visualization views that might explain it are automatically provided (the Recommender module populates the View Manager). To guide the exploration of anomalies, the Recommender produces a primary view and a set of related views: anomaly-oriented views (visualize column(s) that contain anomaly) and value-oriented views (predict the presence of anomalies). The View Manager displays these visual summaries in a linked fashion based on the view specification that came from the Recommender. The user interacts with the set of views, based on the brushing and linking paradigm to get more insight about the possible anomalies in the data. Basically, the Profiler takes some of the interaction load and delegates it to statistical analysis and automatics view recommendations to set up the exploration and then leaves the interaction to the user for more detailed inspection. Profiler was shown to have a good performance (brushing and linking) for datasets with up to 1M records. More importantly, this paradigm of integrating statistical and visual analysis reduced the time spent diagnosing data quality issues, allowing domain experts to spend more time on meaningful analysis.

Polaris and imMens are related to SQL. What is SQL's relationship with data visualization? What types of visualizations/interactions can it express? When is it not enough?

Data visualization is utilizing the human visual channel (that has a high information throughput) to detect patterns and trends in data and perform comparisons. This becomes heavily important when it comes to large and multidimensional datasets. This data is typically stored in relational databases represented by interconnected tables. However, we typically need only a subset of the data to be visualized. To retrieve the data necessary for visualization we have to select the appropriate subset from the data. This is where SQL comes into play. It is a powerful tool to retrieve the subset data for visualization purposes which can be then mapped to visual cues. It serves the purpose well for the Polaris system where each table pane is essentially an SQL query. However, the data visualization process (mapping data to visual cues) has multiple constraints in reality, especially when it comes to interactive visualizations. In this scenario, there is a point for visualizing the data that is most often residing on different machine than the data source itself and each interaction with the visualization interface might imply a need for new dataset (a subset from the data source) to be visualized. At least three important steps are involved in this process: retrieving the right data in acceptably short time, rendering the data (shape, size, orientation, color) in an acceptably short time in a manner that enables enough interaction expression with the rendered data. The data can be prohibitively large to be frequently transferred from a remote location, but then also prohibitively big to be stored locally and rendered in a short time with enough granularity for delivering the message and enabling interaction. As we could see from the systems in this reading set: imMens and Profiler, just SQL in isolation can't fulfill the demands for such system. Series of data transformations and manipulations (reduction, decomposition, reconstruction, statistical analysis) are necessary for securing these strict demands.

sh3266 Daniel Hong

  • In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

Polaris

Polaris makes a conscious graphic decision in choosing a table-based display for data characterizations, while defining (limiting) the exploration tools for expressivity, analytics, and interactivity. There are many design choices involved in Polaris, from retinal variables to graphical language and visualization models for processing queries. Each table is a layer of panes that supports different types of data. This provides the benefit of combining statistical analysis and visualization. User can place desired data sets from the database schema onto the axis shelves for Polaris to generate table algebra of relationships between various dimensions of chosen data sets. Two types of data, ordinal and quantitative, can be filled in the axis shelves, leading to three possible graphic types: ordinal-ordinal, ordinal-quantitative, quantitative-ordinal (this reminded me of "dimensions" and "measures" fields of Tableau). This limitation allows the program to deliver more powerful analytics results - user-specified queries mapping to the limited possible graphics receive immediate visual and analytical feedback. Furthermore, unlike a complex analytical or visualization toolkit, Polaris supports few simple shapes. This may sacrifice expressiveness but allows stronger data and graphical transformations.

imMens

imMens is concerned with appropriately adjusting data visualization resolution through data reduction to prevent over-plotting and alleviate users' strains on perceptual and interactive capacities, as it is difficult to graphically analyze and manipulate millions of records. Methods of data reduction such as random sampling or filtering risk loss of significant extremities, so binned aggregation is chosen to visualize densities in predefined bins to adjust the resolution from four dimensions: numeric, ordinal, temporal and geographic. The binning method allows the conversion from data cubes to reduced data tiles. This limitation allows faster graphical and analytical computation to provide feedback for real time queries and interactions.

  • What is SQL's relationship with data visualization? What types of visualizations/interactions can it express? When is it not enough?

Fundamentally, data must be queried to be transformed, manipulated, and visualized. Both Polaris and imMens selects records for initial data flow of records from the database, but their techniques for querying after applying filters and creating predefined bins may differ. Interactivity at the query level is most commonly selection with user-specified conditions and aggregations. The SQL language itself is highly restrictive. It is extremely difficult to process big data analytics at the query level because graphics are not supported. It is a high-level language for flow of data and cannot be used as a software development or big data analytics tool. It is inappropriate for any big data visualizations and certainly for complex computations.

lw2666 Luren Wang

  • In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

Polaris

Polaris constrains the space of explorations to its exploratory based interface. This interface are tables which consists of a number of rows, columns, and layers. The characteristics of Polaris tables allow the user to rapidly change the data they are viewing and how they are viewing that data. The constraint in Polaris is the algebraic system used for table configurations. The algebraic system allows users to create expressive table configurations quickly. The table algebraic system is suitable for Polaris's goal of exploratory analysis since it provides a way for the user to incrementally create complex queries while recieving visual feedback to alter their queries. Although, in contrast to SQL, the algebraic system operators at a coarser grainularity in terms of data manipulation. Performance, in this case, is measured by the how quickly the user can incrementally develop visual specifications and get appropriate feedback. Polaris's pivot-table based interface and algebraic system is optimized for this goal.

imMens

imMens utilizes the data reduction technique: binned aggregation as a system constraint. The binned aggregation is scalable in the context of big data. Additionally, imMens also uses other performant techniques such as a dense indexing scheme which simplifies parallel query processing. Performance in imMens means scalability. For instance, the ability to run the system in the browser at a reasonable fps (50 fps) is considered to be good performance. In this case, binned aggregation is used as it is scalable and avoids overwhelming the user by displaying every single data point. While there are other data reduction techniques, the authors used binned aggregation since they claim it conveys both global patterns such as densities and local features such as outliers while enabling multiple levels of resolution by choosing the appropriate bin size.

  • What is SQL's relationship with data visualization? What types of visualizations/interactions can it express? When is it not enough?

SQL is the language that databases implement. On the other hand, data visualization systems are built on top of databases. Therefore, visualization systems will use SQL at the low level to obtain data from the database. However, visualization systems may also implement their own standards that are optimized for exploratory analysis. For example, Polaris uses an algebraic system for its table configurations. While SQL has fine granularity when it comes to data manipulation, it is not suited for exploratory analysis. SQL can perform complex manipulations such as aggregation, conditional filtering, etc. However, the drawback of SQL is its verbosity. In exploratory analysis, users do not want to deal with complex queries that may contain nested SQL statements for example. Instead, flexibility must be sacrificed for the sake of conciseness for data visualizations. Additionally, SQL is also not suitable for very complex data manipulation tasks such as machine learning. In this case, a fully fledged programming language such as Python is used.

az2407 Alireza Zareian

  • In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

    • Polaris: the authors define an algebra for visualization. This algebra serves as a constraint on the visualizations a user can make. So the cost is limitation in possible visualizations. The benefit is that, such an algebra allows the authors to map all possible visualizations to SQL queries. Since relational database management systems are usually implemented with high performance, if visualizations can be directly mapped to SQL queries, they can be generated efficiently.

      More specifically, the algebra is to limit each graphical view to a table, with rows, columns, and layers. Users map database variables to axes of this table. Each axis might consist of nested variables. Each entry in such a table is mapped to a SQL query, selecting a part of data using WHERE statement. This selected parts of data are used to generate the content of each table entry. To generate the graph for each entry, aggregations, grouping and sorting, as needed, can be mapped to the aforementioned SQL query, and the result can be mapped to visual elements (and their properties e.g. location, size, etc.) as specified by users.

    • imMens: the only idea of Polaris for scalability is to use SQL queries to retrieve data. But imMens goes beyond this by adding several other layers of efficiency, both to generally speed up graph generation and to speed up the interaction with the visualization. Polaris does not provide any solution for interaction at all.

      The first idea is based on the following quoted principle: "perceptual and interactive scalability should be limited by the chosen resolution of the visualized data, not the number of records." More specifically, the idea is to aggregate data into bins, where the number of bins is selected according to the number of display pixels devoted for the graph. Panning and zooming can be easily done by changing the bin resolution.

      The second idea is to break the high-dimensional data cube into the minimal set of 2 or 3-dimensional cubes (tiles) required to represent all dimensions. They implement brushing and linking by aggregating two tiles. They additionally discuss some other ideas such as precomputation of tiles in different bin resolutions, sparse/dense data tile storage, and parallel query processing.

      imMens also poses some constraints on the visualizations. Each visualization consists of a set of either two-dimensional charts or three-dimensional heatmaps, that can be linked. All the design principles like binned aggregation and 2 or 3-D data tiles are based on this constraints.

  • What is SQL's relationship with data visualization? What types of visualizations/interactions can it express? When is it not enough?

    All visualizations in both systems should be based on selection and possibly aggregation queries on a filtered part of data, possibly grouped by some dimensions. This of course covers a wide range of visualizations, but has limitations. More importantly, interaction is done by changing the query and running it again, which might not always be the best type of interaction.

    • Polaris: The visualization consists of running a query for each entry of the table. The queries only differ by the range of the filter in the WHERE statement. In each query, at most two dimensions are selected to make a 2-dimensional plot. Each dimension can be aggregated by some grouping dimension. Clearly, this is not the set of all possible visualizations, although it can cover a wide range of useful visualizations. One can compare this to D3, which is powerful enough to represent all possible visualizations, but of course does not solve the scalability issue. Moreover, interaction is done by changing the query and running it again, which might not always be the best type of interaction.

    • imMens: While the ideas behind imMens are not limited to SQL-based databases, I suppose we use a SQL-based backend for it to make it comparable with Polaris. In this system we are limited to 2 or 3-D graphs which are generated by running queries which select 2 or 3 dimensions. Each dimension is grouped to bins and aggregated over bins. This is similarly not as diverse as graphs possible by D3. But again it is worth it for the sake of scalability. Interaction is better defined in this system where panning, zooming, brushing and linking are explicitly defined. But each interaction again transforms to new SQL queries that need to run from scratch, and this might limit the efficiency of the system.

te2245 Tom Effland

Polaris

In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

Polaris defines an algebra over possible groupings of table attributes (pivot tables) that directly maps from visualization specifications to SQL queries. This constrains the set of possible visualizations to table-based graphics, but gets wins in performance by allowing query optimization to be done at the DB level. Additionally there are gains in development performance, because the interface allows the users to build up more complex queries (that could benefit even more from query optimizers) iteratively without having to write the SQL themselves.

imMens

  • In what specific ways does each system constrain the space of explorations that the user can perform, and how does the constraint enable better performance? (what does performance mean in each case?)

They focus on visualizations that utilize binned aggregation. They argue that this approach is better than sampling as control over bin sizes makes resolution and detection of outliers more doable than with sampling. They further constrain binning to rectangular bins as it ensures bin linked selection compatibility with 1d and 2d plots. They get performance wins with these assumptions because a binned aggregation over a n-d table can be precomputed as a materialized view. By further recognizing that linked plots typically do not utilize more than 4 dimensions, they further partition the n-d data cube into 4-d sub-cubes, which keep the size of these data cubes (mostly) in check as now the number of multiplicative factors in the cube size are limited to 4 instead of n. For reasonable bin counts (say 50) this means 50^4 << 50^n. They also get wins by performing aggregations over densely partitioned data tiles in parallel.

  • What is SQL's relationship with data visualization? What types of visualizations/interactions can it express? When is it not enough?

SQL is a language for querying data that is stored in a multi-table relational schema. For visualizations that require subsets of data, SQL is a way to retrieve this data with high balance of efficiency and expressiveness. It can express pivoting, joins, and aggregations of data attributes. I'm not sure what it cannot support, as UDFs can implement any program over the data.