# OLAP Queries

Common operation: **aggregate** - measure over one or more dimensions
* eg: total sales for each city, top 10 products in NorthEast, top performing employee per region

**Roll up** is aggregating different levels together
* from total sales by city, roll-up gets us sales by state. Rollup again gets us sales by country

## Comparison with SQL Queries
Result of join:
* does not lose tuples
* does not produce duplicate tubles

# Cube Operator
We have $2^k$ SQL `GROUP BY` queries that can be generated by pivoting on a subset of dimensions

We can do a bunch with `CUBE BY`

```sql
CUBE BY pid, locid, timeid SUM Sales
```
* Equivalent to rolling up Sales on all eight subsets of the set [pid, locid, timeid]
* Each rollup corresponds to a SQL query
```sql
SELECT ..., SUM(S.Sales)
FROM Sales S
GROUP BY grouping-list
```

CUBE allows us to efficiently compute multiple granularity aggregates
* However it is expensive and computation is large
* CUBE may be *partially* or *fully materialized* or not even at all
* Lots of interest in computing it fast, compressing it, approximating it

# Views and Decision Support
OLAP queries are typically aggregate queries
* precomputation is essential for interactive response time
* CUBE is a collection of aggregate queries, so it's important we precompute as much as we can
* Think of **warehouses** as a collection of asynchronously replicated tables and periodically maintained views

## Issues in View Materialization
* Which views should we materialize? What indexes should we build on precomputed results
* Given a query and a set of materialized views, can we use the views to answer the query?
* How frequently should we refresh materialized views? (to keep them consistently)
* How can we refresh incrementally?
    * Naive: throw away old view, recompute the new view
    
### Examples:
* **Top N Queries**: You would want to avoid recomputing the costs for all cars before sorting to determine 10 cheapest

```sql
OPTIMIZE 10 ROWS
```

* **Online Aggregation**: Can we provide user some information as we're computing before the final result is completed?
    * eg. *Find average sales by state*: we can provide a "running" average as we're computing. Even with 2000 records, we're fairly close enough to the final result.
    
# SQL Server Management Studio
* Integrated environment supporting
    1. SQL server for relational and OLAP dbs
    2. SQL Server Integration Services for:
        * db service utils, extract-transform-load operations
    3. SQL Server Analysis Services
        * data warehouses and OLAP
        * metadata as XML
        * multidimentional expressions
        
# Summary
* Datra warehousing/OLAP is an emerging and rapidly growing subarea of dbs
    * "business intelligence"
    * OLAP vs OLTP differs in important ways
* Data warehouses are large, consolidated data repos
* Data warehouses exploit sophisticated analysis techniques: complex multidimensional queries
    * influenced by SQL and spreadsheet
* Important data management issues for DW/OLAP are
    * Semantic integration (different units, table layouts, etc)
    * Heterogenous sources
    * etc.
* Data warehouses contain fact tables and dimension tables
    * fact tables connect to dim tables via FKs
* Star schema more common than Snowflake
* Vendors have developed sophisticated engines that work with regular dbs to create data warehouses supporting OLAP