<img src="./img/Dolan.png" width="180px" align="right">

# **DATA 6510**
# **Lesson 7: Dimensional Data Warehouses** 
_Facts, Dimensions, and Cubes_

## **Learning Objectives**
### **Theory / Be able to explain ...**
- How database design for analytical applications is different from that for transaction systems
- The various forms of the star schema and when each is most applicable
- The concept of data granularity in dimensional data warehouses
- Data cubes as a dimensional data model 

### **Skills / Know how to ...**
- Identify dimensions that provide context to facts

---
## **BIG PICTURE: The Holy Grail of Data Warehousing**

Data Warehousing as the "one true source of truth" is an idea that goes all the way back to the 1980s. Analysts had by that point built up quite a repertoire of models for just about any kind of analysis. They could create classification and regression trees (decision trees, random forests, etc.). They could do linear, nonlinear, kernel, and logistic regression with large datasets. They could solve optimization problems with thousands of variables and tens of thousands of constraints. Even the neural network models at the core of the latest and greatest deep learning techniques were pretty mature by 1994.  


What was missing was data! Well, sort of. Mainframe systems had been collecting transaction data for decades. Every day, banks and credit card companies were processing millions of transactions, transportation systems were tracking hundreds of thousands of shipments, etc. All we had to do was archive it all so we could analyze it later. Or so it seemed to most of us. 

Analytical data is not the same as transactional data, of course, in ways that were clear at the time.

|   | Transactional Data | Analytical Data |
|---| --- |--- |
| **Scope** | Functional / Operational | Strategic / Executive |
| **Provenance** | Online / Production | Historical + 3-rd Party |
| **Operating Environment** | Enterprise / Big Iron | Workspace / Workstation |
| **Data Quality** | "good enough to run the company" | "small errors compound over time" |
| **Performance Objective** | maximize transactions per hour | minimize time to results |
| **Access to Datasets** | read-only queries and reports | offline, with ability to make corrections |

Where would anybody get the analytical data with the right scope, volume, quality, etc. quickly and easily enough to useful? 

Even back then the ultimate solutions were known, though not remotely close to being available. Consider, for example, this figure from Ralph Kimball's seminal book *The Data Warehouse Lifecycle Toolkit*, published in 1998 and based on original work **from the mid-1980s**.

![Kimball's Data Warehouse](./img/L10_Kimball_Data_Warehouse_Elements.png)

All of the elements of a modern data pipeline are there. It even articulated the steps of the ETL process in detail. It was all there ... to be realized *someday*.

Someday is now. With commodity data storage, ample computing power, ready-made software for just about any kind of modeling, and the analytical results to attract attention from management, data infrastructure is finally seen as what it should have been all along: a critical resource upon which the company relies to make it stand out against the competition. 


That's what the vendors tell us, anyway.

In this lesson we will explore the **Dimensional Data Warehouse** model first proposed by Kimball et al. all those years ago. We will also consider how its mass adoption has influenced the SQL standard in recent years, with the addition of **window functions** and **collection** data types (Lesson 8) that relax fundamental assumptions of the relational data model in favor of analytical use cases. 
 



---
## **The Star Schema Pattern**

At some level, nearly all data warehouses look roughly the same. There is a huge table in the center with lots of columns and foreign key relationships to  smaller tables around the periphery. This general pattern is called a **star schema**. A relational database that implements the star schema pattern (or one of its variants) is called a dimensional data warehouse.


> **Heads Up**: The star schema is a **design pattern, a standard solution to a standard problem.** By standardizing what they call these solutions, designers can communicate among themselves with just a few words instead of re-explaining the details every time. It is both more efficient and potentially less error-prone, as any standard should be.

The Star Schema pattern features two kinds of tables:
- **A fact table with precomputed quantitative measures.** The measures themselves are somewhat volatile, with new measures continually added and others redefined to suit the ever changing needs of the analysts. If there is a way to precompute a statistic or other measure so analysts don't have to, then do it. If a given measure is no longer needed or misleading, then we redefine or remove it. 
- **Dimension tables that provide context for the facts.** Dimensions are somewhat timeless and immutable. Even when the facts themselves may change over time, the dimensions remain relatively static. 

Generally, the fact table has a foreign key reference to each of the dimension tables. Dimension tables, meanwhile, stand on their own, without any foreign keys. 

The star schema addresses the disconnect between the way data is recorded versus how it is used by analysts. It reduces all data down to measures (facts) and context (dimensions). Since *all information* takes that form eventually, star schemas strike a nice balance between structure and general applicability.

Once again, here is the NBA PlayFacts warehouse, this time noting some of the key features. We will use it as an example, starting with the dimensions before moving on to the facts. We will also explore variations on the general star schema pattern that fit certain use cases. 

![NBA PlayFacts Dim DW](./img/L10_Star_Schema_Notes.png)

---
## **Dimension Tables**

Dimensions are the lens through which we interpret each fact. They are what give it context and meaning. 


In theory the **dimensions are strong entities that exist independently of the facts**. Each fact, meanwhile, represents a collection (more like a selection) of details found in the dimensions. We'll go deeper into this when get discuss fact tables and granularity.

Though there is some disagreement about this, **the usual recommendation is that dimension tables be fully _denormalized_**. Since they are often fairly small (relative to the fact table) and don't change much, there is little chance of creating anomalies over time. So, while it may be tempting to, for example, normalize out zip codes and cities from a location dimension, there is no real need, especially when it would require an unnecessary table join.

So, what kinds of details are we talking about? A good starting point is the framework used by journalists and storytellers the world over:
- **Who was involved?** People, roles, etc. 
- **What happened** Event types, outcomes, etc. 
- **When did it happen?** Timing or place in a sequence
- **Where did it happen?** Location, which may be conceptual rather than physical
- **Why did it happen?** Intent, cause, etc. 
- **How did it happen?** Steps, sequential logic, etc.

Each of the who/what/when/where/why/how questions can suggest multiple dimensions. As we can see in the NBA example, the question of "who is involved in a given play?" is answered with *three* dimensions: 
- the individual player who gets credited with each event
- the lineup of players on the court at the time
- the team whose play is being reflected by the facts

It is done this way to support different, independently-calculated measures:
- the counting stats (points, rebounds, assists, etc.) for an individual player, lineup, or team
- the total playing time for each player or lineup  on the court, even when they are not generating counting stats

### **How Many Dimensions? Granularity and Focus**

An often overlooked but potentially tricky aspect of dimensional design is whether dimensions are allowed to overlap. In other words, can the same dimension be represented two different ways? Can we combine dimensions to create a third uber-dimension? 

Like a lot of things, it depends. 

**When to combine dimensions:** Some dimensions may permit several levels of *granularity*. For example, a given office is in a region, which is in a division, etc. To borrow terminology from normalization, we say that there is a chain of functional dependencies: office $\rightarrow$ region $\rightarrow$ division. The recommendation is to keep the levels together in a single dimension table rather than separate them out. 

> This would apply, for example, to the way NBA teams are grouped into *divisions* and *conferences* (Eastern or Western) that affect scheduling decisions. In other words, team $\rightarrow$ division $\rightarrow$ conference. Since each group nests cleanly inside the next, there is no need to separate them.

**When to separate into multiple dimensions:** Within a company the *location* of a fact can mean geolocation (addresses) or a spot in organizational hierarchy (functional area, group, etc.). While they are both location dimensions, they are not logically connected via a functional dependency. They should reside in separate dimension tables, not one. 

> We have already seen this case with the NBA players, lineups, and teams. There are many-to-many relationships among them, making it impossible to create a consistent functional dependency chain; they don't nest together cleanly. 

### **Slowly Changing Dimensions**

There are plenty of specialized types of dimensions (conformed, role-playing, junk, etc.) that serve different purposes. Most of the time, however, we are working with so-called **_slow moving_** dimensions where the rows and columns don't change much over time. Often such dimensions can be maintained by hand or perhaps via a periodic updating process.

For example, while NBA players do occasionally move from team to team or sign "10 day contracts" to replace injured players, for the most part the team rosters do not change much during the season. We can maintain them as needed, starting with a preliminary roster at the start of the seaosn and then adding to it as new players join the team. 

Similarly, NBA games are scheduled released well in advance, allowing us to update the games dimension all at once with perhaps a rare game reschedule if needed. 

---
## **Fact Tables**

**Fact tables exist at the intersection of the dimension tables.** Each fact is labeled with foreign keys, usually one key per dimension. The rest of the columns are measures that can be used in aggregate calculations. 

What makes a good measure? Anything for which we can calculate descriptive statistics (counts, averages, etc.): 
- For text data, we generally are limited to the text itself and counts of some sort. We may, for example, count the number of times the word "no" appears, how many sentences there are, etc. 
- For numerical data we can use all of the usual statistics like mean, maximum, minimum, etc., or perhaps *bin* into nonoverlapping categories (or *segments*). 
- For temporal data (dates and times), we may calculate elapsed times, inter-event times, cumulative times, etc. that can be treated like numerical data.
- For binary data (pictures, etc.), the options are very limited, though one may be able to apply a machine learning technique to generate numerical digests that can be aggregated.  

Interestingly, the fact table can only be as **granular** as the **dimensions** allow. In other words, if a given dimension only has 3 possible labels (rows), then that dimension can only divide up the facts three ways. We can, however, increase the granularity by adding **new** dimensions. Each additional dimension potentially increases the granularity but never decreases it. Alternatively, we can increase the granularity by adding strategically selected rows to an **existing** dimension. 

One way to visualize this is with a (hyper-) cube, with each dimension on a side. Each fact is *binned* inside one of the smaller cubes at the intersection of the dimensions. For the NBA PlayFacts cube below, each fact is binned based on the game, team, and player. Thus, if with only three dimensions, we would only be able to generate box score stats for full games. In order to get statistics within a game (e.g., for the last two minutes of each period) we would need to include a *play segment* dimension. (Don't ask about how we'd show a 4-dimensional cube. Just know that we can.) Thus, by adding a dimension we have also increased the granularity. 

![Data Cube](./img/L10_DataCube_wide.png)

> **Heads Up:** It is sometimes difficult to distinguish dimensions from measures when source data is numerical. For example, is the time on the clock (i.e., seconds remaining in the period) a measure or a dimension? It is a measure, in the sense that it captures the passage of time, but it is also a dimension, in that it records when a given event happened. The key when considering whether any given quantity belongs on the fact table is to ask whether you would i) aggregate it (sum, average, etc.) or ii) cite it. In a basketball game it is the latter (cited), so we separate it out into the play segment dimension. The clock *interval* between events (elapsed time), however, is something that we can sum up by quarter, player, etc. Thus, it belongs on the fact table. 

### **Rollups and Drilldowns**

One of the big advantages of a dimensional data warehouse design is that it makes it very simple to aggregate and disaggregate data at various levels of granularity. 


> **Heads Up**: The terms "rollup" and "drilldown" were coined by vendors of OLAP (OnLine Analytical Processing) systems that do real-time ETL from transaction data but the terms can be readily applied to any dimensional data warehouse.

A **rollup** is the standard `GROUP BY` aggregation operation. The idea is that a whole stripe of data in the cube (i.e., a dimension) is "rolled up" like a carpet and then replaced with summary data. We can do this for several dimensions at a time to get summarized data of various purposes. 

A **drilldown** is the opposite of a rollup. Starting with aggregate data, a drilldown disaggregates it to a finer level of granularity. Behind the scenes, it is the same as a rollup, just with a lower level of aggregation. Ultimately, the lowest level of a drilldown is the fact table itself. 

---
## **Variations**
 


### **Dimension *Tables* vs Dimension *Columns***

One of the advantages of keeping dimensional data in separate tables is that it can significantly reduce storage costs by eliminating redundant data. However, with the advent of cheap cloud-based data storage, cost becomes less important than performance. Thus, we may choose to denormalize everything into a single table (or materialized view) that doesn't require any expensive joins. It's simple enough. If we already have the data in a normalized form, then we would just need to join in every table and select every column to generate a new "one table fits all" data warehouse. 


> A materialized view is like a self-updating table; see below.

```sql
CREATE MATERIALIZED VIEW denormalized_fact_table AS
  ( SELECT *
    FROM fact_table 
      JOIN dimension_a ...
      JOIN dimension_b ...
      JOIN ...
  );
```

The dimensions would still be there, just as columns instead of tables. The joins would be completed in advance, simplifying `SELECT` queries even further.

> Note that in practice we would not use * in a view object; it's much more efficient to list specific columns. Also, materialized views need to be refreshed just before use to avoid stale data. 


While denormalized views sound great in theory, there are two good reasons for creating separate dimension tables: 
- **If care is taken to *use just foreign keys in the `GROUP BY` clause*, then it can *sometimes* be actually be faster to query multiple tables than a single table.** This is because the joins will happen **after** the grouping has reduced the data to a manageable number of fact table rows. The incremental performance cost of the join is then practically nil, especially if there are a small number of groups in the result set. 
- **Dimension tables provide opportunities to add in static descriptive data.** For example, we could add in the seating capacity or age of a given basketball arena if we treat it as a dimension table instead of just a column. If an arena were just a few columns then we'd have to update the view each time we added a column. 


Whether either of these advantages are relevant depends on the situation. As a general rule, unless you have a good reason not to, it is best to create dimension tables instead of dimension columns. You can always create a view if needed. 

### **Snowflakes and Galaxies**

The rule that dimensions be denormalized is more of a convenience than a law. The purpose is to make writing queries as simple and bug-free as possible. However, if we are ultimately going to denormalize the data using the same joins every time, then what difference does it make how many joins there are? Such is the reasoning behind the so-called **Snowflake design pattern**.

Here, for example, is the snowflake version of the PlayFacts database. The dotted lines show relationships normalized out from the star schema dimensions. There is no effect on the fact table, just more detail in the dimensions. 

![](./img/L10_NBA_PlayFacts_Snowflake_DW.png) 

The advantages of normalizing the dimensions include:
- less data redundancy, smaller storage requirements, etc. that we usually associate with normalization
- the ability to add new details (like team franchises or arenas) without disrupting existing dimensions

It is important to point out that the snowflake and star schema patterns are both perfectly valid. Which to use is totally situational. Let the data, its usage, and performance considerations guide your decision. 


Like the snowflake pattern, the **galaxy pattern** also extends the star schema, this time by allowing multiple fact tables. The baseball database from lesson 2 is a galaxy with ...
- Fact tables for fielding, batting, pitching, etc.  
- Dimension tables for players (Master), teams, all star game appearances, etc.  

It also follows the snowflake pattern, with enhancements like All Star Game appearances, Hall of Fame voting, etc. attached to the players and teams. 

![Lahman 2016 ERD](./img/L2_baseball_stats_schema.png)

---
## **Congratulations! You've made it to the end of Lesson 7.**

Next week we will consider alternative data models that can improve flexibility and performance. 

