# Chapter 3: Data Modelling for Analytics

## Kimball's Dimensional Data Modelling

### Key Concept 1: The Star Schema

The star schema consists of two types of tables:
- The **Fact Table** which stores measures of a business process
- The **Dimension Table** which stores descriptions of the associated fact tables

The general layout is to have the Fact Tables surrounding the Dimension Tables, forming the aptly named "Star Schema" design. Also, as a rule of thumb, the Dimension Tables are more slow moving while the Fact Tables are much faster moving.

For example, 

- in a POS system, there could be `Product Dimension` and `Customer Dimension`, and they have associated `Order Fact` tables. 
- in a HR / applications management system, there could be an `Employee Dimension`, a `Role/Position Dimension`, followed by `Application Fact` tables.

To develop a star schema, we follow the four steps of data modelling:

1. Select a business process to model. What are the inputs and outputs and what does the business intend to measure?
2. Select the grain. What is the atomic level of data (the level where objects cannot be split further) you want to measure? 
3. Identify the Dimensions. What are the descriptive contexts of the grain you want to capture? Note that this can also include externally determined dimensions like temporal attributes (year, month, week, day, day of week, week no. quarter no. etc.)
4. Identify the Facts. What are the numeric performance indicators you want to capture about the dimension?

## Kimball Modelling, then and now

Some important concepts to keep in mind today because of the technology considerations today:

1. Storage is cheap
2. Compute is cheap
3. Engineering time is expensive

**Example 1: Inventory Management**

Previously, when trying to manage inventory movements, Kimball proposed to implement "snapshot" tables taken periodically and perform aggregation across snapshots. However, with cheaper storage, we can get away by storing daily dumps of facts.

**Example 2: Slowly Changing Dimensions**

Previously, if a dimension table changes, there could be multiple ways to handle this change:

1. Type 1: Update the column of the dimension directly
2. Type 2: Add a new row to the dimension table, with a newer version and updated column value
3. Type 3: Add a new column to the dimension table to store the updated value, while retaining the old value in the current column

Today, we can use a Partitioning feature to store periodic (e.g. daily, monthly) snapshots of a dimension, and it is much easier intuitively to compare the same row in the same dimension across different time periods.

## Conclusion

In conclusion, the book argues that Kimball's concepts are still very relevant. We can still create adhoc queries for reports, or for POCs. But when certain features can be combined, we argue to remodel the process using Kimball's techniques.

We can still leverage on data modelling tools to annotate data, and can still rely on SQL to perform simple ELT of data in a data warehouse. Ideally, developing a capability that leverages on technology and relies on less manpower, with more self-service tools at hand today, is more efficient.

The goal of modelling is self-service. That means all the required data, business logic & presentation logic are all captured within the data modelling tool, and that will ensure that insights are delivered to the business unit timely & accurately.