<div align="right"><img src="./img/logos.png" alt="drawing" width="400" /> <div/> 



# Data modeling at scale: introduction to dimension modeling
_Diego Ardila - Staff Data Scientist at Shopify_



* [Data modeling at scale?](#data_modeling_at_scale)
  * [What is data warehouse?](#why_a_data_warehouse)
  * [Why a data warehouse?](#what_is_data_warehouse)
  * [What is ETL/ELT?](#what_is_etl)
* [Methodology (dimensional modeling)](#dimensional_modeling)
  * [Example 1: Who are the best movie directors?](#dimensional_modeling_example_1)
    * [Why dimensional modeling](#why_dimensional_modeling_example_1)
    * [Dimensions](#dimensional_modeling_dimensions)
    * [Facts](#dimensional_modeling_facts)
  * [Example 2: Whhen did a show started, and ended? What was the total runtime?](#dimensional_modeling_example_2)

## Data modeling at scale? <a class="anchor" id="data_modeling_at_scale"></a>

*Infrastructure + ETL/ELT + Methodology (dimensional modeling)  = Data warehouse/Data lake*

## What is data warehouse? <a class="anchor" id="what_is_data_warehouse"></a>

1. Database 
2. Data collected from one or multiple sources
3. Designed to support reporting and data analytics

## Why a data warehouse? <a class="anchor" id="why_a_data_warehouse"></a>
1. Keep historic data
2. Central view
3. Data quality and Common model

![alt](https://imgs.xkcd.com/comics/standards_2x.png)

4. Query performance without impacting production systems
5. Augment source system  



## What is ETL/ELT? <a class="anchor" id="what_is_etl"></a>

**Extraction + Transformation + Loading**

- Extract: move data from the source to the data lake. Technlogies such as: [iceberg](https://iceberg.apache.org/), [kafka](https://kafka.apache.org/)  
- Transformation: augment/transform the sources into a consistent format. Technologies such as [spark](https://spark.apache.org/), [dbt](https://www.getdbt.com/), [flink](https://flink.apache.org/), [apache beam](https://beam.apache.org/).   
- Load: load data into frontroom tables for consumption. 

All this typically wrapped around some scheduling solution in batch aplications, e.g. [airflow](https://airflow.apache.org/), [prefect](https://www.prefect.io/), [luigi](https://github.com/spotify/luigi). 

Alternatively, we can talk about ELT

**Extraction + Loading + Transformation**


## Methodology (dimensional modeling) <a class="anchor" id="dimensional_modeling_example_2"></a>
- Data warehouse design technique, developed by [Kimball and Ross](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/). 
- Latest edition (3rd) of the K & R book created in 2013. Right before Redshift and pre-cloud. Still makes sense for understability.
- Based on the "star" model. Two main elements: conformed dimensions and facts:

![alt text](https://cdn.holistics.io/guidebook/star-schema.png "Title")

- Four large steps: 
  - Select the business process
  - Identify grain
  - Identify the dimensions
  - Identify facts
  - Implement model. We can use a SQL transformation engine. Check out [DBT](https://www.getdbt.com/).
- Let's use an example to introduce the main concepts


### Our database: IMDB database (RAW!).




[IMDB datasets](https://www.imdb.com/interfaces/)


<div align="center"><img src="./img/imdb_raw.png" width="500" /> <div/> 

Let's take a look... 

In [41]:
import pandas as pd
import psycopg2
import sqlalchemy
import urllib
from IPython.display import display

username = 'ssc_workshop'
password = 'sql_for_ds'
host = "ssc-2022-workshop.ct6ghoz7smhy.us-east-1.rds.amazonaws.com"
port = '5432'
engine = sqlalchemy.create_engine('postgresql://{username}:{password}@{host}:{port}/imdb_raw'.format(username=username, password=password, host=host, port=port))


In [42]:
title_basics_sample = pd.read_sql("select * from title_basics limit 10", engine)
crew_sample = pd.read_sql("select * from crew limit 10", engine)
ratings_sample = pd.read_sql("select * from title_ratings limit 10", engine)
episodes_sample = pd.read_sql("select * from episodes limit 10", engine)
print("Ratings sample is...")
display(ratings_sample)
print("Title basics sample...")
display(title_basics_sample)
print("Crew sample is...")
display(crew_sample)
print("Episodes_sample sample is...")
display(episodes_sample)

Ratings sample is...


Unnamed: 0,index,tconst,averageRating,numVotes
0,208757,tt0368133,6.6,8
1,251121,tt0466988,6.5,188
2,451315,tt10001184,8.9,1456
3,451452,tt10004456,7.6,156
4,451795,tt10009170,7.5,17256
5,452022,tt10012914,7.6,60
6,452025,tt10012924,7.6,47
7,452031,tt10013078,7.3,9
8,452097,tt10014140,7.7,521
9,452101,tt10014162,8.6,600


Title basics sample...


Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,1358469,tt10702128,tvEpisode,Fallout,Fallout,0,2021.0,,29.0,"Comedy,Romance"
1,3321629,tt14281396,tvEpisode,Empire,Empire,0,2021.0,,,Documentary
2,2830079,tt13386180,tvEpisode,Nate Silver/View Your Deal,Nate Silver/View Your Deal,0,2020.0,,,Talk-Show
3,2496861,tt12770530,tvEpisode,Barnwood Backyard,Barnwood Backyard,0,2020.0,,,Documentary
4,5268374,tt20494264,tvEpisode,Episode dated 23 May 2022,Episode dated 23 May 2022,0,2022.0,,,"News,Talk-Show"
5,2082528,tt11997690,tvEpisode,Winnie the Pooh: Flower Pots,Winnie the Pooh: Flower Pots,0,2020.0,,,"Family,Reality-TV"
6,5312581,tt20765906,tvEpisode,Episode #12.2,Episode #12.2,0,2022.0,,,Reality-TV
7,3450749,tt14515708,video,Internal Love Vol. 7,Internal Love Vol. 7,1,2021.0,,144.0,"Adult,Romance"
8,4820459,tt18305916,tvEpisode,Episode dated 21 February 2022,Episode dated 21 February 2022,0,2022.0,,,Talk-Show
9,4339666,tt16249538,tvEpisode,Episode #1.25,Episode #1.25,0,2021.0,,62.0,"Comedy,News"


Crew sample is...


Unnamed: 0,index,tconst,directors,writers
0,219209,tt0228802,nm0847300,\N
1,352804,tt0368133,nm0698464,nm0698464
2,448620,tt0466988,nm0279202,\N
3,968546,tt10001184,nm0920274,"nm0920274,nm0008036,nm1954466"
4,970390,tt10004456,"nm3069099,nm7679077",\N
5,970732,tt10005078,\N,nm4323749
6,973006,tt10009170,nm2885631,"nm0663048,nm0663050"
7,975070,tt10012914,"nm0490640,nm7968966","nm2244274,nm2268818,nm0614014,nm2956436"
8,975076,tt10012924,"nm0490640,nm7968966","nm2268818,nm2956436"
9,975166,tt10013078,nm0934639,"nm0581709,nm8200343"


Episodes_sample sample is...


Unnamed: 0,index,tconst,parentTconst,seasonNumber,episodeNumber
0,445617,tt10014036,tt10009170,1,1
1,445620,tt10014044,tt10009170,1,2
2,445622,tt10014052,tt10009170,1,3
3,445625,tt10014062,tt10009170,1,4
4,445642,tt10014102,tt10009170,1,5
5,445657,tt10014140,tt10009170,1,6
6,445660,tt10014150,tt10009170,1,7
7,445663,tt10014162,tt10009170,1,8
8,461863,tt10048860,tt10048452,1,4
9,461879,tt10048888,tt10048452,1,3


### Example 1: Who are the best movie directors? <a class="anchor" id="dimensional_modeling_example_1"></a>

#### Why dimensional modeling? <a class="anchor" id="why_dimensional_modeling_example_1"></a>

Let's do first the SQL query....



In [30]:
who_are_best_movie_directors = """
with directors as (
  select  tconst, d as director
  from crew c, unnest(regexp_split_to_array(directors, ',')) d
),
ratings_directors as (
select d.director, avg(r."averageRating") as averageRating
from 
    title_basics t 
    join directors d on t.tconst = d.tconst
    join title_ratings r on t.tconst = r.tconst
group by 1
)
select rd.director, rd.averageRating
from ratings_directors rd
order by averageRating desc
limit 10 
"""
print(pd.read_sql(who_are_best_movie_directors, engine))

     director  averagerating
0  nm11301451           10.0
1  nm11185737           10.0
2  nm11066250           10.0
3  nm10625960           10.0
4   nm0237008           10.0
5  nm11092554           10.0
6  nm10690473           10.0
7   nm0837908           10.0
8  nm10348145           10.0
9  nm11315554           10.0


This works, so what are the problems? 

- We carry the naming and standards of the source database
- Okay to perform on ad-hoc basis, but what if you have to regularly use this information?
- Agregations in production are costly. You don't want to do this

Ideally we have something simpler, 

```
select director_id, avg(average_rating) as average_rating
from directors_facts join directors_dimension using(director_id)
group by 1
order by 2 desc
limit 10
```

#### Dimensions <a class="anchor" id="dimensional_modeling_dimensions"></a>

- Business entity. In our example, `directors_dimension`
- There are several types of dimensions:
  - Type 1 dimension
  - Type 2 dimension
- Simple case: type 1 dimension:




In [31]:
create_dimensions = """
drop table if exists directors_dimension_diego_ardila;
create table directors_dimension_diego_ardila as
select d as director_id
from crew c, unnest(regexp_split_to_array(directors, ',')) d
limit 10000;

drop table if exists movie_dimension_diego_ardila;
create table movie_dimension_diego_ardila as
select t.tconst as movie_id, "primaryTitle" as primary_title, "startYear" as release_year, "runtimeMinutes" as runtime_minutes
from title_basics t
limit 10000;

"""
engine.execute(create_dimensions)
print("Movie dimension \n ...")
pd.read_sql("select * from movie_dimension_diego_ardila", engine).head()

Movie dimension 
 ...


Unnamed: 0,movie_id,primary_title,release_year,runtime_minutes
0,tt10702128,Fallout,2021.0,29.0
1,tt14281396,Empire,2021.0,
2,tt13386180,Nate Silver/View Your Deal,2020.0,
3,tt12770530,Barnwood Backyard,2020.0,
4,tt20494264,Episode dated 23 May 2022,2022.0,


#### Fact tables <a class="anchor" id="dimensional_modeling_facts"></a>

- Table containing measurements.
- Grain defined by related dimensions.
- Facts are usually additive, but not always.
- Resolves many to many relationships.
- There are several types of fact tables
  - *Transaction Fact Table* represents an event that occurs at the primary point.
  - *Snapshot Fact Table* describes the state of things at a particular time.
  - *Accumulated Fact Table* is used to show the activity of a process that has a beginning and an end. 
- Simple case: Transaction Fact Table of movies directed by a director.



In [32]:
create_facts = """
-- An entry for every movie a director has directed
drop table if exists directors_facts_diego_ardila;
create table directors_facts_diego_ardila as
with directors as (
  select  tconst, d as director_id
  from crew c, unnest(regexp_split_to_array(directors, ',')) d
)
select d.director_id, t.tconst as movie_id, "averageRating" as average_rating, "startYear" as movie_released_on_year
from 
    title_basics t 
    join title_ratings r on t.tconst = r.tconst
    join directors d on t.tconst = d.tconst
limit 100;

"""
engine.execute(create_facts)
print("Directors Facts")
pd.read_sql("select * from directors_facts_diego_ardila", engine).head()

Directors Facts


Unnamed: 0,director_id,movie_id,average_rating,movie_released_on_year
0,nm0005690,tt0000001,5.7,1894.0
1,nm0721526,tt0000002,5.9,1892.0
2,nm0721526,tt0000003,6.5,1892.0
3,nm0721526,tt0000004,5.8,1892.0
4,nm0005690,tt0000005,6.2,1893.0


How does the query looks now? 


In [33]:
who_are_best_movie_directors_using_dm = """
select director_id, avg(average_rating) as average_rating
from directors_facts_diego_ardila join directors_dimension_diego_ardila using(director_id)
group by 1
order by 2 desc
limit 10 
"""
pd.read_sql(who_are_best_movie_directors_using_dm, engine).head()

Unnamed: 0,director_id,average_rating
0,nm0279202,7.175
1,\N,5.844715


##### What if we have a new use case: first year a director directed a movie?



In [34]:
first_year_director_directed_movie_sql = """
select director_id, min(movie_released_on_year) as first_year_released_movie
from directors_facts_diego_ardila join directors_dimension_diego_ardila using(director_id)
group by 1
order by 2 desc
limit 10 
"""
pd.read_sql(first_year_director_directed_movie_sql, engine).head()

Unnamed: 0,director_id,first_year_released_movie
0,nm0279202,1927.0
1,\N,1896.0


### Example 2: When did a show started and ended? What was the total runtime? <a class="anchor" id="dimensional_modeling_example_2"></a>

1. What grain?: tv show
2. What kind of query are we looking for? 

```
select show_id, primary_title, first_episode_started_on, last_episode_ended_on, number_of_episodes, number_of_seasons, total_run_time
from show_accumulating_facts
```

3. Let's do first the SQL query



In [35]:
when_did_a_show_started_and_ended_sql = """
with series as (
  select tconst, "primaryTitle", "startYear", "titleType"
  from title_basics t
  where "titleType" = 'tvSeries'
),
episodes_series as (
  select tconst, "primaryTitle", "startYear", "runtimeMinutes"
  from title_basics t
  where "titleType" = 'tvSeries'
),
series_facts as (
select s.tconst as series_id
   
    , min(es."startYear") as first_episode_started_on
    , max(es."startYear") as last_episode_ended_on
    , count(distinct "seasonNumber") as number_of_seasons
    , count(distinct "episodeNumber") as number_of_episodes
    , sum(cast(es."runtimeMinutes" as float)) as total_run_time
from 
    series s 
    left join episodes e on s.tconst = e."parentTconst"
    left join episodes_series es on e.tconst = es.tconst
 group by 1
)
select series_id, first_episode_started_on, last_episode_ended_on, number_of_episodes, number_of_seasons, total_run_time
from series_facts
limit 5
"""
pd.read_sql(when_did_a_show_started_and_ended_sql, engine)

Unnamed: 0,series_id,first_episode_started_on,last_episode_ended_on,number_of_episodes,number_of_seasons,total_run_time
0,tt10009170,,,8,1,
1,tt10048452,,,8,1,
2,tt10062292,,,10,4,
3,tt10065678,,,8,2,
4,tt10087640,,,8,1,


4. Enters dimensional modeling....


#### Which dimensions?

In [36]:
create_dimensions = """
drop table if exists series_dimension_diego_ardila;
create table series_dimension_diego_ardila as
select tconst as series_id, "primaryTitle" as primary_title, "startYear" as series_started_on
from title_basics t
where "titleType" = 'tvSeries'
"""
engine.execute(create_dimensions)
pd.read_sql("select * from series_dimension_diego_ardila limit 100", engine).head()

Unnamed: 0,series_id,primary_title,series_started_on
0,tt13972342,My Wife and Me,2020.0
1,tt11998020,Interested Anthony,2020.0
2,tt19726764,Rules of Engagement,2022.0
3,tt13498220,Ban Sao Sod,2020.0
4,tt12740750,Everything Is Planned,2020.0


#### Which facts?

In [37]:
create_facts =  """
drop table if exists series_accumulating_facts_diego_ardila;
create table series_accumulating_facts_diego_ardila as
with series as (
  select tconst, "primaryTitle", "startYear", "titleType"
  from title_basics t
  where "titleType" = 'tvSeries'
),
episodes_series as (
  select tconst, "primaryTitle", "startYear", "runtimeMinutes"
  from title_basics t
  where "titleType" = 'tvSeries'
),
series_facts as (
select s.tconst as series_id 
    , min(es."startYear") as first_episode_started_on
    , max(es."startYear") as last_episode_ended_on
    , count(distinct "seasonNumber") as number_of_seasons
    , count(distinct "episodeNumber") as number_of_episodes
    , sum(cast(es."runtimeMinutes" as float)) as total_run_time
from 
    series s 
    left join episodes e on s.tconst = e."parentTconst"
    left join episodes_series es on e.tconst = es.tconst
 group by 1
)
select series_id, first_episode_started_on, last_episode_ended_on, number_of_episodes, number_of_seasons, total_run_time
from series_facts limit 10
"""
engine.execute(create_facts)



<sqlalchemy.engine.result.ResultProxy at 0x7fa74a75ff90>

*How does the query looks now?*

In [38]:

pd.read_sql("select * from series_accumulating_facts_diego_ardila limit 10", engine).head()

Unnamed: 0,series_id,first_episode_started_on,last_episode_ended_on,number_of_episodes,number_of_seasons,total_run_time
0,tt10009170,,,8,1,
1,tt10048452,,,8,1,
2,tt10062292,,,10,4,
3,tt10065678,,,8,2,
4,tt10087640,,,8,1,


### Example 3: What was the first and last shows of a director? What was the best and worst shows? How many shows has a director had?  <a class="anchor" id="dimensional_modeling_example_3"></a>

1. What grain?: tv series 
2. What kind of query are we looking for?

```
select director_id
    , first_series_started_on
    , last_series_started_on
    , fsd.series_id as best_series_id
    , lsd.series_id as last_series_id
    , number_of_series
from directors_accumulating_facts d
    join series_dimension fsd on  d.best_series_id = fsd.series_id
    join series_dimension lsd on  d.last_series_id = lsd.series_id
```
3. Let's do first the SQL query 
4. Which dimensions? Which facts?



# Other topics that we did not covered


- Slow changing dimensions
- Junk dimensions
- Outrigger dimension
- Role playing dimensions
- Bus Architecture
- ... :)  

**Thanks for joining and ...**
<div align="center"><img src="./img/thats_all_folks.png" width="500" /> <div/> 