### Connect to Pagila

In [1]:
%load_ext sql

In [2]:
DB_ENDPOINT = "127.0.0.1"
DB = 'pagila'
DB_USER = 'postgres'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}@{}:{}/{}".format(DB_USER, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)

postgresql://postgres@127.0.0.1:5432/pagila


In [3]:
%sql $conn_string

### Star Schema

<img src="pagila-star.png" width="50%"/>

### Start by a simple cube

In [4]:
%%sql
SELECT dimDate.day, dimMovie.rating, dimCustomer.city, sum(sales_amount) as revenue
FROM factSales 
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate      on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer  on (dimCustomer.customer_key = factSales.customer_key)
group by (dimDate.day, dimMovie.rating, dimCustomer.city)
order by revenue desc
limit  20;

 * postgresql://postgres@127.0.0.1:5432/pagila
20 rows affected.


day,rating,city,revenue
15,PG-13,Jhansi,20.97
14,G,Qomsheh,19.97
4,NC-17,Lapu-Lapu,19.97
16,PG-13,Kamyin,18.97
27,NC-17,Sincelejo,18.97
2,NC-17,Shubra al-Khayma,18.97
4,R,Siegen,18.96
1,PG,Izumisano,18.96
17,R,Athenai,17.98
24,PG,Memphis,17.98


### Slicing

* Slicing is the reduction of the dimensionality of a cube by 1, e.g., 3 dimensions to 2, fixing one of the dimensions to a single value.
* The following example shows a 3-dimensional cube on the day, rating, and city.
* In the example below, `rating` is fixed to "PG-13", reducing the dimensionality.


In [5]:
%%sql
SELECT dimDate.day, dimMovie.rating, dimCustomer.city, sum(sales_amount) as revenue
FROM factSales
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)
WHERE dimMovie.rating = 'PG-13'
GROUP by (dimDate.day, dimCustomer.city, dimMovie.rating)
ORDER by revenue desc
LIMIT  20;

 * postgresql://postgres@127.0.0.1:5432/pagila
20 rows affected.


day,rating,city,revenue
15,PG-13,Jhansi,20.97
16,PG-13,Kamyin,18.97
5,PG-13,Karnal,17.97
8,PG-13,Probolinggo,16.98
1,PG-13,s-Hertogenbosch,16.98
16,PG-13,Cuauhtmoc,16.98
10,PG-13,Greensboro,15.98
22,PG-13,Jedda,15.98
27,PG-13,Binzhou,14.98
21,PG-13,Otsu,14.98


### Dicing
 - Creating a subcube, same dimensionality, less values for 2 or more dimensions

In [6]:
%%sql
SELECT dimDate.day,dimMovie.rating, dimCustomer.city, sum(sales_amount) as revenue
FROM factSales
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)
WHERE dimMovie.rating in ('PG-13', 'PG')
AND dimCustomer.city in ('Bellevue', 'Lancaster')
AND dimDate.day in ('1', '15', '30')
GROUP by (dimDate.day, dimCustomer.city, dimMovie.rating)
ORDER by revenue desc
LIMIT  20;

 * postgresql://postgres@127.0.0.1:5432/pagila
3 rows affected.


day,rating,city,revenue
15,PG-13,Lancaster,8.99
1,PG,Lancaster,6.99
30,PG,Bellevue,0.99


### Roll-up
- Stepping up the level of aggregation to a large grouping
- e.g.`city` is summed as `country`

In [7]:
%%sql
SELECT dimDate.day, dimMovie.rating, dimCustomer.country, sum(sales_amount) as revenue
FROM factSales
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)
GROUP by (dimDate.day,  dimMovie.rating, dimCustomer.country)
ORDER by revenue desc
LIMIT  20;

 * postgresql://postgres@127.0.0.1:5432/pagila
20 rows affected.


day,rating,country,revenue
18,PG-13,India,85.85
9,PG-13,India,80.85
27,PG-13,China,78.81
27,R,India,75.83
14,NC-17,India,75.82
17,PG-13,China,74.84
16,NC-17,China,73.87
26,PG,India,73.83
25,PG-13,India,72.87
25,PG-13,China,72.87


### Drill-down
- Breaking up one of the dimensions to a lower level.
- e.g.`city` is broken up to  `districts`

In [8]:
%%sql
SELECT dimDate.day, dimMovie.rating, dimCustomer.district, sum(sales_amount) as revenue
FROM factSales
JOIN dimMovie     on (dimMovie.movie_key         = factSales.movie_key)
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)
GROUP by (dimDate.day, dimCustomer.district, dimMovie.rating)
ORDER by revenue desc
LIMIT  20;

 * postgresql://postgres@127.0.0.1:5432/pagila
20 rows affected.


day,rating,district,revenue
18,PG-13,West Bengali,36.95
27,NC-17,Buenos Aires,36.91
3,R,West Bengali,34.94
12,NC-17,Buenos Aires,33.92
19,G,California,32.96
26,G,California,32.94
9,PG-13,Buenos Aires,30.96
24,G,England,30.95
17,NC-17,Illinois,29.96
19,G,Shandong,29.95


### Grouping Sets
- It happens a lot that for a 3 dimensions, you want to aggregate a fact:
    - by nothing (total)
    - then by the 1st dimension
    - then by the 2nd 
    - then by the 3rd 
    - then by the 1st and 2nd
    - then by the 2nd and 3rd
    - then by the 1st and 3rd
    - then by the 1st and 2nd and 3rd
    
- Since this is very common, and in all cases, we are iterating through all the fact table anyhow, there is a move clever way to do that using the SQL grouping statement "GROUPING SETS" 

In [33]:
%%sql
SELECT dimDate.month, dimStore.country, sum(sales_amount) as revenue
FROM factSales
JOIN dimDate  on (dimDate.date_key  = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by grouping sets ((), dimDate.month,  dimStore.country, (dimDate.month,  dimStore.country));

 * postgresql://postgres@127.0.0.1:5432/pagila
24 rows affected.


month,country,revenue
,,67416.51
7.0,South Africa,4956.26
5.0,New Zealand,5643.63
2.0,South Africa,5110.04
2.0,New Zealand,5030.0
3.0,New Zealand,5937.78
1.0,New Zealand,1465.58
6.0,New Zealand,5360.52
4.0,South Africa,5261.55
6.0,South Africa,5535.97


### CUBE 
- Equivalent to what we did above (grouping sets)

In [38]:
%%sql
SELECT dimDate.month, dimStore.country, sum(sales_amount) as revenue
FROM factSales
JOIN dimDate  on (dimDate.date_key         = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by cube(dimDate.month,  dimStore.country);

 * postgresql://postgres@127.0.0.1:5432/pagila
24 rows affected.


month,country,revenue
,,67416.51
7.0,South Africa,4956.26
5.0,New Zealand,5643.63
2.0,South Africa,5110.04
2.0,New Zealand,5030.0
3.0,New Zealand,5937.78
1.0,New Zealand,1465.58
6.0,New Zealand,5360.52
4.0,South Africa,5261.55
6.0,South Africa,5535.97


### Naive Way
- Takes 50-60% more time compared to grouping sets and cube.

In [28]:
%%time
%%sql
SELECT  NULL as month, NULL as country, sum(sales_amount) as revenue
FROM factSales
    UNION all 
SELECT NULL, dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by  dimStore.country
    UNION all 
SELECT cast(dimDate.month as text) , NULL, sum(sales_amount) as revenue
FROM factSales
JOIN dimDate on (dimDate.date_key = factSales.date_key)
GROUP by dimDate.month
    UNION all
SELECT cast(dimDate.month as text),dimStore.country,sum(sales_amount) as revenue
FROM factSales
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimStore on (dimStore.store_key = factSales.store_key)
GROUP by (dimDate.month, dimStore.country)

 * postgresql://postgres@127.0.0.1:5432/pagila
24 rows affected.
CPU times: user 3.87 ms, sys: 1.38 ms, total: 5.25 ms
Wall time: 33.7 ms


month,country,revenue
,,67416.51
,South Africa,33689.74
,New Zealand,33726.77
7.0,,9760.54
1.0,,3074.84
5.0,,11373.24
4.0,,10746.53
2.0,,10140.04
6.0,,10896.49
3.0,,11424.83
