# Exercise 1:  Sakila Star Schema & ETL  

All the database tables in this demo are based on public database samples and transformations
- `Sakila` is a sample database created by `MySql` [Link](https://video.udacity-data.com/topher/2021/August/61120e06_pagila-3nf/pagila-3nf.png)
- The postgresql version of it is called `Pagila` [Link](https://github.com/devrimgunduz/pagila)
- The facts and dimension tables design is based on O'Reilly's public dimensional modelling tutorial schema [Link](https://video.udacity-data.com/topher/2021/August/61120d38_pagila-star/pagila-star.png)

## 1. Connect to the newly created pagila db
The db has been locally created using docker and postgres, according to these [instructions](https://github.com/devrimgunduz/pagila)

In [1]:
# Load ipython-sql
%load_ext sql

# Setup database connection
DB_ENDPOINT = 'localhost'
DB_NAME = 'pagila'
DB_USER = 'postgres'
DB_PASSWORD = 'postgres'
DB_PORT = '5432'

conn_string = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_ENDPOINT}:{DB_PORT}/{DB_NAME}"

%sql $conn_string

In [4]:
DB_ENDPOINT = 'localhost'
DB_NAME = 'pagila'
DB_USER = 'postgres'
DB_PASSWORD = 'postgres'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_ENDPOINT}:{DB_PORT}/{DB_NAME}"

print(conn_string)

postgresql://postgres:postgres@localhost:5432/pagila


## 2. Explore the  3NF Schema

<img src="../../../images/cloud_data_warehouse_pagila.png" width="50%"/>

## 2.1 How much? What data sizes are we looking at?

In [6]:
nStores = %sql SELECT COUNT(*) from store;
nFilms = %sql SELECT COUNT(*) from film;
nCustomers = %sql SELECT COUNT(*) from customer;
nRentals = %sql SELECT COUNT(*) from rental;
nPayment = %sql SELECT COUNT(*) from payment;
nStaff = %sql SELECT COUNT(*) from staff;
nCity = %sql SELECT COUNT(*) from city;
nCountry = %sql SELECT COUNT(*) from country;

print("nFilms\t\t=", nFilms[0][0])
print("nCustomers\t=", nCustomers[0][0])
print("nRentals\t=", nRentals[0][0])
print("nPayment\t=", nPayment[0][0])
print("nStaff\t\t=", nStaff[0][0])
print("nStores\t\t=", nStores[0][0])
print("nCities\t\t=", nCity[0][0])
print("nCountry\t\t=", nCountry[0][0])

 * postgresql://postgres:***@localhost:5432/pagila
1 rows affected.
 * postgresql://postgres:***@localhost:5432/pagila
1 rows affected.
 * postgresql://postgres:***@localhost:5432/pagila
1 rows affected.
 * postgresql://postgres:***@localhost:5432/pagila
1 rows affected.
 * postgresql://postgres:***@localhost:5432/pagila
1 rows affected.
 * postgresql://postgres:***@localhost:5432/pagila
1 rows affected.
 * postgresql://postgres:***@localhost:5432/pagila
1 rows affected.
 * postgresql://postgres:***@localhost:5432/pagila
1 rows affected.
nFilms		= 1000
nCustomers	= 599
nRentals	= 16044
nPayment	= 16049
nStaff		= 2
nStores		= 2
nCities		= 600
nCountry		= 109


## 2.2 When? What time period are we talking about?

In [9]:
%%sql 
SELECT MIN(payment_date) AS START, MAX(payment_date) AS end FROM payment;

 * postgresql://postgres:***@localhost:5432/pagila
1 rows affected.


start,end
2020-01-24 21:21:56.996577+00:00,2020-05-14 12:44:29.996577+00:00


## 2.3 Where? Where do events in this database occur?
TODO: Write a query that displays the number of addresses by district in the address table. Limit the table to the top 10 districts. Your results should match the table below.

In [8]:
%%sql
SELECT
    district,
    COUNT(*) AS n
FROM address
GROUP BY 1
ORDER BY n DESC
LIMIT 10

 * postgresql://postgres:***@localhost:5432/pagila
10 rows affected.


district,n
Buenos Aires,10
Shandong,9
California,9
West Bengali,9
Uttar Pradesh,8
So Paulo,8
England,7
Maharashtra,7
Southern Tagalog,6
Gois,5
