# Lesson 2 Exercise 2: Creating Denormalized Tables

<img src="../../../images/postgre_sql_logo.png" width="250" height="250">

In [1]:
from src.utils.postgres.generic_commands import PostgresCommands

##### Create a connection to the database, get a cursor, and set autocommit to true

In [5]:
postgres = PostgresCommands()

PostgresCommands class initiated


In [6]:
postgres.connect(host='localhost', port='5432', database_name='nano_data_engineering_db', user='postgres', password='####', autocommit=True)

Connection established with nano_data_engineering_db


##### Let's start with our normalized (3NF) database set of tables we had in the last exercise, but we have added a new table `sales`. 

`Table Name: transactions` \
`column 0: transaction Id` \
`column 1: Customer Name` \
`column 2: Cashier Id` \
`column 3: Year` 

`Table Name: albums_sold` \
`column 0: Album Id` \
`column 1: Transaction Id` \
`column 3: Album Name` 

`Table Name: employees` \
`column 0: Employee Id` \
`column 1: Employee Name` 

`Table Name: sales` \
`column 0: Transaction Id` \
`column 1: Amount Spent` 

<img src="../../../images/normalized_transactions.png" width="450" height="450"> 
<img src="../../../images/normalized_albums.png" width="450" height="450"> 
<img src="../../../images/normalized_employees.png" width="350" height="350"> 
<img src="../../../images/normalized_sales.png" width="350" height="350">

##### Generate and check tables

In [11]:
# Transactions
postgres.create_table(table_name='transactions', schema='(transaction_id int, customer_name varchar, cashier_id int, year int)')
postgres.insert_rows(table_name='transactions', 
                     columns='(transaction_id, customer_name, cashier_id, year)',
                     rows=[
                           (1, 'Amanda', 1, 2000),
                           (2, 'Toby', 1, 2000),
                           (3, 'Max', 2, 2018)
                          ]
                    )
postgres.print_rows(table_name='transactions')

# Albums
postgres.create_table(table_name='albums', schema='(album_id int, transaction_id int, album_name varchar)')
postgres.insert_rows(table_name='albums', 
                     columns='(album_id, transaction_id, album_name)',
                     rows=[
                           (1, 1, 'Rubber Soul'),
                           (2, 1, 'Let It Be'),
                           (3, 2, 'My Generation'),
                           (4, 3, 'Meet the Beatles'),
                           (5, 3, 'Help!'),
                          ]
                    )
postgres.print_rows(table_name='albums')

# Employees
postgres.create_table(table_name='employees', schema='(employee_id int, employee_name varchar)')
postgres.insert_rows(table_name='employees', 
                     columns='(employee_id, employee_name)',
                     rows=[
                           (1, 'Sam'),
                           (2, 'Bob')
                          ]
                    )
postgres.print_rows(table_name='employees')

# Sales
postgres.create_table(table_name='sales', schema='(transaction_id int, amount_spent int)')
postgres.insert_rows(table_name='sales', 
                     columns='(transaction_id, amount_spent)',
                     rows=[
                           (1, 40),
                           (2, 19),
                           (3, 45)
                          ]
                    )
postgres.print_rows(table_name='sales')

Table named transactions created
(1, 'Amanda', 1, 2000)
(2, 'Toby', 1, 2000)
(3, 'Max', 2, 2018)
Table named albums created
(1, 1, 'Rubber Soul')
(2, 1, 'Let It Be')
(3, 2, 'My Generation')
(4, 3, 'Meet the Beatles')
(5, 3, 'Help!')
Table named employees created
(1, 'Sam')
(2, 'Bob')
Table named sales created
(1, 40)
(2, 19)
(3, 45)


##### Let's say you need to do a query that gives:

`transaction_id` \
`customer_name` \
`cashier_name` \
`year` \
`albums_sold` \
`amount_sold` 

In [16]:
postgres.custom_query(query="""
                            SELECT
                                transaction_id,
                                customer_name,
                                employee_name AS cashier_name,
                                year,
                                album_name,
                                amount_spent
                            FROM transactions
                            LEFT JOIN employees
                            ON transactions.cashier_id = employees.employee_id
                            LEFT JOIN albums
                            USING (transaction_id)
                            LEFT JOIN sales
                            USING (transaction_id)
                            """
                     )                                

[(1, 'Amanda', 'Sam', 2000, 'Rubber Soul', 40),
 (1, 'Amanda', 'Sam', 2000, 'Let It Be', 40),
 (2, 'Toby', 'Sam', 2000, 'My Generation', 19),
 (3, 'Max', 'Bob', 2018, 'Meet the Beatles', 45),
 (3, 'Max', 'Bob', 2018, 'Help!', 45)]

Great we were able to get the data we wanted.

But, we had to perform a 3 way `JOIN` to get there. While it's great we had that flexibility, we need to remember that `JOINS` are slow and if we have a read heavy workload that required low latency queries we want to reduce the number of `JOINS`.  Let's think about denormalizing our normalized tables.

##### With denormalization you want to think about the queries you are running and how to reduce the number of JOINS even if that means duplicating data. The following are the queries you need to run.

####  Query 1: `select transaction_id, customer_name, amount_spent FROM <min number of tables>`

One way to do this would be to do a JOIN on the `sales` and `transactions2` table but we want to minimize the use of `JOINS`.  

To reduce the number of tables, first add `amount_spent` to the `transactions` table so that you will not need to do a JOIN at all. 

`Table Name: transactions` \
`column 0: transaction_id` \
`column 1: customer_name` \
`column 2: cashier_id` \
`column 3: year` \
`column 4: amount_spent`

<img src="../../../images/denormalized.png" width="450" height="450">

In [19]:
# Drop earlier created table
postgres.drop_table('transactions')

# Sales
postgres.create_table(table_name='transactions', schema='(transaction_id int, customer_name text, cashier_id int, year int, amount_spent int)')
postgres.insert_rows(table_name='transactions', 
                     columns='(transaction_id, customer_name, cashier_id, year, amount_spent)',
                     rows=[
                           (1, 'Amanda', 1, 2000, 40),
                           (2, 'Toby', 1, 2000, 19),
                           (3, 'Max', 2, 2018, 45)
                          ]
                    )
postgres.print_rows(table_name='transactions')

Dropped table transactions
Table named transactions created
(1, 'Amanda', 1, 2000, 40)
(2, 'Toby', 1, 2000, 19)
(3, 'Max', 2, 2018, 45)


##### Now you should be able to do a simplifed query to get the information you need. No  `JOIN` is needed.

In [22]:
for row in postgres.custom_query(query="""
                                        SELECT
                                            transaction_id,
                                            customer_name,
                                            amount_spent
                                        FROM transactions
                                        """
                                ):
    print(row)

(1, 'Amanda', 40)
(2, 'Toby', 19)
(3, 'Max', 45)


#### Query 2: `select cashier_name, SUM(amount_spent) FROM <min number of tables> GROUP BY cashier_name` 

To avoid using any `JOINS`, first create a new table with just the information we need. 

`Table Name: cashier_sales` \
`column 0: transaction_id` \
`column 1: cashier_name` \
`column 2: cashier_id` \
`column 3: amount_spent`


<img src="../../../images/denormalized_2.png" width="450" height="450">

In [24]:
# Cashier Sales
postgres.create_table(table_name='cashier_sales', schema='(transaction_id int, cashier_name text, cashier_id int, amount_spent int)')
postgres.insert_rows(table_name='cashier_sales', 
                     columns='(transaction_id, cashier_name, cashier_id, amount_spent)',
                     rows=[
                           (1, 'Sam', 1, 40),
                           (2, 'Sam', 1, 19),
                           (3, 'Bob', 2, 45)
                          ]
                    )
postgres.print_rows(table_name='cashier_sales')

Table named cashier_sales created
(1, 'Sam', 1, 40)
(2, 'Sam', 1, 19)
(3, 'Bob', 2, 45)


### Run the query

In [25]:
for row in postgres.custom_query(query="""
                                        SELECT
                                            cashier_name,
                                            SUM(amount_spent) AS sales
                                        FROM cashier_sales
                                        GROUP BY 1
                                        """
                                ):
    print(row)

('Sam', 59)
('Bob', 45)


##### And finally close your cursor and connection

In [26]:
postgres.close_connection()

Closed cursor
Closed connection
