<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/Database_08_Database_Managment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Database Management at "The Office"
Welcome to the chapter on Database Management. This topic is an integral part of database systems and has a direct impact on the efficiency, security, reliability, and overall performance of these systems. The importance of database management can't be overstated, as it plays a pivotal role in the smooth operation of businesses and organizations, allowing them to process transactions, secure sensitive information, ensure data consistency, and make informed decisions.

Database management isn't just about storing data; it's about storing it in a way that makes it easily retrievable, secure, and consistent. It includes the techniques and tools that enable the performance optimization of a database, manage the data and transactions, maintain the security and privacy of the data, and also facilitate data recovery in case of any failure or mishap.

In this chapter, we will delve into the specifics of database management. We will discuss key topics like transactions, performance optimization, security, the role of database systems in business, and more. To make these topics more approachable and easier to understand, we will relate them to real-world scenarios throughout the chapter.

## Case Study Introduction

Let's imagine Dunder Mifflin, a mid-sized paper supply company known to many from the TV show The Office. In the heart of Scranton, Pennsylvania, Dunder Mifflin maintains a busy workflow, with various departments such as sales, accounting, customer service, and human resources, all relying on data to carry out their operations.

They have a database filled with various types of data, including employee information, client details, product inventory, sales orders, and more. Given the volume and variety of data, the database system's role becomes crucial to support the day-to-day operations, strategic planning, and the overall success of Dunder Mifflin.

Through this case study, we'll explore how different aspects of database management apply to Dunder Mifflin's operations. We'll discuss how transactions are processed, the role of performance optimization techniques like indexing, how security measures like database views protect sensitive data, and how the database system contributes to the company's broader business goals.

As we journey through the complex yet intriguing landscape of database management, we'll get a deeper understanding of its importance, not just theoretically, but in a practical, business-oriented setting.

Here are the databases we'll be working with:

In [1]:
!pip install SQLAlchemy==1.3.24 -q # Needed o avoid problems with more recent version in Colab

# For this section, we need to use PostgreSQL,
# which has more advanced indexing and database management abilities

!apt install postgresql postgresql-contrib &>log
!service postgresql start
!sudo -u postgres psql -c "CREATE USER root WITH SUPERUSER"
# set connection
%load_ext sql
%config SqlMagic.feedback=False
%config SqlMagic.autopandas=True
%sql postgresql+psycopg2://@/postgres

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for SQLAlchemy (setup.py) ... [?25l[?25hdone
 * Starting PostgreSQL 12 database server
   ...done.
CREATE ROLE


## Introduction to the Dundler Mifflin Database
In any organization, data is often one of the most valuable assets. It gives the company insights into its operations and helps make informed decisions. For Dunder Mifflin, a fictional paper sales company, data is stored in a relational database, organized into tables that represent different entities in the organization. Let's take a closer look at the design of this database and understand why it's structured the way it is. (You can see the SQL code immediately following this section).

![Dunder Mifflin Database](https://github.com/brendanpshea/database_sql/raw/main/images/dunder_mifflin.png)





1\. Employees Table

The Employees table represents the employees of Dunder Mifflin. Each employee is uniquely identified by an 'employee_id'. This is the table's primary key, ensuring there are no duplicates and every employee can be uniquely identified. Other fields include the employee's first name, last name, email, and job title. Note that the email field is also unique, reflecting the real-world constraint that each employee should have a distinct email address.

2\. Clients Table

The Clients table represents the clients Dunder Mifflin does business with. Much like the Employees table, each client is uniquely identified by a 'client_id'. The company name is mandatory ('NOT NULL'), reflecting the business rule that every client must be associated with a company. Contact details and the company's location details are also stored here.

3\. Products Table

The Products table keeps track of the products Dunder Mifflin sells. Each product has a unique 'product_id' and must have a name and a price. The price must also be non-negative, which is enforced by a 'CHECK' constraint. This illustrates how databases can enforce data integrity at the field level.

4\. Orders Table

The Orders table represents the business transactions, i.e., the orders made by clients. Each order has a unique 'order_id' and is associated with a client and an employee, who are referenced by their respective IDs. This creates a relationship between the tables: for each record in the Orders table, there is a corresponding client in the Clients table and an employee in the Employees table. This association is enforced through 'FOREIGN KEY' constraints, which maintain the integrity of these relationships.

5\. OrderDetails Table

The OrderDetails table describes the specific items included in each order, connecting the order to the Products table. Each order detail record is uniquely identified by a combination of 'order_id' and 'product_id', meaning that the same product can appear only once in each order (though it can appear in multiple different orders). This is an example of a composite primary key.

This database design reflects a common approach in relational databases: breaking down information into smaller, manageable pieces and defining relationships between them. Entities are represented as tables (Employees, Clients, Products), and relationships are enforced through foreign keys (in Orders and OrderDetails). The choice of data types and constraints reflects the nature of the data and ensures its integrity. For example, email addresses are unique and non-nullable, prices are non-negative, etc.

In [25]:
%%sql
-- The Employees table represents the employees of Dunder Mifflin.
DROP TABLE IF EXISTS employees cascade;
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,  -- Every employee has a unique ID
    first_name VARCHAR(255) NOT NULL,  -- Employees' first names cannot be null
    last_name VARCHAR(255) NOT NULL,  -- Employees' last names cannot be null
    email VARCHAR(255) UNIQUE NOT NULL,  -- Every employee has a unique and not-null email address
    job_title VARCHAR(255) NOT NULL  -- Job title of each employee cannot be null
);

-- The Clients table represents the clients of Dunder Mifflin.
DROP TABLE IF EXISTS clients cascade;
CREATE TABLE clients (
    client_id INT PRIMARY KEY,  -- Every client has a unique ID
    company_name VARCHAR(255) NOT NULL,  -- Clients' company names cannot be null
    contact_name VARCHAR(255),  -- Name of the contact person at client's company
    contact_email VARCHAR(255) UNIQUE,  -- Unique email of the contact person at client's company
    address VARCHAR(255),  -- Address of the client's company
    city VARCHAR(255),  -- City where client's company is located
    postal_code VARCHAR(10),  -- Postal code of client's company
    country VARCHAR(255)  -- Country where client's company is located
);

-- The Products table represents the products that Dunder Mifflin sells.
DROP TABLE IF EXISTS products cascade;
CREATE TABLE products (
    product_id INT PRIMARY KEY,  -- Every product has a unique ID
    product_name VARCHAR(255) NOT NULL,  -- Product names cannot be null
    category VARCHAR(255),  -- Category that each product belongs to
    price DECIMAL(10, 2) NOT NULL CHECK (price >= 0)  -- Price of each product, cannot be null or negative
);

-- The Orders table represents the orders made by clients. Each order is handled by an employee.
DROP TABLE IF EXISTS orders cascade;
CREATE TABLE orders (
    order_id INT PRIMARY KEY,  -- Every order has a unique ID
    client_id INT NOT NULL,  -- Every order is associated with a client, client_id cannot be null
    employee_id INT NOT NULL,  -- Every order is handled by an employee, employee_id cannot be null
    order_date DATE NOT NULL,  -- Date when the order was made, cannot be null
    FOREIGN KEY (client_id) REFERENCES clients(client_id),  -- Link to the Clients table
    FOREIGN KEY (employee_id) REFERENCES employees(employee_id)  -- Link to the Employees table
);

-- The OrderDetails table represents the details of each order (which products were ordered and in what quantity).
DROP TABLE IF EXISTS order_details cascade;
CREATE TABLE order_details (
    order_id INT,  -- ID of the order
    product_id INT,  -- ID of the ordered product
    quantity INT NOT NULL CHECK (quantity > 0),  -- Quantity of the ordered product, cannot be null or zero
    FOREIGN KEY (order_id) REFERENCES orders(order_id),  -- Link to the Orders table
    FOREIGN KEY (product_id) REFERENCES products(product_id),  -- Link to the Products table
    PRIMARY KEY (order_id, product_id)  -- Each combination of order_id and product_id is unique
);


 * postgresql+psycopg2://@/postgres


## Inserting Test Data
Database management is often as much about the data as it is about the structure and schema. While the schema defines the shape of the data and the relationships between different data entities, it is often the data itself that brings the system to life and allows us to see how it behaves in real-world scenarios. This is especially true when it comes to database performance tuning, which includes activities like creating indexes, optimizing queries, or fine-tuning your database configuration settings.

Test data, or data that mimics real-world data, can be instrumental in these scenarios. By populating the database with test data, we can simulate actual load and usage patterns, identify performance bottlenecks, and fine-tune the database design or settings to achieve optimal performance. Test data can help us answer questions like:

-   How quickly can we retrieve customer order history?
-   Does our database scale as the number of orders increase?
-   Are the database indexes we've set up helping speed up data retrieval?

To start with, let's just insert some basic entries on employees, clients, and products.

In [26]:
%%sql
--Now, we can insert some sample data
DELETE FROM employees cascade;
INSERT INTO employees (employee_id, first_name, last_name, email, job_title)
VALUES
(1, 'Jim', 'Halpert', 'jim.halpert@dundermifflin.com', 'Sales Representative'),
(2, 'Pam', 'Beesly', 'pam.beesly@dundermifflin.com', 'Receptionist'),
(3, 'Michael', 'Scott', 'michael.scott@dundermifflin.com', 'Regional Manager'),
(4, 'Dwight', 'Schrute', 'dwight.schrute@dundermifflin.com', 'Assistant to the Regional Manager'),
(5, 'Angela', 'Martin', 'angela.martin@dundermifflin.com', 'Accountant'),
(6, 'Kevin', 'Malone', 'kevin.malone@dundermifflin.com', 'Accountant'),
(7, 'Stanley', 'Hudson', 'stanley.hudson@dundermifflin.com', 'Sales Representative'),
(8, 'Phyllis', 'Vance', 'phyllis.vance@dundermifflin.com', 'Sales Representative'),
(9, 'Oscar', 'Martinez', 'oscar.martinez@dundermifflin.com', 'Accountant'),
(10, 'Toby', 'Flenderson', 'toby.flenderson@dundermifflin.com', 'HR Representative');

DELETE FROM clients cascade;
INSERT INTO clients (client_id, company_name, contact_name, contact_email, address, city, postal_code, country)
VALUES
(1, 'Schrute Farms', 'Dwight Schrute', 'dwight@schrutefarms.com', 'Farm Road', 'Scranton', '18505', 'USA'),
(2, 'Poor Richards', 'Richard Poor', 'richard@poorrichards.com', 'Bar Street', 'Scranton', '18505', 'USA'),
(3, 'The Finer Things Club', 'Pam Beesly', 'pam@finerthingsclub.com', 'Office Park', 'Scranton', '18505', 'USA'),
(4, 'Vance Refrigeration', 'Bob Vance', 'bob@vancerefrigeration.com', 'Industrial Park', 'Scranton', '18505', 'USA'),
(5, 'Prince Family Paper', 'David Prince', 'david@princefamilypaper.com', 'Suburban Rd', 'Scranton', '18505', 'USA'),
(6, 'Michael Scott Paper Company', 'Michael Scott', 'michael@mspapercompany.com', 'Office Park', 'Scranton', '18505', 'USA'),
(7, 'Alfredo’s Pizza Cafe', 'Alfredo', 'alfredo@alfredospizza.com', 'Pizza Street', 'Scranton', '18505', 'USA'),
(8, 'Pizza by Alfredo', 'Alfredo Jr.', 'alfredo.jr@pizzabyalfredo.com', 'Pizza Street', 'Scranton', '18505', 'USA'),
(9, 'Cooper’s Seafood House', 'Paul Cooper', 'paul@cooperseafood.com', 'Dock Street', 'Scranton', '18505', 'USA'),
(10, 'Hooters', 'John Hooter', 'john@hooters.com', 'Restaurant Row', 'Scranton', '18505', 'USA');

DELETE FROM products cascade;
INSERT INTO products (product_id, product_name, category, price)
VALUES
(1, 'Letter Paper', 'Paper', 10.00),
(2, 'Legal Paper', 'Paper', 15.00),
(3, 'Printer', 'Office Supplies', 120.00),
(4, 'Stapler', 'Office Supplies', 5.00),
(5, 'Desk Lamp', 'Office Supplies', 20.00),
(6, 'Computer Monitor', 'Office Electronics', 150.00),
(7, 'Keyboard', 'Office Electronics', 30.00),
(8, 'Mouse', 'Office Electronics', 15.00),
(9, 'File Cabinet', 'Furniture', 80.00),
(10, 'Desk Chair', 'Furniture', 85.00);




 * postgresql+psycopg2://@/postgres


Now, we're going to do something a bit more complex, which involves inserting thousands of rows of test data (using some more advanced features of PostgreSQL)

In [27]:
%%sql
DELETE FROM orders cascade;
DELETE FROM order_details cascade;

-- Generate 100 random orders
WITH data AS (
    SELECT
        s.id,
        (random() * 9 + 1)::int,  -- Assumes 10 clients with IDs 1, 2, 3, etc
        (random() * 9 + 1)::int,  -- Assumes 10 employees with IDs 1, 2, 3, etc.
        timestamp '2005-01-01' + random() * (timestamp '2015-12-31' - timestamp '2005-01-01')  -- Random date between 2005 and 2015
    FROM generate_series(1, 500) AS s(id)
)
INSERT INTO orders (order_id, client_id, employee_id, order_date)
SELECT * FROM data
ON CONFLICT (order_id) DO NOTHING;

-- Generate 5000 random order details
WITH data AS (
    SELECT
        (random() * 499 + 1)::int,  -- Assumes 500 orders with IDs 1 through 100
        (random() * 9 + 1)::int,  -- Assumes 10 products with IDs 1, 2, 3, 4
        (random() * 1000 + 1)::int  -- Random quantity between 1 and 1000
    FROM generate_series(1, 5000)
)
INSERT INTO order_details (order_id, product_id, quantity)
SELECT * FROM data
ON CONFLICT (order_id, product_id) DO NOTHING;



 * postgresql+psycopg2://@/postgres


This code block demonstrates a simple (and relatively efficient) way of generating a large amount of test data. Here, we're using PostgreSQL's `generate_series` function to create a large number of records for our `orders` and `order_details` tables. We're using some randomness in our data to simulate real-world scenarios: for example, orders are associated randomly with different clients and employees, and the order quantities vary randomly.

We're also using a feature of SQL known as "upsert" (update or insert). When we attempt to insert a new row that would violate a unique constraint (like a primary key), the `ON CONFLICT DO NOTHING` clause tells PostgreSQL to skip that row and continue with the next one. This helps us avoid errors during the data generation process.

This is just one of the many ways to generate test data. Other strategies might involve using dedicated data generation tools or libraries, scripting your data generation in a programming language like Python or Java, or even manually creating your data if the scale is small. The right approach depends on your specific needs, the complexity of your data model, and the scale of your data.

Here, the goal here isn't to memorize the specifics of each command (as this goes beyond the scope of an introductory textbook like this one). Rather, it's to understand why we generate test data and how it helps us manage and optimize our databases. Later in this chapter, we'll dive deeper into how we can use this test data to explore database performance and optimization techniques.

## Transactions and ACID

In the world of database systems, a **transaction** refers to a sequence of operations performed as a single, indivisible logical unit of work. To understand transactions, imagine them as packages of work that either succeed entirely or don't occur at all. If a transaction is interrupted (due to a system failure or an unexpected error), the database system should be able to revert or "rollback" all changes made during the transaction, restoring the database to its previous state.

In the context of Dunder Mifflin, a simple example of a transaction could be the processing of a sales order. So, suppose that Dwight Schrute, the top salesman, makes a sale to a client. He needs to record this transaction in the company database, which involves a series of operations:

1. Subtract the sold quantity of paper from the inventory.
2. Add a new sales record, including the client's details and the details of the sale.
3. Update the client's total purchases and the company's total sales.

These operations should all occur together as a single transaction. If any of them fails — say, a power outage at the office causes the database server to abruptly shut down after the inventory is updated, but before the sales record is created — then the transaction should be rolled back to maintain data consistency. We don't want to show a reduced inventory without a corresponding sales record.

To ensure data consistency and reliability, a database transaction must satisfy the properties collectively known as **ACID**:

- **Atomicity:** This property implies that a transaction is an indivisible unit, meaning it is all or nothing. If Dwight gets distracted in the middle of recording a sale — perhaps by another prank from Jim — and only completes part of the transaction, atomicity ensures that partial transactions don't happen.

- **Consistency:** The consistency property guarantees that a transaction brings the database from one valid state to another. This means that at the end of a transaction, all rules and constraints defined in the database should still hold true. So if Michael decides to purchase 1000 boxes of paper for his "Michael Scott Paper Company", but Dunder Mifflin only has 500 boxes in stock, the sale shouldn't be allowed to go through.

- **Isolation:** Isolation ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially. So if both Jim and Dwight are trying to sell the last box of paper in stock, isolation ensures that only one of them can succeed.

- **Durability:** Once a transaction has been committed, it remains so, even in the event of a system failure. For instance, if Pam records a big sale and the system crashes soon after, the transaction shouldn't be lost when the database recovers.

By adhering to the ACID properties, Dunder Mifflin can ensure that their sales transactions are processed reliably and consistently, reducing the chance of any discrepancies in their database — even when they're distracted by yet another of Michael's "mandatory" conference room meetings!

SQL implements transactions through a series of SQL commands, primarily BEGIN TRANSACTION, COMMIT, and ROLLBACK. Understanding how to control transactions is crucial to maintaining data integrity and consistency.

1. We use the `BEGIN TRANSACTION` command to initiate a transaction. This statement signals the start of a transaction consisting of one or more SQL statements.

2. If everything goes according to plan, the transaction is finalized using the `COMMIT` command. This statement will save all changes made since the last `BEGIN TRANSACTION` to the database.

3. However, if there's an error during the transaction, or we decide that we don't want to save the changes for whatever reason, we can use the `ROLLBACK` command. This will undo all changes made since the last `BEGIN TRANSACTION`.

Now let's put these commands to work in a Dunder Mifflin context.

Suppose Dwight is inputting a new sales transaction into the database. He needs to subtract the sold quantity from the inventory, add a new sales record, and update the client's total purchases. In SQL, that might look something like this (NOTE: Transactions aren't supported in this "notebook" format, which is why I haven't included this as a "code" cell).


```
BEGIN TRANSACTION;

UPDATE inventory SET quantity = quantity - 100
  WHERE item = 'A4 Paper';

INSERT INTO sales (client_name, item, quantity, salesperson)
  VALUES ('Client X', 'A4 Paper', 100, 'Dwight Schrute');

UPDATE clients SET total_purchases = total_purchases + 100
  WHERE client_name = 'Client X';

COMMIT;
```

In this transaction, we first decrease the quantity of 'A4 Paper' in the inventory by 100. Then we insert a new record into the `sales` table. Finally, we increase the `total_purchases` for 'Client X' in the `clients` table. If all of these statements execute successfully, the `COMMIT` statement will save these changes to the database.

However, if there's a problem --- say, the server goes down, or Jim puts Dwight's stapler in Jello again and he gets distracted, causing an error --- SQL will not execute the `COMMIT` command. In this case, the transaction will be left open, and we'll want to execute a `ROLLBACK` command to undo all changes made since the `BEGIN TRANSACTION`. Here's what that might look like:

```
ROLLBACK;
```

The `ROLLBACK` command will bring the database back to the state it was in before the `BEGIN TRANSACTION`, ensuring the consistency and reliability of the data.

In this way, by properly using transactions, SQL allows us to ensure that the ACID properties (Atomicity, Consistency, Isolation, Durability) are maintained during the database operations, leading to a more robust and reliable system.

## Introduction to Query Performance:

In the world of databases, query performance is the measure of how quickly a database can execute a given query. This is incredibly important for any application, as slow queries can lead to slow application performance and a poor user experience.

There are many factors that can influence the speed at which a query executes. Let's consider a few examples using our Dunder Mifflin database:

1.  *Data Volume:* If a table in your database has a large number of records, queries against that table will take longer to execute than against a table with fewer records. For example, if the "orders" table in the Dunder Mifflin database has millions of records, a query to retrieve all orders might be quite slow.

2.  *Query Complexity:* Complex queries, such as those that involve multiple joins, subqueries, or complex calculations, can take longer to execute than simpler queries. For example, a query that tries to find the average order quantity for each client by joining the "orders", "order_details", and "clients" tables could be slower than a query that simply retrieves all orders from the "orders" table.

3.  *Data Types:* The type of data in your tables can also impact query performance. Some data types take longer to compare and sort than others. For example, text comparisons can be slower than integer comparisons. So, if we were to look for a client by their email address in the "clients" table (a text comparison), this could be slower than looking up a client by their client_id (an integer comparison).

PostgreSQL provides a tool called "EXPLAIN ANALYZE" that can help you understand why a particular query is performing the way it is. This tool provides information about the query execution plan chosen by PostgreSQL's query planner, along with performance metrics like execution time and the number of rows processed.

For example, if we wanted to understand the performance of a query on the Dunder Mifflin database that retrieves all orders for a particular client, we could use "EXPLAIN ANALYZE" like so:

In [7]:
%%sql
--Here is the query
SELECT * FROM orders
WHERE client_id = 1
LIMIT 5;

 * postgresql+psycopg2://@/postgres


Unnamed: 0,order_id,client_id,employee_id,order_date
0,3,1,4,2011-11-04
1,5,1,4,2007-09-15
2,6,1,3,2011-09-05
3,7,1,3,2006-12-11
4,8,1,4,2014-07-16


In [8]:
%%sql
--Now, let's analyze the query's peformance
EXPLAIN ANALYZE SELECT * FROM orders WHERE client_id = 1;

 * postgresql+psycopg2://@/postgres


Unnamed: 0,QUERY PLAN
0,Seq Scan on orders (cost=0.00..2.25 rows=43 w...
1,Filter: (client_id = 1)
2,Rows Removed by Filter: 57
3,Planning Time: 0.091 ms
4,Execution Time: 0.068 ms


Here's what Postgres does to execute this query:

1. *Seq Scan on orders:* PostgreSQL starts with a **sequential scan** on the "orders" table, meaning it reads the entire table, row by row. The **cost** represents the database's estimate of how "expensive" this operation is in terms of time and resources.

2. *Filter: (client_id = 1):* After reading the table, PostgreSQL applies a filter to the rows. It's looking for rows where the "client_id" is equal to 1.

3. *Rows Removed by Filter*: The filter removes 57 rows from the result set because those rows did not meet the condition of having "client_id" equal to 1.

3. *Planning Time:* This is the time it took PostgreSQL to plan how to execute this query. It's like the blueprint construction time before starting actual work.

4. *Execution Time:* This is the actual time it took to execute the query according to the plan.

 ### Analyzing a More Complex Query

 Now, let's say we want to retrieve all orders along with the names of the clients and the employees who handled them. We would need to join the "orders", "clients", and "employees" tables. Here's the query and how we might use "EXPLAIN ANALYZE":

In [9]:
%%sql
--the query
SELECT o.order_id, c.company_name, e.first_name, e.last_name
FROM orders o
JOIN clients c ON o.client_id = c.client_id
JOIN employees e ON o.employee_id = e.employee_id
LIMIT 5;


 * postgresql+psycopg2://@/postgres


Unnamed: 0,order_id,company_name,first_name,last_name
0,1,Poor Richards,Dwight,Schrute
1,2,The Finer Things Club,Michael,Scott
2,3,Schrute Farms,Dwight,Schrute
3,4,The Finer Things Club,Pam,Beesly
4,5,Schrute Farms,Dwight,Schrute


In [10]:
%%sql
EXPLAIN ANALYZE
SELECT o.order_id, c.company_name, e.first_name, e.last_name
FROM orders o
JOIN clients c ON o.client_id = c.client_id
JOIN employees e ON o.employee_id = e.employee_id;


 * postgresql+psycopg2://@/postgres


Unnamed: 0,QUERY PLAN
0,Hash Join (cost=21.12..23.74 rows=100 width=1...
1,Hash Cond: (o.employee_id = e.employee_id)
2,-> Hash Join (cost=10.45..12.77 rows=100 w...
3,Hash Cond: (o.client_id = c.client_id)
4,-> Seq Scan on orders o (cost=0.00.....
5,-> Hash (cost=10.20..10.20 rows=20 w...
6,Buckets: 1024 Batches: 1 Memor...
7,-> Seq Scan on clients c (cost...
8,-> Hash (cost=10.30..10.30 rows=30 width=1...
9,Buckets: 1024 Batches: 1 Memory Usag...


Here's what Postgres has to do to execute this query.

1.  Seq Scan on orders o: PostgreSQL starts by performing a **sequential scan** on the "orders" table, which means it reads through the entire table, row by row. The "cost" here is the database's estimate of how "expensive" this operation is in terms of time and resources.

2.  Hash Join: Next, the database joins the "orders" and "clients" tables. The method used here is called a Hash Join. The database creates a "hash table" of the smaller table ("clients"), which is an in-memory data structure that allows very fast lookups. It then scans the larger table ("orders") and for each row, it uses the hash function to quickly locate matching rows in the smaller table. The condition for the join (Hash Cond) is `(o.client_id = c.client_id)`.

3.  Seq Scan on clients c & Hash: Before the join, a sequential scan on the "clients" table is performed and a hash table is created (this is what "Hash" refers to).

4.  The process is repeated to join the resulting table with the "employees" table.

5.  Planning Time: This is the time it took PostgreSQL to create the plan for executing this query.

6.  Execution Time: This is the actual time it took to execute the query according to the plan.

The PostgreSQL optimizer (or the optimizer in whatever DBMS you happen to be using) uses stats about the data stored in tables (like the number of rows, data distribution, etc.) to decide the best way to execute a query. In this case, it decided that doing a sequential scan of the entire "orders" table, and then performing hash joins was the fastest approach.

## Introduction to Indexes

Just like an index in a book helps you find specific information faster without going through each page, database indexes help the database engine find, filter, and sort records much faster. An **index** is a data structure that improves the speed of data retrieval operations. It is a critical factor in optimizing the performance of a database by reducing disk I/O operations and thus enhancing the query performance.

However, while indexes speed up data retrieval, they slow down data modification operations like `INSERT`, `UPDATE`, and `DELETE` as the index also needs to be updated. Thus, there is a trade-off that needs to be considered based on the specific use-case of your database.

Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records. The database engine uses indexes to quickly locate data without needing to search every row in a database table every time a database table is accessed.

In PostgreSQL, you can create an index using the `CREATE INDEX` command. The syntax is as follows:

```
CREATE INDEX idx_name ON table_name(column_name);
```

### Context

Let's go back to our Dunder Mifflin example. Imagine one fine day, Dwight Schrute, the Assistant (to the) Regional Manager, decides he wants to personally send a bobblehead doll to every client who places an order with a quantity greater than 500. These large orders require special attention, and Dwight believes this personal touch will boost customer satisfaction.

The company has thousands of orders, and Dwight doesn't have all day. He needs to find these big orders quickly. The original database design doesn't cater to this specific requirement, as finding these large orders would require scanning every row in the order_details table, which could be time-consuming.

Here's the result of the initial analysis for the query (without an index):

In [28]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM order_details
WHERE quantity > 500;


 * postgresql+psycopg2://@/postgres


Unnamed: 0,QUERY PLAN
0,Seq Scan on order_details (cost=0.00..60.35 r...
1,Filter: (quantity > 500)
2,Rows Removed by Filter: 1548
3,Planning Time: 0.096 ms
4,Execution Time: 0.766 ms
