<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_10_DatabaseSQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducing Databases:

At the heart of data management is the concept of a **relational database**. The term 'relational' gives away its essence --- it's all about relationships. Many of you might be familiar with data frames from your work with Pandas. While data frames provide a two-dimensional structure, where each row represents an entry and each column signifies a feature or attribute of that entry, a relational database goes beyond.

A relational database is a collection of interrelated **tables (or "relations")**. Each table in this database is akin to a data frame, but what sets a relational database apart is its ability to establish connections or relationships between these tables. These relationships allow for efficient organization, retrieval, and manipulation of data, especially when dealing with complex datasets.

Let's draw a comparison for better understanding:

1.  Data Frames

    -   Two-dimensional: rows and columns.
    -   Each row is an entry; each column is an attribute or feature.
    -   Useful for linear datasets where relationships between data points are not the main focus.
2.  Relational Databases

    -   Multi-dimensional: comprises multiple tables.
    -   Relationships between tables are defined using keys.
    -   Designed to handle complex datasets where interrelations between data are essential.

Now, with this foundational knowledge, let's dive into a practical application.

## Jurassic Park Database Case Study

Imagine stepping into the vast and thrilling world of Jurassic Park. The park is teeming with a variety of dinosaurs, each housed in its unique enclosure. As park managers, it's crucial to keep track of every dinosaur, its characteristics, its habitat, feeding times, and so much more. A simple list or a single table won't suffice. This is where our relational database comes into play.

In our case study, we'll be exploring a partial Jurassic Park database. This database contains tables that represent different entities like 'Dinosaurs', 'Enclosures', and perhaps 'Park Staff'. These tables not only store information about each entity but also define relationships. For instance, which dinosaur resides in which enclosure? Who is the caretaker responsible for a particular dinosaur? Answering such questions becomes seamless with our relational database.

As we delve deeper into the world of SQL and databases, you'll see how this Jurassic Park scenario helps illuminate the power and flexibility of relational databases in managing and querying data.

## What is a Relational Database?
A relational database is a collection of data items organized as a set of tables. Each table represents a category of data, making it easier to store, retrieve, and manage information.

At the foundation of each table is the **entity**. Think of an entity as the main topic or subject of a table. In our Jurassic Park example, one entity could be 'Dinosaurs'. So, there would be a table dedicated to dinosaurs, containing all relevant information about them. Other potential tables for a theme park database might include things like:

- Enclosures: to store data about the animals' living quarters
- Employees: to store data about employes
- Visitors: to record information about visitors to the park
- And many others...

Each entity has various **attributes**, which are specific pieces of information we want to capture about the entity. Attributes are represented as columns in a table. For the 'Dinosaurs' entity, attributes could include 'name', indicating the dinosaur's name, 'species' specifying its species type, and 'diet' detailing whether it's a herbivore, carnivore, or omnivore. Each row in this table would represent a specific dinosaur, and the data in that row would provide the details for each attribute.

But databases aren't just about storing isolated chunks of data. They shine when showing **relationships** between data. In Jurassic Park, we might want to know which dinosaur resides in which enclosure. This relationship can be represented by linking our 'Dinosaurs' table to another table, 'Enclosures', using shared attributes. In our case, the shared attribute could be 'enclosure_number'.

A vital component in our tables is the primary key. This is a unique identifier for each record in a table. In the 'Dinosaurs' table, this could be 'dinosaur_id'. No two dinosaurs would have the same 'dinosaur_id', ensuring that each record is distinct and easily identifiable.

Lastly, we define each attribute with a specific data type to ensure consistency in the data we store. Data types determine the nature of data an attribute can hold. For example, 'dinosaur_id' might be an integer (whole number), 'name' would be text, and 'dob' (date of birth) would be a date. This ensures we store data consistently and helps prevent errors. For instance, in our 'Enclosures' table, the 'square_feet' attribute would be an integer, representing the size of the enclosure in square feet.

## Two Tables at Jurrasic Park

To see how this might look in a concrete case, let's look at (a small section) of a potential Jurrasic Park database.


### Dinosaurs Table

| dinosaur_id | name | species | diet | enclosure_number | dob | biography |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | Rex | Tyrannosaurus | Carnivore | 5 | 1990-05-20 | The most fearsome dinosaur in the park. |
| 2 | Daisy | Brachiosaurus | Herbivore | 3 | 1991-08-14 | Known for her gentle nature and tall neck. |
| 3 | Spike | Triceratops | Herbivore | 4 | 1992-04-03 | Recognized by his three distinctive horns. |

### Enclosures Table

| enclosure_number | square_feet | security_level | habitat_type | notes |
| --- | --- | --- | --- | --- |
| 3 | 5000 | 2 | Tropical Forest | Ideal for herbivores. |
| 4 | 3000 | 3 | Grassland | Open space for dinosaurs to roam freely. |
| 5 | 4000 | 5 | Rocky Terrain | High security for predatory dinosaurs. |

In these tables, we can see the key elements of relational databases

- **Tables and Attributes.** The Dinosaurs Table represents different types of dinosaurs, while the Enclosures Table represents types of enclosures. Each column in these tables, like 'name', 'species', and 'diet' in the Dinosaurs Table or 'square_feet' and 'habitat_type' in the Enclosures Table, represents an **attribute** of that entity.
- **Primary Keys.** The 'dinosaur_id' in the Dinosaurs Table and 'enclosure_number' in the Enclosures Table are primary keys. They uniquely identify each record. For example, the dinosaur with 'dinosaur_id' 1 is Rex.
- **Relationships.**  The 'enclosure_number' in the Dinosaurs Table establishes a relationship with the Enclosures Table. It indicates which enclosure a specific dinosaur resides in. For instance, Rex (from the Dinosaurs Table) resides in the enclosure with 'enclosure_number' 5, which is a rocky terrain with high security (from the Enclosures Table).
- **Data Types.**  The tables use different data types for their attributes. For example, 'name' stores text, 'dob' stores date values, and 'dinosaur_id' uses whole numbers.

By understanding these tables and their attributes, we can see how relational databases organize information, define relationships between data points, and ensure each record's uniqueness, all of which are fundamental concepts in database management.

## Diving into Records

Within the structured confines of an SQL table, a **record** is a specific set of associated data that collectively provides comprehensive information about an entity. It encapsulates a singular instance of the entity the table represents. In essence, a record is a horizontal collection of values, each belonging to a specific column or attribute of the table.

When drawing parallels with other data structures, it's tempting to equate a record in SQL with a row in a spreadsheet or a dataframe. On the surface, they seem quite similar, both representing horizontal collections of data. However, there are nuanced differences:

1.  **Uniqueness--**In SQL tables, each record is often required to have a unique identifier, known as the primary key. This ensures that every record can be precisely identified and differentiated from all others. While spreadsheets and dataframes can have unique identifiers, they aren't inherently structured to enforce this uniqueness.

2.  **Constraints--**SQL tables are rigid in their structure. They can have constraints that enforce specific rules on the data. For instance, a column might be set to only accept integer values or dates. On the other hand, a spreadsheet row or dataframe is more flexible, allowing any kind of data to be entered without such strict enforcement.

3.  **Relational Integrity--**SQL tables are designed to maintain relational integrity. This means that records in one table can be related to records in another table through foreign keys. While you can create relationships in spreadsheets or dataframes, they don't natively support or enforce these relationships as SQL does.


Venturing into the Jurassic landscape, consider the 'Dinosaurs' table. Each dinosaur, whether it's the towering T-Rex or the agile Velociraptor, occupies its own unique record in the table. This arrangement has several implications:

1. Due to the primary key constraint, typically on a 'dinosaur_id' column, no two dinosaurs can share the same identifier. This ensures that even if there are two T-Rexes in the park, they each have a unique record in the database.

2. The SQL table structure ensures that the data for each dinosaur is consistent. For example, the 'dob' (date of birth) column can only contain valid dates. If someone tried to enter text or a number that doesn't represent a date, the database would reject it.

3.  Using the unique records of the 'Dinosaurs' table, we can delve into the relational aspects of the database. For instance, the 'enclosure_number' might be a foreign key linking each dinosaur to a specific enclosure in the 'Enclosures' table. This showcases how records in one table relate to records in another, providing a holistic view of the park's operations.

In summary, while records in an SQL table might seem similar to rows in a spreadsheet or dataframe at first glance, their inherent properties, especially in terms of uniqueness, constraints, and relational capabilities, set them apart. These distinctions become evident when managing a complex system like a Jurassic Park database.

## Launching PostgreSQL with SQL magic

Before creating tables in **PostgreSQL** (an open-source database), let's set up the environment in a Google Colab notebook. Execute the following commands in a single cell:

In [1]:
# Install and launch PostgreSQL as a "Superuser"
!apt install postgresql postgresql-contrib &>log
!service postgresql start
!sudo -u postgres psql -c "CREATE USER root WITH SUPERUSER"

# set connection
%load_ext sql
%sql postgresql+psycopg2://@/postgres


 * Starting PostgreSQL 14 database server
   ...done.
CREATE ROLE


## Introduction to the CREATE TABLE Statement
With PostgreSQL up and running, it's time to delve into one of the foundational SQL commands: the `CREATE TABLE` statement. This command allows us to define and create a new table in the database.

The structure is straightforward:

```sql
CREATE TABLE table_name (
   column1 datatype1 PRIMARY KEY,
   column2 datatype2,
   column3 datatype3,
   ...
);
```

-   `table_name`: This is the name you want to give to the new table.
-   `column1, column2, ...`: These are the names of the columns that you want to create in the table.
-   `datatype1, datatype2, ...`: Each column must have a data type associated with it, defining the kind of data it will store, such as `INTEGER`, `VARCHAR`, or `DATE`.
-   `PRIMARY KEY`: This is an optional constraint you can add to a column, indicating it will be used as the unique identifier for the table.

Let's create our Dinosaurs and Enclosures tables using the CREATE TABLE statement:

In [18]:
%%sql
--First, we'll delete the tables if they already exist
DROP TABLE IF EXISTS Enclosures CASCADE;
DROP TABLE IF EXISTS Dinosaurs CASCADE;

--Now, the actual table creation

CREATE TABLE Enclosures (
   enclosure_number INTEGER PRIMARY KEY,
   square_feet INTEGER CHECK (square_feet > 0),
   security_level INTEGER CHECK (security_level BETWEEN 1 AND 5),
   habitat_type VARCHAR(128) NOT NULL,
   notes TEXT
);

CREATE TABLE Dinosaurs (
   dinosaur_id INTEGER PRIMARY KEY,
   name VARCHAR(30) NOT NULL,
   species VARCHAR(50),
   diet VARCHAR(64) CHECK (diet IN ('Herbivore', 'Carnivore', 'Omnivore')),
   enclosure_number INTEGER REFERENCES Enclosures(enclosure_number),
   dob DATE,
   biography TEXT
);


 * postgresql+psycopg2://@/postgres
Done.
Done.
Done.
Done.


[]

Here, two tables are being created: `Dinosaurs` and `Enclosures`. Each table has been designed to store specific sets of data, and certain rules (constraints) have been applied to ensure that the data is accurate and reliable.

#### Dinosaurs Table

1.  `dinosaur_id:` Every dinosaur gets a unique number (or ID). No two dinosaurs will share the same ID, ensuring that each one can be distinctly identified.

2.  `name:` This is the name of the dinosaur, like "Rex" or "Blue". It's a mandatory field, so every dinosaur must have a name recorded.

3. `species:` This indicates the species of the dinosaur, such as "Tyrannosaurus" or "Velociraptor".

4.  `diet:` This tells us what the dinosaur eats. There are specific options for this: either 'Herbivore', 'Carnivore', or 'Omnivore'. Any other diet type will not be accepted.

5.  `enclosure_number:` This number shows where the dinosaur is housed in the park. It refers to a specific enclosure from the `Enclosures` table. This creates a connection between the two tables.

6.  `dob:` This is the dinosaur's date of birth.

7.  `biography:` A text section that can contain more detailed information or stories about the dinosaur.

#### Enclosures Table

1.  `enclosure_number:` Every enclosure in the park gets a unique number. This ensures each enclosure can be clearly identified.

2.  `square_feet:` This tells us the size of the enclosure. The size should always be a positive value.

3.  `security_level:` This is an indication of the safety measures in place for the enclosure. The value can range from 1 to 5, with 1 being the least secure and 5 being the most secure.

4.  `habitat_type:` Describes the type of environment the enclosure replicates, like a "Tropical Forest" or "Desert". This field is mandatory, so every enclosure must have a habitat type recorded.

5.  `notes:` A text section where additional information or observations about the enclosure can be recorded.

In essence, these tables allow for organized storage of data about dinosaurs and their living environments within the park. The established rules (constraints) ensure that the data stored is consistent, valid, and maintains the relationships between different pieces of information.

## Introduction to the INSERT INTO Statement

In SQL, once you've established the structure of your tables, the next step is populating them with data. The `INSERT INTO` statement is used for this purpose. It allows you to insert new records (rows of data) into a table.

```sql
INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);
```

-   table_name: The name of the table you wish to insert data into.
-   column1, column2, ...: The names of the columns in the table where you want to insert data.
-   value1, value2, ...: The corresponding values for these columns.

It's essential to ensure that the order of columns matches the order of values, and the data types of the values match the data types of the columns.

Let's populate our `Dinosaurs` and `Enclosures` tables with some sample data using the `INSERT INTO` statement:

In [19]:
%%sql

--Add an enclosure
INSERT INTO Enclosures (enclosure_number, square_feet, security_level, habitat_type, notes)
VALUES (5, 4000, 5, 'Rocky Terrain', 'High security for predatory dinosaurs.');

--Add a row to the Dinosaur Table
INSERT INTO Dinosaurs (dinosaur_id, name, species, diet, enclosure_number, dob, biography)
VALUES (1, 'Rex', 'Tyrannosaurus', 'Carnivore', 5, '1990-05-20', 'The most fearsome dinosaur in the park.');

 * postgresql+psycopg2://@/postgres
1 rows affected.
1 rows affected.


[]

In this  example:

-   We're adding a record to the `Dinosaurs` table for a Tyrannosaurus named Rex, who is a carnivore, resides in enclosure number 5, and was born on May 20, 1990.
-   We're also adding a record to the `Enclosures` table for an enclosure with the number 5, which has a size of 4000 square feet, a security level of 5, and is a rocky terrain.

Now, let's use the `INSERT INTO` statement to add more records to these tables.

In [20]:
%%sql
INSERT INTO Enclosures (enclosure_number, square_feet, security_level, habitat_type, notes)
VALUES
(1, 30000, 3, 'Tropical Forest', 'Lush greenery for herbivores.'),
(2, 25000, 4, 'Grassland', 'Open plains for dinosaurs to roam.'),
(3, 45000, 2, 'Swamp', 'Wetland habitat suitable for diverse species.');

 * postgresql+psycopg2://@/postgres
3 rows affected.


[]

In [21]:
%%sql
-- Inserting data for Dinosaurs
INSERT INTO Dinosaurs (dinosaur_id, name, species, diet, enclosure_number, dob, biography)
VALUES
(2, 'Ruby', 'Tyrannosaurus', 'Carnivore', 5, '1991-07-25', 'Younger T-Rex, known for her vibrant scales.'),
(3, 'Blue', 'Velociraptor', 'Carnivore', 2, '1992-03-17', 'The alpha of her raptor pack.'),
(4, 'Echo', 'Velociraptor', 'Carnivore', 2, '1992-11-03', 'Blues second-in-command, fiercely loyal.'),
(5, 'Sarah', 'Triceratops', 'Herbivore', 1, '1992-08-11', 'Easily recognizable with her large frill.'),
(6, 'Tommy', 'Triceratops', 'Herbivore', 1, '1993-01-02', 'Youngest Triceratops, still growing his horns.'),
(7, 'Pete', 'Pterodactyl', 'Carnivore', 3, '1991-02-28', 'Often seen soaring above the swamps.'),
(8, 'Daisy', 'Brachiosaurus', 'Herbivore', 2, '1988-12-05', 'Oldest herbivore, with a gentle disposition.'),
(9, 'Bella', 'Brachiosaurus', 'Herbivore', 2, '1989-06-16', 'Daisys close companion, often seen together.'),
(10, 'Stego', 'Stegosaurus', 'Herbivore', 3, '1992-05-15', 'Known for the large plates along its spine.'),
(11, 'Stella', 'Stegosaurus', 'Herbivore', 3, '1992-05-16', 'Stegos twin, with slightly smaller plates.'),
(12, 'Spike', 'Spinosaurus', 'Carnivore', 1, '1993-06-09', 'Prefers staying near water bodies.'),
(13, 'Dino', 'Dilophosaurus', 'Carnivore', 2, '1990-07-23', 'Identified by the frill around its neck.'),
(14, 'Herbie', 'Hadrosaurus', 'Herbivore', 3, '1991-09-10', 'Loud calls can be heard across the park.'),
(15, 'Hank', 'Hadrosaurus', 'Herbivore', 3, '1991-11-12', 'A bit more reclusive than Herbie.');


 * postgresql+psycopg2://@/postgres
14 rows affected.


[]

## Exercise 1: Create a Table for Prehistoric Plants
Create a table to record data about various prehistoric plants found in the park.

Directions:

1.  Name the table `PrehistoricPlants`.
2.  The table should have the following columns:
    -   `plant_id`: An integer that serves as the primary key.
    -   `name`: A variable character string with a maximum length of 30, which should not be null.
    -   `period`: A variable character string with a maximum length of 50 to denote the geological period (e.g., Jurassic, Cretaceous).
    -   `dietary_use`: A variable character string with a maximum length of 64 to denote if it was a primary food source for herbivores or just decorative.

You'll be using `CREATE TABLE` to do this.

In [22]:
%%sql
--Run this cell first to drop old tables
DROP TABLE IF EXISTS PrehistoricPlants CASCADE;

 * postgresql+psycopg2://@/postgres
Done.


[]

In [None]:
%%sql
--Exercise 1 -- Your code below

## Exercise 2: Insert Data into the Prehistoric Plants Table

Populate the `PrehistoricPlants` table with data. Directions:

1.  Add the following plants to the `PrehistoricPlants` table:
    -   Name: "Cycadeoidea", Period: "Jurassic", Dietary Use: "Primary Food Source"
    -   Name: "Williamsonia", Period: "Jurassic", Dietary Use: "Decorative"
2.  Ensure that each entry has a unique `plant_id`.

Here, you'll be using the `INSERT INTO` statement.

In [None]:
%%sql
--Run this cell first to delete old data
DELETE FROM PrehistoricPlants CASCADE;

## Introduction to SELECT...FROM...WHERE in SQL

The essence of databases lies in the ability to query them, to ask questions and retrieve answers. The `SELECT` statement in SQL is the fundamental tool to achieve this. Let's dissect it piece by piece.

### The Anatomy of a Simple Query

Every query typically has three core components:

-   What you want to select.
-   From where you want to select it.
-   Under what conditions you want to select.

In SQL terms, these map to `SELECT`, `FROM`, and `WHERE`.

### The SELECT Clause

The `SELECT` clause determines which columns you want to view in your results. Think of it as shining a spotlight on specific parts of your table.

```sql SELECT column_name
```

For instance, if you have a table of books and you only want to view the titles, you'd use:

```sql
SELECT title
FROM books;
```

### The FROM Clause

The `FROM` clause tells the database from which table you're trying to select data. It's like choosing a specific bookshelf in a vast library.

```sql
SELECT column_name FROM table_name
```

If you wanted to see all authors from a 'books' table, it'd be:

```sql
SELECT author
FROM books;
```

### The WHERE Clause

The `WHERE` clause allows you to filter your results based on conditions. It's akin to only selecting books of a certain genre from a shelf.

```sql
SELECT column_name FROM table_name WHERE condition
```

For our book example, if you wished to only view titles of books published after 2000:

```sql
SELECT title
FROM books
WHERE publication_year > 2000;
```

### SELECTing Multiple Columns

You're not limited to selecting just one column. By separating column names with commas, you can retrieve multiple columns:

```sql
SELECT title, author
FROM books;
```

This would display both the title and author for every book in the 'books' table.

With our understanding of `SELECT`, `FROM`, and `WHERE`, let's consider the 'Dinosaurs' table. If you wanted to know the names and species of all herbivorous dinosaurs:

In [25]:
%%sql
SELECT name, species
FROM Dinosaurs
WHERE diet = 'Herbivore';


 * postgresql+psycopg2://@/postgres
8 rows affected.


name,species
Sarah,Triceratops
Tommy,Triceratops
Daisy,Brachiosaurus
Bella,Brachiosaurus
Stego,Stegosaurus
Stella,Stegosaurus
Herbie,Hadrosaurus
Hank,Hadrosaurus
