<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_10_DatabaseSQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducing Databases

At the heart of data management is the concept of a **relational database**. The term 'relational' gives away its essence --- it's all about relationships. Many of you might be familiar with data frames from your work with Pandas. While data frames provide a two-dimensional structure, where each row represents an entry and each column signifies a feature or attribute of that entry, a relational database goes beyond.

A relational database is a collection of interrelated **tables (or "relations")**. Each table in this database is akin to a data frame, but what sets a relational database apart is its ability to establish connections or relationships between these tables. These relationships allow for efficient organization, retrieval, and manipulation of data, especially when dealing with complex datasets.

Let's draw a comparison for better understanding:

1.  Data Frames

    -   Two-dimensional: rows and columns.
    -   Each row is an entry; each column is an attribute or feature.
    -   Useful for linear datasets where relationships between data points are not the main focus.
2.  Relational Databases

    -   Multi-dimensional: comprises multiple tables.
    -   Relationships between tables are defined using keys.
    -   Designed to handle complex datasets where interrelations between data are essential.

Now, with this foundational knowledge, let's dive into a practical application.

## Jurassic Park Database Case Study

Imagine stepping into the vast and thrilling world of Jurassic Park. The park is teeming with a variety of dinosaurs, each housed in its unique enclosure. As park managers, it's crucial to keep track of every dinosaur, its characteristics, its habitat, feeding times, and so much more. A simple list or a single table won't suffice. This is where our relational database comes into play.

In our case study, we'll be exploring a partial Jurassic Park database. This database contains tables that represent different entities like 'Dinosaurs', 'Enclosures', and perhaps 'Park Staff'. These tables not only store information about each entity but also define relationships. For instance, which dinosaur resides in which enclosure? Who is the caretaker responsible for a particular dinosaur? Answering such questions becomes seamless with our relational database.

As we delve deeper into the world of SQL and databases, you'll see how this Jurassic Park scenario helps illuminate the power and flexibility of relational databases in managing and querying data.

## What is a Relational Database?
A relational database is a collection of data items organized as a set of tables. Each table represents a category of data, making it easier to store, retrieve, and manage information.

At the foundation of each table is the **entity**. Think of an entity as the main topic or subject of a table. In our Jurassic Park example, one entity could be 'Dinosaurs'. So, there would be a table dedicated to dinosaurs, containing all relevant information about them. Other potential tables for a theme park database might include things like:

- Enclosures: to store data about the animals' living quarters
- Employees: to store data about employes
- Visitors: to record information about visitors to the park
- And many others...

Each entity has various **attributes**, which are specific pieces of information we want to capture about the entity. Attributes are represented as columns in a table. For the 'Dinosaurs' entity, attributes could include 'name', indicating the dinosaur's name, 'species' specifying its species type, and 'diet' detailing whether it's a herbivore, carnivore, or omnivore. Each row in this table would represent a specific dinosaur, and the data in that row would provide the details for each attribute.

But databases aren't just about storing isolated chunks of data. They shine when showing **relationships** between data. In Jurassic Park, we might want to know which dinosaur resides in which enclosure. This relationship can be represented by linking our 'Dinosaurs' table to another table, 'Enclosures', using shared attributes. In our case, the shared attribute could be 'enclosure_number'.

A vital component in our tables is the primary key. This is a unique identifier for each record in a table. In the 'Dinosaurs' table, this could be 'dinosaur_id'. No two dinosaurs would have the same 'dinosaur_id', ensuring that each record is distinct and easily identifiable.

Lastly, we define each attribute with a specific data type to ensure consistency in the data we store. Data types determine the nature of data an attribute can hold. For example, 'dinosaur_id' might be an integer (whole number), 'name' would be text, and 'dob' (date of birth) would be a date. This ensures we store data consistently and helps prevent errors. For instance, in our 'Enclosures' table, the 'square_feet' attribute would be an integer, representing the size of the enclosure in square feet.

## Two Tables at Jurrasic Park

To see how this might look in a concrete case, let's look at (a small section) of a potential Jurrasic Park database.


### Dinosaurs Table

| dinosaur_id | name | species | diet | enclosure_number | dob | biography |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | Rex | Tyrannosaurus | Carnivore | 5 | 1990-05-20 | The most fearsome dinosaur in the park. |
| 2 | Daisy | Brachiosaurus | Herbivore | 3 | 1991-08-14 | Known for her gentle nature and tall neck. |
| 3 | Spike | Triceratops | Herbivore | 4 | 1992-04-03 | Recognized by his three distinctive horns. |

### Enclosures Table

| enclosure_number | square_feet | security_level | habitat_type | notes |
| --- | --- | --- | --- | --- |
| 3 | 5000 | 2 | Tropical Forest | Ideal for herbivores. |
| 4 | 3000 | 3 | Grassland | Open space for dinosaurs to roam freely. |
| 5 | 4000 | 5 | Rocky Terrain | High security for predatory dinosaurs. |

In these tables, we can see the key elements of relational databases

- **Tables and Attributes.** The Dinosaurs Table represents different types of dinosaurs, while the Enclosures Table represents types of enclosures. Each column in these tables, like 'name', 'species', and 'diet' in the Dinosaurs Table or 'square_feet' and 'habitat_type' in the Enclosures Table, represents an **attribute** of that entity.
- **Primary Keys.** The 'dinosaur_id' in the Dinosaurs Table and 'enclosure_number' in the Enclosures Table are primary keys. They uniquely identify each record. For example, the dinosaur with 'dinosaur_id' 1 is Rex.
- **Relationships.**  The 'enclosure_number' in the Dinosaurs Table establishes a relationship with the Enclosures Table. It indicates which enclosure a specific dinosaur resides in. For instance, Rex (from the Dinosaurs Table) resides in the enclosure with 'enclosure_number' 5, which is a rocky terrain with high security (from the Enclosures Table).
- **Data Types.**  The tables use different data types for their attributes. For example, 'name' stores text, 'dob' stores date values, and 'dinosaur_id' uses whole numbers.

By understanding these tables and their attributes, we can see how relational databases organize information, define relationships between data points, and ensure each record's uniqueness, all of which are fundamental concepts in database management.

## Diving into Records

Within the structured confines of an SQL table, a **record** is a specific set of associated data that collectively provides comprehensive information about an entity. It encapsulates a singular instance of the entity the table represents. In essence, a record is a horizontal collection of values, each belonging to a specific column or attribute of the table.

When drawing parallels with other data structures, it's tempting to equate a record in SQL with a row in a spreadsheet or a dataframe. On the surface, they seem quite similar, both representing horizontal collections of data. However, there are nuanced differences:

1.  **Uniqueness--**In SQL tables, each record is often required to have a unique identifier, known as the primary key. This ensures that every record can be precisely identified and differentiated from all others. While spreadsheets and dataframes can have unique identifiers, they aren't inherently structured to enforce this uniqueness.

2.  **Constraints--**SQL tables are rigid in their structure. They can have constraints that enforce specific rules on the data. For instance, a column might be set to only accept integer values or dates. On the other hand, a spreadsheet row or dataframe is more flexible, allowing any kind of data to be entered without such strict enforcement.

3.  **Relational Integrity--**SQL tables are designed to maintain relational integrity. This means that records in one table can be related to records in another table through foreign keys. While you can create relationships in spreadsheets or dataframes, they don't natively support or enforce these relationships as SQL does.


Venturing into the Jurassic landscape, consider the 'Dinosaurs' table. Each dinosaur, whether it's the towering T-Rex or the agile Velociraptor, occupies its own unique record in the table. This arrangement has several implications:

1. Due to the primary key constraint, typically on a 'dinosaur_id' column, no two dinosaurs can share the same identifier. This ensures that even if there are two T-Rexes in the park, they each have a unique record in the database.

2. The SQL table structure ensures that the data for each dinosaur is consistent. For example, the 'dob' (date of birth) column can only contain valid dates. If someone tried to enter text or a number that doesn't represent a date, the database would reject it.

3.  Using the unique records of the 'Dinosaurs' table, we can delve into the relational aspects of the database. For instance, the 'enclosure_number' might be a foreign key linking each dinosaur to a specific enclosure in the 'Enclosures' table. This showcases how records in one table relate to records in another, providing a holistic view of the park's operations.

In summary, while records in an SQL table might seem similar to rows in a spreadsheet or dataframe at first glance, their inherent properties, especially in terms of uniqueness, constraints, and relational capabilities, set them apart. These distinctions become evident when managing a complex system like a Jurassic Park database.

## Launching PostgreSQL with SQL magic

Before creating tables in **PostgreSQL** (an open-source database), let's set up the environment in a Google Colab notebook. Execute the following commands in a single cell:

In [1]:
# Install and launch PostgreSQL as a "Superuser"
!apt install postgresql postgresql-contrib &>log
!service postgresql start
!sudo -u postgres psql -c "CREATE USER root WITH SUPERUSER"

# set connection
%load_ext sql
%sql postgresql+psycopg2://@/postgres


 * Starting PostgreSQL 14 database server
   ...done.
CREATE ROLE


## Introduction to the CREATE TABLE Statement
With PostgreSQL up and running, it's time to delve into one of the foundational SQL commands: the `CREATE TABLE` statement. This command allows us to define and create a new table in the database.

The structure is straightforward:

```sql
CREATE TABLE table_name (
   column1 datatype1 PRIMARY KEY,
   column2 datatype2,
   column3 datatype3,
   ...
);
```

-   `table_name`: This is the name you want to give to the new table.
-   `column1, column2, ...`: These are the names of the columns that you want to create in the table.
-   `datatype1, datatype2, ...`: Each column must have a data type associated with it, defining the kind of data it will store, such as `INTEGER`, `VARCHAR`, or `DATE`.
-   `PRIMARY KEY`: This is an optional constraint you can add to a column, indicating it will be used as the unique identifier for the table.

Let's create our Dinosaurs and Enclosures tables using the CREATE TABLE statement:

In [27]:
%%sql
--First, we'll delete the tables if they already exist
DROP TABLE IF EXISTS Enclosures CASCADE;
DROP TABLE IF EXISTS Dinosaurs CASCADE;

--Now, the actual table creation

CREATE TABLE Enclosures (
   enclosure_number INTEGER PRIMARY KEY,
   square_feet INTEGER CHECK (square_feet > 0),
   security_level INTEGER CHECK (security_level BETWEEN 1 AND 5),
   habitat_type VARCHAR(128) NOT NULL,
   min_temp_c FLOAT,
   max_temp_c FLOAT
);

CREATE TABLE Dinosaurs (
   dinosaur_id INTEGER PRIMARY KEY,
   name VARCHAR(30) NOT NULL,
   species VARCHAR(50),
   diet VARCHAR(64) CHECK (diet IN ('Herbivore', 'Carnivore', 'Omnivore')),
   enclosure_number INTEGER REFERENCES Enclosures(enclosure_number),
   dob DATE,
   weight_kg INT,
   length_m INT
);


 * postgresql+psycopg2://@/postgres
Done.
Done.
Done.
Done.


[]

Here, two tables are being created: `Dinosaurs` and `Enclosures`. Each table has been designed to store specific sets of data, and certain rules (constraints) have been applied to ensure that the data is accurate and reliable.

#### Enclosures Table

1.  `enclosure_number:` Every enclosure in the park gets a unique number. This ensures each enclosure can be clearly identified.

2.  `square_feet:` This tells us the size of the enclosure. The size should always be a positive value.

3.  `security_level:` This is an indication of the safety measures in place for the enclosure. The value can range from 1 to 5, with 1 being the least secure and 5 being the most secure.

4.  `habitat_type:` Describes the type of environment the enclosure replicates, like a "Tropical Forest" or "Desert". This field is mandatory, so every enclosure must have a habitat type recorded.

5.  `min_temp_c` and `max_temp_c`: These record data about the minimum and maximum temperature.


#### Dinosaurs Table

1.  `dinosaur_id:` Every dinosaur gets a unique number (or ID). No two dinosaurs will share the same ID, ensuring that each one can be distinctly identified.

2.  `name:` This is the name of the dinosaur, like "Rex" or "Blue". It's a mandatory field, so every dinosaur must have a name recorded.

3. `species:` This indicates the species of the dinosaur, such as "Tyrannosaurus" or "Velociraptor".

4.  `diet:` This tells us what the dinosaur eats. There are specific options for this: either 'Herbivore', 'Carnivore', or 'Omnivore'. Any other diet type will not be accepted.

5.  `enclosure_number:` This number shows where the dinosaur is housed in the park. It refers to a specific enclosure from the `Enclosures` table. This creates a connection between the two tables.

6.  `dob:` This is the dinosaur's date of birth.

7.  `weight_kg` and `length_m`: These contain data about the dinosaur's weight and length.


In essence, these tables allow for organized storage of data about dinosaurs and their living environments within the park. The established rules (constraints) ensure that the data stored is consistent, valid, and maintains the relationships between different pieces of information.

## Introduction to the INSERT INTO Statement

In SQL, once you've established the structure of your tables, the next step is populating them with data. The `INSERT INTO` statement is used for this purpose. It allows you to insert new records (rows of data) into a table.

```sql
INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);
```

-   table_name: The name of the table you wish to insert data into.
-   column1, column2, ...: The names of the columns in the table where you want to insert data.
-   value1, value2, ...: The corresponding values for these columns.

It's essential to ensure that the order of columns matches the order of values, and the data types of the values match the data types of the columns.

Let's populate our `Dinosaurs` and `Enclosures` tables with some sample data using the `INSERT INTO` statement:

In [28]:
%%sql
--Delete existing data (in case you re-run this cell)
DELETE FROM Enclosures;

INSERT INTO Enclosures (enclosure_number, square_feet, security_level, habitat_type, min_temp_c, max_temp_c) VALUES
(1, 50000, 5, 'Tropical Rainforest', 20, 35),
(2, 250000, 4, 'Grasslands', 15, 28),
(3, 180000, 3, 'Woodlands', 10, 25),
(4, 36000, 2, 'Desert', 25, 40),
(5, 22000, 5, 'Wetlands', 16, 30);


 * postgresql+psycopg2://@/postgres
0 rows affected.
5 rows affected.


[]

In [32]:
%%sql
--Delete existing data (in case you re-run this cell)
DELETE FROM Dinosaurs;

-- Inserting data for Dinosaurs
INSERT INTO Dinosaurs (dinosaur_id, name, species, diet, enclosure_number, dob, weight_kg, length_m) VALUES
(1, 'Rexie', 'Tyrannosaurus Rex', 'Carnivore', 1, '1990-07-18', 7000, 12),
(2, 'Blue', 'Velociraptor', 'Carnivore', 2, '1993-05-14', NULL, 2),
(3, 'Ducky', 'Brachiosaurus', 'Herbivore', 1, '1989-04-16', 50000, 30),
(4, 'Spike', 'Stegosaurus', 'Herbivore', 3, NULL, 3100, 9),
(5, 'Chomper', 'Tyrannosaurus Rex', 'Carnivore', 1, '1991-09-02', 8000, NULL),
(6, 'Littlefoot', 'Apatosaurus', 'Herbivore', 2, '1988-11-07', 22000, 21),
(7, 'Cera', 'Triceratops', 'Herbivore', 3, '1990-01-30', 6000, 8),
(8, 'Petrie', 'Pteranodon', 'Carnivore', NULL, '1992-07-22', 20, 6),
(9, 'Munch', 'Ankylosaurus', 'Herbivore', 4, '1993-12-11', 6000, 6.5),
(10, 'Blink', 'Velociraptor', 'Carnivore', 2, '1994-03-15', NULL, 2.5),
(11, 'Ivy', 'Diplodocus', 'Herbivore', 5, '1987-05-06', 12000, 27),
(12, 'Echo', 'Velociraptor', 'Carnivore', 2, NULL, 90, 2),
(13, 'Delta', 'Velociraptor', 'Carnivore', 2, '1995-10-23', 100, 2),
(14, 'Ruby', 'Gallimimus', 'Omnivore', 5, '1992-04-01', 440, 6),
(15, 'Scar', 'Allosaurus', 'Carnivore', 4, '1990-06-30', 1500, 8.5);


 * postgresql+psycopg2://@/postgres
0 rows affected.
15 rows affected.


[]

## Exercise 1: Create a Table for Prehistoric Plants
Create a table to record data about various prehistoric plants found in the park.

Directions:

1.  Name the table `PrehistoricPlants`.
2.  The table should have the following columns:
    -   `plant_id`: An integer that serves as the primary key.
    -   `name`: A variable character string with a maximum length of 30, which should not be null.
    -   `period`: A variable character string with a maximum length of 50 to denote the geological period (e.g., Jurassic, Cretaceous).
    -   `dietary_use`: A variable character string with a maximum length of 64 to denote if it was a primary food source for herbivores or just decorative.

You'll be using `CREATE TABLE` to do this.

In [6]:
%%sql
--Run this cell if you need to drop old tables
DROP TABLE IF EXISTS PrehistoricPlants CASCADE;

 * postgresql+psycopg2://@/postgres
Done.


[]

In [7]:
%%sql
--Exercise 1 -- Your code below

 * postgresql+psycopg2://@/postgres
(psycopg2.ProgrammingError) can't execute an empty query
[SQL: --Exercise 1 -- Your code below]
(Background on this error at: https://sqlalche.me/e/20/f405)


## Exercise 2: Insert Data into the Prehistoric Plants Table

Populate the `PrehistoricPlants` table with data. Directions:

1.  Add the following plants to the `PrehistoricPlants` table:
    -   Name: "Cycadeoidea", Period: "Jurassic", Dietary Use: "Primary Food Source"
    -   Name: "Williamsonia", Period: "Jurassic", Dietary Use: "Decorative"
2.  Ensure that each entry has a unique `plant_id`.

Here, you'll be using the `INSERT INTO` statement.

In [8]:
%%sql
--Run this cell if you need to delete old data
DELETE FROM PrehistoricPlants;

 * postgresql+psycopg2://@/postgres
(psycopg2.errors.UndefinedTable) relation "prehistoricplants" does not exist
LINE 2: DELETE FROM PrehistoricPlants CASCADE;
                    ^

[SQL: --Run this cell if you need to delete old data
DELETE FROM PrehistoricPlants CASCADE;]
(Background on this error at: https://sqlalche.me/e/20/f405)


## Introduction to SELECT...FROM...WHERE in SQL

The essence of databases lies in the ability to query them, to ask questions and retrieve answers. The `SELECT` statement in SQL is the fundamental tool to achieve this. Let's dissect it piece by piece.

### The Anatomy of a Simple Query

Every query typically has three core components:

-   What you want to select.
-   From where you want to select it.
-   Under what conditions you want to select.

In SQL terms, these map to `SELECT`, `FROM`, and `WHERE`.

### The SELECT Clause

The `SELECT` clause determines which columns you want to view in your results. Think of it as shining a spotlight on specific parts of your table.

```sql
SELECT column_name
```

For instance, if you have a table of books and you only want to view the titles, you'd use:

```sql
SELECT title
FROM books;
```

### The FROM Clause

The `FROM` clause tells the database from which table you're trying to select data. It's like choosing a specific bookshelf in a vast library.

```sql
SELECT column_name FROM table_name
```

If you wanted to see all authors from a 'books' table, it'd be:

```sql
SELECT author
FROM books;
```

### The WHERE Clause

The `WHERE` clause allows you to filter your results based on conditions. It's akin to only selecting books of a certain genre from a shelf.

```sql
SELECT column_name FROM table_name WHERE condition
```

For our book example, if you wished to only view titles of books published after 2000:

```sql
SELECT title
FROM books
WHERE publication_year > 2000;
```

### SELECTing Multiple Columns

You're not limited to selecting just one column. By separating column names with commas, you can retrieve multiple columns:

```sql
SELECT title, author
FROM books;
```

This would display both the title and author for every book in the 'books' table.

With our understanding of `SELECT`, `FROM`, and `WHERE`, let's consider the 'Dinosaurs' table. If you wanted to know the names and species of all herbivorous dinosaurs:

In [33]:
%%sql
SELECT name, species
FROM Dinosaurs
WHERE diet = 'Herbivore';


 * postgresql+psycopg2://@/postgres
6 rows affected.


name,species
Ducky,Brachiosaurus
Spike,Stegosaurus
Littlefoot,Apatosaurus
Cera,Triceratops
Munch,Ankylosaurus
Ivy,Diplodocus


## Filtering Data With `WHERE`

The `WHERE` clause in SQL serves as a filter applied to rows in a database table, enabling the selection of records that fulfill a specific criterion. It's an indispensable tool for sifting through data. Here's a breakdown of its syntax:

1.  **Comparison Operators** are symbols (such as `<', '=', '>', etc.) that denote how one value compares to another. They are foundational to most queries and can be used to filter data numerically, textually, and chronologically.

2.  **Logical Operators** include `AND`, `OR`, and `NOT`. They allow for the combination of multiple conditions, either to narrow down results (`AND`) or broaden them (`OR`), or to specifically exclude certain records (`NOT`).

3. The `BETWEEN` operator is used for range conditions and is inclusive, meaning it will select records where the column's value lies within the given range.

4.  `IN` is used to filter records where a column's value matches any in a provided list. It's a shorthand for multiple `OR` conditions.

5.  `LIKE` is used for pattern matching in strings. `%` represents any sequence of characters, and `_` represents a single character. For example:
  - `LIKE 'The%'` means "Find any string starting with "The".
  - `LIKE '%y'` means "Find any string ending with y.
  - `LIKE '%Rex%` means "Find any strings with "Rex" anywhere."

6.  `IS NULL` checks for empty fields. A field with a `NULL` value is one that has been left blank during record creation.

In [35]:
%%sql
-- This query selects dinosaur names that are strictly herbivores.
SELECT name, species
FROM Dinosaurs
-- 'diet = Herbivore' is a comparison operation checking for equality.
WHERE diet = 'Herbivore';

 * postgresql+psycopg2://@/postgres
6 rows affected.


name,species
Ducky,Brachiosaurus
Spike,Stegosaurus
Littlefoot,Apatosaurus
Cera,Triceratops
Munch,Ankylosaurus
Ivy,Diplodocus


In [36]:
%%sql
-- This query finds dinosaurs born before the year 2000 that are also carnivores.
SELECT name, dob
FROM Dinosaurs
-- The logical operator 'AND' combines two conditions.
WHERE diet = 'Carnivore' AND dob < '2000-01-01';

 * postgresql+psycopg2://@/postgres
7 rows affected.


name,dob
Rexie,1990-07-18
Blue,1993-05-14
Chomper,1991-09-02
Petrie,1992-07-22
Blink,1994-03-15
Delta,1995-10-23
Scar,1990-06-30


In [37]:
%%sql
-- This query retrieves the names of dinosaurs whose species fall within a set list.
SELECT name
FROM Dinosaurs
-- 'IN' allows for matching against multiple values.
WHERE species IN ('T-Rex', 'Velociraptor', 'Triceratops')

 * postgresql+psycopg2://@/postgres
5 rows affected.


name
Blue
Cera
Blink
Echo
Delta


In [38]:
%%sql
-- Let's find enclosures that have between 10,000 and 40,000 sqft
SELECT * -- This means select "all columns"
FROM Enclosures
WHERE square_feet BETWEEN 10000 AND 40000;

 * postgresql+psycopg2://@/postgres
2 rows affected.


enclosure_number,square_feet,security_level,habitat_type,min_temp_c,max_temp_c
4,36000,2,Desert,25.0,40.0
5,22000,5,Wetlands,16.0,30.0


In [39]:
%%sql
-- This query searches for dinosaurs whose names start with 'S'
SELECT name
FROM Dinosaurs
-- 'LIKE' with '%' wildcard matches any sequence of characters.
WHERE name LIKE 'S%';

 * postgresql+psycopg2://@/postgres
2 rows affected.


name
Spike
Scar


In [40]:
%%sql
-- This query finds dinosaurs with no recorded date of birth.
SELECT name
FROM Dinosaurs
-- 'IS NULL' checks for the absence of data.
WHERE dob IS NULL;

 * postgresql+psycopg2://@/postgres
2 rows affected.


name
Spike
Echo


## Overview of SQL JOIN

The real power of relational databases comes in their abilities to "relate" different tables. To do this, the `JOIN` operation is a means to combine rows from two or more tables based on a related column between them, which is known as a "key." An (INNER) JOIN specifically retrieves records that have matching values in both tables.

To see how this works, imagine two tables as two different sets of information. An INNER JOIN effectively finds the intersection of these two sets, where the specified condition is met in both. It's akin to the overlapping section in a Venn diagram where both sets meet.

```sql
SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;
```

Here, `table1` and `table2` are the tables from which you want to fetch data. The `ON` clause is critical as it specifies the column on which the join will be based. The result of this operation is a new table that combines columns from `table1` and `table2`, including only those rows where the join condition is true.

For example, Consider a database with two tables: `Employees` and `Departments`.

-   `Employees` might contain: EmployeeID, Name, DeptID.
-   `Departments` might contain: DeptID, DepartmentName.

If you want to list all employees along with their respective department names, you would use an INNER JOIN to join on the common column, which is `DeptID` in both tables.

```sql
SELECT Employees.Name, Departments.DepartmentName
FROM Employees
INNER JOIN Departments
ON Employees.DeptID = Departments.DeptID;
```

This query joins the `Employees` table with the `Departments` table where the `DeptID` is matching in both tables, and selects the `Name` of the employee along with their `DepartmentName`. Only employees who have a department will be included in the results, thanks to the nature of the INNER JOIN.

### Example (SQL JOIN): Matching Dinosaurs to Their Enclosures

To find out which dinosaur is in which enclosure, we can join the `Dinosaurs` table with the `Enclosures` table using the `enclosure_number` as the key.

In [41]:
%%sql
SELECT Dinosaurs.name AS DinosaurName,
  Enclosures.habitat_type AS HabitatType
FROM Dinosaurs JOIN Enclosures
  ON Dinosaurs.enclosure_number = Enclosures.enclosure_number;

 * postgresql+psycopg2://@/postgres
14 rows affected.


dinosaurname,habitattype
Rexie,Tropical Rainforest
Blue,Grasslands
Ducky,Tropical Rainforest
Spike,Woodlands
Chomper,Tropical Rainforest
Littlefoot,Grasslands
Cera,Woodlands
Munch,Desert
Blink,Grasslands
Ivy,Wetlands


This query selects the name of each dinosaur and the type of habitat they are in. The INNER JOIN clause creates a temporary table where each dinosaur is matched with its enclosure, but only where there's a valid enclosure number that exists in both tables.

### Example (SQL JOIN): Finding Carnivorous Dinosaurs and Their Security Levels

Suppose we want to identify all carnivorous dinosaurs and the security level of their enclosures.

In [42]:
%%sql
SELECT Dinosaurs.name AS DinosaurName,
  Dinosaurs.diet,
  Enclosures.security_level AS SecurityLevel
FROM Dinosaurs INNER JOIN Enclosures
  ON Dinosaurs.enclosure_number = Enclosures.enclosure_number
WHERE Dinosaurs.diet = 'Carnivore';


 * postgresql+psycopg2://@/postgres
7 rows affected.


dinosaurname,diet,securitylevel
Rexie,Carnivore,5
Blue,Carnivore,4
Chomper,Carnivore,5
Blink,Carnivore,4
Echo,Carnivore,4
Delta,Carnivore,4
Scar,Carnivore,2


In this query, we're joining the `Dinosaurs` table with the `Enclosures` table, again on enclosure_number, but we're also filtering the results with a `WHERE` clause to only include dinosaurs whose diet is 'Carnivore'. This shows us the security levels for enclosures containing carnivorous species.

## GROUP BY Clause

The `GROUP BY` clause in SQL is used to arrange identical data into groups. This clause comes in handy when, in conjunction with aggregate functions, one needs to summarize or aggregate identical data into single rows. It's commonly used with aggregate functions like COUNT, MAX, MIN, SUM, and AVG to perform the aggregation.
The syntax is:

```sql
SELECT column_name(s), AGGREGATE_FUNCTION(column_name)
FROM table_name
WHERE condition
GROUP BY column_name(s);
```

-   column_name(s): The columns by which the result set is grouped.
-   AGGREGATE_FUNCTION: An SQL function like COUNT, SUM, AVG, MAX, or MIN.
-   table_name: The name of the table from where to retrieve records.
-   condition: A condition to filter the result set before it is grouped.

## HAVING Clause

The `HAVING` clause is like a `WHERE` clause but for grouped records. Since the `WHERE` clause cannot be used with aggregate functions, `HAVING` is used to filter the results returned by the `GROUP BY` clause.

```sql
SELECT column_name(s), AGGREGATE_FUNCTION(column_name)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition;
```

-   condition: In the context of `HAVING`, this condition is an aggregate function.

## Aggregate Functions
Aggregate functions perform a calculation on a set of values and return a single value. They are used with the `SELECT` statement, and are often used with `GROUP BY` and `HAVING`.

-   COUNT(): Returns the number of rows that matches a specified criterion.
-   SUM(): Returns the total sum of a numeric column.
-   AVG(): Returns the average value of a numeric column.
-   MAX(): Returns the largest value of the selected column.
-   MIN(): Returns the smallest value of the selected column.

### Example (SQL GROUP BY): Counting Dinosaurs Per Enclosure
To get a sense of how we can use `GROUP BY` and `HAVING` with aggregate functions, let's try counting the number of dinosaurs per enclosure.

In [45]:
%%sql
SELECT
  Enclosures.enclosure_number,
  Enclosures.habitat_type,
  COUNT(Dinosaurs.dinosaur_id) AS "Number of Dinosaurs"
FROM Dinosaurs JOIN Enclosures
    ON Dinosaurs.enclosure_number = Enclosures.enclosure_number
GROUP BY Enclosures.enclosure_number;


 * postgresql+psycopg2://@/postgres
5 rows affected.


enclosure_number,habitat_type,Number of Dinosaurs
3,Woodlands,2
5,Wetlands,2
4,Desert,2
2,Grasslands,5
1,Tropical Rainforest,3


## Example (SQL GROUP BY)

In [46]:
%%sql
SELECT
    Enclosures.habitat_type,
    AVG(Dinosaurs.weight_kg) AS AverageWeight
FROM Dinosaurs JOIN Enclosures
  ON Dinosaurs.enclosure_number = Enclosures.enclosure_number
GROUP BY Enclosures.habitat_type
HAVING AVG(Dinosaurs.weight_kg) > 500;


 * postgresql+psycopg2://@/postgres
5 rows affected.


habitat_type,averageweight
Tropical Rainforest,21666.666666666668
Wetlands,6220.0
Grasslands,7396.666666666665
Desert,3750.0
Woodlands,4550.0


## Dr. Ian Malcolm's Guide to Writing Good Queries
(For those who don't know, Dr. Ian Malcom is the "crazy (but correct) scientist" character in the Jurassic Park movies....

### Know Your Territory

Before you write a query, get to know your database like I know the Jurassic landscape -- intimately. Understand the lay of the land (your tables), the creatures that roam it (the data), and how they interact (relationships). Remember, a database without relationships is like a park without dinosaurs -- not much to see here.

*Example:*
```sql
-- This is like recognizing that T-Rex doesn't want to be fed, it wants to hunt.
-- So, we JOIN the 'Dinosaurs' table with the 'Enclosures' to see where the hunting happens.
SELECT D.name, E.habitat_type
FROM Dinosaurs D
JOIN Enclosures E
  ON D.enclosure_number = E.enclosure_number;
```

### The Butterfly Effect
In SQL, as in chaos theory, a small error can have large repercussions. Ensure your questions are as clear as crystal -- vague questions lead to a database rampage.

*Example:*
```sql
-- Imagine asking a T-Rex for a high-five. Bad idea.
-- It's like running this without a WHERE clause. You'll get more than you bargained for.
SELECT name, species
FROM Dinosaurs;
```

### Life Starts Simple
Your first query should be as simple as a single-celled organism. Start basic, get it right, then evolve. Complexity will find a way.

```sql
-- It's like hatching your first dinosaur -- thrilling, yet manageable.
SELECT name
FROM Dinosaurs
WHERE diet = 'Carnivore';
```

### Survival of the Fittest Query
Your queries should adapt and evolve. Introduce WHERE, GROUP BY, and HAVING like introducing new species into the park -- carefully and one at a time.

*Example:*

```sql
-- It's like observing the food chain in action.
-- We're looking for the top predators, but only the ones that weigh (on average) more than a ton.
SELECT species, AVG(weight_kg) as average_weight
FROM Dinosaurs
WHERE diet = 'Carnivore'
GROUP BY species
HAVING AVG(weight_kg) > 1000;
```

### Expect Chaos
No matter how well you plan, expect the unexpected. Your query might work on paper but fail in the wild. Test it, tweak it, and test again. And remember, if you don't document what you did, it's like it never happened.
```sql
-- Think you've contained the raptors? Think again.
-- This query might return more 'velociraptors' than you thought existed.
SELECT *
FROM Dinosaurs
WHERE species LIKE '%raptor%';
```

In essence, approach SQL with a mix of respect, caution, and a good sense of humor, much like how one would navigate a park filled with prehistoric creatures.