<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/Database_05_Joins_Sets_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Tables Together With Joins and Sets
### Brendan Shea, PhD

In this chapter, we'll learn about joins and set operations using SQLite. We'll use a database called movies.sqlite, which is based on data from IMDB and includes movie, actor, director, person, and Oscar data.

- First, we'll show you the database schema for movies.sqlite. This is a guide to how the data is organized. We'll also display the first five rows of each table, so you can see what the data looks like.

- Next, we'll explain what an entity-relationship diagram is. It's a way of showing how different tables in a database are connected. It helps you understand the database's structure. We'll also discuss VARCHAR and CHAR, which are different kinds of data types, and explain how they compare to TEXT.

- Then, we'll talk about how tables relate to each other, using something called cardinalities. We'll go over "join" tables, like the Actor and Director ones, and explain their importance in connecting data. We'll also explain what primary keys are and why they're necessary in each table.


- The main part of this chapter is about joins: INNER JOIN, LEFT (OUTER) JOIN, and CROSS JOIN. We'll show you how each one works, give examples, and explain when you might want to use them.

- Finally, we'll go over set operations: UNION, DIFFERENCE, EXCEPT, and INTERSECT. These are methods you can use to combine or compare results from different SQL queries.

Remember, the key to understanding databases is practice. Don't be afraid to experiment and make mistakes. That's how you'll learn. Now, let's get started on joins and set operations!


## Brendan's Lecture
Run the following cell to see my lecture.

In [None]:
##Click here to launch my lecture
from IPython.display import YouTubeVideo
YouTubeVideo('XljJMe3nwTk', width=800, height=500)

## Introduction to the Movie Database
For this chapter, we'll be using a database based on data from the Internet Movie Database (IMDB). The full dataset is available here: https://developer.imdb.com/non-commercial-datasets/. The formatting of this datbase is adapted from that of David Sullivan (a computer science professor at Harvard/Boston U). We'll just be using a small part of this data, based on movies that have (at one point or another) been in the "top 100" in terms of box office returns.

To start off with, we need to download a copy of the database and connect to it:

In [None]:
# Now let's download the file we'll be using for this lab
!wget -N 'https://github.com/brendanpshea/database_sql/raw/main/data/movie.sqlite' -q

%load_ext sql
%sql sqlite:///movie.sqlite

### Exploring the Data
Now, we can take a look at the data to see what tables, attributes, and relationships are present. First, we'll display the table "schema" (the list of tables:

In [None]:
# Now let's diplay the table schema
%sql SELECT name FROM sqlite_master WHERE type='table';

 * sqlite:///movie.sqlite
Done.


name
Movie
Person
Actor
Director
Oscar


In [None]:
# Show the data types for each table
movie_df = %sql PRAGMA table_info(Movie);
person_df = %sql PRAGMA table_info(Person);
actor_df = %sql PRAGMA table_info(Actor);
director_df = %sql PRAGMA table_info(Director);
oscar_df = %sql PRAGMA table_info(Oscar);

print('\nMovie\n', movie_df,'\nPerson\n',person_df, '\nActor\n', actor_df,
      '\nDirector\n', director_df, '\nOscar\n', oscar_df)

 * sqlite:///movie.sqlite
Done.
 * sqlite:///movie.sqlite
Done.
 * sqlite:///movie.sqlite
Done.
 * sqlite:///movie.sqlite
Done.
 * sqlite:///movie.sqlite
Done.

Movie
 +-----+---------------+-------------+---------+------------+----+
| cid |      name     |     type    | notnull | dflt_value | pk |
+-----+---------------+-------------+---------+------------+----+
|  0  |       id      |   CHAR(7)   |    0    |    None    | 1  |
|  1  |      name     | VARCHAR(64) |    0    |    None    | 0  |
|  2  |      year     |   INTEGER   |    0    |    None    | 0  |
|  3  |     rating    |  VARCHAR(5) |    0    |    None    | 0  |
|  4  |    runtime    |   INTEGER   |    0    |    None    | 0  |
|  5  |     genre     | VARCHAR(16) |    0    |    None    | 0  |
|  6  | earnings_rank |   INTEGER   |    0    |    None    | 0  |
+-----+---------------+-------------+---------+------------+----+ 
Person
 +-----+------+--------------+---------+------------+----+
| cid | name |     type     | notnull |

Now, using our knowledge of the table names, we can display the first few rows of each table. (Note: I've used a bit of "Python" here to format things a bit more nicely):

In [None]:
# Show the first 5 rows of each table
movie_df = %sql SELECT * FROM Movie LIMIT 5;
person_df = %sql SELECT * FROM Person LIMIT 5;
actor_df = %sql SELECT * FROM Actor LIMIT 5;
director_df = %sql SELECT * FROM Director LIMIT 5;
oscar_df = %sql SELECT * FROM Oscar LIMIT 5;
print('\nMovie\n', movie_df,'\nPerson\n',person_df, '\nActor\n', actor_df,
      '\nDirector\n', director_df, '\nOscar\n', oscar_df)

 * sqlite:///movie.sqlite
Done.
 * sqlite:///movie.sqlite
Done.
 * sqlite:///movie.sqlite
Done.
 * sqlite:///movie.sqlite
Done.
 * sqlite:///movie.sqlite
Done.

Movie
 +---------+------------------------------+------+--------+---------+-------+---------------+
|    id   |             name             | year | rating | runtime | genre | earnings_rank |
+---------+------------------------------+------+--------+---------+-------+---------------+
| 2488496 | Star Wars: The Force Awakens | 2015 | PG-13  |   138   |   A   |       1       |
| 4154796 |      Avengers: Endgame       | 2019 | PG-13  |   181   |  AVS  |       2       |
| 1087260 |   Spider-Man: No Way Home    | 2021 | PG-13  |   148   |  AVFS |       3       |
| 0499549 |            Avatar            | 2009 | PG-13  |   162   |  AVYS |       4       |
| 1745960 |      Top Gun: Maverick       | 2022 | PG-13  |   130   |   AD  |       5       |
+---------+------------------------------+------+--------+---------+-------+------------

### Movies: Basic Structure

Our database consists of five tables: Movie, Person, Actor, Director, and Oscar. These tables store data related to movies, persons (actors and directors), and Oscar awards. Now, let's explore each of these tables, their attributes, data types, and primary keys.

### Data Types

As we've discussed previously, every column in a SQL table has a related "data type". The data type defines what kind of data a column can store. Here are some of the data types found in our tables:

1.  **CHAR:** This type is used to store character strings of a fixed length. The number within the parentheses (like CHAR(7)) represents the fixed length of characters the field can hold.

2.  **VARCHAR:** Unlike CHAR, VARCHAR is used for variable-length strings. The number within the parentheses represents the maximum number of characters the field can hold.

3.  **INTEGER:** This type is used for integer values (negative and positive whole numbers, such as -2, 0, 2, etc.)

4.  **DATE:** This type is used to store date values. In SQLite, these are stored as a type of TEXT.

### Tables


1\. Movie: This table contains information about different movies. It has 7 attributes:

-   `id` (CHAR(7)): This is a unique identifier for each movie. This serves as the primary key for this table.
-   `name` (VARCHAR(64)): The name of the movie.
-   `year` (INTEGER): The year when the movie was released.
-   `rating` (VARCHAR(5)): The MPAA rating of the movie.
-   `runtime` (INTEGER): The duration of the movie in minutes.
-   `genre` (VARCHAR(16)): The genre of the movie.
-   `earnings_rank` (INTEGER): The earnings rank of the movie.

2\. Person: This table stores data related to persons (actors and directors). It consists of 4 attributes:

-  `id` (CHAR(7)): This is a unique identifier for each person. This is the primary key for this table.
-   `name` (VARCHAR(64)): The name of the person.
-   `dob` (DATE): The date of birth of the person.
-   `pob` (VARCHAR(128)): The place of birth of the person.

3\. Actor and Director: These are **junction tables** (or **join tables**) which store the relationship between movies and persons (actors or directors). A join table is used to resolve many-to-many relationships by breaking it down into two one-to-many relationships. In this case, a movie can have multiple actors and a person can act in multiple movies. Same applies for directors. Each of these tables has two attributes:

-   `actor_id/director_id` (CHAR(7)): This refers to the id of the person who is an actor/director. This forms part of the composite primary key.
-   `movie_id` (CHAR(7)): This refers to the id of the movie. This also forms part of the composite primary key.

4\. Oscar: This table stores information about Oscar awards. It contains 4 attributes:

-   `movie_id` (CHAR(7)): This refers to the id of the movie that won an Oscar.
-   `person_id` (CHAR(7)): This refers to the id of the person who won an Oscar.
-   `type` (VARCHAR(23)): This represents the type of Oscar award.
-   `year` (INTEGER): The year when the Oscar was awarded.

The key for this table is a combination of ALL FOUR columns.

### Movies: Some Design Principles
Let's conclude with some key takeaways:

1. Unique identifiers (and primary keys) ensure that each record in a table is unique. They also allow us to link tables together efficiently. The Movie and Person tables demonstrate this.

2. The Actor and Director tables illustrate how **junction tables tables**  enable many-to-many relationships between entities. Many actors can star in many movies, and a movie can have multiple directors.

3.  The Actor, Director, and Oscar tables show how **composite primary keys** (where more than one column is used to identify a record uniquely) can handle complex relationships.

4. The design of the Oscar table emphasizes that database design should closely reflect the realities of the data you're working with. It demonstrates how to incorporate multiple, overlapping relationships in a single table.

This dataset is an example of a relational database, where tables connect via relationships, allowing us to organize complex, real-world data in a structured way. In the next sections, we'll explore how to leverage these relationships to query and manipulate our data efficiently. Remember, practice is the key to mastering database design, so don't be afraid to experiment and learn from your mistakes!

## An Entity-Relationship Diagram of the Movie Data
To help us better understand the stucture of the movie data, let's create an **entity-relationship** diagram.

![E-R Diagram](https://github.com/brendanpshea/database_sql/raw/main/images/movies_crows_foot.png)

### Why ER Diagrams?
An **Entity-Relationship (ER) diagram** (like the one you see here) is a visual tool that allows us to model and represent the relationships between different entities in a database. An 'entity' could be anything we want to store information about, such as movies, people, or awards in our movies.sqlite database.

In an ER diagram, **entities** (roughly, classes of things corresponding to tables) are usually represented as rectangles, while **relationships** are shown as lines or arrows connecting these rectangles. These relationships illustrate how data in one entity relates to data in another. For example, in our movies.sqlite database, a relationship exists between the 'Movie' and 'Person' entities through the 'Actor' and 'Director' entities.

This diagram used a simplified version of **Crow's foot notation**, which is a type of ER diagram where relationships are displayed with symbols that somewhat resemble a bird's footprints (hence the name). This notation is particularly helpful in showing the cardinality of relationships, or in other words, how many instances of one entity relate to instances of another entity.

In crow's foot notation, we look at the **connectors** at the end of lines to determine the relationships between entities.

-   A `1` (also indicated with `|`) stands for 'exactly 1'.
-   A crow's foot (`<` or `>`) symbol stands for 'many'.
-  A circle (`O`) on a line represents "optional" (we don't have any of these in this diagram).

So, for example, consider the line between 'Movie' and 'Director'. Since there is a crow's foot ("many") next to the Director box and a 1 next to the Movie box, this means there is a **many-to-one** relationship between directors and movies: Every director can direct many movies, but each movie (at least for the purposes of this database can have only one director). Many-to-one relationships form the "heart" of relational databases, since **one-to-one relationships** are often encoded as "attributes" of tables (for example, a Person's relationships with their own name) while **many-to-many** relationships (for example, the relationship between Persons and Movies--a person can be in many movies and a movie can have many persons) can't be represented without breaking them down into multiple many-to-one (or one-to-many) tables (as we've done here with Director and Actor).

### The Role of ER Diagrams in Database Design and Interpretation

ER diagrams play a crucial role in both the design and interpretation of databases:

-   Design: When designing a database, ER diagrams help us to visualize the structure and relationships in our database before we start creating it. By mapping out entities and their relationships, we can see and adjust our database's design more easily. It helps ensure we accurately represent the data and relationships we want to capture.

-   Interpretation: ER diagrams can also be handy tools for understanding existing databases. They provide a clear, visual representation of the structure of a database, making it easier to understand the relationships between different entities. This understanding is key when it comes to effectively querying a database.

Remember, a well-designed ER diagram can save you time and trouble. It gives you a clear roadmap of your database, helps prevent errors, and makes the database easier to work with, whether you're the one who created it or not.

## Discussion Questions: Movies DB
1. In the given database, why do we use CHAR as the data type for the 'id' attribute in Movie and Person tables? What would happen if we used VARCHAR instead of CHAR?

2. Analyze the Movie table. Why might 'year' be represented as an INTEGER instead of a DATE type? Can you think of any advantages or disadvantages to this approach?

3. Can you explain the role and significance of the Actor and Director tables in our movie database? How do these "join tables" help us solve the problem of many-to-many relationships between movies and persons?

4. Looking at the Oscar table, why do you think all four columns form the primary key? Discuss why a single attribute, or a subset of these attributes, might not serve as an efficient primary key in this context.

5. Consider the ER diagram provided. Based on it, can you describe the relationship between the entities Movie and Person, as mediated by the Actor and Director tables? How would you describe these relationships using the terms "one-to-one", "one-to-many", and "many-to-many"?

### My Answers: Movies DB

1.

2.

3.

4.

5.

## What is a JOIN and What Role Do They Play in Relational Databases?

In the world of relational databases, tables are like individual characters in a play. Each character—be it Hamlet in Shakespeare's classic or Elsa in Disney's "Frozen"—has their own storyline, traits, and connections to others. But the story becomes compelling when these characters interact. In databases, this interaction is orchestrated by SQL Joins.

A **JOIN** is a SQL operation used to combine rows from two or more tables based on a related column between them. Imagine a high school drama club preparing for a play. One table, let's call it `Students`, lists all the members: their names, roles, and ages. Another table, named `Roles`, provides details about each role in the play: the character's name, importance, and the number of lines they have.

Here, each table is distinct and serves a specific purpose. But what if you want to know which student is playing which role and how many lines they have? This is where a Join comes into play. It combines these separate pieces of information into a cohesive whole, much like how a director combines various elements to create a unified performance.

In databases, the "director's notes" for these Joins are often found in what we call **primary keys** and **foreign Keys**. A Primary Key is a unique identifier for a record in a table. Think of it as a student's unique ID number. A foreign key, on the other hand, is a field in one table that refers to the primary key in another table. It's like a reference in the script that mentions which student is playing which role. These keys establish the relationship between tables and set the stage for a join.

There are several types of Joins—INNER, LEFT, RIGHT, FULL—but they all serve the same purpose: to bring different tables together in a meaningful way. For instance, an INNER JOIN would only list the students who actually have a role in the play. In contrast, a LEFT JOIN would list all students, but only show role details for those who have one.


### Understanding the Links Between Tables: The Role of Junction Tables and References

In our database, akin to a cinematic universe, the characters are not isolated figures; they exist in a web of relationships. The `Movie`, `Person`, `Actor`, `Director`, and `Oscar` tables are individual chapters, but they are connected in ways that deepen our understanding of each.

-   `Actor` Table: This table contains `actor_id` and `movie_id`, serving as a junction table. Each row tells us which actor (from the `Person` table) played in which movie (from the `Movie` table). It's like a program booklet that tells you which actor is playing which role.

-   `Director` Table: Similarly, this table links directors from the `Person` table to their films in the `Movie` table. A row in this table is like a director's credit in a film reel, pointing from the name to the artwork they've overseen.

These junction tables use Foreign Keys---the `actor_id` and `director_id` are Foreign Keys that refer to the `id` in the `Person` table, and the `movie_id` is a Foreign Key referring to the `id` in the `Movie` table.

The `Oscar` table, is the grand awards ceremony where artists and their works are celebrated. This table uses Foreign Keys to refer to both `Person` and `Movie` tables, connecting them through accolades.

-   `movie_id`: Tells us which movie won or was nominated for an Oscar. This is a Foreign Key linking to the `Movie` table.
-   `person_id`: Indicates the person who received the award or nomination. This is another Foreign Key, this time linking to the `Person` table.

This table captures instances where a movie or a person has been honored, acting as a bridge between the art and the artist, the film and the filmmaker.

### INNER JOIN

An INNER JOIN returns records that have matching values in both tables being joined. This is the most common form of join in SQL, and is sometimes called a **natural join** (or just a JOIN -- you don't need to include INNER).

The general form of an SQL JOIN statement looks like this:

```
SELECT column1, column2, ...
FROM table1
[INNER] JOIN table2 ON condition;
```

The INNER JOIN keyword selects records that have matching values in both tables. This means you only get rows back when there's something to match in both tables.

To illustrate, consider two simplified tables in our database, 'Actor' and 'Person'. The 'Actor' table contains two columns: 'actor_id' and 'movie_id'. The 'Person' table includes 'id' and 'name'. Here are the tables with a few rows of data:

'Actor' Table:

| actor_id | movie_id |
| --- | --- |
| 1 | 100 |
| 2 | 100 |
| 3 | 101 |
| 4 | 102 |

'Person' Table:

| id | name |
| --- | --- |
| 1 | Tom Hanks |
| 2 | Morgan Freeman |
| 3 | Meryl Streep |
| 5 | Robert De Niro |

Now, suppose we want to find out which actors played in the movie with the ID '100'. To do that, we can use an INNER JOIN to combine the 'Actor' and 'Person' tables:

```
SELECT Person.name
FROM Actor
JOIN Person ON Actor.actor_id = Person.id
WHERE Actor.movie_id = 100;`
```

This statement takes the 'Actor' table and joins it with the 'Person' table where the 'actor_id' in 'Actor' matches the 'id' in 'Person'. We're only interested in the movie with ID '100', so we specify that with `WHERE Actor.movie_id = 100`.

The result of this query would look like this:

| name |
| --- |
| Tom Hanks |
| Morgan Freeman |

This result shows us that Tom Hanks and Morgan Freeman were both actors in the movie with the ID '100'.

Now, let's move onto multiple-table joins. Suppose we want to get a list of movie names along with the actors. To do this, we would join three tables: 'Movie', 'Actor', and 'Person'. Let's start by adding a 'Movie' table into our example:

'Movie' Table:

| id | name |
| --- | --- |
| 100 | Forrest Gump |
| 101 | The Godfather |
| 102 | Casablanca |

We could adjust our previous SQL query to include the 'Movie' table like this:

```
SELECT Movie.name, Person.name
FROM Movie
JOIN Actor ON Movie.id = Actor.movie_id
JOIN Person ON Actor.actor_id = Person.id;
```

Now we are joining 'Movie' to 'Actor', and then that result to 'Person'. This would give us a list of movie names along with the corresponding actor names, something like this:

| Movie Name | Actor Name |
| --- | --- |
| Forrest Gump | Tom Hanks |
| Forrest Gump | Morgan Freeman |
| The Godfather | Meryl Streep |

#### Example 1: Actors in 'PG-13' Movies
To accomplish this, we'll INNER JOIN the Movie and Actor tables, linking them through the id and movie_id columns, respectively. We'll then filter the results to only include movies with a 'PG-13' rating.

In [None]:
%%sql
SELECT Person.name as "Actor",
  Movie.name AS "Movie Title", Rating
FROM Person
INNER JOIN Actor ON Person.id = Actor.actor_id
INNER JOIN Movie ON Actor.movie_id = Movie.id
WHERE Movie.rating = 'PG-13'
LIMIT 10;

 * sqlite:///movie.sqlite
Done.


Actor,Movie Title,rating
Leonardo DiCaprio,Titanic,PG-13
Kate Winslet,Titanic,PG-13
Billy Zane,Titanic,PG-13
Kathy Bates,Titanic,PG-13
Bill Paxton,Titanic,PG-13
Tobey Maguire,Spider-Man,PG-13
Willem Dafoe,Spider-Man,PG-13
Kirsten Dunst,Spider-Man,PG-13
James Franco,Spider-Man,PG-13
Cliff Robertson,Spider-Man,PG-13


#### Example 2: Aggregate Functions - Average Runtime of Movies by Rating
Suppose you're planning a movie marathon and want to know the average runtime of movies based on their ratings. We can use an INNER JOIN between the Movie and Actor tables and then apply the SQL aggregate function AVG() to calculate this.

In [None]:
%%sql
SELECT Movie.rating,
  ROUND(AVG(Movie.runtime), 2) AS "Average Time (min)"
FROM Movie
INNER JOIN Actor ON Movie.id = Actor.movie_id
GROUP BY Movie.rating;

 * sqlite:///movie.sqlite
Done.


rating,Average Time (min)
,115.93
G,117.13
GP,187.0
M,124.0
NC-17,114.18
PG,112.2
PG-13,128.74
R,126.35


#### Example 3: Text Filtering with LIKE - Directors Born in New York

What if you're curious to know which directors were born in New York and have directed a movie? We can INNER JOIN the `Person` and `Director` tables and use the SQL `LIKE` keyword to filter text data.

In [None]:
%%sql
SELECT Person.name, pob
FROM Person
INNER JOIN Director ON Person.id = Director.director_id
WHERE Person.pob LIKE '%New York%'
LIMIT 5;

 * sqlite:///movie.sqlite
Done.


name,pob
Mel Gibson,"Peekskill, New York, USA"
Barry Sonnenfeld,"New York, New York, USA"
Joel Zwick,"Brooklyn, New York, USA"
Martin Brest,"The Bronx, New York, New York, USA"
Bryan Singer,"New York, New York, USA"


#### Example 4: INNER JOIN with Multiple Conditions - Oscar-Winning Actors in PG-13 Movies
The following query finds the number of PG-13 movies for which each actor has won a Best Actor Oscar. It does this by first joining the four tables `Person`, `Actor`, `Movie`, and `Oscar` on the `id` column to create a single table that contains all of the relevant data. It then filters the table to only include rows where the movie rating is `PG-13` and the Oscar type is `BEST-ACTOR`. Finally, it groups the rows by the actor name and counts the number of rows in each group. This gives us a table with two columns: `Actor` and `Number of PG Movies`.

In [None]:
%%sql

SELECT Person.name AS "Actor",
  COUNT(Movie.name) AS "Number of PG Movies"
FROM Person
INNER JOIN Actor ON Person.id = Actor.actor_id
INNER JOIN Movie ON Actor.movie_id = Movie.id
INNER JOIN Oscar ON Person.id = Oscar.person_id
WHERE Movie.rating = 'PG-13' AND Oscar.type = 'BEST-ACTOR'
GROUP BY Person.name
LIMIT 5;


 * sqlite:///movie.sqlite
Done.


Actor,Number of PG Movies
Adrien Brody,1
Anthony Hopkins,6
Ben Kingsley,1
Charles Laughton,1
Cliff Robertson,1


#### Example 5: Sorting with ORDER BY - Top Earning PG-13 Movies and Their Directors

Perhaps you're interested in knowing which 'PG-13' movies earned the most at the box office and who directed them. In this example, we use an INNER JOIN between the `Movie` and `Director` tables and sort the results using the `ORDER BY` keyword.

In [None]:
%%sql
SELECT Movie.name AS "Movie",
  Person.name AS "Director",
  Movie.earnings_rank AS "Earnings Rank"
FROM Movie
INNER JOIN Director ON Movie.id = Director.movie_id
INNER JOIN Person ON Director.director_id = Person.id
WHERE Movie.rating = 'PG-13'
  AND Movie.earnings_rank IS NOT NULL
ORDER BY Movie.earnings_rank ASC
LIMIT 5;

 * sqlite:///movie.sqlite
Done.


Movie,Director,Earnings Rank
Star Wars: The Force Awakens,J.J. Abrams,1
Avengers: Endgame,Anthony Russo,2
Avengers: Endgame,Joe Russo,2
Spider-Man: No Way Home,Jon Watts,3
Avatar,James Cameron,4


#### Example 6: Counting Entries with COUNT - Number of Movies by Rating

For a school project, you might need to know how many movies exist in each rating category. To accomplish this, we can use the `COUNT()` function along with an INNER JOIN between `Movie` and `Actor` tables, then group the results by rating.

In [None]:
%%sql
SELECT Movie.rating, COUNT(Movie.id)
FROM Movie
INNER JOIN Actor ON Movie.id = Actor.movie_id
GROUP BY Movie.rating;


 * sqlite:///movie.sqlite
Done.


rating,COUNT(Movie.id)
,693
G,197
GP,5
M,25
NC-17,11
PG,727
PG-13,1231
R,851


### LEFT JOIN (LEFT OUTER JOIN)

The general form of a LEFT [OUTER] JOIN SQL statement looks like this:

```
SELECT column1, column2, ...
FROM table1
LEFT JOIN table2 ON condition;
```

A LEFT JOIN returns all records from the left table (table1), and the matched records from the right table (table2). If there is no match, the result is NULL on the right side. This is quite different from an INNER JOIN, which only returns records where there is a match in both tables.

Continuing with our previous example, consider the 'Actor' and 'Person' tables:

'Actor' Table:

| actor_id | movie_id |
| --- | --- |
| 1 | 100 |
| 2 | 100 |
| 3 | 101 |
| 4 | 102 |

'Person' Table:

| id | name |
| --- | --- |
| 1 | Tom Hanks |
| 2 | Morgan Freeman |
| 3 | Meryl Streep |
| 5 | Robert De Niro |

Now, suppose we want to find out all the actors, including those who didn't act in any movie. We can use a LEFT JOIN to combine the 'Person' and 'Actor' tables:

```
SELECT Person.name, Actor.movie_id
FROM Person
LEFT JOIN Actor ON Person.id = Actor.actor_id;
```

This statement takes the 'Person' table and joins it with the 'Actor' table where the 'id' in 'Person' matches the 'actor_id' in 'Actor'. If there's no match, the 'movie_id' will be NULL.

The result of this query would look like this:

| name | movie_id |
| --- | --- |
| Tom Hanks | 100 |
| Morgan Freeman | 100 |
| Meryl Streep | 101 |
| Robert De Niro | NULL |

This result shows us that Tom Hanks, Morgan Freeman, and Meryl Streep acted in various movies, while Robert De Niro's name appears with a NULL 'movie_id', meaning he hasn't acted in any of the movies listed in the 'Actor' table.

In contrast, if we had used an INNER JOIN instead, Robert De Niro wouldn't appear in our result at all, because INNER JOIN only includes records where there's a match in both tables. The choice between INNER JOIN and LEFT JOIN depends on whether you want to include all records from one table (LEFT JOIN), or only records where there's a match in both tables (INNER JOIN).

Some versions of SQL also ofter commands for RIGHT (OUTER) JOIN (which just reverses the order of the two tables in LEFT JOIN) and FULL OUTER JOIN (which combines the results of LEFT and RIGHT joins). However, you can replicate these by smartly using LEFT JOIN.

#### Example 7: All Movies and Their Oscar Wins
Imagine you're compiling a list of all movies and want to know which ones have won Oscars. Even if a movie hasn't won, it should still appear in the list.

In [None]:
%%sql
SELECT Movie.name, Oscar.type
FROM Movie
LEFT JOIN Oscar ON Movie.id = Oscar.movie_id
LIMIT 30;


 * sqlite:///movie.sqlite
Done.


name,type
Star Wars: The Force Awakens,
Avengers: Endgame,
Spider-Man: No Way Home,
Avatar,
Top Gun: Maverick,
Black Panther,
Avengers: Infinity War,
Titanic,BEST-DIRECTOR
Titanic,BEST-PICTURE
Jurassic World,


#### Example 8: All the "Steven"s and Movies They've Directed
For whatever reason, you are interested in knowing which people named Steven have directed which movies. You'd also like to know which Stevens haven't directed any movies. A LEFT JOIN can fetch this information, including directors who haven't yet directed a movie.

In [None]:
%%sql
SELECT Person.name AS "Person",
  Movie.name AS "Movie Directed"
FROM Person
LEFT JOIN Director ON Director.director_id = Person.id
LEFT JOIN Movie ON Movie.id = Director.movie_id
WHERE Person.name LIKE 'Steven%';



 * sqlite:///movie.sqlite
Done.


Person,Movie Directed
Steven Spielberg,Jaws
Steven Spielberg,Indiana Jones and the Raiders of the Lost Ark
Steven Spielberg,E.T. the Extra-Terrestrial
Steven Spielberg,Indiana Jones and the Temple of Doom
Steven Spielberg,Indiana Jones and the Last Crusade
Steven Spielberg,Jurassic Park
Steven Spielberg,Schindler's List
Steven Spielberg,"Lost World: Jurassic Park, The"
Steven Spielberg,Saving Private Ryan
Steven Spielberg,Minority Report


### CROSS JOIN

A CROSS JOIN, also known as a Cartesian join, returns the Cartesian product of the two tables. This means that it returns every combination of rows from the first table with every row from the second table. Unlike other types of joins, the CROSS JOIN does not require a join condition, and if used without one, it returns all possible combinations of rows from the joined tables.

The general form of a SQL CROSS JOIN statement looks like this:

```
SELECT column1, column2, ...
FROM table1
CROSS JOIN table2;
```

The CROSS JOIN keyword combines every record from the first table with every record from the second table. It doesn't need a matching condition.

To illustrate, let's use the same 'Actor' and 'Person' tables from our previous examples:

'Actor' Table:

| actor_id | movie_id |
| --- | --- |
| 1 | 100 |
| 2 | 100 |
| 3 | 101 |
| 4 | 102 |

'Person' Table:

| id | name |
| --- | --- |
| 1 | Tom Hanks |
| 2 | Morgan Freeman |
| 3 | Meryl Streep |
| 5 | Robert De Niro |

Now, suppose we want to see all possible combinations of these two tables. We can perform a CROSS JOIN as follows:

```
SELECT Person.name, Actor.movie_id
FROM Person
CROSS JOIN Actor;
```

The result of this query would include every combination of the 'Person' and 'Actor' tables, resulting in a total of 16 rows (4 rows from 'Actor' multiplied by 4 rows from 'Person'):

| name | movie_id |
| --- | --- |
| Tom Hanks | 100 |
| Tom Hanks | 100 |
| Tom Hanks | 101 |
| Tom Hanks | 102 |
| Morgan Freeman | 100 |
| Morgan Freeman | 100 |
| Morgan Freeman | 101 |
| Morgan Freeman | 102 |
| ... (continues) |  |

The CROSS JOIN produces a large result set, especially if the tables involved have many rows. It's often used in scenarios where you need all possible combinations between two sets of data, such as in certain analytical tasks. Care should be taken when using this join, as it can quickly lead to a large number of results, potentially affecting performance.

#### Example 9: Every Actor With Every Movie
In a fantastical "what if" scenario, you might wonder what it would look like if every actor had acted in every movie. A CROSS JOIN between the `Actor` and `Movie` tables can simulate this.

In [None]:
%%sql
SELECT *
FROM Person
CROSS JOIN Movie
WHERE Person.id IN (SELECT actor_id FROM Actor)
LIMIT 10;

 * sqlite:///movie.sqlite
Done.


id,name,dob,pob,id_1,name_1,year,rating,runtime,genre,earnings_rank
2,Lauren Bacall,1924-09-16,"New York, New York, USA",2488496,Star Wars: The Force Awakens,2015,PG-13,138,A,1
2,Lauren Bacall,1924-09-16,"New York, New York, USA",4154796,Avengers: Endgame,2019,PG-13,181,AVS,2
2,Lauren Bacall,1924-09-16,"New York, New York, USA",1087260,Spider-Man: No Way Home,2021,PG-13,148,AVFS,3
2,Lauren Bacall,1924-09-16,"New York, New York, USA",499549,Avatar,2009,PG-13,162,AVYS,4
2,Lauren Bacall,1924-09-16,"New York, New York, USA",1745960,Top Gun: Maverick,2022,PG-13,130,AD,5
2,Lauren Bacall,1924-09-16,"New York, New York, USA",1825683,Black Panther,2018,PG-13,134,AVS,6
2,Lauren Bacall,1924-09-16,"New York, New York, USA",4154756,Avengers: Infinity War,2018,PG-13,149,AVYS,7
2,Lauren Bacall,1924-09-16,"New York, New York, USA",120338,Titanic,1997,PG-13,194,DR,8
2,Lauren Bacall,1924-09-16,"New York, New York, USA",369610,Jurassic World,2015,PG-13,124,A,9
2,Lauren Bacall,1924-09-16,"New York, New York, USA",848228,The Avengers,2012,PG-13,143,A,10


## Set Opeations in SQL
SQL also allows us to do basic **set** operations, of the sort you might have learned about in high-school or college math classes. In SQL, the main set operations we talk about are UNION, INTERSECT, and EXCEPT. All these operations can be applied to the result sets returned by SELECT statements.

To help see how these work, let's suppose we have the following tables:

#### Old_Movies:

| id | title |
| --- | --- |
| 1 | Avatar |
| 2 | Titanic |
| 3 | The Godfather |
| 4 | Star Wars |
| 5 | Jaws |

#### New_Movies:

| id | title |
| --- | --- |
| 1 | Avatar |
| 2 | Endgame |
| 3 | Inception |
| 4 | Interstellar |
| 5 | Titanic |

### UNION

The `UNION` operator is used to combine the result sets of 2 or more SELECT statements. However, it removes duplicate rows from the result set. Also, the SELECT statements within the UNION must have the same number of columns and the corresponding columns must have compatible data types. For example:


```
SELECT title FROM Old_Movies
UNION
SELECT title FROM New_Movies;
```

This gives us:

| title |
| --- |
| Avatar |
| Titanic |
| The Godfather |
| Star Wars |
| Jaws |
| Endgame |
| Inception |
| Interstellar |


### INTERSECT

The INTERSECT operator returns the intersection of 2 or more SELECT statements, i.e., it returns only the rows that are common to all the SELECT statements. For example:

```
SELECT title FROM Old_Movies
INTERSECT
SELECT title FROM New_Movies;
```

This gives us:

| title |
| --- |
| Avatar |
| Titanic |

### EXCEPT

The EXCEPT operator returns the difference between the first SELECT statement and the second SELECT statement. It returns rows from the first SELECT statement that are not returned by the second SELECT statement. For example:


```
SELECT title FROM Old_Movies
EXCEPT
SELECT title FROM New_Movies;
```

This gives us:

| title |
| --- |
| The Godfather |
| Star Wars |
| Jaws |

These set operations help you manipulate and analyze your data in a variety of ways, providing you with the flexibility to obtain the insights you need. Just like SQL join operations, they're another powerful tool in your SQL toolkit for data analysis and manipulation.

#### Example 10: List of All Actors and Directors
You might want to create a list of all persons who have either acted in or directed a movie. This is a case for UNION.

In [None]:
%%sql
SELECT Person.name FROM Person
INNER JOIN Actor ON Person.id = Actor.actor_id
UNION
SELECT Person.name FROM Person
INNER JOIN Director ON Person.id = Director.director_id
LIMIT 10;


 * sqlite:///movie.sqlite
Done.


name
Aaron Eckhart
Aaron Taylor-Johnson
Abe Vigoda
Abigail Breslin
Adam Brody
Adam Bryant
Adam Driver
Adam McKay
Adam Sandler
Adam Shankman


#### Example 11: Persons Who Are Both Actors and Directors
Perhaps you're interested in multi-talented individuals who have both acted in and directed movies. Here, we can use `INTERSECT`.

In [None]:
%%sql
SELECT Person.name FROM Person
INNER JOIN Actor ON Person.id = Actor.actor_id
INTERSECT
SELECT Person.name FROM Person
INNER JOIN Director ON Person.id = Director.director_id
LIMIT 10;


 * sqlite:///movie.sqlite
Done.


name
Ben Affleck
Bob Peterson
Bradley Cooper
Chris Sanders
Clint Eastwood
David Butler
Denzel Washington
Ed Harris
Elizabeth Banks
Garry Marshall


#### Example 12: Directors Who Have Never Acted
In an exploration of how to become a director, you might want to know which directors have never acted. Now, can use `EXCEPT`.

In [None]:
%%sql
SELECT Person.name FROM Person
INNER JOIN Director ON Person.id = Director.director_id
EXCEPT
SELECT Person.name FROM Person
INNER JOIN Actor ON Person.id = Actor.actor_id
LIMIT 10;



 * sqlite:///movie.sqlite
Done.


name
Adam McKay
Adam Shankman
Adrian Lyne
Adrian Molina
Alan J. Pakula
Alan Taylor
Alejandro Gonzalez Inarritu
Alex Proyas
Alexander Korda
Alfonso Cuaron


## More Sample Queries
Let's try out what we've learned using joins to explore our movies database. Along the way, we'll also learn a few new tricks and techniques in SQL


### Example 13: Actors from Batman Begins

In [None]:
%%sql
-- Get a list of actors in "Batman Returns"
SELECT * FROM Person
JOIN Actor ON Person.id = Actor.actor_id
JOIN Movie on Movie.id = Actor.movie_id
WHERE Movie.name = "Batman Returns";

 * sqlite:///movie.sqlite
Done.


id,name,dob,pob,actor_id,movie_id,id_1,name_1,year,rating,runtime,genre,earnings_rank
474,Michael Keaton,1951-09-05,"Coraopolis, Pennsylvania, USA",474,103776,103776,Batman Returns,1992,PG-13,126,AYRT,
362,Danny DeVito,1944-11-17,"Neptune, New Jersey, USA",362,103776,103776,Batman Returns,1992,PG-13,126,AYRT,
201,Michelle Pfeiffer,1958-04-29,"Santa Ana, California, USA",201,103776,103776,Batman Returns,1992,PG-13,126,AYRT,
686,Christopher Walken,1943-03-31,"Queens, New York, USA",686,103776,103776,Batman Returns,1992,PG-13,126,AYRT,
1284,Michael Gough,1917-11-23,Malaya. [now Malaysia],1284,103776,103776,Batman Returns,1992,PG-13,126,AYRT,


Here, we "join" Actor, Person, and Movie to get a list of actors in "Batman Returns" by doing the following:

-   `SELECT *` - Select all columns from the `Person` table.
-   `FROM Person` - Specify that the data should be retrieved from the `Person` table.
-   `JOIN Actor ON Person.id = Actor.actor_id` - Join the `Person` table to the `Actor` table on the `Person.id` and `Actor.actor_id` columns.
-   `JOIN Movie on Movie.id = Actor.movie_id` - Join the `Actor` table to the `Movie` table on the `Actor.movie_id` column.
-   `WHERE Movie.name = "Batman Returns"` - Filter the results to only include records where the `Movie.name` column is equal to "Batman Returns".

This shows everything (all columns) returned by the join. We could also make the query a bit nicer looking by something like:

In [None]:
%%sql
SELECT Person.name AS "Batman Returns Actor",
  dob AS "Date of birth"
FROM Person
  JOIN Actor ON Person.id = Actor.actor_id
  JOIN Movie on Movie.id = Actor.movie_id
WHERE Movie.name = "Batman Returns"
ORDER BY dob;

 * sqlite:///movie.sqlite
Done.


Batman Returns Actor,Date of birth
Michael Gough,1917-11-23
Christopher Walken,1943-03-31
Danny DeVito,1944-11-17
Michael Keaton,1951-09-05
Michelle Pfeiffer,1958-04-29


#### Example 14: Recent Star Wars and Star Trek Films

In [None]:
%%sql
SELECT *
FROM Movie
WHERE (name LIKE "Star Wars%"
  OR name LIKE "Star Trek%")
AND year > 2010
ORDER BY year

 * sqlite:///movie.sqlite
Done.


id,name,year,rating,runtime,genre,earnings_rank
1408101,Star Trek Into Darkness,2013,PG-13,132,A,164
2488496,Star Wars: The Force Awakens,2015,PG-13,138,A,1
2527336,Star Wars: Episode VIII - The Last Jedi,2017,PG-13,152,AVYS,11
2527338,Star Wars: The Rise Of Skywalker,2019,PG-13,141,AVYS,16


Here, get a list of recent Star Wars and Star Trek films by doing the following:
-   `SELECT *` - Select all columns from the `Movie` table.
-   `FROM Movie` - Specify that the data should be retrieved from the `Movie` table.
-   `WHERE (name LIKE "Star Wars%" OR name LIKE "Star Trek%")` - Filter the results to only include records where the `name` column contains the string "Star Wars" or "Star Trek".
-   `AND year > 2010` - Filter the results to only include records where the `year` column is greater than 2010.
-   `ORDER BY year` - Sort the results by the `year` column in ascending order.

Note the the parentheses here matter! Here's what happen if I remove them:

In [None]:
%%sql
SELECT *
FROM Movie
WHERE name LIKE "Star Wars%"
  OR name LIKE "Star Trek%"
AND year > 2010
ORDER BY year

 * sqlite:///movie.sqlite
Done.


id,name,year,rating,runtime,genre,earnings_rank
76759,Star Wars: Episode IV - A New Hope,1977,PG,121,AVYS,21
80684,Star Wars: Episode V - The Empire Strikes Back,1980,PG,124,AVYS,101
86190,Star Wars: Episode VI - Return of the Jedi,1983,PG,134,AVYS,90
120915,Star Wars: Episode I - The Phantom Menace,1999,PG,133,AVS,20
121765,Star Wars: Episode II - Attack of the Clones,2002,PG,143,AVS,88
121766,Star Wars: Episode III - Revenge of the Sith,2005,PG-13,140,AVYS,47
1408101,Star Trek Into Darkness,2013,PG-13,132,A,164
2488496,Star Wars: The Force Awakens,2015,PG-13,138,A,1
2527336,Star Wars: Episode VIII - The Last Jedi,2017,PG-13,152,AVYS,11
2527338,Star Wars: The Rise Of Skywalker,2019,PG-13,141,AVYS,16


Here, the "order of operations" is different, so I end up with different results.  (Basically, I get a list of movies that are either Star Wars films from *any* year, plus Star Trek films after 2010).

#### Example 15: Actors Who Have Appeared in More than 10 Movies
Now, let's get a list of actors who have appeared in more than 10 movies, along with count of these movies:

In [None]:
%%sql
SELECT P.name, COUNT(*) AS "Num_Movies"
FROM Person P JOIN
Actor A ON A.actor_id = P.id
JOIN Movie M on M.id = A.movie_id
GROUP BY P.name
HAVING COUNT(*) > 10

 * sqlite:///movie.sqlite
Done.


name,Num_Movies
Harrison Ford,11
Robert Downey Jr.,11
Tom Cruise,16
Tom Hanks,12
Will Smith,13


The above query uses several important SQL techniques, including:

-   JOINs to combine data from multiple tables. In this case, the query uses two JOINs to connect the `Person`, `Actor`, and `Movie` tables.
-   GROUP BY to group the results by a particular column. In this case, the query groups the results by the `Person` name.
-   HAVING to filter the results based on a condition. In this case, the query filters the results toinclude ONLY people who have appeared in more than 10 movies.

#### Example 16: Information About Movies that Have Won 4 or more Oscars
Now, let's get information about movies that have won four or more Oscars (Note: Our database only tracks a limited number of Oscars, such as BEST-PICTURE, BEST-ACTOR, BEST-ACTRESS, BEST-SUPPORTING-ACTRESS, and BEST-SUPPORTING-ACTOR:

In [None]:
%%sql
SELECT M.name, M.year, M.rating, COUNT(M.id)
FROM Movie M JOIN
Oscar O ON O.movie_id = M.id
GROUP BY M.id
HAVING COUNT(M.id) > 3
ORDER BY M.year

 * sqlite:///movie.sqlite
Done.


name,year,rating,COUNT(M.id)
It Happened One Night,1934,,4
Gone with the Wind,1939,G,4
Mrs. Miniver,1942,,4
Going My Way,1944,,4
"Best Years of Our Lives, The",1946,,4
From Here to Eternity,1953,,4
On the Waterfront,1954,,4
Ben-Hur,1959,G,4
West Side Story,1961,,5
One Flew Over the Cuckoo's Nest,1975,R,4


Important ideas:
-   The query uses a `JOIN` to combine the `Movie` and `Oscar` tables.
-   It then GROUPS the results by the `Movie` id.
-   The query filters the results using `HAVING` to only include movies that have been nominated for more than 3 Oscars.
- Finally, it sorts the results by the `Movie` year.

#### Example 18: Simple Subqueries
Now, let's review what we learned earlier about subqueries with a few different examples:

In [None]:
# Let's find the movie (or movies) with the longest run time
%%sql
SELECT name, runtime FROM Movie
  WHERE runtime IN
    (SELECT MAX(runtime) FROM Movie)

 * sqlite:///movie.sqlite
Done.


name,runtime
Justice League,242


In [None]:
# Or, we could just find the list of movies
# that have a run time of 150% or more of the average
%%sql
SELECT name, runtime FROM Movie
  WHERE runtime > 1.5 *
    (SELECT AVG(runtime) FROM Movie)

 * sqlite:///movie.sqlite
Done.


name,runtime
Titanic,194
Avengers: Age of Ultron,195
"Lord of the Rings: The Return of the King, The",201
Batman v Superman: Dawn of Justice,183
Justice League,242
King Kong,187
Gone with the Wind,238
Pearl Harbor,183
"Green Mile, The",188
Schindler's List,197


#### Example 18: Tricky Subqueries.
Now, let's consider a tricky subquery -- determining the numbers of movies that appeared in by actors who have appeared in any Star Wars Film.


In [None]:
# A tricky one -- The number of movies appeared in
# by actors who have been in any Star Wars Film
%%sql
SELECT P.name, COUNT(A.actor_id) as "# of Movies" FROM Actor A
  JOIN Person P ON P.id = A.actor_id
  GROUP BY A.actor_id, P.name
  HAVING A.actor_id IN
    (SELECT A1.actor_id FROM Movie M1
      JOIN Actor A1 ON A1.movie_id=M1.id
      WHERE M1.name LIKE '%Star Wars%')
    AND COUNT(A.actor_id) > 5;

 * sqlite:///movie.sqlite
Done.


name,# of Movies
Harrison Ford,11
Samuel L. Jackson,7
Natalie Portman,7
Carrie Fisher,7
Mark Hamill,6


Let's break this down piece by piece:

1.  `SELECT P.name, COUNT(A.actor_id) as "# of Movies" FROM Actor A JOIN Person P ON P.id = A.actor_id`: This section of the query is the main part, it is retrieving the name of the actor from the `Person` table and the number of movies each actor has participated in from the `Actor` table. The `COUNT(A.actor_id)` command is used to count the number of movies each actor has participated in.

2.  `JOIN`: The `JOIN` statement is used to combine rows from two or more tables, based on a related column between them. In this case, the related column is the `id` field in the `Person` table and `actor_id` field in the `Actor` table. Essentially, this is how we link an actor's name with their roles.

3.  `GROUP BY A.actor_id, P.name`: This clause is used to group the result-set by one or more columns. It's used in collaboration with aggregate functions like `COUNT`, `SUM`, `AVG`, etc., to group the result set by one or more columns. Here it groups the result by `actor_id` and `name`, meaning it will aggregate the count of movies for each unique actor and their corresponding name.

4.  `HAVING A.actor_id IN (SELECT A1.actor_id FROM Movie M1 JOIN Actor A1 ON A1.movie_id=M1.id WHERE M1.name LIKE '%Star Wars%') AND COUNT(A.actor_id) > 5;`: This part of the query is a filter that applies to the grouped data. The `HAVING` clause works like a WHERE clause, but for aggregated data. The subquery `(SELECT A1.actor_id FROM Movie M1 JOIN Actor A1 ON A1.movie_id=M1.id WHERE M1.name LIKE '%Star Wars%')` returns a list of actor IDs who have played in a movie with 'Star Wars' in its title. The `IN` operator checks whether a value is within a set of values returned by a subquery. So, `HAVING A.actor_id IN ...` is used to include only the actors who were in a 'Star Wars' movie. The other part of the `HAVING` clause `AND COUNT(A.actor_id) > 5` is used to filter the actors who have played in more than 5 movies.

#### Example 19: Making Sense of Genre and Rating
Finally, let's talk a bit about two columns we haven't done much with so far: `genre` and rating. The `genre` column allows movies to have MULTIPLE genres, with each genre being abbreviated by a letter. So, for example, the entry for Avtar  is as follows:

 0499549 |            Avatar            | 2009 | PG-13  |   162   |  AVYS |       4       |

 Here `AVYS` means it belongs to the Action (A), Adventure (V), Youth (Y), and Science Fiction (S) Genres.  Other genres include Fantasy (F), Drama (D), and others.

 We can use this knowledge to make queries. For example, "What movie ratings (G, PG, PG-13, R) did Science Fiction movies received in the decade 2010-2019?

In [None]:
%%sql
SELECT rating,
  COUNT(rating) As Num_Movies
FROM Movie
WHERE genre LIKE "%S%"
AND year BETWEEN 2010 AND 2019
GROUP BY rating

 * sqlite:///movie.sqlite
Done.


rating,Num_Movies
PG,2
PG-13,21
R,1


The syntax of this query is as follows:
-   SELECT: We want to retrieve the `rating`  column and the  `COUNT(rating)` "calculated" column.
-   FROM: We want to retrieve data from the `Movie` table. (No joins needed!)
-   WHERE: We are filtering the results to only include movies that have a genre that contains the letter `S` (for "science fiction") We are also filtering the results to only include movies that were released between 2010 and 2019.
-   GROUP BY: We group the results of the query by the `rating` column. This means that we will get one row for each unique rating, and the `COUNT(rating)` column will tell us how many movies have that rating.

## Review With Quizlet
Click the following cell to launch the flashcards for this chapter.

In [None]:
%%html
<iframe src="https://quizlet.com/819344635/learn/embed?i=psvlh&x=1jj1" height="600" width="100%" style="border:0"></iframe>

## Exercises

Here are some problems for you try. For all of these problems should include LIMIT 5 at the end (if you don't, some will lead to MANY results). Before starting the problems, make sure you have "run" the cells that start the database (near the beginning of this document). You can find answers to selected exercises below.

1. Find the names, release years, and ratings of all movies released in the year 2015 with a rating of 'PG-13'. Limit the results to 5. Hint: Use SELECT and WHERE clauses.

2. Retrieve the names of the top 5 actors who played in the movie 'Avatar' along with the movie's name. Hint: Use JOIN to connect the Actor, Movie, and Person tables.

3. Find the names of the top 5 directors and the number of movies directed by each. Hint: Use JOIN to connect the Director, Movie, and Person tables, GROUP BY on the director_id column, and LIMIT.

4. Determine the number of movies in the "science fiction" genre released in each year. (Note: In the genre column, any entry that contains the letter "S" belongs to this column). Hint: You'll need to use LIKE and WHERE.

5. Find the names of the top 5 directors who have directed movies with names starting with 'A' along with the number of movies they directed. Hint: Use JOIN to connect the Director, Movie, and Person tables, use GROUP BY on the director_id column, use HAVING with the LIKE operator, and LIMIT.

6. Find the names of the top 5 actors who have acted in the most movies directed by George Lucas along with the names of those movies. Hint: Use a subquery in the WHERE clause with the IN operator and LIMIT.

7. Find the names of the top 5 actors who have acted in more movies than the average number of movies acted in by all actors, along with the number of movies they acted in. Hint: Use a subquery in the HAVING clause to find the average number of movies acted in by all actors, and LIMIT.

8. Find the name of the movie in the "Action" genre (in the genre column, anything that contains "A") with the longest runtime along with its runtime. Hint: Use a subquery with the MAX aggregate function to find the longest runtime and LIMIT.

In [None]:
%%sql
--Ex. 1

 * sqlite:///movie.sqlite
Done.


[]

In [None]:
%%sql
--Ex. 2

 * sqlite:///movie.sqlite
Done.


[]

In [None]:
%%sql
--Ex. 3

 * sqlite:///movie.sqlite
Done.


[]

In [None]:
%%sql
--Ex. 4

 * sqlite:///movie.sqlite
Done.


[]

In [None]:
%%sql
--Ex. 5

 * sqlite:///movie.sqlite
Done.


[]

In [None]:
%%sql
--Ex. 6

 * sqlite:///movie.sqlite
Done.


[]

In [None]:
%%sql
--Ex. 7

 * sqlite:///movie.sqlite
Done.


[]

In [None]:
%%sql
--Ex. 8

 * sqlite:///movie.sqlite
Done.


[]

In [None]:
%%sql
--Ex. 9

 * sqlite:///movie.sqlite
Done.


[]

## Table: SQL Queries With Joins and Sets
Here's a table illustrating some of the main concepts we've been coverining in this section:

| SQL Query | Description in English |
| --- | --- |
| `SELECT * FROM Movie;` | SQL code to retrieve all records from the 'Movie' table |
| `SELECT name FROM Person;` | SQL code to retrieve the names of all persons |
| `SELECT actor_id FROM Actor;` | SQL code to find all actors' ids |
| `SELECT DISTINCT type FROM Oscar;` | SQL code to list all unique Oscar types |
| `SELECT COUNT(DISTINCT id) FROM Movie;` | SQL code to count the number of different movies |
| `SELECT DISTINC(P.name) FROM Person P JOIN Oscar O ON p.id = O.person_id;` | SQL code to retrieve the names of people who have won an Oscar |
| `SELECT M.name FROM Movie M JOIN Director D ON M.id = D.movie_id JOIN Person P ON D.director_id = P.id WHERE P.name LIKE 'Steven Spielberg';` | SQL code to find all movies directed by Steven Spielberg. |
| `SELECT P.name FROM Person P JOIN Actor A ON P.id = A.actor_id JOIN Movie M ON A.movie_id = M.id WHERE M.name LIKE '%Batman%';` | SQL code to find all actors who starred in any movie with "Batman" in the title. |
| `SELECT M.name FROM Movie M JOIN Actor A ON M.id = A.movie_id JOIN Person P ON A.actor_id = P.id WHERE P.name LIKE 'Marlon Brando';` | SQL code to find all movies where Marlon Brando has starred. |
| `SELECT M.name FROM Movie M JOIN Oscar O ON M.id = O.movie_id WHERE O.type = "BEST-PICTURE" AND M.name LIKE '%War%';` | SQL code to find all movies nominated for Best-picture Oscars  with "War" in the title. |
| `SELECT DISTINCT(M.name) FROM Movie M JOIN Oscar O ON M.id = O.movie_id WHERE M.genre LIKE '%A%';` | SQL code to finds all 'Action' genre movies that have received an Oscar nomination. |
| `SELECT DISTINCT(P.name) FROM Person P JOIN Director D ON P.id = D.director_id JOIN Movie M ON D.movie_id = M.id WHERE M.name LIKE '%Spider-Man%';` | SQL code to find the director(s) of "Spider-Man" movies. |
| `SELECT p.name FROM Person p INNER JOIN Actor a ON p.id = a.actor_id INTERSECT SELECT p.name FROM Person p INNER JOIN Director d ON p.id = d.director_id;` | SQL code to find persons who are both actors and directors |
| `SELECT name, earnings_rank FROM Movie WHERE year > 2000;` | SQL code to retrieve all movies released after 2000 with their earnings ranks |
| `SELECT m.name, COUNT(a.actor_id) as num_actors FROM Movie m JOIN Actor a ON m.id = a.movie_id GROUP BY m.id HAVING num_actors > 3;` | SQL code to count the number of actors for each movie, only including movies with more than 3 actors |
| `SELECT p.name, COUNT(o.type) as num_oscars FROM Person p JOIN Oscar o ON p.id = o.person_id GROUP BY p.id ORDER BY num_oscars DESC LIMIT 1;` | SQL code to find the person who won the most Oscars |

## Glossary
| Term | Definition |
| --- | --- |
| VARCHAR(n) | A data type used for variable-length strings where the number inside the parentheses specifies the maximum number of characters the field can hold. |
| CHAR(n) | A data type used for character strings of a fixed length, with the number inside the parentheses specifying the exact length of characters the field can hold. |
| Junction (Join) Table | A table in a database used to resolve many-to-many relationships between two other tables. It typically contains foreign keys that correspond to the primary keys of the related tables. |
| INNER (NATURAL) JOIN | A type of join operation in SQL that returns rows where there is a match in both tables being joined. If specified as NATURAL, it automatically matches columns between the tables with the same names. |
| LEFT (OUTER) JOIN | A join operation that returns all the rows from the left table and matched rows from the right table. If there is no match, the result from the right side will contain NULL values. |
| Entity-Relationship Diagram | A visual representation of different entities within a database and the relationships between them. Entities are typically depicted as rectangles, with relationships illustrated as lines or arrows connecting these rectangles. |
| UNION | An SQL operation that combines rows from two or more SELECT statements into a single result, eliminating duplicate entries. All SELECT statements within the UNION must have the same number of columns with compatible data types. |
| INTERSECT | An SQL operation that returns the common records between two SELECT statements. Both SELECT statements need to have the same number of columns with compatible data types. |
| EXCEPT | An SQL operation that returns the records present in the first SELECT statement but not in the second one. Both SELECT statements must have the same number of columns with compatible data types. |