# Joining Data in SQL

## Introducing Joins

In the SQL Fundamentals course, we worked exclusively with data that existed in a single table. In the real world, it's much more common for databases to have data in more than one table. If we want to be able to work with that data, we'll have to combine multiple tables within a query. The way we do this in SQL is using **joins**. As in the SQL Fundamentals course, we'll continue to use [SQLite](https://sqlite.org/index.html) throughout this course.<br>

In this mission, we're going to be using a version of the CIA World Factbook (Factbook) database from the guided project from the SQL Fundamentals course. To refresh your memory, this database had one table called `facts`, where each row represented a country from the Factbook. Here are the first 5 rows of the `facts` table:

In [1]:
import sqlite3
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline

In [4]:
conn = sqlite3.connect("data/factbook.db")

q = "SELECT * FROM facts LIMIT 5;"
pd.read_sql_query(q, conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
0,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
1,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
2,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
3,4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
4,5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


In addition to the `facts` table we've added a new table, called `cities` which contains information on [major urban areas](https://www.cia.gov/library/publications/the-world-factbook/docs/notesanddefs.html?fieldkey=2219&term=Major%20urban%20areas%20-%20population) from countries in the Factbook (for the rest of this mission, we'll use the word 'cities' to mean the same as 'major urban areas'. Let's take a look at the first few rows of this new table and a description of what each column represents:

In [5]:
q = "SELECT * FROM cities LIMIT 5;"
pd.read_sql_query(q, conn)

Unnamed: 0,id,name,population,capital,facts_id
0,1,Oranjestad,37000,1,216
1,2,Saint John'S,27000,1,6
2,3,Abu Dhabi,942000,1,184
3,4,Dubai,1978000,0,184
4,5,Sharjah,983000,0,184


* `id` - A unique ID for each city.
* `name` - The name of the city.
* `population` - The population of the city.
* `capital` - Whether the city is a capital city: `1` if it is, `0` if it isn't.
* `facts_id` - The ID of the country, from the facts table.

The last column is of particular interest to us, as it is a column of data that also exists in our original `facts` table. This link between tables is important as it's used to combine the data in our queries. Below is a **schema diagram**, which shows the two tables in our database, the columns within them and how the two are linked.

![https://s3.amazonaws.com/dq-content/179/schema.svg](https://s3.amazonaws.com/dq-content/179/schema.svg)

The line in the schema diagram clearly shows the link between the id column in the `facts` table and the `facts_id` column in the `cities` table. You may need to refer back to this schema diagram throughout the mission.<br>

The most common way to join data using SQL is using an **inner join**. The syntax for an inner join is:

```sql
SELECT [column_names] FROM [table_name_one]
INNER JOIN [table_name_two] ON [join_constraint];
```

The inner join clause is made up of two parts:

* `INNER JOIN`, which tells the SQL engine the name of the table you wish to join in your query, and that you wish to use an inner join.
* `ON`, which tells the SQL engine what columns to use to join the two tables.

Joins are usually used in a query after the `FROM` clause. Let's look at a basic inner join where we combine the data from both of our tables.

```python
SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id
LIMIT 5;
```

Let's look at the line of the query with the join in it:
* `INNER JOIN cities` - This tells the SQL engine that we wish to join the `cities` table to our query using an inner join.
* `ON cities.facts_id = facts.id` - This tells the SQL engine which columns to use when joining the data, following the syntax `table_name.column_name`.<br>

You might presume that `SELECT * FROM facts` will mean that the query returns only columns from the `facts` table, however the `*` wildcard when used with a join will give you all columns from both tables. Here is the result of this query:

In [7]:
query = '''
        select * from facts
        inner join cities
        on facts.id = cities.facts_id
        limit 5;
'''

In [8]:
pd.read_sql_query(query, conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,id.1,name.1,population.1,capital,facts_id
0,216,aa,Aruba,180,180,0,112162,1.33,12.56,8.18,8.92,1,Oranjestad,37000,1,216
1,6,ac,Antigua and Barbuda,442,442,0,92436,1.24,15.85,5.69,2.21,2,Saint John'S,27000,1,6
2,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,3,Abu Dhabi,942000,1,184
3,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,4,Dubai,1978000,0,184
4,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,5,Sharjah,983000,0,184


This query gives us all columns from both tables and every row where there is a match between the `id` column from `facts` and the `facts_id` from `cities`, limited to the first 5 rows. We'll look at how the join itself works in detail in a moment, but first let's practice writing our first join.

* Write a query that returns all columns from the `facts` and `cities` tables.
  * Use an `INNER JOIN` to join the `cities` table to the `facts` table.
  * Join the tables on the values where `facts.id` and `cities.facts_id` are equal.
  * Limit the query to the first 10 rows.

In [9]:
query = '''
        select * from facts
        inner join cities
        on facts.id = cities.facts_id
        limit 10;
'''

pd.read_sql_query(query, conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,id.1,name.1,population.1,capital,facts_id
0,216,aa,Aruba,180,180,0,112162,1.33,12.56,8.18,8.92,1,Oranjestad,37000,1,216
1,6,ac,Antigua and Barbuda,442,442,0,92436,1.24,15.85,5.69,2.21,2,Saint John'S,27000,1,6
2,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,3,Abu Dhabi,942000,1,184
3,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,4,Dubai,1978000,0,184
4,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,5,Sharjah,983000,0,184
5,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51,6,Kabul,3097000,1,1
6,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92,7,Algiers,2916000,1,3
7,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92,8,Oran,783000,0,3
8,11,aj,Azerbaijan,86600,82629,3971,9780780,0.96,16.64,7.07,0.0,9,Baku,2123000,1,11
9,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3,10,Tirana,419000,1,2
