# Joining Data in SQL

## Introducing Joins

In the SQL Fundamentals course, we worked exclusively with data that existed in a single table. In the real world, it's much more common for databases to have data in more than one table. If we want to be able to work with that data, we'll have to combine multiple tables within a query. The way we do this in SQL is using **joins**. As in the SQL Fundamentals course, we'll continue to use [SQLite](https://sqlite.org/index.html) throughout this course.<br>

In this mission, we're going to be using a version of the CIA World Factbook (Factbook) database from the guided project from the SQL Fundamentals course. To refresh your memory, this database had one table called `facts`, where each row represented a country from the Factbook. Here are the first 5 rows of the `facts` table:

In [1]:
import sqlite3
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline

In [2]:
conn = sqlite3.connect("data/factbook.db")

q = "SELECT * FROM facts LIMIT 5;"
pd.read_sql_query(q, conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
0,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
1,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
2,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
3,4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
4,5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


In addition to the `facts` table we've added a new table, called `cities` which contains information on [major urban areas](https://www.cia.gov/library/publications/the-world-factbook/docs/notesanddefs.html?fieldkey=2219&term=Major%20urban%20areas%20-%20population) from countries in the Factbook (for the rest of this mission, we'll use the word 'cities' to mean the same as 'major urban areas'. Let's take a look at the first few rows of this new table and a description of what each column represents:

In [3]:
q = "SELECT * FROM cities LIMIT 5;"
pd.read_sql_query(q, conn)

Unnamed: 0,id,name,population,capital,facts_id
0,1,Oranjestad,37000,1,216
1,2,Saint John'S,27000,1,6
2,3,Abu Dhabi,942000,1,184
3,4,Dubai,1978000,0,184
4,5,Sharjah,983000,0,184


* `id` - A unique ID for each city.
* `name` - The name of the city.
* `population` - The population of the city.
* `capital` - Whether the city is a capital city: `1` if it is, `0` if it isn't.
* `facts_id` - The ID of the country, from the facts table.

The last column is of particular interest to us, as it is a column of data that also exists in our original `facts` table. This link between tables is important as it's used to combine the data in our queries. Below is a **schema diagram**, which shows the two tables in our database, the columns within them and how the two are linked.

![https://s3.amazonaws.com/dq-content/179/schema.svg](https://s3.amazonaws.com/dq-content/179/schema.svg)

The line in the schema diagram clearly shows the link between the id column in the `facts` table and the `facts_id` column in the `cities` table. You may need to refer back to this schema diagram throughout the mission.<br>

The most common way to join data using SQL is using an **inner join**. The syntax for an inner join is:

```sql
SELECT [column_names] FROM [table_name_one]
INNER JOIN [table_name_two] ON [join_constraint];
```

The inner join clause is made up of two parts:

* `INNER JOIN`, which tells the SQL engine the name of the table you wish to join in your query, and that you wish to use an inner join.
* `ON`, which tells the SQL engine what columns to use to join the two tables.

Joins are usually used in a query after the `FROM` clause. Let's look at a basic inner join where we combine the data from both of our tables.

```python
SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id
LIMIT 5;
```

Let's look at the line of the query with the join in it:
* `INNER JOIN cities` - This tells the SQL engine that we wish to join the `cities` table to our query using an inner join.
* `ON cities.facts_id = facts.id` - This tells the SQL engine which columns to use when joining the data, following the syntax `table_name.column_name`.<br>

You might presume that `SELECT * FROM facts` will mean that the query returns only columns from the `facts` table, however the `*` wildcard when used with a join will give you all columns from both tables. Here is the result of this query:

In [4]:
query = '''
        select * from facts
        inner join cities
        on facts.id = cities.facts_id
        limit 5;
'''

In [5]:
pd.read_sql_query(query, conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,id.1,name.1,population.1,capital,facts_id
0,216,aa,Aruba,180,180,0,112162,1.33,12.56,8.18,8.92,1,Oranjestad,37000,1,216
1,6,ac,Antigua and Barbuda,442,442,0,92436,1.24,15.85,5.69,2.21,2,Saint John'S,27000,1,6
2,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,3,Abu Dhabi,942000,1,184
3,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,4,Dubai,1978000,0,184
4,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,5,Sharjah,983000,0,184


This query gives us all columns from both tables and every row where there is a match between the `id` column from `facts` and the `facts_id` from `cities`, limited to the first 5 rows. We'll look at how the join itself works in detail in a moment, but first let's practice writing our first join.

* Write a query that returns all columns from the `facts` and `cities` tables.
  * Use an `INNER JOIN` to join the `cities` table to the `facts` table.
  * Join the tables on the values where `facts.id` and `cities.facts_id` are equal.
  * Limit the query to the first 10 rows.

In [6]:
query = '''
        select * from facts
        inner join cities
        on facts.id = cities.facts_id
        limit 10;
'''

pd.read_sql_query(query, conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,id.1,name.1,population.1,capital,facts_id
0,216,aa,Aruba,180,180,0,112162,1.33,12.56,8.18,8.92,1,Oranjestad,37000,1,216
1,6,ac,Antigua and Barbuda,442,442,0,92436,1.24,15.85,5.69,2.21,2,Saint John'S,27000,1,6
2,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,3,Abu Dhabi,942000,1,184
3,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,4,Dubai,1978000,0,184
4,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,5,Sharjah,983000,0,184
5,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51,6,Kabul,3097000,1,1
6,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92,7,Algiers,2916000,1,3
7,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92,8,Oran,783000,0,3
8,11,aj,Azerbaijan,86600,82629,3971,9780780,0.96,16.64,7.07,0.0,9,Baku,2123000,1,11
9,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3,10,Tirana,419000,1,2


## Understanding Inner Joins

We've now joined the two tables to give us extra information about each row in `cities`. Let's take a closer look at how this inner join works.<br>

An inner join works by including only rows from each table that have a match as specified using the `ON` clause. Let's look at a diagram of how our join from the previous screen works. We have included a selection of rows which best illustrate the join:

![https://s3.amazonaws.com/dq-content/179/inner_join.svg](https://s3.amazonaws.com/dq-content/179/inner_join.svg)

Our inner join will include:
* Rows from the cities table that have a cities.facts_id that matches a facts.id from facts.

Our inner join will not include:
* Rows from the cities table that have a cities.facts_id that doesn't match any facts.id from facts.
* Rows from the facts table that have a facts.id that doesn't match any cities.facts_id from cities.

You can see this represented as a venn diagram:

![https://s3.amazonaws.com/dq-content/179/venn_inner.svg](https://s3.amazonaws.com/dq-content/179/venn_inner.svg)

In the SQL fundamentals course, we learned how to use [aliases](https://www.tutorialspoint.com/sqlite/sqlite_alias_syntax.htm) to specify custom names for columns, eg:

```sql
SELECT AVG(population) AS AVERAGE_POPULATION
```

We can also create aliases for table names, which makes queries with joins easier to both read and write. Instead of:

```sql
SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id
```
We can write:

```sql
SELECT * FROM facts AS f
INNER JOIN cities AS c ON c.facts_id = f.id
```
Just like with column names, using `AS` is optional. We can get the same result by writing:

```sql
SELECT * FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
```
We can also combine aliases with wildcards - for instance, using the aliases created above, `c.*` would give us all columns from the table `cities`.<br>

While our query from the previous screen included both columns from the `ON` clause, we don't need to use either column from our `ON` clause in our final list of columns. This is useful as it means we can show only the information we're interested in, rather than having to include the two join columns every time.<br>

Let's use what we've learned to build on our original query.

Write a query that:
* Joins `cities` to `facts` using an `INNER JOIN`.
* Uses aliases for table names.
* Includes, in order:
  * All columns from `cities`.
  * The `name` column from `facts` aliased to `country_name`.
* Includes only the first 5 rows.

In [7]:
query = '''
        select c.*, f.name AS country_name
        from cities as c
        inner join facts as f
        on c.facts_id = f.id
        limit 5
'''

pd.read_sql(query, conn)

Unnamed: 0,id,name,population,capital,facts_id,country_name
0,1,Oranjestad,37000,1,216,Aruba
1,2,Saint John'S,27000,1,6,Antigua and Barbuda
2,3,Abu Dhabi,942000,1,184,United Arab Emirates
3,4,Dubai,1978000,0,184,United Arab Emirates
4,5,Sharjah,983000,0,184,United Arab Emirates


## Practicing Inner Joins

Let's practice writing a query to answer a question from our database using an inner join. Say we want to produce a table of countries and their capital cities from our database using what we've learned so far. Our first step is to think about what columns we'll need in our final query. We'll need:

* The `name` column from `facts`
* The `name` column from `cities`

Given that we've identified that we need data from two tables, we need to think about how to join them. The schema diagram from earlier indicated that there is only one column in each table that links them together, so we can use an inner join with those columns to join the data.<br>

So far, thinking through our question we can already write most of our query:

```sql
SELECT f.name, c.name FROM cities c
INNER JOIN facts f ON f.id = c.facts_id
```
The last part of our process is to make sure we have the correct rows. From the previous two screens we know that a query like this will return all rows from `cities` that have a corresponding match from `facts` in the `facts_id` column. We're only interested in the capital cities from the `cities` table, so we'll need to use a `WHERE` clause on the `capital` column, which has a value of `1` if the city is a capital, and `0` if it isn't:

```sql
WHERE c.capital = 1
```
We can now put this all together to write a query that answers our question.

Write a query that returns, in order:
* A column of country names, called `country`.
* A column of each country's capital city, called `capital_city`.

Use an `INNER JOIN` to join the two tables in your query.

In [9]:
query = '''
        select f.name as country, c.name as capital_city
        from cities as c
        inner join facts as f
        on c.facts_id = f.id
        where c.capital = 1
        limit 5
'''
pd.read_sql(query, conn)

Unnamed: 0,country,capital_city
0,Aruba,Oranjestad
1,Antigua and Barbuda,Saint John'S
2,United Arab Emirates,Abu Dhabi
3,Afghanistan,Kabul
4,Algeria,Algiers


## Left Joins

As we mentioned earlier, an inner join will not include any rows where there is not a mutual match from both tables. This means there could be information we are not seeing in our query where rows don't match.<br>

We can use the SQL console to run some queries to explore this:

```sql
>>> SELECT COUNT(DISTINCT(name)) FROM facts;

    [["COUNT(DISTINCT(name))"], [261]]

>>> SELECT COUNT(DISTINCT(facts_id)) FROM cities;

    [["COUNT(DISTINCT(facts_id))"], [210]]
```

By running these two queries, we can see that there are some countries in the `facts` table that don't have corresponding cities in the `cities` table, which indicates we may have some incomplete data.<br>

### Let's look at how we can create a query to explore the missing data using a new type of join— the **left join**.

A left join includes all the rows that an inner join will select, plus any joins from the first (or left) table that don't have a match in the second table. We can see this represented as a venn diagram.

![https://s3.amazonaws.com/dq-content/179/venn_left.svg](https://s3.amazonaws.com/dq-content/179/venn_left.svg)

Let's look at an example by replacing `INNER JOIN` with `LEFT JOIN` from the first query we wrote, and looking at the same selection of rows from our earlier diagram

```sql
SELECT * FROM facts
LEFT JOIN cities ON cities.facts_id = facts.id
```

![https://s3.amazonaws.com/dq-content/179/left_join.svg](https://s3.amazonaws.com/dq-content/179/left_join.svg)

Here we can see that for the rows where `facts.id` doesn't match any values in `cities.facts_id` (237, 238, 240, and 244), the rows are still included in the results. When this happens, all of the columns from the `cities` table are populated with null values.<br>

We can use these null values to filter our results to just the countries that don't exist in `cities` with a `WHERE` clause. When making a comparison to null in SQL, we use the `IS` keyword, rather than the = sign. If we want to select rows where a column is null we can write:

```sql
WHERE column_name IS NULL
```
If we want to select rows where a column name isn't null, we use:

```sql
WHERE column_name IS NOT NULL
```
Let's use a left join to explore the countries that don't exist in the `cities` table.


Write a query that returns the countries that don't exist in `cities`:
* Your query should return two columns:
  * The country names, with the alias `country`.
  * The country population.
* Use a `LEFT JOIN` to join `cities` to `facts`.
* Include only the countries from `facts` that don't have a corresponding value in `cities`.

In [11]:
query = '''
        select f.name as country, f.population
        from facts as f
        left join cities as c
        on f.id = c.facts_id
        where c.facts_id IS NULL
        limit 5
'''

pd.read_sql(query, conn)

Unnamed: 0,country,population
0,Kosovo,1870981
1,Monaco,30535
2,Nauru,9540
3,San Marino,33020
4,Singapore,5674472


## Right Joins and Outer Joins

Looking through the results of the query we wrote in the previous screen, we can see a number of different reasons that countries don't have corresponding values in cities:

* Countries with small populations and/or no major urban areas (which are defined as having populations of over 750,000), eg San Marino, Kosovo, and Nauru.
* City-states, such as Monaco and Singapore.
* Territories that are not themselves countries, such as Hong Kong, Gibraltar, and the Cook Islands.
* Regions & Oceans that aren't countries, such as the European Union and the Pacific Ocean.
* Genuine cases of missing data, such as Taiwan.

It's important whenever you use inner joins to be mindful that you might be excluding important data, especially if you are joining based on columns that aren't linked in the database schema.<br>

There are two less-common join types SQLite does not support that you should be aware of. The first is a **right join**. A right join, as the name indicates, is exactly the opposite of a left join. Where the left join includes all rows in the table before the `JOIN` clause, the right join includes all rows in the new table in the `JOIN` clause. We can see a right join in the venn diagram below:

![https://s3.amazonaws.com/dq-content/179/venn_right.svg](https://s3.amazonaws.com/dq-content/179/venn_right.svg)

The following two queries, one using a left join and one using a right join, produce identical results.

```sql
SELECT f.name country, c.name city
FROM facts f
LEFT JOIN cities c ON c.facts_id = f.id
LIMIT 5;
```
```sql
SELECT f.name country, c.name city
FROM cities c
RIGHT JOIN facts f ON f.id = c.facts_id
LIMIT 5;
```
The main reason a right join would be used is in a complex query where you are joining more than two tables. In these cases, using a right join is preferable because it can avoid restructuring your whole query to join one table. Outside of this, right joins are used reasonably rarely, so for simple joins it's better to use a left join than a right as it will be easier for your query to be read and understood by others.<br>

The other join type not supported by SQLite is a **full outer join**. A full outer join will include all rows from the tables on both sides of the join. We can see a right join in the venn diagram below:

![https://s3.amazonaws.com/dq-content/179/venn_full.svg](https://s3.amazonaws.com/dq-content/179/venn_full.svg)

Like right joins, full outer joins are reasonably uncommon, and similar results can be achieved using a union clause (which we will teach in the next mission). The standard SQL syntax for an full outer join is:

```sql
SELECT f.name country, c.name city
FROM cities c
FULL OUTER JOIN facts f ON f.id = c.facts_id
LIMIT 5;
```
When joining `cities` and `facts` with a full outer join, the result will be be the same as our left and right joins above, because there are no values in `cities.facts_id` that don't exist in `facts.id`.<br>

Let's look at the venn diagrams of each join type side by side, which should help you compare the differences of each of the four joins we've discussed so far.

![https://s3.amazonaws.com/dq-content/179/join_venn_diagram.svg](https://s3.amazonaws.com/dq-content/179/join_venn_diagram.svg)

Next, let's practice using joins to answer some questions about our data.

## Finding the Most Populous Capital Cities

Previously, we've used column names when specifying order for our query results, like so:

```sql
SELECT name, migration_rate FROM FACTS
ORDER BY migration_rate desc;
```
There is a handy shortcut we can use in our queries which lets us skip the column names, and instead use the order in which the columns appear in the `SELECT` clause. In this instance, `migration_rate` is the second column in our `SELECT` clause so we can just use `2` instead of the column name:

```sql
SELECT name, migration_rate FROM FACTS
ORDER BY 2 desc;
```
You can use this shortcut in either the `ORDER BY` or `GROUP BY` clauses. Be mindful that you want to ensure your queries are still readable, so typing the full column name may be better for more complex queries.<br>

Let's use what we've learned to produce a list of the top 10 capital cities by population. Remember that `capital` is a boolean column containing `1` or `0`, depending on whether a city is a capital or not. We won't specify which join type you should use - you will need to think about what results you require and select an appropriate join type.

* Write a query that returns the 10 capital cities with the highest population ranked from biggest to smallest population.
* You should include the following columns, in order:
  * `capital_city`, the name of the city.
  * `country`, the name of the country the city is from.
  * `population`, the population of the city.

In [17]:
query = '''
        select c.name as capital_city,
                f.name as country,
                c.population as population
        from facts f
        inner join cities c
        on f.id = c.facts_id
        where c.capital = 1
        order by population desc
        limit 10
'''
pd.read_sql(query, conn)

Unnamed: 0,capital_city,country,population
0,Tokyo,Japan,37217000
1,New Delhi,India,22654000
2,Mexico City,Mexico,20446000
3,Beijing,China,15594000
4,Dhaka,Bangladesh,15391000
5,Buenos Aires,Argentina,13528000
6,Manila,Philippines,11862000
7,Moscow,Russia,11621000
8,Cairo,Egypt,11169000
9,Jakarta,Indonesia,9769000


## Combining Joins with Subqueries

As we learned in the SQL fundamentals course, subqueries can be used to substitute parts of queries, allowing us to find the answers to more complex questions. We can also join to the result of a subquery, just like we could a table.<br>

Here's an example of a using a join and a subquery to produce a table of countries and their capital cities, like we did earlier in the mission.

![https://s3.amazonaws.com/dq-content/179/explain_subquery.svg](https://s3.amazonaws.com/dq-content/179/explain_subquery.svg)

Reading subqueries can be overwhelming at first, so we'll break down what happens in this example in several steps. The important thing to remember is that the result of any subqueries are always calculated first, so we read from the inside out.
* The subquery, in the red box, is calculated first. This simple query selects all columns from `cities`, filtering rows that are marked as `capital` cities by having a value for capital of 1.
* The `INNER JOIN` joins the subquery result, aliased as `c`, to the `facts` table based on the `ON` clause.
* Two columns are selected from the results of the join:
  * `f.name`, aliased as `country`.
  * `c.name`, aliased as `capital_city`.
* The results are limited to the first 10 rows.

Below is the output of this query:

In [18]:
query_ = '''
        select f.name country, c.name capital_city
        from facts f
        inner join (
                    select * from cities
                    where capital = 1
        ) c
        on c.facts_id = f.id
        limit 10;
'''
pd.read_sql(query_, conn)

Unnamed: 0,country,capital_city
0,Aruba,Oranjestad
1,Antigua and Barbuda,Saint John'S
2,United Arab Emirates,Abu Dhabi
3,Afghanistan,Kabul
4,Algeria,Algiers
5,Azerbaijan,Baku
6,Albania,Tirana
7,Armenia,Yerevan
8,Andorra,Andorra La Vella
9,Angola,Luanda


Using this example as a model, we'll write a similar query to find the capital cities with populations of over 10 million.

* Using a join and a subquery, write a query that returns capital cities with populations of over 10 million ordered from largest to smallest. Include the following columns:
  * `capital_city` - the name of the city.
  * `country` - the name of the country the city is the capital of.
  * `population` - the population of the city.

In [20]:
query = '''
        select c.name capital_city,
                f.name country,
                c.population population
        from facts f
        inner join (
                select * from cities
                where capital = 1
        ) c
        on f.id = c.facts_id
        where c.population > 10000000
        order by c.population desc
        limit 10
'''

pd.read_sql(query, conn)

Unnamed: 0,capital_city,country,population
0,Tokyo,Japan,37217000
1,New Delhi,India,22654000
2,Mexico City,Mexico,20446000
3,Beijing,China,15594000
4,Dhaka,Bangladesh,15391000
5,Buenos Aires,Argentina,13528000
6,Manila,Philippines,11862000
7,Moscow,Russia,11621000
8,Cairo,Egypt,11169000


## Challenge:
## Complex Query with Joins and Subqueries

Let's take everything we've learned before and use it to write a more complex query. It's not uncommon to find that 'thinking in SQL' takes a bit of getting used to, so don't be discouraged if this challenge takes you a while. It will get easier with practice!<br>

When you're writing complex queries with joins and subqueries, it helps to follow this process:
* Think about what data you need in your final output
* Work out which tables you'll need to join, and whether you will need to join to a subquery.
  * If you need to join to a subquery, write the subquery first.
* Then start writing your `SELECT` clause, followed by the join and any other clauses you will need.
* Don't be afraid to write your query in steps, running it as you go— for instance you can run your subquery as a 'stand alone' query first to make sure it looks like you want before writing the outer query.

We will be writing a query to find the countries where the urban center (city) population is more than half of the country's total population. Our final results will look like this:

country|urban_pop|total_pop|urban_pct
---|---|---|---
Uruguay|1672000|3341893|0.500315
Congo, Republic of the|2445000|4755097|0.514185
Brunei|241000|429646|0.560927
New Caledonia|157000|271615|0.578024
Virgin Islands|60000|103574|0.579296
...|...|...|...


To help you out, the query you will write will include:
* A join to a subquery.
* A subquery to make a calculation.
* An aggregate function.
* A `WHERE` clause.
* A `CAST` expression.

Remember that there are multiple ways to write this query, and the list above is based on the approach we took in our solution.

Write a query that generates output as shown above. The query should include:
* The following columns, in order:
  * `country`, the name of the country.
  * `urban_pop`, the sum of the population in major urban areas belonging to that country.
  * `total_pop`, the total population of the country.
  * `urban_pct`, the percentage of the popularion within urban areas, calculated by dividing `urban_pop` by `total_pop`.
* Only countries that have an `urban_pct` greater than 0.5.
* Rows should be sorted by `urban_pct` in ascending order.

In [26]:
test_query = '''
        select population, name
        from facts
        limit 10
'''
pd.read_sql(test_query, conn)

Unnamed: 0,population,name
0,32564342,Afghanistan
1,3029278,Albania
2,39542166,Algeria
3,85580,Andorra
4,19625353,Angola
5,92436,Antigua and Barbuda
6,43431886,Argentina
7,3056382,Armenia
8,22751014,Australia
9,8665550,Austria


In [27]:
# test_1

test_query = '''
        select sum(c.population) urban_pop,
                f.population total_pop,
                f.name country
        from cities c 
        inner join facts f
        on c.facts_id = f.id
        group by f.name
        limit 10
'''
pd.read_sql(test_query, conn)

Unnamed: 0,urban_pop,total_pop,country
0,3097000,32564342,Afghanistan
1,419000,3029278,Albania
2,3699000,39542166,Algeria
3,64000,54343,American Samoa
4,23000,85580,Andorra
5,6166000,19625353,Angola
6,2000,16418,Anguilla
7,27000,92436,Antigua and Barbuda
8,18951000,43431886,Argentina
9,1116000,3056382,Armenia


In [42]:
# test_2
# add more query statements/modifications to test_1.

test_query = '''
        
        select sub.country, sub.urban_pop, sub.total_pop,
                sub.urban_pct
        from (select f.name country, 
                    sum(c.population) urban_pop,
                    f.population total_pop,
                    (cast(sum(c.population) as float)/cast(f.population as float)) urban_pct
            from cities c 
            inner join facts f
            on c.facts_id = f.id
            group by f.name) sub
            
        where sub.urban_pct > 0.5
        order by sub.urban_pct
        limit 10
        
'''

pd.read_sql(test_query, conn)

Unnamed: 0,country,urban_pop,total_pop,urban_pct
0,Uruguay,1672000,3341893,0.500315
1,"Congo, Republic of the",2445000,4755097,0.514185
2,Brunei,241000,429646,0.560927
3,New Caledonia,157000,271615,0.578024
4,Virgin Islands,60000,103574,0.579296
5,Falkland Islands (Islas Malvinas),2000,3361,0.595061
6,Djibouti,496000,828324,0.5988
7,Australia,13789000,22751014,0.606083
8,Iceland,206000,331918,0.620635
9,Israel,5226000,8049314,0.649248


## Next Steps

In this mission we learned:
* The difference between inner and left joins.
* How to choose which join is appropriate for your task.
* Using joins with subqueries, aggregate functions and other SQL techniques.

In the next mission, we're going to keep practicing using joins and learn some advanced joining techniques, including:
* Queries with more than one join.
* Nesting and recursive joins.
* Joining data using `UNION`
* Using `CASE` to categorize data.

In [43]:
conn.close()