<img src = "https://images2.imgbox.com/60/09/VFwl5LOq_o.jpg" width="400">

# 1. Introduction to Joins
---

In this chapter, you'll be introduced to the concept of joining tables, and will explore the different ways you can enrich your queries using inner joins and self joins. You'll also see how to use the case statement to split up a field into different categories.

In [1]:
from sqlalchemy import create_engine

In [2]:
engine_leaders = create_engine('sqlite:///data/leaders.db')
conn_leaders = engine_leaders.connect()

In [3]:
engine_countries = create_engine('sqlite:///data/countries.db')
conn_countries = engine_countries.connect()

## Introduction to INNER JOIN
---

As the name suggests, the focus of this course is using SQL to join two or more database tables together into a single table, an essential skill for data scientists. In this chapter, you'll learn about the INNER JOIN, which along with LEFT JOIN are probably the two most common JOINs. You'll see diagrams throughout this course that are designed to help you understand the mechanics of the different joins. Let's begin with a diagram showing the layout of some data and then how an INNER JOIN can be applied to that data.

<img src = "https://images2.imgbox.com/9f/c3/3g9SQsPC_o.png" width="400">

In this chapter and the next, we'll often work with two tables named left and right. You can see that matching values of the id field are colored with the same color. The id field is known as a KEY field since it can be used to reference one table to another. Both the left and right tables also have another field named val. This will be useful in helping you see specifically which records and values are included in each join.

### INNER JOIN diagram

An INNER JOIN only includes records in which the key is in both tables. You can see here that the id field matches for values of 1 and 4 only. With inner joins we look for matches in the right table corresponding to all entries in the key field in the left table.

<img src = "https://images2.imgbox.com/d2/84/5DBhNa6v_o.png" width="400">

### INNER JOIN diagram (2)

So the focus here shifts to only those records with a match in terms of the id field. The records not of interest to INNER JOIN have been faded.

<img src = "https://images2.imgbox.com/b7/f9/qy0sDGqu_o.png" width="400">

### INNER JOIN diagram (3)

Here's a resulting single table from the INNER JOIN clause that gives the val field from the right table with records corresponding to only those with id value of 1 or 4, which are colored as yellow and purple. Now that you have a sense for how INNER JOIN works, let's try an example in SQL.

<img src = "https://images2.imgbox.com/cb/ff/UPAlB8Ta_o.png" width="400">

### prime_ministers table

The prime_ministers table is one of the tables in the leaders database. It is displayed here. Note the countries that are included. Suppose you were interested in determining nations that have both a prime minister and a president AND putting the results into a single table. Next you'll see the presidents table.

In [4]:
ministers = conn_leaders.execute("""

SELECT *
FROM   prime_ministers 
    
""")

In [5]:
ministers_res = ministers.fetchall()

In [6]:
ministers_list = [x for i, x in enumerate(ministers_res) if i < 10]

In [7]:
ministers_list

[('Egypt', 'Africa', 'Sherif Ismail'),
 ('Portugal', 'Europe', 'Antonio Costa'),
 ('Vietnam', 'Asia', 'Nguyen Xuan Phuc'),
 ('Haiti', 'North America', 'Jack Guy Lafontant'),
 ('India', 'Asia', 'Narendra Modi'),
 ('Australia', 'Oceania', 'Malcolm Turnbull'),
 ('Norway', 'Europe', 'Erna Solberg'),
 ('Brunei', 'Asia', 'Hassanal Bolkiah'),
 ('Oman', 'Asia', 'Qaboos bin Said al Said')]

### presidents table

How did I display all of the prime_ministers table in the previous? Recall the use of SELECT and FROM clauses as is shown for the presidents table here. Which countries appear in both tables? With small tables like these, it is easy to notice that Egypt, Portugal, Vietnam, and Haiti appear in both tables. For larger tables, it isn't as simple as just picking these countries out visually. 

In [8]:
presidents = conn_leaders.execute("""

SELECT *
FROM   presidents 

""")

In [9]:
presidents_res = presidents.fetchall()

In [10]:
presidents_list = [x for i, x in enumerate(presidents_res) if i < 10]

In [11]:
presidents_list

[('Egypt', 'Africa', 'Abdel Fattah el-Sisi'),
 ('Portugal', 'Europe', 'Marcelo Rebelo de Sousa'),
 ('Haiti', 'North America', 'Jovenel Moise'),
 ('Uruguay', 'South America', 'Jose Mujica'),
 ('Liberia', 'Africa', 'Ellen Johnson Sirleaf'),
 ('Chile', 'South America', 'Michelle Bachelet')]

So what does the syntax look like for SQL to get the results of countries with a prime minister and a president from these two tables into one?

### INNER JOIN in SQL

The syntax for completing an INNER JOIN from the prime_ministers table to the presidents table based on a key field of country is shown. Note the use of aliases for prime_ministers as p1 and presidents as p2. This helps to simplify your code, especially with longer table names like prime_ministers and presidents. A SELECT statement is used to select specific fields from the two tables. In this case, since country exists in both tables, we must write p1 and the period to avoid a SQL error. Next we list the table on the left of the inner join after FROM and then we list the table on the right after INNER JOIN. Lastly, we specify the keys in the two tables that we would like to match on.

In [12]:
query = conn_leaders.execute("""

SELECT p1.country,
       p1.continent,
       prime_minister,
       president
FROM   prime_ministers AS p1
       INNER JOIN presidents AS p2
               ON p1.country = p2.country 

""")

In [13]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('Egypt', 'Africa', 'Sherif Ismail', 'Abdel Fattah el-Sisi'),
 ('Portugal', 'Europe', 'Antonio Costa', 'Marcelo Rebelo de Sousa'),
 ('Haiti', 'North America', 'Jack Guy Lafontant', 'Jovenel Moise')]

## Inner join
---

Throughout this course, you'll be working with the **countries** database containing information about the most populous world cities as well as country-level economic data, population data, and geographic data. This **countries** database also contains information on languages spoken in each country.

You can see the different tables in this database by clicking on the corresponding tabs. Click through them to get a sense for the types of data that each table contains before you continue with the course! Take note of the fields that appear to be shared across the tables.

Recall from the video the basic syntax for an `INNER JOIN`, here including all columns in **both** tables:

`SELECT *
 FROM   left_table
        INNER JOIN right_table
                ON left_table.id = right_table.id `

You'll start off with a `SELECT` statement and then build up to an `INNER JOIN` with the **cities** and **countries** tables. Let's get to it!

### Instructions
Begin by selecting all columns from the cities table.

In [14]:
query = conn_countries.execute("""

SELECT *
FROM   cities

""")

In [15]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('Abidjan', 'CIV', 4765000.0, None, 4765000.0),
 ('Abu Dhabi', 'ARE', 1145000.0, None, 1145000.0),
 ('Abuja', 'NGA', 1235880.0, 6000000.0, 1235880.0),
 ('Accra', 'GHA', 2070463.0, 4010054.0, 2070463.0),
 ('Addis Ababa', 'ETH', 3103673.0, 4567857.0, 3103673.0),
 ('Ahmedabad', 'IND', 5570585.0, None, 5570585.0),
 ('Alexandria', 'EGY', 4616625.0, None, 4616625.0),
 ('Algiers', 'DZA', 3415811.0, 5000000.0, 3415811.0),
 ('Almaty', 'KAZ', 1703481.0, None, 1703481.0),
 ('Ankara', 'TUR', 5271000.0, 4585000.0, 5271000.0)]

Inner join the `cities` table on the left to the `countries` table on the right, keeping all of the fields in both tables.

You should match the tables on the `country_code` field in `cities` and the `code` field in `countries`.

**Do not** alias your tables here or in the next step. Using `cities` and `countries` is fine for now.

In [16]:
query = conn_countries.execute("""

SELECT *
FROM   cities
       INNER JOIN countries
               ON cities.country_code = countries.code 

""")

In [17]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('Abidjan', 'CIV', 4765000.0, None, 4765000.0, 'CIV', "Cote d'Ivoire", 'Africa', 'Western Africa', 322463.0, 1960, 'Cote d\x92Ivoire', 'Republic', 'Yamoussoukro', -4.0305, 5.332),
 ('Abu Dhabi', 'ARE', 1145000.0, None, 1145000.0, 'ARE', 'United Arab Emirates', 'Asia', 'Middle East', 83600.0, 1971, 'Al-Imarat al-´Arabiya al-Muttahida', 'Emirate Federation', 'Abu Dhabi', 54.3705, 24.4764),
 ('Abuja', 'NGA', 1235880.0, 6000000.0, 1235880.0, 'NGA', 'Nigeria', 'Africa', 'Western Africa', 923768.0, 1960, 'Nigeria', 'Federal Republic', 'Abuja', 7.48906, 9.05804),
 ('Accra', 'GHA', 2070463.0, 4010054.0, 2070463.0, 'GHA', 'Ghana', 'Africa', 'Western Africa', 238533.0, 1957, 'Ghana', 'Republic', 'Accra', -0.20795, 5.57045),
 ('Addis Ababa', 'ETH', 3103673.0, 4567857.0, 3103673.0, 'ETH', 'Ethiopia', 'Africa', 'Eastern Africa', 1104300.0, -1000, 'YeItyop´iya', 'Republic', 'Addis Ababa', 38.7468, 9.02274),
 ('Ahmedabad', 'IND', 5570585.0, None, 5570585.0, 'IND', 'India', 'Asia', 'Southern and Cent

Modify the `SELECT` statement to keep only the name of the city, the name of the country, and the name of the region the country resides in.
Alias the name of the city `AS city` and the name of the country `AS country`.

In [18]:
query = conn_countries.execute("""

SELECT cities.name    AS city,
       countries.name AS country,
       region
FROM   cities
       INNER JOIN countries
               ON cities.country_code = countries.code

""")

In [19]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('Abidjan', "Cote d'Ivoire", 'Western Africa'),
 ('Abu Dhabi', 'United Arab Emirates', 'Middle East'),
 ('Abuja', 'Nigeria', 'Western Africa'),
 ('Accra', 'Ghana', 'Western Africa'),
 ('Addis Ababa', 'Ethiopia', 'Eastern Africa'),
 ('Ahmedabad', 'India', 'Southern and Central Asia'),
 ('Alexandria', 'Egypt', 'Northern Africa'),
 ('Algiers', 'Algeria', 'Northern Africa'),
 ('Almaty', 'Kazakhstan', 'Southern and Central Asia'),
 ('Ankara', 'Turkey', 'Middle East')]

## Inner join (2)
---

Instead of writing the full table name, you can use table aliasing as a shortcut. For tables you also use `AS` to add the alias immediately after the table name with a space. Check out the aliasing of **cities** and **countries** below.

`
SELECT c1.NAME AS city,
        c2.NAME AS country
 FROM   cities AS c1
        INNER JOIN countries AS c2
                ON c1.country_code = c2.code `

Notice that to select a field in your query that appears in multiple tables, you'll need to identify which table/table alias you're referring to by using a `.` in your `SELECT` statement.

You'll now explore a way to get data from both the **countries** and **economies** tables to examine the inflation rate for both 2010 and 2015.

Sometimes it's easier to write SQL code out of order: you write the `SELECT` statement after you've done the `JOIN`.

### Instructions

Join the tables `countries` (left) and `economies` (right) aliasing `countries AS c` and `economies AS e`.

Specify the field to match the tables `ON`.

From this join, `SELECT`:
`c.code`, aliased as `country_code`.
`name`, `year`, and `inflation_rate`, not aliased.

In [20]:
query = conn_countries.execute("""

SELECT c.code AS country_code,
       name,
       year,
       inflation_rate
FROM   countries AS c
       INNER JOIN economies AS e
               ON c.code = e.code

""")

In [21]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('AFG', 'Afghanistan', 2010, 2.179),
 ('AFG', 'Afghanistan', 2015, -1.549),
 ('AGO', 'Angola', 2010, 14.48),
 ('AGO', 'Angola', 2015, 10.287),
 ('ALB', 'Albania', 2010, 3.605),
 ('ALB', 'Albania', 2015, 1.896),
 ('ARE', 'United Arab Emirates', 2010, 0.878),
 ('ARE', 'United Arab Emirates', 2015, 4.07),
 ('ARG', 'Argentina', 2010, 10.461),
 ('ARG', 'Argentina', 2015, None)]

## Inner join (3)
---

The ability to combine multiple joins in a single query is a powerful feature of SQL, e.g:

`SELECT *
 FROM   left_table
        INNER JOIN right_table
                ON left_table.id = right_table.id
        INNER JOIN another_table
                ON left_table.id = another_table.id `
    
As you can see here it becomes tedious to continually write long table names in joins. This is when it becomes useful to alias each table using the first letter of its name (e.g. `countries AS c`)! It is standard practice to alias in this way and, if you choose to alias tables or are asked to specifically for an exercise in this course, you should follow this protocol.

Now, for each country, you want to get the country name, its region, the fertility rate, and the unemployment rate for both 2010 and 2015.

Note that results should work throughout this course with or without table aliasing unless specified differently.

### Instructions

Inner join `countries` (left) and `populations` (right) on the `code` and `country_code` fields respectively.

Alias `countries AS c` and `populations AS p`.

Select `code`, `name`, and `region` from `countries` and also select `year` and `fertility_rate` from `populations` (5 fields in total).

In [22]:
query = conn_countries.execute("""

SELECT c.code,
       c.name,
       c.region,
       p.year,
       p.fertility_rate
FROM   countries AS c
       INNER JOIN populations AS p
               ON c.code = p.country_code  

""")

In [23]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('AFG', 'Afghanistan', 'Southern and Central Asia', 2015, 4.653),
 ('AFG', 'Afghanistan', 'Southern and Central Asia', 2010, 5.746),
 ('ALB', 'Albania', 'Southern Europe', 2015, 1.793),
 ('ALB', 'Albania', 'Southern Europe', 2010, 1.663),
 ('DZA', 'Algeria', 'Northern Africa', 2015, 2.805),
 ('DZA', 'Algeria', 'Northern Africa', 2010, 2.873),
 ('ASM', 'American Samoa', 'Polynesia', 2015, None),
 ('ASM', 'American Samoa', 'Polynesia', 2010, None),
 ('AND', 'Andorra', 'Southern Europe', 2015, None),
 ('AND', 'Andorra', 'Southern Europe', 2010, 1.27)]

Add an additional `INNER JOIN` with `economies` to your previous query by joining on `code`.

Include the `unemployment_rate` column that became available through joining with `economies`.

Note that `year` appears in both `populations` and `economies`, so you have to explicitly use `e.year` instead of `year` as you did before.

In [24]:
query = conn_countries.execute("""

SELECT c.code,
       c.name,
       c.region,
       e.year,
       p.fertility_rate,
       e.unemployment_rate
FROM   countries AS c
       INNER JOIN populations AS p
               ON c.code = p.country_code
       INNER JOIN economies AS e
               ON e.code = c.code  

""")

In [25]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('AFG', 'Afghanistan', 'Southern and Central Asia', 2010, 4.653, None),
 ('AFG', 'Afghanistan', 'Southern and Central Asia', 2015, 4.653, None),
 ('AFG', 'Afghanistan', 'Southern and Central Asia', 2010, 5.746, None),
 ('AFG', 'Afghanistan', 'Southern and Central Asia', 2015, 5.746, None),
 ('ALB', 'Albania', 'Southern Europe', 2010, 1.793, 14.0),
 ('ALB', 'Albania', 'Southern Europe', 2015, 1.793, 17.1),
 ('ALB', 'Albania', 'Southern Europe', 2010, 1.663, 14.0),
 ('ALB', 'Albania', 'Southern Europe', 2015, 1.663, 17.1),
 ('DZA', 'Algeria', 'Northern Africa', 2010, 2.805, 9.961),
 ('DZA', 'Algeria', 'Northern Africa', 2015, 2.805, 11.214)]

Scroll down the query result and take a look at the results for Albania from your previous query. Does something seem off to you?

The trouble with doing your last join on `c.code = e.code` and not also including year is that e.g. the 2010 value for `fertility_rate` is also paired with the 2015 value for `unemployment_rate`.

Fix your previous query: in your last `ON` clause, use `AND` to add an additional joining condition. In addition to joining on `code` in `c` and `e`, also join on `year` in `e` and `p`.

In [26]:
query = conn_countries.execute("""

SELECT c.code,
       name,
       region,
       e.year,
       fertility_rate,
       unemployment_rate
FROM   countries AS c
       INNER JOIN populations AS p
               ON c.code = p.country_code
       INNER JOIN economies AS e
               ON c.code = e.code
                  AND e.year = p.year  

""")

In [27]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('AFG', 'Afghanistan', 'Southern and Central Asia', 2015, 4.653, None),
 ('AFG', 'Afghanistan', 'Southern and Central Asia', 2010, 5.746, None),
 ('ALB', 'Albania', 'Southern Europe', 2015, 1.793, 17.1),
 ('ALB', 'Albania', 'Southern Europe', 2010, 1.663, 14.0),
 ('DZA', 'Algeria', 'Northern Africa', 2015, 2.805, 11.214),
 ('DZA', 'Algeria', 'Northern Africa', 2010, 2.873, 9.961),
 ('AGO', 'Angola', 'Central Africa', 2015, 5.996, None),
 ('AGO', 'Angola', 'Central Africa', 2010, 6.416, None),
 ('ATG', 'Antigua and Barbuda', 'Caribbean', 2015, 2.063, None),
 ('ATG', 'Antigua and Barbuda', 'Caribbean', 2010, 2.13, None)]

## INNER JOIN via USING
---

Recall the INNER JOIN diagram you saw. Think about the SQL code needed to complete this diagram. Let's check it out. We select and alias three fields and use the left table on the left of the join and the right table on the right of the join matching based on the entries for the id key field.

<img src = "https://images2.imgbox.com/cb/ff/UPAlB8Ta_o.png" width="400">

`SELECT left_table.id   AS L_id,
        left_table.val  AS L_val,
        right_table.val AS R_val
 FROM   left_table
        INNER JOIN right_table
                ON left_table.id = right_table.id `


### The INNER JOIN diagram with USING

When the key field you'd like to join on is the same name in both tables, you can use a USING clause instead of the ON clause you have seen so far. Since id is the same name in both the left table and the right table we can specify USING instead of ON here. Note that the parentheses are required around the key field with USING. Let's revisit the example of joining the prime_ministers table

<img src = "https://images2.imgbox.com/61/05/BnIToehf_o.png" width="400">

`SELECT left_table.id   AS L_id,
        left_table.val  AS L_val,
        right_table.val AS R_val
 FROM   left_table
        INNER JOIN right_table USING( id )`


### Countries with prime ministers and presidents

To the presidents table to determine countries with both types of leaders. How could you fill in the blanks to get the result with USING? This is a bit trick. But why does this work? Since an INNER JOIN includes entries in both tables and both tables contain the countries listed, it doesn't matter the order in which we place the tables in the join if we SELECT these columns. You'll be told in the exercises which table to use on the left and on the right to avoid this confusion. Note again the use of the parentheses around country after USING.

In [28]:
query = conn_leaders.execute("""

SELECT p1.country,
       p1.continent,
       prime_minister,
       president
FROM   presidents AS p1
       INNER JOIN prime_ministers AS p2 USING( country )
       
""")

In [29]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('Egypt', 'Africa', 'Sherif Ismail', 'Abdel Fattah el-Sisi'),
 ('Portugal', 'Europe', 'Antonio Costa', 'Marcelo Rebelo de Sousa'),
 ('Haiti', 'North America', 'Jack Guy Lafontant', 'Jovenel Moise')]

## Inner join with using
---

When joining tables with a common field name, e.g.

`SELECT *
FROM   countries
       INNER JOIN economies
               ON countries.code = economies.code `
    
You can use `USING` as a shortcut:

`SELECT *
FROM   countries
       INNER JOIN economies USING( code )`
    
You'll now explore how this can be done with the `countries` and `languages` tables.

### Instructions

Inner join `countries` on the left and `languages` on the right with `USING(code)`.

Select the fields corresponding to:
- country name `AS country`,
- continent name,
- language name `AS language`, and
- whether or not the language is official.

Remember to alias your tables using the first letter of their names.

In [30]:
query = conn_countries.execute("""

SELECT c.name AS country,
       c.continent,
       l.name AS language,
       l.official
FROM   countries AS c
       INNER JOIN languages AS l USING( code ) 

""")

In [31]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('Afghanistan', 'Asia', 'Dari', 'TRUE'),
 ('Afghanistan', 'Asia', 'Pashto', 'TRUE'),
 ('Afghanistan', 'Asia', 'Turkic', 'FALSE'),
 ('Afghanistan', 'Asia', 'Other', 'FALSE'),
 ('Albania', 'Europe', 'Albanian', 'TRUE'),
 ('Albania', 'Europe', 'Greek', 'FALSE'),
 ('Albania', 'Europe', 'Other', 'FALSE'),
 ('Albania', 'Europe', 'unspecified', 'FALSE'),
 ('Algeria', 'Africa', 'Arabic', 'TRUE'),
 ('Algeria', 'Africa', 'French', 'FALSE')]

## Self-ish joins, just in CASE
---

You'll now dive into inner joins where a table is joined with itself. Sounds a little selfish, doesn't it? These types of joins, as you may have guessed, are called self joins. You'll also explore how to slice a numerical field into categories using the CASE command. Joining a table to.

### Join a table to itself?

Itself may seem like a bit of a crazy, strange thing to ever want to do. Self-joins are used to compare values in a field to other values of the same field from within the same table. Let's further explore this with an example. Recall the prime_ministers table from earlier. What if you wanted to create a new table showing countries that are in the same continent matched as pairs? Let's explore a chunk of INNER JOIN code using the prime_ministers table.

### Join prime_ministers to itself?

The country column is selected twice as well as continent. The prime_ministers table is on both the left and the right. The vital step here is setting the key columns by which we match the table to itself. For each country, we will have a match if the country in the "right table" (that is also prime_ministers) is in the same continent. Lastly, since the results of this query are more than can fit on the slide, you'll only see the first 14 records. See how we have exactly this in the result! It's a pairing of each country with every other country in its same continent. But do you see a problem here? We don't want to list the country with itself after all. In the next slide, you'll see a way to do this. Pause to think about how to get around this before continuing. We don't want to include rows.

In [32]:
query = conn_leaders.execute("""

SELECT p1.country AS country1,
       p2.country AS country2,
       p1.continent
FROM   prime_ministers AS p1
       INNER JOIN prime_ministers AS p2
               ON p1.continent = p2.continent
LIMIT  14 

""")

In [33]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 14]

result_list

[('Egypt', 'Egypt', 'Africa'),
 ('Portugal', 'Norway', 'Europe'),
 ('Portugal', 'Portugal', 'Europe'),
 ('Vietnam', 'Brunei', 'Asia'),
 ('Vietnam', 'India', 'Asia'),
 ('Vietnam', 'Oman', 'Asia'),
 ('Vietnam', 'Vietnam', 'Asia'),
 ('Haiti', 'Haiti', 'North America'),
 ('India', 'Brunei', 'Asia'),
 ('India', 'India', 'Asia'),
 ('India', 'Oman', 'Asia'),
 ('India', 'Vietnam', 'Asia'),
 ('Australia', 'Australia', 'Oceania'),
 ('Norway', 'Norway', 'Europe')]

### Finishing off the self-join on prime_ministers

Where the country is the same in the country1 and country2 fields. The AND clause can check that multiple conditions are met. Here a match will not be made between prime_ministers and itself if the countries match. You, thus, have the correct table now; the results here are again limited in order for them to fit on the slide. Notice that self-join doesn't have a syntax quite as simple as INNER JOIN (You can't just write SELF JOIN in SQL code).

In [34]:
query = conn_leaders.execute("""

SELECT p1.country AS country1,
       p2.country AS country2,
       p1.continent
FROM   prime_ministers AS p1
       INNER JOIN prime_ministers AS p2
               ON p1.continent = p2.continent
                  AND p1.country <> p2.country 

""")

In [35]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('Portugal', 'Norway', 'Europe'),
 ('Vietnam', 'Brunei', 'Asia'),
 ('Vietnam', 'India', 'Asia'),
 ('Vietnam', 'Oman', 'Asia'),
 ('India', 'Brunei', 'Asia'),
 ('India', 'Oman', 'Asia'),
 ('India', 'Vietnam', 'Asia'),
 ('Norway', 'Portugal', 'Europe'),
 ('Brunei', 'India', 'Asia'),
 ('Brunei', 'Oman', 'Asia')]

### CASE WHEN and THEN

The next command isn't a join, but is a useful tool in your repertoire. You'll be introduced to using CASE with another table in the leaders database. The states table contains numeric data about different countries in the six inhabited world continents. We'll focus on the field indep_year now. Suppose we'd like to group the year of independence into categories of before 1900, between 1900 and 1930, and after 1930. CASE will get us there! CASE is a way to do multiple if-then-else statements in a simplified way in SQL.

### Preparing indep_year_group in states

You can now see the basic layout for creating a new field containing the groupings. How might we fill them in? After the first WHEN should specify that we want to check for indep_year being less than 1900. Next we want indep_year_group to contain 'between 1900 and 1930' in the next blank. Lastly any other record not matching these conditions will be assigned the value of 'after 1930' for indep_year_group.

### Creating indep_year_group in states

Check out the completed query with completed blanks. Notice how the values of indep_year are grouped in indep_year_group. Also observe how continent relates to indep_year_group.

In [36]:
query = conn_leaders.execute("""

SELECT NAME,
       continent,
       indep_year,
       CASE
         WHEN indep_year < 1900 THEN 'before 1900 and 1930'
         WHEN indep_year <= 1930 THEN 'between 1900 and 1930'
         ELSE 'after 1930'
       END AS indep_year_group
FROM   states
ORDER  BY indep_year_group 

""")

In [37]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('Brunei', 'Asia', 1984, 'after 1930'),
 ('India', 'Asia', 1947, 'after 1930'),
 ('Oman', 'Asia', 1951, 'after 1930'),
 ('Chile', 'South America', 1810, 'before 1900 and 1930'),
 ('Haiti', 'North America', 1804, 'before 1900 and 1930'),
 ('Liberia', 'Africa', 1847, 'before 1900 and 1930'),
 ('Portugal', 'Europe', 1143, 'before 1900 and 1930'),
 ('Spain', 'Europe', 1492, 'before 1900 and 1930'),
 ('Uruguay', 'South America', 1828, 'before 1900 and 1930'),
 ('Australia', 'Oceania', 1901, 'between 1900 and 1930')]

## Self-join
---

In this exercise, you'll use the populations table to perform a self-join to calculate the percentage increase in population from 2010 to 2015 for each country code!

Since you'll be joining the populations table to itself, you can alias populations as p1 and also populations as p2. This is good practice whenever you are aliasing and your tables have the same first letter. Note that you are required to alias the tables with self-joins.

### Instructions

Join `populations` with itself `ON` `country_code`.

Select the `country_code` from `p1` and the `size` field from both `p1` and `p2`. SQL won't allow same-named fields, so alias `p1.size` as `size2010` and `p2.size` as `size2015`.

In [38]:
query = conn_countries.execute("""

SELECT p1.country_code,
       p1.size AS size2010,
       p2.size AS size2015
FROM   populations AS p1
       INNER JOIN populations AS p2
               ON p1.country_code = p2.country_code

""")

In [39]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('AFG', 32526562.0, 27962207.0),
 ('AFG', 32526562.0, 32526562.0),
 ('AFG', 27962207.0, 27962207.0),
 ('AFG', 27962207.0, 32526562.0),
 ('ALB', 2889167.0, 2889167.0),
 ('ALB', 2889167.0, 2913021.0),
 ('ALB', 2913021.0, 2889167.0),
 ('ALB', 2913021.0, 2913021.0),
 ('DZA', 39666519.0, 36036159.0),
 ('DZA', 39666519.0, 39666519.0)]

Notice from the result that for each `country_code` you have four entries laying out all combinations of 2010 and 2015.

Extend the `ON` in your query to include only those records where the `p1.year` (2010) matches with `p2.year - 5` (2015 - 5 = 2010). This will omit the three entries per `country_code` that you aren't interested in.

In [40]:
query = conn_countries.execute("""

SELECT p1.country_code,
       p1.size AS size2010,
       p2.size AS size2015
FROM   populations AS p1
       INNER JOIN populations AS p2
               ON p1.country_code = p2.country_code
                  AND p1.year = p2.year - 5

""")

In [41]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('AFG', 27962207.0, 32526562.0),
 ('ALB', 2913021.0, 2889167.0),
 ('DZA', 36036159.0, 39666519.0),
 ('ASM', 55636.0, 55538.0),
 ('AND', 84419.0, 70473.0),
 ('AGO', 21219954.0, 25021974.0),
 ('ATG', 87233.0, 91818.0),
 ('ARG', 41222875.0, 43416755.0),
 ('ARM', 2963496.0, 3017712.0),
 ('ABW', 101597.0, 103889.0)]

As you just saw, you can also use SQL to calculate values like `p2.year - 5` for you. With two fields like `size2010` and `size2015`, you may want to determine the percentage increase from one field to the next:

With two numeric fields A and B, the percentage growth from A to B can be calculated as (B - A) A * 100.0

Add a new field to `SELECT`, aliased as `growth_perc`, that calculates the percentage population growth from 2010 to 2015 for each country, using `p2.size` and `p1.size`.

In [42]:
query = conn_countries.execute("""

SELECT p1.country_code,
       p1.size                                     AS size2010,
       p2.size                                     AS size2015,
       ( ( p2.size - p1.size ) / p1.size * 100.0 ) AS growth_perc
FROM   populations AS p1
       INNER JOIN populations AS p2
               ON p1.country_code = p2.country_code
                  AND p1.year = ( p2.year - 5 ) 

""")

In [43]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('AFG', 27962207.0, 32526562.0, 16.3233002316305),
 ('ALB', 2913021.0, 2889167.0, -0.8188749754979453),
 ('DZA', 36036159.0, 39666519.0, 10.074214624261147),
 ('ASM', 55636.0, 55538.0, -0.17614494212380472),
 ('AND', 84419.0, 70473.0, -16.519977730131842),
 ('AGO', 21219954.0, 25021974.0, 17.917192468937493),
 ('ATG', 87233.0, 91818.0, 5.256038425825089),
 ('ARG', 41222875.0, 43416755.0, 5.321996585633583),
 ('ARM', 2963496.0, 3017712.0, 1.8294608799876901),
 ('ABW', 101597.0, 103889.0, 2.255972125161176)]

## Case when and then
---

Often it's useful to look at a numerical field not as raw data, but instead as being in different categories or groups.

You can use `CASE` with `WHEN`, `THEN`, `ELSE`, and `END` to define a new grouping field.

### Instructions 

Using the `countries` table, create a new field `AS geosize_group` that groups the countries into three groups:

- If `surface_area` is greater than 2 million, `geosize_group` is `'large'`.
- If `surface_area` is greater than 350 thousand but not larger than 2 million, `geosize_group` is `'medium'`.
- Otherwise, `geosize_group` is `'small'`.

In [44]:
query = conn_countries.execute("""

SELECT NAME,
       continent,
       code,
       surface_area,
       CASE
         WHEN surface_area > 2000000 THEN 'large'
         WHEN surface_area > 350000 THEN 'medium'
         ELSE 'small'
       END AS geosize_group
FROM   countries 

""")

In [45]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('Afghanistan', 'Asia', 'AFG', 652090.0, 'medium'),
 ('Netherlands', 'Europe', 'NLD', 41526.0, 'small'),
 ('Albania', 'Europe', 'ALB', 28748.0, 'small'),
 ('Algeria', 'Africa', 'DZA', 2381740.0, 'large'),
 ('American Samoa', 'Oceania', 'ASM', 199.0, 'small'),
 ('Andorra', 'Europe', 'AND', 468.0, 'small'),
 ('Angola', 'Africa', 'AGO', 1246700.0, 'medium'),
 ('Antigua and Barbuda', 'North America', 'ATG', 442.0, 'small'),
 ('United Arab Emirates', 'Asia', 'ARE', 83600.0, 'small'),
 ('Argentina', 'South America', 'ARG', 2780400.0, 'large')]

## Inner challenge
---

The table you created with the added `geosize_group` field has been loaded for you here with the name `countries_plus`. Observe the use of (and the placement of) the `INTO` command to create this `countries_plus` table:

`SELECT NAME,
        continent,
        code,
        surface_area,
        CASE
          WHEN surface_area > 2000000 THEN 'large'
          WHEN surface_area > 350000 THEN 'medium'
          ELSE 'small'
        END AS geosize_group
 INTO   countries_plus
 FROM   countries`

You will now explore the relationship between the size of a country in terms of surface area and in terms of population using grouping fields created with `CASE`.

### Instructions

Using the `populations` table focused only for the `year` 2015, create a new field aliased as `popsize_group` to organize population `size` into

-`'large'` (> 50 million),

-`'medium'` (> 1 million), and

-`'small'` groups.

Select only the country code, population size, and this new `popsize_group` as fields.

In [46]:
query = conn_countries.execute("""

SELECT country_code,
       size,
       CASE
         WHEN size > 50000000 THEN 'large'
         WHEN size > 1000000 THEN 'medium'
         ELSE 'small'
       END AS popsize_group
FROM   populations
WHERE  year = 2015 

""")

In [47]:
result_res = query.fetchall()
result_list = [x for i, x in enumerate(result_res) if i < 10]

result_list

[('AFG', 32526562.0, 'medium'),
 ('ALB', 2889167.0, 'medium'),
 ('DZA', 39666519.0, 'medium'),
 ('ASM', 55538.0, 'small'),
 ('AND', 70473.0, 'small'),
 ('AGO', 25021974.0, 'medium'),
 ('ATG', 91818.0, 'small'),
 ('ARG', 43416755.0, 'medium'),
 ('ARM', 3017712.0, 'medium'),
 ('ABW', 103889.0, 'small')]

Use `INTO` to save the result of the previous query as `pop_plus`. You can see an example of this in the `countries_plus` code in the assignment text. Make sure to include a `;` at the end of your `WHERE` clause!

Then, include another query below your first query to display all the records in `pop_plus` using `SELECT * FROM pop_plus`; so that you generate results and this will display `pop_plus` in the query result.

`SELECT country_code,
        size,
        CASE
          WHEN size > 50000000 THEN 'large'
          WHEN size > 1000000 THEN 'medium'
          ELSE 'small'
        END AS popsize_group
 INTO   countries.pop_plus
 FROM   populations
 WHERE  year = 2015 `

Keep the first query intact that creates `pop_plus` using `INTO`.

Write a query to join `countries_plus AS c` on the left with `pop_plus AS` p on the right matching on the country code fields.

Sort the data based on `geosize_group`, in ascending order so that `large` appears on top.

Select the `name`, `continent`, `geosize_group`, and `popsize_group` fields.

`SELECT country_code,
        size,
        CASE
          WHEN size > 50000000 THEN 'large'
          WHEN size > 1000000 THEN 'medium'
          ELSE 'small'
        END AS popsize_group
 INTO   pop_plus
 FROM   populations
 WHERE  year = 2015`

    
`SELECT c.NAME,
        c.continent,
        c.geosize_group,
        p.popsize_group
 FROM   countries_plus AS c
        INNER JOIN pop_plus AS p
                ON c.code = p.country_code
 ORDER  BY geosize_group`