# Joining Data in SQL

> the power of joining tables while exploring interesting features of countries and their cities throughout the world

- author: Victor Omondi
- toc: true
- comments: true
- categories: [joins, sql]
- image: images/jds-shield-png

# Overview

We'll explore the power of joining tables while exploring interesting features of countries and their cities throughout the world. We will use inner and outer joins, as well as self joins, semi joins, anti joins and cross joins—fundamental tools in any PostgreSQL wizard's toolbox.

# Setup 

In [1]:
import pandas as pd
%load_ext sql
%sql sqlite://

'Connected: @None'

# Introduction to joins

We'll be exploring the concept of joining tables, and will explore the different ways to enrich queries using inner joins and self joins. We'll also see how to use the case statement to split up a field into different categories.

## Introduction to INNER JOIN

### prime_ministers table

In [3]:
prime_ministers = pd.read_csv("datasets/leaders/prime_ministers.csv")
prime_ministers.head()

Unnamed: 0,country,continent,prime_minister
0,Egypt,Africa,Sherif Ismail
1,Portugal,Europe,Antonio Costa
2,Vietnam,Asia,Nguyen Xuan Phuc
3,Haiti,North America,Jack Guy Lafontant
4,India,Asia,Narendra Modi


In [5]:
presidents = pd.read_csv("datasets/leaders/presidents.csv")
presidents.head()

Unnamed: 0,country,continent,president
0,Egypt,Africa,Abdel Fattah el-Sisi
1,Portugal,Europe,Marcelo Rebelo de Sousa
2,Haiti,North America,Jovenel Moise
3,Uruguay,South America,Jose Mujica
4,Liberia,Africa,Ellen Johnson Sirleaf


In [6]:
%sql DROP TABLE IF EXISTS prime_ministers;
%sql PERSIST prime_ministers;
%sql DROP TABLE IF EXISTS presidents;
%sql PERSIST presidents;

 * sqlite://
Done.
 * sqlite://
 * sqlite://
Done.
 * sqlite://


'Persisted presidents'

### INNER JOIN in SQL

In [7]:
%%sql
SELECT p1.country, p1.continent, prime_minister, president
    FROM prime_ministers AS p1
    INNER JOIN presidents AS p2
    ON p1.country = p2.country

 * sqlite://
Done.


country,continent,prime_minister,president
Egypt,Africa,Sherif Ismail,Abdel Fattah el-Sisi
Portugal,Europe,Antonio Costa,Marcelo Rebelo de Sousa
Vietnam,Asia,Nguyen Xuan Phuc,Tran Dai Quang
Haiti,North America,Jack Guy Lafontant,Jovenel Moise


We'll be working with the `countries` database containing information about the most populous world cities as well as country-level economic data, population data, and geographic data. This `countries` database also contains information on languages spoken in each country.

In [8]:
cities = pd.read_csv("datasets/countries/cities.csv")
cities.head()

Unnamed: 0,name,country_code,city_proper_pop,metroarea_pop,urbanarea_pop
0,Abidjan,CIV,4765000,,4765000
1,Abu Dhabi,ARE,1145000,,1145000
2,Abuja,NGA,1235880,6000000.0,1235880
3,Accra,GHA,2070463,4010054.0,2070463
4,Addis Ababa,ETH,3103673,4567857.0,3103673


In [9]:
countries = pd.read_csv("datasets/countries/countries.csv")
countries.head()

Unnamed: 0,code,country_name,continent,region,surface_area,indep_year,local_name,gov_form,capital,cap_long,cap_lat
0,AFG,Afghanistan,Asia,Southern and Central Asia,652090.0,1919.0,Afganistan/Afqanestan,Islamic Emirate,Kabul,69.1761,34.5228
1,NLD,Netherlands,Europe,Western Europe,41526.0,1581.0,Nederland,Constitutional Monarchy,Amsterdam,4.89095,52.3738
2,ALB,Albania,Europe,Southern Europe,28748.0,1912.0,Shqiperia,Republic,Tirane,19.8172,41.3317
3,DZA,Algeria,Africa,Northern Africa,2381740.0,1962.0,Al-Jazair/Algerie,Republic,Algiers,3.05097,36.7397
4,ASM,American Samoa,Oceania,Polynesia,199.0,,Amerika Samoa,US Territory,Pago Pago,-170.691,-14.2846


In [10]:
%sql DROP TABLE IF EXISTS cities;
%sql DROP TABLE IF EXISTS countries;
%sql PERSIST cities;
%sql PERSIST countries;

 * sqlite://
Done.
 * sqlite://
Done.
 * sqlite://
 * sqlite://


'Persisted countries'

In [11]:
%%sql
-- Select all columns from cities
SELECT *
    FROM cities
    LIMIT 5;

 * sqlite://
Done.


index,name,country_code,city_proper_pop,metroarea_pop,urbanarea_pop
0,Abidjan,CIV,4765000,,4765000
1,Abu Dhabi,ARE,1145000,,1145000
2,Abuja,NGA,1235880,6000000.0,1235880
3,Accra,GHA,2070463,4010054.0,2070463
4,Addis Ababa,ETH,3103673,4567857.0,3103673


In [12]:
%%sql
SELECT * 
FROM cities
    -- 1. Inner join to countries
    INNER JOIN countries
    -- 2. Match on the country codes
    ON cities.country_code = countries.code
    LIMIT 5;

 * sqlite://
Done.


index,name,country_code,city_proper_pop,metroarea_pop,urbanarea_pop,index_1,code,country_name,continent,region,surface_area,indep_year,local_name,gov_form,capital,cap_long,cap_lat
0,Abidjan,CIV,4765000,,4765000,133,CIV,Cote d'Ivoire,Africa,Western Africa,322463.0,1960.0,Cote dIvoire,Republic,Yamoussoukro,-4.0305,5.332000000000001
1,Abu Dhabi,ARE,1145000,,1145000,8,ARE,United Arab Emirates,Asia,Middle East,83600.0,1971.0,Al-Imarat al-´Arabiya al-Muttahida,Emirate Federation,Abu Dhabi,54.3705,24.4764
2,Abuja,NGA,1235880,6000000.0,1235880,131,NGA,Nigeria,Africa,Western Africa,923768.0,1960.0,Nigeria,Federal Republic,Abuja,7.48906,9.05804
3,Accra,GHA,2070463,4010054.0,2070463,52,GHA,Ghana,Africa,Western Africa,238533.0,1957.0,Ghana,Republic,Accra,-0.20795,5.57045
4,Addis Ababa,ETH,3103673,4567857.0,3103673,45,ETH,Ethiopia,Africa,Eastern Africa,1104300.0,-1000.0,YeItyop´iya,Republic,Addis Ababa,38.7468,9.02274


In [14]:
%%sql
-- 1. Select name fields (with alias) and region 
SELECT cities.name AS city, countries.country_name AS country, countries.region
    FROM cities
    INNER JOIN countries
    ON cities.country_code = countries.code
    LIMIT 5;

 * sqlite://
Done.


city,country,region
Abidjan,Cote d'Ivoire,Western Africa
Abu Dhabi,United Arab Emirates,Middle East
Abuja,Nigeria,Western Africa
Accra,Ghana,Western Africa
Addis Ababa,Ethiopia,Eastern Africa


Instead of writing the full table name, we can use table aliasing as a shortcut. For tables we also use AS to add the alias immediately after the table name with a space. To select a field in the query that appears in multiple tables, we'll need to identify which table/table alias we're referring to by using a `.` in the `SELECT` statement.

We'll now explore a way to get data from both the `countries` and `economies` tables to examine the inflation rate for both 2010 and 2015.

In [15]:
economies = pd.read_csv("datasets/countries/economies.csv")
economies.head()

Unnamed: 0,econ_id,code,year,income_group,gdp_percapita,gross_savings,inflation_rate,total_investment,unemployment_rate,exports,imports
0,1,AFG,2010,Low income,539.667,37.133,2.179,30.402,,46.394,24.381
1,2,AFG,2015,Low income,615.091,21.466,-1.549,18.602,,-49.11,-7.294
2,3,AGO,2010,Upper middle income,3599.27,23.534,14.48,14.433,,-3.266,-21.076
3,4,AGO,2015,Upper middle income,3876.2,-0.425,10.287,9.552,,6.721,-21.778
4,5,ALB,2010,Upper middle income,4098.13,20.011,3.605,31.305,14.0,10.645,-8.013


In [18]:
%sql DROP TABLE IF EXISTS economies
%sql PERSIST economies;

 * sqlite://
Done.
 * sqlite://


'Persisted economies'

In [20]:
%%sql
-- 3. Select fields with aliases
SELECT c.code AS country_code, country_name, year, inflation_rate
    FROM countries AS c
    -- 1. Join to economies (alias e)
    INNER JOIN economies AS e
    -- 2. Match on code
    ON c.code = e.code
    LIMIT 5;

 * sqlite://
Done.


country_code,country_name,year,inflation_rate
AFG,Afghanistan,2010,2.1790000000000003
AFG,Afghanistan,2015,-1.5490000000000002
NLD,Netherlands,2010,0.932
NLD,Netherlands,2015,0.22
ALB,Albania,2010,3.605


In [22]:
populations = pd.read_csv("datasets/countries/populations.csv")
populations.head()

Unnamed: 0,pop_id,country_code,year,fertility_rate,life_expectancy,size
0,20,ABW,2010,1.704,74.953537,101597.0
1,19,ABW,2015,1.647,75.573585,103889.0
2,2,AFG,2010,5.746,58.970829,27962207.0
3,1,AFG,2015,4.653,60.717171,32526562.0
4,12,AGO,2010,6.416,50.654171,21219954.0


In [24]:
%sql DROP TABLE IF EXISTS populations;
%sql PERSIST populations;

 * sqlite://
Done.
 * sqlite://


'Persisted populations'

Now, for each country, we want to get the country name, its region, and the fertility rate and unemployment rate for both 2010 and 2015.

In [26]:
%%sql
-- 4. Select fields
SELECT code, country_name, region, year, fertility_rate
    -- 1. From countries (alias as c)
    FROM countries as c
    -- 2. Join with populations (as p)
    INNER JOIN populations as p
    -- 3. Match on country code
    ON code = country_code
    LIMIT 5;

 * sqlite://
Done.


code,country_name,region,year,fertility_rate
AFG,Afghanistan,Southern and Central Asia,2010,5.746
AFG,Afghanistan,Southern and Central Asia,2015,4.6530000000000005
NLD,Netherlands,Western Europe,2010,1.79
NLD,Netherlands,Western Europe,2015,1.71
ALB,Albania,Southern Europe,2010,1.663


In [28]:
%%sql
-- 6. Select fields
SELECT c.code, country_name, region, e.year, fertility_rate, unemployment_rate
    -- 1. From countries (alias as c)
    FROM countries AS c
    -- 2. Join to populations (as p)
    INNER JOIN populations AS p
    -- 3. Match on country code
    ON c.code = p.country_code
    -- 4. Join to economies (as e)
    INNER JOIN economies as e
    -- 5. Match on country code
    ON c.code = e.code
    LIMIT 5;

 * sqlite://
Done.


code,country_name,region,year,fertility_rate,unemployment_rate
AFG,Afghanistan,Southern and Central Asia,2010,4.6530000000000005,
AFG,Afghanistan,Southern and Central Asia,2015,4.6530000000000005,
AFG,Afghanistan,Southern and Central Asia,2010,5.746,
AFG,Afghanistan,Southern and Central Asia,2015,5.746,
NLD,Netherlands,Western Europe,2010,1.71,4.995


In [31]:
%%sql
-- 6. Select fields
SELECT c.code, country_name, region, e.year, fertility_rate, unemployment_rate
    -- 1. From countries (alias as c)
    FROM countries AS c
    -- 2. Join to populations (as p)
    INNER JOIN populations AS p
    -- 3. Match on country code
    ON c.code = p.country_code
     -- 4. Join to economies (as e)
    INNER JOIN economies AS e
    -- 5. Match on country code and year
    ON c.code = e.code AND e.year=p.year
    LIMIT 5;

 * sqlite://
Done.


code,country_name,region,year,fertility_rate,unemployment_rate
AFG,Afghanistan,Southern and Central Asia,2010,5.746,
AFG,Afghanistan,Southern and Central Asia,2015,4.6530000000000005,
NLD,Netherlands,Western Europe,2010,1.79,4.995
NLD,Netherlands,Western Europe,2015,1.71,6.891
ALB,Albania,Southern Europe,2010,1.663,14.0


## INNER JOIN via USING
