<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img
 src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/alx-courses/aice/assets/Content_page_banner_blue_dots.png"
 alt="ALX Content Header"
 class="full-width-image"
/>
</div>

# Subquery in the JOIN clause

In this notebook we will look at subqueries, which are powerful tools to enable more in-depth analysis in SQL. They are essentially intermediate results sets that we access with another query, so **a query inside another query**. We can use subqueries in various places in a query, and those subquery results also have various forms. Here, we look at the **use of a subquery in the `JOIN` clause.**

> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

In this train, we will learn:
- How to use the result set of a subquery in the main query by joining the main table to the subquery on a related column.

## Overview

Imagine we want to calculate the percentage of land area that a specific country in a sub-region occupies, as a percentage of the total land area in that sub-region. We would need to divide each country’s land area with the sum of all countries in that sub-region.

Previously, we created a correlated subquery that calculated the land area for each row. Let’s improve on that. It would be more efficient if we calculated the total land area once for each sub-region. **We can then run a query that would just retrieve the land area value from the result of the inner query.**

## Connecting to our MySQL database

We will use our `Geographic_location` table in our `united_nations` database that we created in MySQL Workbench. We can apply the same queries we used in MySQL Workbench in this notebook if we connect to our MySQL server by running the cells below.


In [1]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

In [2]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://root:WangilaP%400911@localhost:3306/united_nations

'Connected: root@united_nations'

## Exercise

### 1. Calculate the total land area for each sub-region

Write a query that will aggregate the data by summing the `land area` and then grouping the sums by `sub-region`. Give the sums the alias `TotalLandArea`.

In [6]:
%%sql
SELECT
    Sub_region,
    SUM(Land_area) AS TotalLandArea
FROM
    Geographic_location
GROUP BY 
    Sub_region;

 * mysql+pymysql://root:***@localhost:3306/united_nations
18 rows affected.


Sub_region,TotalLandArea
Southern Asia,4770136
Northern Africa,6610941
Polynesia,7218
Middle Africa,3888270
Caribbean,208104
South America,15401392
Western Asia,3488572
Australia and New Zealand,7953710
Central America,2452080
Western Africa,5735549


### 2. Calculate country land area percentages for all the regions using a subquery in the JOIN clause 

Create a query with a main query that selects, from the `Geographic_location` table, the columns `Country_name`, `Land_area`, and `Sub_region`. The next line should then divide the `Land area` by the land area totals named, `TotalLandArea`, gotten from the subquery. Give this calculated column the alias `Pct_of_region_land`.

The query should also have a `JOIN` clause where we will add the query we created in Exercise 1 as a subquery named `Land_per_region`. This join occurs between the `Geographic_location` table and the `Land_per_region` subquery on the `Sub_region` column.

In [8]:
%%sql
SELECT
    Country_name,
    Land_area,
    Sub_region,
    (Land_area/(SELECT
                    SUM(Land_area) AS TotalLandArea
                FROM
                    Geographic_location
                GROUP BY 
                    Sub_region;)
FROM
    Geographic_location;

 * mysql+pymysql://root:***@localhost:3306/united_nations
(pymysql.err.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ';)\nFROM\n    Geographic_location' at line 10")
[SQL: SELECT
    Country_name,
    Land_area,
    Sub_region,
    (Land_area/(SELECT
                    SUM(Land_area) AS TotalLandArea
                FROM
                    Geographic_location
                GROUP BY 
                    Sub_region;)
FROM
    Geographic_location;]
(Background on this error at: https://sqlalche.me/e/14/f405)


## Solutions

### 1. Calculate the total land area for each sub-region

In [None]:
%%sql

SELECT 
    Sub_region, 
    SUM(Land_area) AS Total_Land_Area
FROM 
    Geographic_location
GROUP BY 
    Sub_region

This gives us a table with a list of all the sub-regions and their total land areas calculated. In this case, we are not calculating a single value, but an entire table of values. Now we just join this table to the main one and reference the `Sub-region` as the key.

### 2. Calculate country land area percentages for all the regions using a subquery in the JOIN clause

In [9]:
%%sql

SELECT 
    geoloc.Country_name,
    geoloc.Land_area,
    geoloc.Sub_region,
    (geoloc.Land_area / Land_per_region.Total_Land_Area) * 100 AS Pct_Of_Region_Land
FROM
    Geographic_location AS geoloc
JOIN 
    ( 
    SELECT 
        Sub_region, 
        SUM(Land_area) AS Total_Land_Area
    FROM 
        Geographic_location
    GROUP BY 
        Sub_region)  AS Land_per_region 
    ON 
        geoloc.Sub_region = Land_per_region.sub_Region;

 * mysql+pymysql://root:***@localhost:3306/united_nations
182 rows affected.


Country_name,Land_area,Sub_region,Pct_Of_Region_Land
Sri Lanka,61878.0,Southern Asia,1.2972
Pakistan,770880.0,Southern Asia,16.1605
Nepal,143350.0,Southern Asia,3.0052
Maldives,300.0,Southern Asia,0.0063
Iran (Islamic Republic of),,Southern Asia,
India,2973190.0,Southern Asia,62.3293
Bhutan,38138.0,Southern Asia,0.7995
Bangladesh,130170.0,Southern Asia,2.7289
Afghanistan,652230.0,Southern Asia,13.6732
Tunisia,155360.0,Northern Africa,2.35


In [13]:
%%sql
SELECT 
    Sub_region, 
    SUM(Land_area) AS Total_Land_Area
FROM 
    Geographic_location
GROUP BY 
    Sub_region;

 * mysql+pymysql://root:***@localhost:3306/united_nations
18 rows affected.


Sub_region,Total_Land_Area
Southern Asia,4770136
Northern Africa,6610941
Polynesia,7218
Middle Africa,3888270
Caribbean,208104
South America,15401392
Western Asia,3488572
Australia and New Zealand,7953710
Central America,2452080
Western Africa,5735549


While this method may look a bit more complicated, it runs much faster than the previous method because it does not need to repeat calculations for each row. This is especially true for larger tables.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/refs/heads/master/ALX_banners/ALX_Navy.png"  style="width:100px"  ;/>
</div>