# Introduction to SQL Using Python: Independent Subqueries

This tutorial will be a brief introduction to __INDEPENDENT SUBQUERIES__. A __SUBQUERY__ is a query inside a query. An __INDEPENDENT SUBQUERY__ is a subquery that can be run on its own, without the main subquery.

We will be using the Football Delphi database, which can be downloaded here (https://www.kaggle.com/laudanum/footballdelphi). If you are not familiar with the database, now is a good time to review the different tables and columns in each table. If you have not seen my previous SQL tutorials, the links are below:
 - Introduction to SQL Using Python: Filtering Data with the WHERE Statement (https://medium.com/analytics-vidhya/introduction-to-sql-using-python-filtering-data-with-the-where-statement-80d89688f39e)
 - Introduction to SQL Using Python: Computing Statistics & Aggregating Data (https://medium.com/analytics-vidhya/introduction-to-sql-using-python-computing-statistics-aggregating-data-c2861186b79f)
 - Introduction to SQL Using Python: Using JOIN Statements to Merge Multiple Tables

The followng topics will be discussed in this tutorial:
 - SUBQUERIES
 - SUBQUERIES with Conditional Operators
 - SUBQUERIES with Aggregation Functions
 - SUBQUERIES that Return More than One Value

To begin, we will download the necessary libraries, __sqlite3__ and __pandas__.

In [1]:
# Import necessary libraries

import sqlite3
import pandas as pd

Next, you will need to connect to the database and create a cursor object.

In [2]:
# Connect to database
conn = sqlite3.connect('''database.sqlite''')

# Create cursor object
cur = conn.cursor()

The following is the format we will be using to run our SQL queries in Python.

In [None]:
cur.execute('''Enter SQL query here;''') # Runs SQL query
data = pd.DataFrame(cur.fetchall()) # Converts SQL query results into dataframe format
data.columns = [x[0] for x in cur.description] # Labels the columns of the dataframe
data # View SQL results dataframe

# SUBQUERIES

A __SUBQUERY__ is a query inside a query. A __SUBQUERY__ is written inside parentheses and will execute before the main query. For example, what if you wanted to find all the teams that had the same number of players as Wolfsburg did during the 2016 season? To do this you would first need to write a query that returned the number of players (__KaderHome__) on Wolfsburg during the 2016 season. This would be the __SUBQUERY__. You then could write the main query that compares the __KaderHome__ value in every row of the __Teams__ table to the value returned by the __SUBQUERY__. To see what this query would look like, examine the query below:

In [8]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE KaderHome = (
                   SELECT KaderHome
                   FROM Teams
                   WHERE TeamName ='Wolfsburg' AND Season = 2016);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2016,Wolfsburg,43,24,21,225350000,5240000,30000
1,2013,Hoffenheim,43,23,24,77080000,1790000,30164
2,2010,Karlsruhe,43,24,12,18930000,440000,47728


When we run the __SUBQUERY__ on its own below we see that the value returned is 43.

In [113]:
cur.execute('''SELECT KaderHome
                FROM Teams
                WHERE TeamName ='Wolfsburg' AND Season = 2016;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,KaderHome
0,43


Instead of writing 43 in the __WHERE__ statement we wrote the __SUBQUERY__ to compute the number of players on __Wolfsburg__ during the __2016 Season__. The outer query compares every __KaderHome__ value to the value returned by the __SUBQUERY__, 43, and as a result we see the following rows returned and all have 43 as a a __KaderHome__ value. 

Practice using __SUBQUERIES__ by writing a query that returns all the Matches fom the __Matches__ table, where the number of home goals scored (__FTHG__) is the same as the number of home goals scored by __Bayern Munich__ on __2010-04-17__. Compare your query to the one below:

In [117]:
cur.execute('''SELECT * 
               FROM Matches
               WHERE FTHG = (
                   SELECT FTHG
                   FROM Matches
                   WHERE HomeTeam ='Bayern Munich' AND Date = '2010-04-17');''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,128,D1,2009,2010-04-17,Bayern Munich,Hannover,7,0,H
1,852,D1,2010,2010-09-18,Stuttgart,M'gladbach,7,0,H
2,1197,D1,2011,2011-09-10,Bayern Munich,Freiburg,7,0,H
3,1643,D1,2011,2012-03-10,Bayern Munich,Hoffenheim,7,1,H
4,5561,D1,2005,2006-02-11,Schalke 04,Leverkusen,7,4,H
5,5911,D2,2005,2006-05-03,Karlsruhe,Braunschweig,7,0,H
6,6730,D1,2007,2008-05-17,Hamburg,Karlsruhe,7,0,H
7,30488,D2,1993,1993-10-16,F Koln,Saarbrucken,7,4,H
8,30810,D1,1994,1994-09-24,M'gladbach,Bochum,7,1,H
9,31329,D2,1994,1995-05-14,Hannover,FSV Frankfurt,7,0,H


# SUBQUERIES With Conditional Operators

__SUBQUERIES__ can also be used with other operators besides the equality sign (__=__). The following operators can also be used with __SUBQUERIES__:
    - > (Greater than)
    - >= (Greater than or equal to)
    - < (Less than)
    - <= (Less than or equal to)
    - !=  (Not equal to)
    
The below query will return data for any team with a higher __StadiumCapacity__ value than __Stuttgartt's__ __StadiumCappacity__ during the __2010 Season__:

In [118]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE  Season = 2010 AND StadiumCapacity > (
                   SELECT StadiumCapacity
                   FROM Teams
                   WHERE TeamName ='Stuttgart' AND Season = 2010);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2010,Bayern Munich,31,25,17,284500000,9180000,75000
1,2010,Schalke 04,36,24,23,139650000,3880000,62271
2,2010,Dortmund,31,24,16,114650000,3700000,81359


Practice using comparison operators with __SUBQUERIES__ by writing a query that returns all the rows from the __Teams__ table where the average age of the players (__AvgAgeHome__) is less than the average of team players on the __Dortmund__ team during the __2012 Season__. Compare your query with the one below:

In [132]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE  AvgAgeHome < (
                   SELECT AvgAgeHome
                   FROM Teams
                   WHERE TeamName ='Dortmund' AND Season = 2012)
            ORDER BY AvgMarketValueHome DESC;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2010,Munich 1860,37,22,13,20930000,566000,47728
1,2016,Bochum,35,22,10,15180000,434000,47728
2,2009,Hoffenheim,33,22,16,90630000,2750000,30164
3,2015,RB Leipzig,31,22,12,35300000,1140000,47728


# SUBQUERIES with Aggregation Functions

We can also use the aggregation functions discussed in my previous blogs within __SUBQUERIES__. The below query will show all the teams from the __2014 Season__ where the number of foreign players is higher than the average number of foreign players on all teams during the 2014 season.

In [35]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE Season = 2014 AND ForeignPlayersHome > (
                   SELECT AVG(ForeignPlayersHome)
                   FROM Teams
                   WHERE Season = 2014);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2014,Bayern Munich,33,25,17,564180000,17100000,75000
1,2014,Dortmund,37,24,17,329800000,8910000,81359
2,2014,Schalke 04,36,23,15,213380000,5930000,62271
3,2014,Wolfsburg,30,25,18,176130000,5870000,30000
4,2014,Leverkusen,26,24,14,166550000,6410000,30210
5,2014,Hamburg,38,24,21,114950000,3030000,57376
6,2014,M'gladbach,27,25,14,107330000,3980000,54014
7,2014,Stuttgart,30,24,14,97900000,3260000,60449
8,2014,Hannover,35,25,16,78050000,2230000,49200
9,2014,Mainz,36,24,19,76900000,2140000,34000


To practice aggregating with __SUBQUERIES__ write a query that will show each team from the __Teams__ table  where the sum of that teams __OverallMarketValueHome__ is less than the sum of __Bayern Munich__ __OverallMarketValue__. Compare your query to the one below:

In [142]:
cur.execute('''SELECT *
               FROM Teams
               GROUP BY TeamName
               HAVING SUM(OverallMarketValueHome) < (
                   SELECT SUM(OverallMarketValueHome)
                   FROM Teams
                   WHERE TeamName = 'Bayern Munich');''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2006,Aachen,26,25,5,23950000,921000,32960
1,2014,Aalen,33,25,8,11580000,351000,47728
2,2009,Ahlen,37,25,14,12930000,349000,47728
3,2017,Augsburg,36,26,20,63100000,1750000,30660
4,2008,Bielefeld,27,26,11,30000000,1110000,26515
5,2009,Bochum,32,26,15,44550000,1390000,29299
6,2013,Braunschweig,32,25,10,29600000,925000,23325
7,2006,Burghausen,28,25,13,11030000,394000,47728
8,2007,CZ Jena,33,24,15,14080000,427000,47728
9,2008,Cottbus,34,26,21,29450000,866000,22528


# SUBQUERIES That Return More than One Value

The previous queries all included __SUBQUERIES__ that returned a single value. __SUBQUERIES__ can also return more than a single value. Look at the query below to see how this is done.

In [53]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE Season = 2014 AND KaderHome IN (36, 38, 40);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2014,Schalke 04,36,23,15,213380000,5930000,62271
1,2014,Hamburg,38,24,21,114950000,3030000,57376
2,2014,Mainz,36,24,19,76900000,2140000,34000
3,2014,Werder Bremen,36,23,16,58580000,1630000,42100
4,2014,Greuther Furth,40,23,17,25150000,629000,47728
5,2014,Munich 1860,36,24,18,22450000,624000,47728


In the below query, the subquery returns all the __HomeTeams__ from the __Matches__ table that begin with a "D". The outer query compares every __TeamName__ from __2013__ in the  __Teams__ data table. If a __TeamName__ in the outer query matches any of the values/__HomeTeams__ in the inner __SUBQUERY__, that __TeamName__ will be returned.

In [143]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE Season = 2013 AND TeamName IN (
                    SELECT HomeTeam
                    FROM Matches
                    WHERE HomeTeam LIKE 'D%');''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2013,Dortmund,31,24,11,285150000,9200000,81359
1,2013,Dresden,27,25,12,13300000,493000,47728


Practice writing a

You have reached the end of this tutorial. We have discussed __INDEPENDENT SUBQUERIES__ and the following topics:
 - SUBQUERIES
 - SUBQUERIES with Conditional Operators
 - SUBQUERIES with Aggregation Functions
 - SUBQUERIES that Return More than One Value
 
 Keep practicing with this database and using subqueries to make your SQL queries more complex!

In [111]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE FTHG > ANY (
                   SELECT AVG(ForeignPlayersHome)
                   FROM Teams);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

OperationalError: near "SELECT": syntax error

In [None]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE Season = 2013 AND TeamName IN (
                    SELECT HomeTeam
                    FROM Matches
                    WHERE HomeTeam LIKE 'D%');''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

In [None]:
SELECT *
FROM trip
WHERE city_id IN
(SELECT id
FROM city
WHERE area > 100)

In [70]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT HomeTeam
                FROM Matches
                GROUP BY HomeTeam
                HAVING AVG(FTHG) > 1.8;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,HomeTeam
0,Arsenal
1,Bayern Munich
2,Chelsea
3,Dortmund
4,Leverkusen
5,Liverpool
6,M'Gladbach
7,Man United
8,Middlesboro
9,Reutlingen


In [98]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT *
                FROM Matches;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1,D2,2009,2010-04-04,Oberhausen,Kaiserslautern,2,1,H
1,2,D2,2009,2009-11-01,Munich 1860,Kaiserslautern,0,1,A
2,3,D2,2009,2009-10-04,Frankfurt FSV,Kaiserslautern,1,1,D
3,4,D2,2009,2010-02-21,Frankfurt FSV,Karlsruhe,2,1,H
4,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
5,6,D2,2009,2010-04-03,Union Berlin,Karlsruhe,1,1,D
6,7,D2,2009,2009-08-14,Paderborn,Karlsruhe,2,0,H
7,8,D2,2009,2010-03-08,Bielefeld,Karlsruhe,0,1,A
8,9,D2,2009,2009-09-26,Kaiserslautern,Karlsruhe,2,0,H
9,10,D2,2009,2009-11-21,Hansa Rostock,Karlsruhe,2,1,H


A __SELF JOIN__ is similar to a __JOIN__ but instead of joining one table to another table you are joining a table to itself. To see an example of a __SELF JOIN__ preview the query below:

In [None]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT *
               FROM Teams_in_Matches T1
               JOIN Teams_in_Matches T2;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

In [None]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT *
               FROM Teams_in_Matches T1
               JOIN Teams_in_Matches T2
               ON T1.Match_ID = T2.Match_ID;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

In [None]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT T1.Match_ID,
                      T1.Unique
               FROM Teams_in_Matches T1
               JOIN Teams_in_Matches T2
               ON T1.Match_ID = T2.Match_ID AND T1.Unique_Team_ID != T2.Unique_Team_ID;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

In [None]:
Te

In [6]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT * 
               FROM Teams;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2017,Bayern Munich,27,26,15,597950000,22150000,75000
1,2017,Dortmund,33,25,18,416730000,12630000,81359
2,2017,Leverkusen,31,24,15,222600000,7180000,30210
3,2017,RB Leipzig,30,23,15,180130000,6000000,42959
4,2017,Schalke 04,29,24,17,179550000,6190000,62271
5,2017,M'gladbach,31,25,17,154400000,4980000,54014
6,2017,Wolfsburg,31,24,14,124430000,4010000,30000
7,2017,FC Koln,24,26,9,118550000,4940000,49968
8,2017,Hoffenheim,31,24,14,107330000,3460000,30164
9,2017,Hertha,26,26,12,86800000,3340000,74475
