# Introduction to SQL Using Python: SELF JOINS

To begin, we will download the necessary libraries, __sqlite3__ and __pandas__.

In [1]:
# Import necessary libraries

import sqlite3
import pandas as pd

Next, you will need to connect to the database and create a cursor object.

In [2]:
# Connect to database
conn = sqlite3.connect('''database.sqlite''')

# Create cursor object
cur = conn.cursor()

The following is format we will be using to run our SQL queries in Python.

In [None]:
cur.execute('''Enter SQL query here;''') # Runs SQL query
data = pd.DataFrame(cur.fetchall()) # Converts SQL query results into dataframe format
data.columns = [x[0] for x in cur.description] # Labels the columns of the dataframe
data # View SQL results dataframe

# SUBQUERIES

A __SUBQUERY__ is a query inside a query. A __SUBQUERY__ will be inside parentheses and will execute first, then the the outside, final query will be executed. For example, what if you wanted to find all the teams that had the same number of players as Wolfsburg did during the 2016 season? To do this you would first need to write a query that computed the number of players (__KaderHome__) on Wolfsburg during the 2016 season. This would be the __SUBQUERY__. You could then write a query that compares the __KaderHome__ value of every team during every season to the value computed by the __SUBQUERY__. To see what this query would look like, examine the query below:

In [8]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT * 
               FROM Teams
               WHERE KaderHome = (
                   SELECT KaderHome
                   FROM Teams
                   WHERE TeamName ='Wolfsburg' AND Season = 2016);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2016,Wolfsburg,43,24,21,225350000,5240000,30000
1,2013,Hoffenheim,43,23,24,77080000,1790000,30164
2,2010,Karlsruhe,43,24,12,18930000,440000,47728


__SUBQUERIES__ can also be used with other opperators besides the equality sign. The below query will return data for any team with a higher __StadiumCapacity__ value than __Stuttgartt's__ __StadiumCappacity__ during the __2010 Season__:

In [31]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE  Season = 2010 AND StadiumCapacity > (
                   SELECT StadiumCapacity
                   FROM Teams
                   WHERE TeamName ='Stuttgart' AND Season = 2010);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2010,Bayern Munich,31,25,17,284500000,9180000,75000
1,2010,Schalke 04,36,24,23,139650000,3880000,62271
2,2010,Dortmund,31,24,16,114650000,3700000,81359


We can also use the aggregation functions discussed in previous blogs within __SUBQUERIES__. The below query will show all the teams from the __2014 Season__ where the number of foreign players is higher than the average number of foreign players on all teams during the 2014 season.

In [35]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE Season = 2014 AND ForeignPlayersHome > (
                   SELECT AVG(ForeignPlayersHome)
                   FROM Teams
                   WHERE Season = 2014);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2014,Bayern Munich,33,25,17,564180000,17100000,75000
1,2014,Dortmund,37,24,17,329800000,8910000,81359
2,2014,Schalke 04,36,23,15,213380000,5930000,62271
3,2014,Wolfsburg,30,25,18,176130000,5870000,30000
4,2014,Leverkusen,26,24,14,166550000,6410000,30210
5,2014,Hamburg,38,24,21,114950000,3030000,57376
6,2014,M'gladbach,27,25,14,107330000,3980000,54014
7,2014,Stuttgart,30,24,14,97900000,3260000,60449
8,2014,Hannover,35,25,16,78050000,2230000,49200
9,2014,Mainz,36,24,19,76900000,2140000,34000


The previous queries all included __SUBQUERIES__ that returned a single value. __SUBQUERIES__ can also return more than a single value. Look at the query below to see how this is done.

In [53]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE Season = 2014 AND KaderHome IN (36, 38, 40);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2014,Schalke 04,36,23,15,213380000,5930000,62271
1,2014,Hamburg,38,24,21,114950000,3030000,57376
2,2014,Mainz,36,24,19,76900000,2140000,34000
3,2014,Werder Bremen,36,23,16,58580000,1630000,42100
4,2014,Greuther Furth,40,23,17,25150000,629000,47728
5,2014,Munich 1860,36,24,18,22450000,624000,47728


In the below query, the subquery returns all the __HomeTeams__ from the __Matches__ table that begin with a "D". The outer query compares every __TeamName__ from __2013__ in the  __Teams__ data table. If a __TeamName__ in the outer query matches any of the values/__HomeTeams__ in the inner __SUBQUERY__, that __TeamName__ will be returned.

In [106]:
cur.execute('''SELECT * 
               FROM Matches
               WHERE FTHG > ALL (
                   SELECT FTAG
                    FROM Matches);''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

OperationalError: near "ALL": syntax error

In [None]:
cur.execute('''SELECT * 
               FROM Teams
               WHERE Season = 2013 AND TeamName IN (
                    SELECT HomeTeam
                    FROM Matches
                    WHERE HomeTeam LIKE 'D%');''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

In [None]:
SELECT *
FROM trip
WHERE city_id IN
(SELECT id
FROM city
WHERE area > 100)

In [70]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT HomeTeam
                FROM Matches
                GROUP BY HomeTeam
                HAVING AVG(FTHG) > 1.8;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,HomeTeam
0,Arsenal
1,Bayern Munich
2,Chelsea
3,Dortmund
4,Leverkusen
5,Liverpool
6,M'Gladbach
7,Man United
8,Middlesboro
9,Reutlingen


In [98]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT *
                FROM Matches;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1,D2,2009,2010-04-04,Oberhausen,Kaiserslautern,2,1,H
1,2,D2,2009,2009-11-01,Munich 1860,Kaiserslautern,0,1,A
2,3,D2,2009,2009-10-04,Frankfurt FSV,Kaiserslautern,1,1,D
3,4,D2,2009,2010-02-21,Frankfurt FSV,Karlsruhe,2,1,H
4,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
5,6,D2,2009,2010-04-03,Union Berlin,Karlsruhe,1,1,D
6,7,D2,2009,2009-08-14,Paderborn,Karlsruhe,2,0,H
7,8,D2,2009,2010-03-08,Bielefeld,Karlsruhe,0,1,A
8,9,D2,2009,2009-09-26,Kaiserslautern,Karlsruhe,2,0,H
9,10,D2,2009,2009-11-21,Hansa Rostock,Karlsruhe,2,1,H


A __SELF JOIN__ is similar to a __JOIN__ but instead of joining one table to another table you are joining a table to itself. To see an example of a __SELF JOIN__ preview the query below:

In [None]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT *
               FROM Teams_in_Matches T1
               JOIN Teams_in_Matches T2;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

In [None]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT *
               FROM Teams_in_Matches T1
               JOIN Teams_in_Matches T2
               ON T1.Match_ID = T2.Match_ID;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

In [None]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT T1.Match_ID,
                      T1.Unique
               FROM Teams_in_Matches T1
               JOIN Teams_in_Matches T2
               ON T1.Match_ID = T2.Match_ID AND T1.Unique_Team_ID != T2.Unique_Team_ID;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

In [None]:
Te

In [6]:
# Return the number of rows from Teams dataframe

cur.execute('''SELECT * 
               FROM Teams;''')
df =pd.DataFrame(cur.fetchall())
df.columns = [x[0] for x in cur.description]
df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2017,Bayern Munich,27,26,15,597950000,22150000,75000
1,2017,Dortmund,33,25,18,416730000,12630000,81359
2,2017,Leverkusen,31,24,15,222600000,7180000,30210
3,2017,RB Leipzig,30,23,15,180130000,6000000,42959
4,2017,Schalke 04,29,24,17,179550000,6190000,62271
5,2017,M'gladbach,31,25,17,154400000,4980000,54014
6,2017,Wolfsburg,31,24,14,124430000,4010000,30000
7,2017,FC Koln,24,26,9,118550000,4940000,49968
8,2017,Hoffenheim,31,24,14,107330000,3460000,30164
9,2017,Hertha,26,26,12,86800000,3340000,74475
