# Introduction to SQL Using Python: Filtering Data with the WHERE Statement

Structured Query Language, more popularly known as SQL is a powerful Relational Database language. This language allows you to access and manipulate data stored in a SQL database. This tutorial will show you how to access the power of SQL using Python and how data can be filtered using the WHERE statement.

The first step is to import the necessary libraries. You will need to import sqlite3, this will allow you to connect to a SQL database and run SQL queries. You will also need to import pandas. Pandas will allow you to view the queried results in a clean, easy-to-read data frame format.

In [1]:
# Import necessary libraries

import sqlite3
import pandas as pd

For this tutorial, the Football Delphi database will be used. You can download the database here, https://www.kaggle.com/laudanum/footballdelphi. Take some time to read the description to learn more about the database and tables we will be using.

Next you will need to connect to the database and create a cursor object. Later we will call the cursors execute method to run SQL queries.

In [2]:
# Connect to database
conn = sqlite3.connect('''database.sqlite''')

# Create cursor object
cur = conn.cursor()

You should now be connected to the Football Delphi database. This is a relational database. A relational database includes tables. Each table is its own unique dataset that stores information. Each table is organized by columns and rows. This database includes four tables:
    - Unique_Teams
    - Matches
    - Teams
    - Teams_in_Matches

To run a SQL query and view it in an easy to view data frame format we will use the following code:

In [None]:
cur.execute('''Enter SQL query here;''') # Runs SQL query
data = pd.DataFrame(cur.fetchall()) # Converts SQL query results into dataframe format
data.columns = [x[0] for x in cur.description] # Labels the columns of the dataframe
data # View SQL results dataframe

For the first query, we will preview the contents of the table, __Unique_Teams__. To do this we can run the following query:

In [3]:
# View Unique_Teams dataframe

cur.execute('''SELECT * 
               FROM Unique_Teams;''')
Unique_Teams_df = pd.DataFrame(cur.fetchall())
Unique_Teams_df.columns = [x[0] for x in cur.description]
Unique_Teams_df

Unnamed: 0,TeamName,Unique_Team_ID
0,Bayern Munich,1
1,Dortmund,2
2,Leverkusen,3
3,RB Leipzig,4
4,Schalke 04,5
5,M'gladbach,6
6,Wolfsburg,7
7,FC Koln,8
8,Hoffenheim,9
9,Hertha,10


The __SELECT__ statement tells the database what columns we are trying to pull from the dataset. The asterisk, (*), tells the database that we want to select all the columns available in the table.
The __FROM__ statement tells the database that we want to select data from the table, __Unique_Teams__.
The SQL query is ended by a semicolon, (;). This tells the database that you have ended your query similar to the way a period ends a sentence.

You can practice this on your own by trying to query all the contents of the __Matches__ table and comparing your query to the answer below.

In [4]:
# View Matches dataframe

cur.execute('''SELECT * 
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1,D2,2009,2010-04-04,Oberhausen,Kaiserslautern,2,1,H
1,2,D2,2009,2009-11-01,Munich 1860,Kaiserslautern,0,1,A
2,3,D2,2009,2009-10-04,Frankfurt FSV,Kaiserslautern,1,1,D
3,4,D2,2009,2010-02-21,Frankfurt FSV,Karlsruhe,2,1,H
4,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
5,6,D2,2009,2010-04-03,Union Berlin,Karlsruhe,1,1,D
6,7,D2,2009,2009-08-14,Paderborn,Karlsruhe,2,0,H
7,8,D2,2009,2010-03-08,Bielefeld,Karlsruhe,0,1,A
8,9,D2,2009,2009-09-26,Kaiserslautern,Karlsruhe,2,0,H
9,10,D2,2009,2009-11-21,Hansa Rostock,Karlsruhe,2,1,H


This is the table we will use for the rest of the tutorial. There are 9 columns in the __Matches__ table. The Kaggle page this database was downloaded from describes each column as follows:

 - Match_ID (int): unique ID per match
 - Div (str): identifies the division the match was played in (D1 = Bundesliga, D2 = Bundesliga 2, E0 = English Premier League)
 - Season (int): Season the match took place in (usually covering the period of August till May of the following year)
 - Date (str): Date of the match
 - HomeTeam (str): Name of the home team
 - AwayTeam (str): Name of the away team
 - FTHG (int) (Full Time Home Goals): Number of goals scored by the home team
 - FTAG (int) (Full Time Away Goals): Number of goals scored by the away team
 - FTR (str) (Full Time Result): 3-way result of the match (H = Home Win, D = Draw, A = Away Win)
 
Perphaps we only want to select the column, __Match_ID__. To do this we will just write __Match_ID__ in our select statement instead of the asterisk. The code below shows how this is done:

In [5]:
# View Match_ID from Matches

cur.execute('''SELECT Match_ID
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


Try selecting only the column __HomeTeam__ from the __Matches__ table and compare your code to the query below:

In [6]:
# Select HomeTeam from the Matches table

cur.execute('''SELECT HomeTeam
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,HomeTeam
0,Oberhausen
1,Munich 1860
2,Frankfurt FSV
3,Frankfurt FSV
4,Ahlen
5,Union Berlin
6,Paderborn
7,Bielefeld
8,Kaiserslautern
9,Hansa Rostock


You can select multiple columns by typing the name of each column in the select statement and separating each column name by a comma, (,). You can see the names of both the Home Teams and Away Teams in each match below:

In [7]:
# View HomeTeam and AwayTeam from Matches

cur.execute('''SELECT HomeTeam,
                      AwayTeam
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,HomeTeam,AwayTeam
0,Oberhausen,Kaiserslautern
1,Munich 1860,Kaiserslautern
2,Frankfurt FSV,Kaiserslautern
3,Frankfurt FSV,Karlsruhe
4,Ahlen,Karlsruhe
5,Union Berlin,Karlsruhe
6,Paderborn,Karlsruhe
7,Bielefeld,Karlsruhe
8,Kaiserslautern,Karlsruhe
9,Hansa Rostock,Karlsruhe


Practice selecting multiple columns by selecting __Match_ID__ and __Date__ from the __Matches__ table and compare your query to the code below:

In [9]:
# View Match_ID and Date from Matches

cur.execute('''SELECT Match_ID,
                      Date
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Date
0,1,2010-04-04
1,2,2009-11-01
2,3,2009-10-04
3,4,2010-02-21
4,5,2009-12-06
5,6,2010-04-03
6,7,2009-08-14
7,8,2010-03-08
8,9,2009-09-26
9,10,2009-11-21


# The WHERE Statement

There are 24,625 rows in the Matches table. This makes it difficult to find specific information. Perhaps we only wanted information from the 2015 Season. To do this, we can add a __WHERE__ statement to our SQL query. The __WHERE__ statement allows us to select information that meets a certain condition. To query the information only pertaining to the 2015 season from the __Matches__ table, we can use the query below:

In [10]:
# View all columns from Matches from the 2015 season

cur.execute('''SELECT *
               FROM Matches
               WHERE Season = 2015;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,3540,D1,2015,2015-10-17,Werder Bremen,Bayern Munich,0,1,A
1,3541,D1,2015,2016-03-05,Dortmund,Bayern Munich,0,0,D
2,3542,D1,2015,2016-03-19,FC Koln,Bayern Munich,0,1,A
3,3543,D1,2015,2016-01-22,Hamburg,Bayern Munich,1,2,A
4,3544,D1,2015,2015-12-19,Hannover,Bayern Munich,0,1,A
5,3545,D1,2015,2016-02-27,Wolfsburg,Bayern Munich,0,2,A
6,3546,D1,2015,2016-04-09,Stuttgart,Bayern Munich,1,3,A
7,3547,D1,2015,2015-09-26,Mainz,Bayern Munich,0,3,A
8,3548,D1,2015,2016-02-06,Leverkusen,Bayern Munich,0,0,D
9,3549,D1,2015,2016-02-14,Augsburg,Bayern Munich,1,3,A


Now we only have 992 rows and each row is from a 2015 season match. This allows us to see the data from the 2015 season much easier than scrolling through the entire table as queried before. Just as the __SELECT__ statement allows us to return specified columns from the data table, the __WHERE__ statement allows us to return specific rows from the dataset that meet a certain condition, like Season = 2015.
To practice using the __WHERE__ statement, write a query that selects all the columns from the __Matches__ table where the Full Time Home Goals (__FTHG__) is equivalent to 5 and compare your query to the query below:

In [11]:
# View all columns from Matches where FTHG = 5

cur.execute('''SELECT *
               FROM Matches
               WHERE FTHG = 5;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,125,D1,2009,2009-12-12,M'gladbach,Hannover,5,3,H
1,143,D2,2009,2009-09-20,Augsburg,Hansa Rostock,5,2,H
2,167,D1,2009,2009-12-19,Bayern Munich,Hertha,5,2,H
3,170,D1,2009,2009-09-27,Hoffenheim,Hertha,5,1,H
4,280,D2,2009,2009-09-18,Paderborn,Cottbus,5,1,H
5,320,D2,2009,2010-03-14,St Pauli,Oberhausen,5,3,H
6,379,D2,2009,2009-09-13,Union Berlin,Paderborn,5,4,H
7,449,D2,2009,2010-01-17,Duisburg,Frankfurt FSV,5,0,H
8,562,D1,2010,2011-04-10,M'gladbach,FC Koln,5,1,H
9,601,D1,2010,2010-09-22,Dortmund,Kaiserslautern,5,0,H


So far we have only used numerical data in the __WHERE__ statement. In order to use string data in the __WHERE__ statement, we need to wrap quotation marks around the data we are using as a condition. To query all the data where Arsenal is the Home Team we can use the query below:

In [12]:
# View all data from Matches where Arsenal is the Home Team

cur.execute('''SELECT *
               FROM Matches
               WHERE HomeTeam = 'Arsenal';''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,37491,E0,1993,1993-08-14,Arsenal,Coventry,0,3,A
1,37525,E0,1993,1993-08-24,Arsenal,Leeds,2,1,H
2,37536,E0,1993,1993-08-28,Arsenal,Everton,2,0,H
3,37557,E0,1993,1993-09-11,Arsenal,Ipswich,4,0,H
4,37579,E0,1993,1993-09-25,Arsenal,Southampton,1,0,H
5,37601,E0,1993,1993-10-16,Arsenal,Man City,0,0,D
6,37623,E0,1993,1993-10-30,Arsenal,Norwich,0,0,D
7,37634,E0,1993,1993-11-06,Arsenal,Aston Villa,1,2,A
8,37666,E0,1993,1993-11-27,Arsenal,Newcastle,2,1,H
9,37688,E0,1993,1993-12-06,Arsenal,Tottenham,1,1,D


To practice using string data as a condition in the __WHERE__ statement, select all the columns that are in the D2 division and compare your query to the one below:

In [13]:
# View all data from Matches where the division is D2

cur.execute('''SELECT *
               FROM Matches
               WHERE Div = 'D2';''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1,D2,2009,2010-04-04,Oberhausen,Kaiserslautern,2,1,H
1,2,D2,2009,2009-11-01,Munich 1860,Kaiserslautern,0,1,A
2,3,D2,2009,2009-10-04,Frankfurt FSV,Kaiserslautern,1,1,D
3,4,D2,2009,2010-02-21,Frankfurt FSV,Karlsruhe,2,1,H
4,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
5,6,D2,2009,2010-04-03,Union Berlin,Karlsruhe,1,1,D
6,7,D2,2009,2009-08-14,Paderborn,Karlsruhe,2,0,H
7,8,D2,2009,2010-03-08,Bielefeld,Karlsruhe,0,1,A
8,9,D2,2009,2009-09-26,Kaiserslautern,Karlsruhe,2,0,H
9,10,D2,2009,2009-11-21,Hansa Rostock,Karlsruhe,2,1,H


## Comparison Operators

So far, we have only used the equal sign (=) as an operator in the __WHERE__ statement. In addition to the equality sign you can use other operators including, 
    - > (greater than)
    - < (less than)
    - >= (greater than or equal to)
    - <= (less than or equal to)
    - != (not equal to)
To see all the data from the __Matches__ table where the Away Team scored 3 or more times, we can use the following query:

In [15]:
# View all data from Matches where the Away Team scored 3 or more times

cur.execute('''SELECT *
               FROM Matches
               WHERE FTAG >= 3;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
1,11,D2,2009,2009-12-19,Greuther Furth,Karlsruhe,1,4,A
2,14,D2,2009,2009-09-14,Cottbus,Karlsruhe,2,4,A
3,20,D2,2009,2009-08-24,Munich 1860,Karlsruhe,1,3,A
4,29,D1,2009,2009-08-22,Freiburg,Leverkusen,0,5,A
5,34,D1,2009,2009-09-12,Wolfsburg,Leverkusen,2,3,A
6,35,D1,2009,2010-01-24,Hoffenheim,Leverkusen,0,3,A
7,38,D1,2009,2009-11-21,Wolfsburg,Nurnberg,2,3,A
8,45,D1,2009,2010-01-30,Hannover,Nurnberg,1,3,A
9,66,D1,2009,2010-03-06,Ein Frankfurt,Schalke 04,1,4,A


To practice using conditional operators in the __WHERE__ statement, query all the data from the __Matches__ table, where the Home Team did not win.

In [16]:
# View all data from Matches where the Home Team did not win

cur.execute('''SELECT *
               FROM Matches
               WHERE FTR != 'H';''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,2,D2,2009,2009-11-01,Munich 1860,Kaiserslautern,0,1,A
1,3,D2,2009,2009-10-04,Frankfurt FSV,Kaiserslautern,1,1,D
2,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
3,6,D2,2009,2010-04-03,Union Berlin,Karlsruhe,1,1,D
4,8,D2,2009,2010-03-08,Bielefeld,Karlsruhe,0,1,A
5,11,D2,2009,2009-12-19,Greuther Furth,Karlsruhe,1,4,A
6,12,D2,2009,2010-04-16,Koblenz,Karlsruhe,2,2,D
7,14,D2,2009,2009-09-14,Cottbus,Karlsruhe,2,4,A
8,15,D2,2009,2010-05-02,Duisburg,Karlsruhe,0,1,A
9,18,D2,2009,2009-10-18,Augsburg,Karlsruhe,1,1,D


## AND & OR: Using Multiple Conditions in the WHERE Statement 

So far we have only put one condition in the __WHERE__ statement but it is possible to put more than one condition in the __WHERE__ statement. To do this you can add an __AND__ or __OR__ clause into your SQL query to further filter your results. When using the __AND__ clause in the __WHERE__ statement, each statement on either side of the __AND__ clause must be true for an observation to be returned. To see how this works, the query below will show all observations from the __Matches__ table where Man United was the Home Team and the final outcome of the game was a draw.

In [17]:
# View all data from Matches where the Home Team was Man United 
# and the outcome of the game was a draw

cur.execute('''SELECT *
               FROM Matches
               WHERE HomeTeam = 'Man United' 
                   AND FTR = 'D';''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,37518,E0,1993,1993-08-21,Man United,Newcastle,1,1,D
1,37660,E0,1993,1993-11-24,Man United,Ipswich,0,0,D
2,37681,E0,1993,1993-12-04,Man United,Norwich,2,2,D
3,37718,E0,1993,1993-12-26,Man United,Blackburn,1,1,D
4,37740,E0,1993,1994-01-01,Man United,Leeds,0,0,D
5,37952,E0,1993,1994-05-08,Man United,Coventry,0,0,D
6,38178,E0,1994,1994-12-28,Man United,Leicester,1,1,D
7,38303,E0,1994,1995-03-15,Man United,Tottenham,0,0,D
8,38331,E0,1994,1995-04-02,Man United,Leeds,0,0,D
9,38363,E0,1994,1995-04-17,Man United,Chelsea,0,0,D


When using the __OR__ clause in a __WHERE__ statement, rows will only be included in the results when either the first condition is true or the second condition is true. If both conditions are true for an observation then that observation will also be returned. To see how the __OR__ clause works, view the following query that shows all the rows from the __Matches__ table that were in either the 2012 season or the 2013 season:

In [18]:
# View all data from Matches from the 2012 and 2013 seasons

cur.execute('''SELECT *
               FROM Matches
               WHERE Season = 2012 
                   OR Season = 2013;''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1704,D1,2012,2013-05-04,Dortmund,Bayern Munich,1,1,D
1,1705,D1,2012,2013-02-02,Mainz,Bayern Munich,0,3,A
2,1706,D1,2012,2013-02-15,Wolfsburg,Bayern Munich,0,2,A
3,1707,D1,2012,2012-09-29,Werder Bremen,Bayern Munich,0,2,A
4,1708,D1,2012,2013-01-27,Stuttgart,Bayern Munich,0,2,A
5,1709,D1,2012,2012-11-17,Nurnberg,Bayern Munich,1,1,D
6,1710,D1,2012,2012-12-08,Augsburg,Bayern Munich,0,2,A
7,1711,D1,2012,2012-09-22,Schalke 04,Bayern Munich,0,2,A
8,1712,D1,2012,2012-11-03,Hamburg,Bayern Munich,0,3,A
9,1713,D1,2012,2013-03-03,Hoffenheim,Bayern Munich,0,1,A


To practice using the __AND__ clause in the __WHERE__ statement, select all the rows from the __Matches__ table where the Home Team scored more than 3 points and the Away team won the game and compare to the query below:

In [19]:
# View all the rows from Matches where the Home Team scored 
# more than 3 goals and the Away Team won the game

cur.execute('''SELECT *
               FROM Matches
               WHERE FTHG > 3 
                   AND FTR = 'A';''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,404,D2,2009,2009-11-07,Greuther Furth,Augsburg,4,5,A
1,3116,D1,2014,2014-10-25,Ein Frankfurt,Stuttgart,4,5,A
2,3260,D1,2014,2015-02-14,Leverkusen,Wolfsburg,4,5,A
3,4163,D1,2016,2017-05-13,RB Leipzig,Bayern Munich,4,5,A
4,30734,D2,1993,1994-05-28,RW Essen,TB Berlin,4,6,A
5,32702,D1,1997,1997-10-31,Duisburg,M'gladbach,4,5,A
6,33508,D1,1998,1999-05-29,Munich 1860,Schalke 04,4,5,A
7,37891,E0,1993,1994-04-09,Norwich,Southampton,4,5,A
8,41961,E0,2004,2004-11-13,Tottenham,Arsenal,4,5,A
9,46238,E0,2015,2016-01-23,Norwich,Liverpool,4,5,A


To practice using the __OR__ clause in the __WHERE__ statement, select the HomeTeam, AwayTeam, FTHG, FTAG, FTR from the __Matches__ table where the Home Team scored more than 7 goals or the Away Team scored more than 7 goals and compare to the query below:

In [20]:
# Show the HomeTeam, AwayTeam, FTHG, FTAG, FTR from the Matches table 
# where the Home Team scored more than 7 goals or the Away Team 
# scored more than 7 goals

cur.execute('''SELECT HomeTeam,
                      AwayTeam,
                      FTHG,
                      FTAG,
                      FTR
               FROM Matches
               WHERE FTHG > 7 
                   OR FTAG > 7;''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,St Pauli,Bayern Munich,1,8,A
1,Bayern Munich,Hamburg,9,2,H
2,Bayern Munich,Hamburg,8,0,H
3,Bayern Munich,Hamburg,8,0,H
4,Werder Bremen,Bielefeld,8,1,H
5,Hansa Rostock,Koblenz,9,0,H
6,M'gladbach,Leverkusen,2,8,A
7,Ulm,Leverkusen,1,9,A
8,Reutlingen,Saarbrucken,8,2,H
9,Unterhaching,Saarbrucken,8,0,H


## BETWEEN & NOT BETWEEN: Filter Rows that Fall Into a Range

Now, we will introduce two more clauses that can be included in the __WHERE__ statement, __BETWEEN__ and __NOT BETWEEN__. With a __BETWEEN__ clause, you can filter your results based on the condition that the value in the column being filtered has to be between two values in order to be returned. For example, if we wanted to select all the matches from the 2012 season to all the matches from the 2015 season, we would use the following query:

In [21]:
# View all data from Matches between the 2012 season and the 2015 season

cur.execute('''SELECT *
               FROM Matches
               WHERE Season BETWEEN 2012 
                   AND 2015;''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1704,D1,2012,2013-05-04,Dortmund,Bayern Munich,1,1,D
1,1705,D1,2012,2013-02-02,Mainz,Bayern Munich,0,3,A
2,1706,D1,2012,2013-02-15,Wolfsburg,Bayern Munich,0,2,A
3,1707,D1,2012,2012-09-29,Werder Bremen,Bayern Munich,0,2,A
4,1708,D1,2012,2013-01-27,Stuttgart,Bayern Munich,0,2,A
5,1709,D1,2012,2012-11-17,Nurnberg,Bayern Munich,1,1,D
6,1710,D1,2012,2012-12-08,Augsburg,Bayern Munich,0,2,A
7,1711,D1,2012,2012-09-22,Schalke 04,Bayern Munich,0,2,A
8,1712,D1,2012,2012-11-03,Hamburg,Bayern Munich,0,3,A
9,1713,D1,2012,2013-03-03,Hoffenheim,Bayern Munich,0,1,A


The __NOT BETWEEN__ clause works in the opposite way. If we were to write __NOT BETWEEN__ in place of __BETWEEN__ in the query above, it would only show the matches that were not included in the 2012-2015 season range. So the matches prior to the 2012 season would be included and the matches after the 2015 season would have been included in the returned result but none of the matches between the 2012 season and the 2015 season would have been included. To query the matches that have a _Match_ID that is not between 25 and 46,750, we could use the following query:

In [22]:
# Select all the columns from the Matches table that show the 
# matches that do not have a Match_ID between 25 and 46,750

cur.execute('''SELECT *
               FROM Matches
               WHERE Match_ID NOT BETWEEN 25 
                   AND 46750;''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1,D2,2009,2010-04-04,Oberhausen,Kaiserslautern,2,1,H
1,2,D2,2009,2009-11-01,Munich 1860,Kaiserslautern,0,1,A
2,3,D2,2009,2009-10-04,Frankfurt FSV,Kaiserslautern,1,1,D
3,4,D2,2009,2010-02-21,Frankfurt FSV,Karlsruhe,2,1,H
4,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
5,6,D2,2009,2010-04-03,Union Berlin,Karlsruhe,1,1,D
6,7,D2,2009,2009-08-14,Paderborn,Karlsruhe,2,0,H
7,8,D2,2009,2010-03-08,Bielefeld,Karlsruhe,0,1,A
8,9,D2,2009,2009-09-26,Kaiserslautern,Karlsruhe,2,0,H
9,10,D2,2009,2009-11-21,Hansa Rostock,Karlsruhe,2,1,H


To practice on your own, select FTHG, FTAG, FTR from all the matches where the number of goals scored by the Home Team is between 7 and 9 AND the number of goals scored by the Away Team is not between 0 and 3. When you have done this, compare your query to the one below:

In [23]:
# Select FTHG, FTAG, FTR of all the matches where the number of goals 
# scored by the Home Team is between 7 and 9 AND the number of goals 
# scored by the Away Team is not between 0 and 3.

cur.execute('''SELECT FTHG,
                      FTAG,
                      FTR
               FROM Matches
               WHERE (FTHG BETWEEN 7 AND 9)
                   AND (FTAG NOT BETWEEN 0 AND 3);''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,FTHG,FTAG,FTR
0,7,4,H
1,7,4,H
2,7,6,H
3,7,4,H


## IN( ): Selecting Values from a List

Sometimes you may only want to select data that is included in a certain list. This is when the __IN__ operator comes into use. You can use the __IN__ operator in the __WHERE__ statement if you want to only return rows when the condition is included in a list. For example, the following query will return all the matches where Chelsea, Hull or Watford was the Home team:

In [24]:
# Show all matches where the Home Team was either Chelsea, Hull or Watford

cur.execute('''SELECT *
               FROM Matches
               WHERE HomeTeam IN('Chelsea', 'Hull', 'Watford');''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,37493,E0,1993,1993-08-14,Chelsea,Blackburn,1,2,A
1,37529,E0,1993,1993-08-25,Chelsea,QPR,2,0,H
2,37538,E0,1993,1993-08-28,Chelsea,Sheffield Weds,1,1,D
3,37559,E0,1993,1993-09-11,Chelsea,Man United,1,0,H
4,37581,E0,1993,1993-09-25,Chelsea,Liverpool,1,0,H
5,37602,E0,1993,1993-10-16,Chelsea,Norwich,1,2,A
6,37625,E0,1993,1993-10-30,Chelsea,Oldham,0,1,A
7,37647,E0,1993,1993-11-20,Chelsea,Arsenal,0,2,A
8,37656,E0,1993,1993-11-22,Chelsea,Man City,0,0,D
9,37697,E0,1993,1993-12-11,Chelsea,Ipswich,1,1,D


Practice using the __IN__ operator to select the AwayTeam from the __Matches__ table when the Away team is either Liverpool, Man City or Swansea and compare your query to the one below:

In [25]:
# Select AwayTeam from Matches when the Away Team name is Liverpool, Man City or Swansea

cur.execute('''SELECT AwayTeam
               FROM Matches
               WHERE AwayTeam IN('Liverpool', 'Man City', 'Swansea');''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,AwayTeam
0,Man City
1,Liverpool
2,Man City
3,Liverpool
4,Liverpool
5,Man City
6,Liverpool
7,Man City
8,Liverpool
9,Man City


## LIKE Operator: Filtering String Data

Next, we will introduce the __LIKE__ operator. This operator is useful when you want to filter string data. To see how it works, look at the query below. This query will return all the matches where the Home Team name begins with "A".

In [26]:
# Show all the matches where the Home Team name begins with an "A"

cur.execute('''SELECT *
               FROM Matches
               WHERE HomeTeam LIKE 'A%';''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
1,17,D2,2009,2010-01-15,Aachen,Karlsruhe,3,1,H
2,18,D2,2009,2009-10-18,Augsburg,Karlsruhe,1,1,D
3,143,D2,2009,2009-09-20,Augsburg,Hansa Rostock,5,2,H
4,144,D2,2009,2009-11-09,Aachen,Hansa Rostock,1,0,H
5,156,D2,2009,2009-10-04,Ahlen,Hansa Rostock,0,2,A
6,200,D2,2009,2010-05-02,Augsburg,Munich 1860,1,0,H
7,204,D2,2009,2009-08-28,Ahlen,Munich 1860,0,0,D
8,205,D2,2009,2009-09-20,Aachen,Munich 1860,2,0,H
9,210,D2,2009,2009-11-22,Augsburg,St Pauli,3,2,H


After the __LIKE__ statement we wrapped in quotation marks, A%. The A tells us that the statement must begin with an "A". The percent sign is used like a wild card. Anything after the A is not required to be any specific value. To see how this works view the following:

 - WHERE HomeTeam LIKE 'Be%' (Returns rows where HomeTeam begins with "Be")
 - WHERE HomeTeam LIKE '%ty' (Returns rows where HomeTeam ends with "ty")
 - WHERE HomeTeam LIKE '%a%' (Returns rows where HomeTeam includes "a")
 
To practice using the __LIKE__ statement to return all matches where the Away Team name includes the term "City" and compare to the following code:

In [27]:
# Show all the matches where the Away Team name includes the term "City"

cur.execute('''SELECT *
               FROM Matches
               WHERE AwayTeam LIKE '%City%';''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,37503,E0,1993,1993-08-17,Everton,Man City,1,0,H
1,37521,E0,1993,1993-08-21,Tottenham,Man City,1,0,H
2,37555,E0,1993,1993-09-01,Swindon,Man City,1,3,A
3,37578,E0,1993,1993-09-20,Wimbledon,Man City,1,0,H
4,37587,E0,1993,1993-09-25,Sheffield United,Man City,0,1,A
5,37601,E0,1993,1993-10-16,Arsenal,Man City,0,0,D
6,37633,E0,1993,1993-11-01,West Ham,Man City,3,1,H
7,37650,E0,1993,1993-11-20,Norwich,Man City,1,1,D
8,37656,E0,1993,1993-11-22,Chelsea,Man City,0,0,D
9,37680,E0,1993,1993-12-04,Leeds,Man City,3,2,H


You can also use the term __NOT LIKE__. If you wanted to return all the matches where the Date did not begin with 2017 you could use the following query:

In [29]:
# Show all the matches where the Date did not begin with '2017'

cur.execute('''SELECT *
               FROM Matches
               WHERE Date NOT LIKE '2017%';''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1,D2,2009,2010-04-04,Oberhausen,Kaiserslautern,2,1,H
1,2,D2,2009,2009-11-01,Munich 1860,Kaiserslautern,0,1,A
2,3,D2,2009,2009-10-04,Frankfurt FSV,Kaiserslautern,1,1,D
3,4,D2,2009,2010-02-21,Frankfurt FSV,Karlsruhe,2,1,H
4,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
5,6,D2,2009,2010-04-03,Union Berlin,Karlsruhe,1,1,D
6,7,D2,2009,2009-08-14,Paderborn,Karlsruhe,2,0,H
7,8,D2,2009,2010-03-08,Bielefeld,Karlsruhe,0,1,A
8,9,D2,2009,2009-09-26,Kaiserslautern,Karlsruhe,2,0,H
9,10,D2,2009,2009-11-21,Hansa Rostock,Karlsruhe,2,1,H


To practice on your own, select all columns from the __Matches__ table where the __Date__ does not begins with 2016 and the HomeTeam name ends with a "y" and the AwayTeam name does not includes an "e". Compare your query to the one below:

In [30]:
# Show all the matches where the Date did not begin with '2016'
# and the Hoem Team Name ends in a "y"
# and the Away Team Name does not include an "e"

cur.execute('''SELECT *
               FROM Matches
               WHERE (Date NOT LIKE '2016%')
                   AND (HomeTeam Like '%y')
                   AND (AwayTeam NOT LIKE '%e%');''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,37526,E0,1993,1993-08-24,Man City,Blackburn,0,2,A
1,37560,E0,1993,1993-09-11,Man City,QPR,3,0,H
2,37600,E0,1993,1993-10-04,Man City,Oldham,1,1,D
3,37603,E0,1993,1993-10-16,Coventry,Southampton,1,1,D
4,37708,E0,1993,1993-12-18,Coventry,Oldham,1,1,D
5,37729,E0,1993,1993-12-28,Man City,Southampton,1,1,D
6,37750,E0,1993,1994-01-03,Coventry,Swindon,1,1,D
7,37768,E0,1993,1994-01-22,Coventry,QPR,0,1,A
8,37777,E0,1993,1994-02-02,Coventry,Ipswich,1,0,H
9,37780,E0,1993,1994-02-05,Man City,Ipswich,2,1,H


## IS NULL & IS NOT NULL: Dealing with Missing Values

Sometimes, data may be missing in some rows. These values are known as __NULL__ data. You can check if any of the data is missing for a specific column using the clause __IS NULL__ in the __WHERE__ statement. To see if the outcome of any match is missing from the Matches table we can use the following query:

In [31]:
# Select all columns from the Matches table where FTR is a NULL value

cur.execute('''SELECT *
               FROM Matches
               WHERE FTR IS NULL;''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

ValueError: Length mismatch: Expected axis has 0 elements, new values have 9 elements

We received a length mismatch error. This is because there are now observations where the outcome of a game is unknown. That is good. Practice using the __IS NULL__ clause to check whether any of the Home Team Goal Count is missing from any of the observations in the __Matches__ table and compare your query to the one below:

In [32]:
# Select all columns from the Matches table where FTHG is a NULL value

cur.execute('''SELECT *
               FROM Matches
               WHERE FTHG IS NULL;''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

ValueError: Length mismatch: Expected axis has 0 elements, new values have 9 elements

There should be no values missing. The statement __IS NOT NULL__ works in the opposite way. This will return all values where a column does not have a NULL value. The following query shows all the observations from the __Matches__ table where there is not a null value in the __HomeTeam__ column.

In [34]:
# Select all columns from the Matches table where HomeTeam is not a NULL value

cur.execute('''SELECT *
               FROM Matches
               WHERE HomeTeam IS NOT NULL;''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1,D2,2009,2010-04-04,Oberhausen,Kaiserslautern,2,1,H
1,2,D2,2009,2009-11-01,Munich 1860,Kaiserslautern,0,1,A
2,3,D2,2009,2009-10-04,Frankfurt FSV,Kaiserslautern,1,1,D
3,4,D2,2009,2010-02-21,Frankfurt FSV,Karlsruhe,2,1,H
4,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
5,6,D2,2009,2010-04-03,Union Berlin,Karlsruhe,1,1,D
6,7,D2,2009,2009-08-14,Paderborn,Karlsruhe,2,0,H
7,8,D2,2009,2010-03-08,Bielefeld,Karlsruhe,0,1,A
8,9,D2,2009,2009-09-26,Kaiserslautern,Karlsruhe,2,0,H
9,10,D2,2009,2009-11-21,Hansa Rostock,Karlsruhe,2,1,H


Practice using the __NOT NULL__ clause by querying all the columns from the Matches table where __Date__ does not have a null value and compare to the query below:

In [35]:
# Select all columns from the Matches table where Date is not a NULL value

cur.execute('''SELECT *
               FROM Matches
               WHERE Date IS NOT NULL;''')
Matches_df = pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Match_ID,Div,Season,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR
0,1,D2,2009,2010-04-04,Oberhausen,Kaiserslautern,2,1,H
1,2,D2,2009,2009-11-01,Munich 1860,Kaiserslautern,0,1,A
2,3,D2,2009,2009-10-04,Frankfurt FSV,Kaiserslautern,1,1,D
3,4,D2,2009,2010-02-21,Frankfurt FSV,Karlsruhe,2,1,H
4,5,D2,2009,2009-12-06,Ahlen,Karlsruhe,1,3,A
5,6,D2,2009,2010-04-03,Union Berlin,Karlsruhe,1,1,D
6,7,D2,2009,2009-08-14,Paderborn,Karlsruhe,2,0,H
7,8,D2,2009,2010-03-08,Bielefeld,Karlsruhe,0,1,A
8,9,D2,2009,2009-09-26,Kaiserslautern,Karlsruhe,2,0,H
9,10,D2,2009,2009-11-21,Hansa Rostock,Karlsruhe,2,1,H


An important thing to note when using __IS NULL__ and __IS NOT NULL__, a value of zero is not equivalent to a NULL value. A NULL is a missing value and a value of 0 is not a missing value.

This is the end of this tutorial. We have covered how to connect to and query through a SQL database with python. We also discussed what the __WHERE__ statement is and the following filtering techniques that can be used with the __WHERE__ statement:
 - Comparison Operators
 - AND & OR
 - IN BETWEEN & NOT BETWEEN
 - IN
 - LIKE & NOT LIKE
 - IS NULL & IS NOT NULL
 
I encourage you to keep practice using these techniques to gain a deeper understanding of how to filter a SQL table using the __WHERE__ statement.