# Introduction to SQL Using Python: Aggregating Data with the GROUP BY Statement

In my last blog, I discussed how to filer data using SQL with the __WHERE__ statement. This blog is a tutorial on how to aggregate data using the __GROUP BY__ statement.

In [1]:
# Import necessary libraries

import sqlite3
import pandas as pd

In [2]:
# Connect to database
conn = sqlite3.connect('''database.sqlite''')

# Create cursor object
cur = conn.cursor()

In [None]:
cur.execute('''Enter SQL query here;''') # Runs SQL query
data = pd.DataFrame(cur.fetchall()) # Converts SQL query results into dataframe format
data.columns = [x[0] for x in cur.description] # Labels the columns of the dataframe
data # View SQL results dataframe

In [7]:
# View Teams dataframe

cur.execute('''SELECT * 
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2017,Bayern Munich,27,26,15,597950000,22150000,75000
1,2017,Dortmund,33,25,18,416730000,12630000,81359
2,2017,Leverkusen,31,24,15,222600000,7180000,30210
3,2017,RB Leipzig,30,23,15,180130000,6000000,42959
4,2017,Schalke 04,29,24,17,179550000,6190000,62271
5,2017,M'gladbach,31,25,17,154400000,4980000,54014
6,2017,Wolfsburg,31,24,14,124430000,4010000,30000
7,2017,FC Koln,24,26,9,118550000,4940000,49968
8,2017,Hoffenheim,31,24,14,107330000,3460000,30164
9,2017,Hertha,26,26,12,86800000,3340000,74475


# COUNT( )

The first aggregation function we will look at is the __COUNT( )__ function. Previously we have either put a single asterisk in the __SELECT__ statment or we had listed the names of the columns we wanted to be returned. The __COUNT( )__ function can also be listed in the __SELECT__ statement and will count the number of rows specified inbetween the parentheses. The query below has the function __COUNT(*)__ in the __SELECT__ statement. __COUNT(*)__ will count all the rows from the table specified. The below query counts all the rows in the __Teams__ dataset. 

In [9]:
# View the number of rows from Teams dataframe

cur.execute('''SELECT COUNT(*) 
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,COUNT(*)
0,468


The are a total of 468 rows included in the __Teams__ dataset. What happens if we insert a column name instead of an asterisk between the parentheses of the __COUNT( )__ function, for example, __COUNT(Season)__? The function will then count all the rows in the __Season__ column where there is a non-null value. If a value of __Season__ is null for any row, then that row will not be counted. If there are no null values in the __Season__ column then the following function will return 468, the same number as the total number of rows in the __Teams__ dataset.

In [10]:
# Count the number of Season rows in the Teams Dataframe

cur.execute('''SELECT COUNT(Season) 
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,COUNT(Season)
0,468


Great, the number of rows with non-null values is 468, the same as the total amount of rows included in the __Teams__ dataset. That means there are no null values. To practice using the __COUNT( )__ statement, count the number of rows in the __Matches__ dataset and compare your query to the one below:

In [11]:
# Count the number of rows in the Matches dataframe

cur.execute('''SELECT COUNT(*) 
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,COUNT(*)
0,24625


There should be 24,625 rows in the __Matches__ dataset. Check to see if there are any missing values in the column __Season__ from the __Matches__ dataset and compare your query to the one below:

In [12]:
# Count the number of Season rows in the Matches Dataframe

cur.execute('''SELECT COUNT(Season) 
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,COUNT(Season)
0,24625


There are 24,625 rows with non-null values in the __Season__ column in the __Matches__ dataset. Since this is the same as the total number of rows in the __Matches__ dataset, there are no missing/null values in the __Season__ column.

# SELECT DISTINCT

With the __COUNT( )__ function, we were able to see how many rows were listed in both the __Teams__ and __Matches__ datasets. We also saw the number of rows that did not have null values in the __Seasons__ column for both datasets. What if we want to know the unqiue values in the __Seasons__ column in the __Teams__ dataset. We can't use __COUNT(Season)__, because that just gives us the number of rows that have a specified value in the __Season__ column. Instead we will use the __DISTINCT__ clause in the __SELECT__ statement. The below query will show us all of unqiue values that are included in the __Season__ column in the __Teams__ dataset.

In [13]:
# View the unqiue Season values in the Teams Dataframe

cur.execute('''SELECT DISTINCT Season
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Season
0,2017
1,2016
2,2015
3,2014
4,2013
5,2012
6,2011
7,2010
8,2009
9,2008


You can also use more than one column name when using the __DISTINCT__ clause is the __SELECT__ statement. For example, one soccer team may be the home team multiple times in a season, but if we only want to know the seasons each team was a home team we can use the following query:

In [19]:
cur.execute('''SELECT DISTINCT Season, HomeTeam
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Season,HomeTeam
0,2009,Oberhausen
1,2009,Munich 1860
2,2009,Frankfurt FSV
3,2009,Ahlen
4,2009,Union Berlin
5,2009,Paderborn
6,2009,Bielefeld
7,2009,Kaiserslautern
8,2009,Hansa Rostock
9,2009,Greuther Furth


The above results show every distinct __HomeTeam__ that was in every distinct __Season__ from the __Matches__ table. This might seem a bit confusing right now but we will play around more with the __DISTINCT__ clause later. For now, practice selecting all the unique __AwayTeams__ there are in the __Matches__ dataset and compare your query to the one below:

In [22]:
# Show all the unqiue AwayTeams in the Matches Dataframe

cur.execute('''SELECT DISTINCT AwayTeam
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,AwayTeam
0,Kaiserslautern
1,Karlsruhe
2,Leverkusen
3,Nurnberg
4,Schalke 04
5,Stuttgart
6,Werder Bremen
7,Bochum
8,Hannover
9,Hansa Rostock


# Using COUNT( ) & DISTINCT Together

We just saw how to get all the unique __AwayTeam__ names from the Matches dataset above. What happens when we want to know how many unqiue values are returned? to get the count of the unique __AwayTeam__ names, we can use the __COUNT( )__ function together with the __DISTINCT__ clause. The below query will find all the the distinct AwayTeam names in the __Matches__ dataset and return the count or number of rows that are included in distinct AwayTeam names.

In [23]:
# Show all the unqiue AwayTeams in the Matches Dataframe

cur.execute('''SELECT COUNT (DISTINCT AwayTeam)
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,COUNT (DISTINCT AwayTeam)
0,128


Now we can see all that there are 128 unique Away Teams in the __Matches__ dataset. We can also see that the name of this column is named __COUNT (DISTINCT AwayTeam)__, this is a bit long and unclear. We can change the name of a column using an __AS__ clause. In the __SELECT__ statement, we can rename any column just by writing __AS__ after the data we are selecting and giving it a new name. The below query shows the number of unique away teams in the __Matches__ dataset but this time, the column header is __Num_of_AwayTeams__.

In [25]:
# Show all the unqiue AwayTeams in the Matches Dataframe

cur.execute('''SELECT COUNT (DISTINCT AwayTeam) AS Num_of_AwayTeams
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Num_of_AwayTeams
0,128


To practice using the __COUNT( )__ function and __DISTINCT__ clause together, query the number of distinct __Seasons__ included in the __Teams__ dataset and rename the column to __Num_of_Seasons__. Compare your query to the one below:

In [27]:
# Count of unqiue seasons in the Teams dataset

cur.execute('''SELECT COUNT(DISTINCT Season) AS Num_of_Seasons
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Num_of_Seasons
0,13


# MAX( ) & MIN( ) Functions

We will continue to look at how data can be aggregated in the __SELECT__ statement before we get to the __GROUP BY__ statement. Next we will look at how to use the __MAX( )__ function and __MIN( )__ function. Perhaps we want to know what is the maximum stadium capacity is that is included in the __Teams__ dataset. We can use the query below:

In [29]:
# Return the maximum StadiumCapacity in the Teams datas

cur.execute('''SELECT MAX(StadiumCapacity)
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,MAX(StadiumCapacity)
0,81359


In [28]:
# View Teams dataframe

cur.execute('''SELECT * 
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2017,Bayern Munich,27,26,15,597950000,22150000,75000
1,2017,Dortmund,33,25,18,416730000,12630000,81359
2,2017,Leverkusen,31,24,15,222600000,7180000,30210
3,2017,RB Leipzig,30,23,15,180130000,6000000,42959
4,2017,Schalke 04,29,24,17,179550000,6190000,62271
5,2017,M'gladbach,31,25,17,154400000,4980000,54014
6,2017,Wolfsburg,31,24,14,124430000,4010000,30000
7,2017,FC Koln,24,26,9,118550000,4940000,49968
8,2017,Hoffenheim,31,24,14,107330000,3460000,30164
9,2017,Hertha,26,26,12,86800000,3340000,74475


# ORDER BY Statement

In [None]:
# DESC & ASC