# Introduction to SQL Using Python: Aggregating Data with the GROUP BY Statement

In my last blog, I discussed how to filer data using SQL with the __WHERE__ statement. This blog is a tutorial on how to aggregate data using the __GROUP BY__ statement.

In [1]:
# Import necessary libraries

import sqlite3
import pandas as pd

In [2]:
# Connect to database
conn = sqlite3.connect('''database.sqlite''')

# Create cursor object
cur = conn.cursor()

In [None]:
cur.execute('''Enter SQL query here;''') # Runs SQL query
data = pd.DataFrame(cur.fetchall()) # Converts SQL query results into dataframe format
data.columns = [x[0] for x in cur.description] # Labels the columns of the dataframe
data # View SQL results dataframe

In [7]:
# View Teams dataframe

cur.execute('''SELECT * 
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2017,Bayern Munich,27,26,15,597950000,22150000,75000
1,2017,Dortmund,33,25,18,416730000,12630000,81359
2,2017,Leverkusen,31,24,15,222600000,7180000,30210
3,2017,RB Leipzig,30,23,15,180130000,6000000,42959
4,2017,Schalke 04,29,24,17,179550000,6190000,62271
5,2017,M'gladbach,31,25,17,154400000,4980000,54014
6,2017,Wolfsburg,31,24,14,124430000,4010000,30000
7,2017,FC Koln,24,26,9,118550000,4940000,49968
8,2017,Hoffenheim,31,24,14,107330000,3460000,30164
9,2017,Hertha,26,26,12,86800000,3340000,74475


# COUNT( )

The first aggregation function we will look at is the __COUNT( )__ function. Previously we have either put a single asterisk in the __SELECT__ statment or we had listed the names of the columns we wanted to be returned. The __COUNT( )__ function can also be listed in the __SELECT__ statement and will count the number of rows specified inbetween the parentheses. The query below has the function __COUNT(*)__ in the __SELECT__ statement. __COUNT(*)__ will count all the rows from the table specified. The below query counts all the rows in the __Teams__ dataset. 

In [9]:
# View the number of rows from Teams dataframe

cur.execute('''SELECT COUNT(*) 
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,COUNT(*)
0,468


The are a total of 468 rows included in the __Teams__ dataset. What happens if we insert a column name instead of an asterisk between the parentheses of the __COUNT( )__ function, for example, __COUNT(Season)__? The function will then count all the rows in the __Season__ column where there is a non-null value. If a value of __Season__ is null for any row, then that row will not be counted. If there are no null values in the __Season__ column then the following function will return 468, the same number as the total number of rows in the __Teams__ dataset.

In [10]:
# Count the number of Season rows in the Teams Dataframe

cur.execute('''SELECT COUNT(Season) 
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,COUNT(Season)
0,468


Great, the number of rows with non-null values is 468, the same as the total amount of rows included in the __Teams__ dataset. That means there are no null values. To practice using the __COUNT( )__ statement, count the number of rows in the __Matches__ dataset and compare your query to the one below:

In [11]:
# Count the number of rows in the Matches dataframe

cur.execute('''SELECT COUNT(*) 
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,COUNT(*)
0,24625


There should be 24,625 rows in the __Matches__ dataset. Check to see if there are any missing values in the column __Season__ from the __Matches__ dataset and compare your query to the one below:

In [12]:
# Count the number of Season rows in the Matches Dataframe

cur.execute('''SELECT COUNT(Season) 
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,COUNT(Season)
0,24625


There are 24,625 rows with non-null values in the __Season__ column in the __Matches__ dataset. Since this is the same as the total number of rows in the __Matches__ dataset, there are no missing/null values in the __Season__ column.

# SELECT DISTINCT

With the __COUNT( )__ function, we were able to see how many rows were listed in both the __Teams__ and __Matches__ datasets. We also saw the number of rows that did not have null values in the __Seasons__ column for both datasets. What if we want to know the unqiue values in the __Seasons__ column in the __Teams__ dataset. We can't use __COUNT(Season)__, because that just gives us the number of rows that have a specified value in the __Season__ column. Instead we will use the __DISTINCT__ clause in the __SELECT__ statement. The below query will show us all of unqiue values that are included in the __Season__ column in the __Teams__ dataset.

In [13]:
# View the unqiue Season values in the Teams Dataframe

cur.execute('''SELECT DISTINCT Season
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Season
0,2017
1,2016
2,2015
3,2014
4,2013
5,2012
6,2011
7,2010
8,2009
9,2008


You can also use more than one column name when using the __DISTINCT__ clause is the __SELECT__ statement. For example, one soccer team may be the home team multiple times in a season, but if we only want to know the seasons each team was a home team we can use the following query:

In [19]:
cur.execute('''SELECT DISTINCT Season, HomeTeam
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Season,HomeTeam
0,2009,Oberhausen
1,2009,Munich 1860
2,2009,Frankfurt FSV
3,2009,Ahlen
4,2009,Union Berlin
5,2009,Paderborn
6,2009,Bielefeld
7,2009,Kaiserslautern
8,2009,Hansa Rostock
9,2009,Greuther Furth


The above results show every distinct __HomeTeam__ that was in every distinct __Season__ from the __Matches__ table. This might seem a bit confusing right now but we will play around more with the __DISTINCT__ clause later. For now, practice selecting all the unique __AwayTeams__ there are in the __Matches__ dataset and compare your query to the one below:

In [22]:
# Show all the unqiue AwayTeams in the Matches Dataframe

cur.execute('''SELECT DISTINCT AwayTeam
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,AwayTeam
0,Kaiserslautern
1,Karlsruhe
2,Leverkusen
3,Nurnberg
4,Schalke 04
5,Stuttgart
6,Werder Bremen
7,Bochum
8,Hannover
9,Hansa Rostock


# Using COUNT( ) & DISTINCT Together

We just saw how to get all the unique __AwayTeam__ names from the Matches dataset above. What happens when we want to know how many unqiue values are returned? to get the count of the unique __AwayTeam__ names, we can use the __COUNT( )__ function together with the __DISTINCT__ clause. The below query will find all the the distinct AwayTeam names in the __Matches__ dataset and return the count or number of rows that are included in distinct AwayTeam names.

In [23]:
# Show all the unqiue AwayTeams in the Matches Dataframe

cur.execute('''SELECT COUNT (DISTINCT AwayTeam)
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,COUNT (DISTINCT AwayTeam)
0,128


Now we can see all that there are 128 unique Away Teams in the __Matches__ dataset. We can also see that the name of this column is named __COUNT (DISTINCT AwayTeam)__, this is a bit long and unclear. We can change the name of a column using an __AS__ clause. In the __SELECT__ statement, we can rename any column just by writing __AS__ after the data we are selecting and giving it a new name. The below query shows the number of unique away teams in the __Matches__ dataset but this time, the column header is __Num_of_AwayTeams__.

In [25]:
# Show all the unqiue AwayTeams in the Matches Dataframe

cur.execute('''SELECT COUNT (DISTINCT AwayTeam) AS Num_of_AwayTeams
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Num_of_AwayTeams
0,128


To practice using the __COUNT( )__ function and __DISTINCT__ clause together, query the number of distinct __Seasons__ included in the __Teams__ dataset and rename the column to __Num_of_Seasons__. Compare your query to the one below:

In [27]:
# Count of unqiue seasons in the Teams dataset

cur.execute('''SELECT COUNT(DISTINCT Season) AS Num_of_Seasons
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Num_of_Seasons
0,13


# MAX( ) & MIN( ) Functions

We will continue to look at how data can be aggregated in the __SELECT__ statement before we get to the __GROUP BY__ statement. Next we will look at how to use the __MAX( )__ function and __MIN( )__ function. Perhaps we want to know what is the maximum stadium capacity is that is included in the __Teams__ dataset. To do this, we can use the query below:

In [31]:
# Return the maximum StadiumCapacity in the Teams dataset

cur.execute('''SELECT MAX(StadiumCapacity)
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,MAX(StadiumCapacity)
0,81359


The __MAX( )__ function returns the highest or maximum value in a specified column. In the example above, 81,359 was the largest value in the __StadiumCapacity__ column in the __Teams__ dataset. To see what the smallest value is in the __StadiumCapacity__ column, we can use the __MIN( )__ function. The __MIN( )__ function is used to return the minimum value of a column.

In [32]:
# Return the minimum StadiumCapacity in the Teams dataset

cur.execute('''SELECT MIN(StadiumCapacity)
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,MIN(StadiumCapacity)
0,15000


We can see that the smallest stadium capacity included in the __Teams__ dataset is 15,000. To practice using the __MAX( )__ and __MIN( )__ functions, query, the largest number of players in a squad __KaderHome__ and rename the column to __Max_Players__, and the smallest number of players in a squad and rename that column to __Min_Players__ and compare your query to the one below:

In [37]:
# Return the maximum and minimum squad count from the Teams table

cur.execute('''SELECT MAX(KaderHome) AS Max_Players,
                      MIN(KaderHome) AS Min_Players
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Max_Players,Min_Players
0,44,23


# AVG( ) Funciton & SUM( ) Function

Now we will look at two more functions that can be used in the __SELECT__ statement. The __AVG( )__ wil return the average value of a specified column. We will continue to look at the __StadiumCapacity__ column in the __Teams__ dataset. The below query shows the average stadium capacity in the __Teams__ dataset.

In [38]:
# Return the average stadium capacity from the Teams table

cur.execute('''SELECT AVG(StadiumCapacity) AS Average_StadiumCapacity
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Average_StadiumCapacity
0,47728.094017


There are a lot of numbers following the decimal. If we want to round to the hundreth place we can add a 
__ROUND( )__ clause around our __AVG( )__ function, like the query below:

In [39]:
# Return the average stadium capacity from the Teams table

cur.execute('''SELECT ROUND(AVG(StadiumCapacity), 2) AS Average_StadiumCapacity
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Average_StadiumCapacity
0,47728.09


A quick note on the __ROUND( )__ function. The format is as follows:__ROUND(number to be rounded, number of decimal places to round number to)__.
The number we wanted to round was the __AVG(StadiumCapacity)__ and we wanted to round to the nearest hundreth, or __2__ numbers following the decimal, therefore the format we used was __ROUND(AVG(StadiumCapacity), 2)__.
To read more about the __ROUND( )__ functio visit, https://www.w3schools.com/sql/func_sqlserver_round.asp.

The __SUM( )__ function will return the sum of a specified column. For example, if we wanted to see the number of home goals scored in the __Matches__ dataset, we could use the following query:

In [40]:
# Show total number of home goals included in the Matches Dataframe

cur.execute('''SELECT SUM(FTHG) AS Total_HomeGoals
               FROM Matches;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Total_HomeGoals
0,37357


To practice using the __AVG( )__ and __SUM( )__ functions on your own, query from the __Teams__ dataset the average overall market value of the team pre-season in EUR, __OverallMarketValueHome__, and round the value to the nearest hundreth and then label that column, __Avg_Market_Val__. Also query the sum of the average age of players,  __AvgAgeHome__, and rename that column to __Avg_Age__. Compare your query to the one below:

In [43]:
# Return the average overall market value of the team pre-season and round to the nearest hundreth,
# also return the sum of the average age of players column

cur.execute('''SELECT ROUND(AVG(OverallMarketValueHome), 2) AS Avg_Market_Val,
                      SUM(AvgAgeHome) AS Avg_Age
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Avg_Market_Val,Avg_Age
0,60808525.64,11492


# Bringing Back the WHERE Statement

If you did not read my last blog post about filtering data in the __WHERE__ statement, now may be a good time to take a look, https://medium.com/analytics-vidhya/introduction-to-sql-using-python-filtering-data-with-the-where-statement-80d89688f39e. We are now going to start aggregating filtered data.

What if you wanted to know what how many goals Man United scored during home games on average. To get this answer, we could use the following query:

In [45]:
# Show average number of home goals scored by Man United in the Matches Dataframe

cur.execute('''SELECT ROUND(AVG(FTHG), 2) AS Avg_HomeGoals
               FROM Matches
               WHERE HomeTeam = 'Man United';''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Avg_HomeGoals
0,2.16


By adding the __WHERE__ statement, we are telling our query that we only want to filter data that has a HomeTeam of 'Man United'. To understand this better, we will briefly discuss the SQL order of operations in relation to this query. Although the __SELECT__ statement is the first statement in our SQL query, the first statement to be executed is the __FROM__ statement. Our __FROM__ statment above tells SQL that we are using the __Matches__ dataset as our table. The next statement executed is the __WHERE__ statement. The __WHERE__ statement filters our dataset to include only the information we want, in this case, we only wanted the rows where __HomeTeam = 'Man United'__ to be included. Lastly, the __SELECT__ statement is executed, it returns the data in the form we specify. In this case, we want to return the rounded average value of home goals scored from the data the SQL query filtered through earlier during the query execution. Practice aggregating and using the __WHERE__ statement by writing a query that shows the sum of squad players for all teams during the 2014 seasin from the __Teams__ dataset. Rename the sum of squard players column to __Total_Soccer_Players__. When you are done compare your query to the one below.

In [56]:
# Query the sum of squad players for all teams during the 2014 seasin from the Teams dataset

cur.execute('''SELECT SUM(KaderHome) AS Total_Soccer_Players
               FROM Teams
               WHERE Season = 2014;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Total_Soccer_Players
0,1164


# GROUP BY STATEMENT & AVG( ) Function

Finally we have reached the __GROUP BY__ function. The __GROUP BY__ function will group together the rows that have matching values in a specified column. To understand how the __GROUP BY__ statement works, we will discuss the query below.

In [60]:
# Return the total home games each team won during the 2016 
# season from the Matches dataset

cur.execute('''SELECT HomeTeam, COUNT(FTR) AS Total_Home_Wins
               FROM Matches
               WHERE FTR = 'H' AND Season = '2016'
               GROUP BY HomeTeam;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,HomeTeam,Total_Home_Wins
0,Arsenal,14
1,Augsburg,5
2,Bayern Munich,13
3,Bielefeld,7
4,Bochum,7
5,Bournemouth,9
6,Braunschweig,13
7,Burnley,10
8,Chelsea,17
9,Crystal Palace,6


This query returns the total number of home games each team won during the 2016 season. There are two variables in our __SELECT__ statement, __HomeTeam__ and __COUNT(FTR)__. The first variable is an unaggregated column from the __Matches__ dataset, the second variable is an aggregated column from othe __Matches__ dataset. The __WHERE__ statement says to only use data where __FTR = 'H'__ (outcome of a game is a home win) and __Season = 2016__. The __GROUP BY__ groups the rows together by __HomeTeam__. The __GROUP BY__ statement tells the query to combine all the rows with the same home team name together. The result is that each home team is grouped together and every time they won a home game, it was added to the total on the left for the final result. 

In [63]:
# Return the total home games each team won during the 2016 
# season from the Matches dataset

cur.execute('''SELECT HomeTeam, Date, COUNT(FTR) AS Total_Home_Wins
               FROM Matches
               WHERE FTR = 'H' AND Season = '2016'
               GROUP BY HomeTeam;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,HomeTeam,Date,Total_Home_Wins
0,Arsenal,2016-09-10,14
1,Augsburg,2017-04-15,5
2,Bayern Munich,2017-04-08,13
3,Bielefeld,2017-03-17,7
4,Bochum,2017-04-28,7
5,Bournemouth,2016-09-10,9
6,Braunschweig,2017-04-10,13
7,Burnley,2016-08-20,10
8,Chelsea,2016-08-15,17
9,Crystal Palace,2016-09-18,6


Try practicing using a __GROUP BY__ statement by writing a query that shows the average number of foreign players for each team from the __Teams__ dataset. Rename the column for average number of foreign players to __Avg_Num_Foreign_Players__ and round the column to the nearest hundreth. Compare your query to the one below:

In [66]:
# View Teams dataframe

cur.execute('''SELECT TeamName, ROUND(AVG(ForeignPlayersHome),2) AS Avg_Num_Foreign_Players
               FROM Teams
               GROUP BY TeamName;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,TeamName,Avg_Num_Foreign_Players
0,Aachen,8.71
1,Aalen,7.0
2,Ahlen,8.67
3,Augsburg,14.83
4,Bayern Munich,14.85
5,Bielefeld,10.2
6,Bochum,11.92
7,Braunschweig,9.67
8,Burghausen,11.0
9,CZ Jena,13.0


# ORDER BY Statement

Looking at thhe result we got from the last query we see that the results are organized by __TeamNAme__. What if we wanted to see the teams with the lowest of foreign players at the top of the list and the teams with the highest average of foreign players at the bottom of the list. We can do this by using an __ORDER BY__ statement.

In [67]:
# View Teams dataframe

cur.execute('''SELECT TeamName, ROUND(AVG(ForeignPlayersHome),2) AS Avg_Num_Foreign_Players
               FROM Teams
               GROUP BY TeamName
               ORDER BY ROUND(AVG(ForeignPlayersHome),2);''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,TeamName,Avg_Num_Foreign_Players
0,Kiel,3.0
1,Heidenheim,4.0
2,Oberhausen,6.33
3,Osnabruck,6.33
4,St Pauli,6.82
5,Aalen,7.0
6,Regensburg,7.5
7,Erzgebirge Aue,7.6
8,Sandhausen,8.33
9,Ahlen,8.67


The query above has an __ORDER BY__ statement. We tell the query to order the results by __ROUND(AVG(ForeignPlayersHome),2)__ and not __Avg_Num_Foreign_Players__. This is because the __SELECT__ statement is executed after the __ORDER BY__ statement, the query does not recognize any variables by there new names as given in the __SELECT__ statement. What if we wanted the highest average number of foreign players to be returned at the top of our results? We can add a __DESC__ clause to the end of the __ORDER BY ROUND(AVG(ForeignPlayersHome, 2)__ statement. See how this works below:

In [71]:
# View Teams dataframe

cur.execute('''SELECT TeamName, ROUND(AVG(ForeignPlayersHome),2) AS Avg_Num_Foreign_Players
               FROM Teams
               GROUP BY TeamName
               ORDER BY ROUND(AVG(ForeignPlayersHome),2) DESC;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,TeamName,Avg_Num_Foreign_Players
0,Wolfsburg,21.54
1,Hamburg,19.0
2,Schalke 04,17.69
3,Mainz,17.62
4,Hoffenheim,17.27
5,Hertha,17.0
6,Werder Bremen,16.38
7,M'gladbach,16.0
8,Hannover,15.92
9,Ein Frankfurt,15.85


Our results were now returned in DESCENDING order by the average number of foreign players for each team. If we do not put the __DESC__ clause at the end of an __ORDER BY__ statement, the results will be returned in ASCENDING order. You can put __ASC__ at the end of an __ORDER BY__ statement to return results in ascending order but it is not necessary. To practice using the __ORDER BY__ clause, write a query that shows the number of games that resulted in a draw there were for every season. Display the results with the season with the highest number of draws at the top of the results and the season with the lowest number of draws at the bottom of the results. Rename the column with the number of draws to __Num_of_Draws__. Compare your query with the one below:

In [75]:
# Show average number of home goals scored by Man United in the Matches Dataframe

cur.execute('''SELECT Season, COUNT(FTR) AS Num_of_Draws
               FROM Matches
               WHERE FTR = 'D'
               GROUP BY Season
               ORDER BY COUNT(FTR) DESC;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Season,Num_of_Draws
0,1993,340
1,1994,322
2,1998,287
3,1995,286
4,1996,282
5,1997,279
6,2012,276
7,1999,274
8,2014,272
9,2003,267


# HAVING Statement

The __HAVING__ statement filters groups. It is always used after the __GROUP BY__ statement and before the __ORDER BY__ statement. It is like the __WHERE__ statement but is only used to filter aggregated rows. To see how it works, we will run the same query we did before but this time we only want to show the seasons where the number of draws was equal to or greater than 275.

In [76]:
# Show average number of home goals scored by Man United in the Matches Dataframe

cur.execute('''SELECT Season, COUNT(FTR) AS Num_of_Draws
               FROM Matches
               WHERE FTR = 'D'
               GROUP BY Season
               HAVING COUNT(FTR) >= 275
               ORDER BY COUNT(FTR) DESC;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Season,Num_of_Draws
0,1993,340
1,1994,322
2,1998,287
3,1995,286
4,1996,282
5,1997,279
6,2012,276


Try practicing using the __HAVING__ statement to query the Sum of Average market value (per player) of the team pre-season in EUR (__AvgMarketValueHome__) for each season where 

In [82]:
# Show average number of home goals scored by Man United in the Matches Dataframe

cur.execute('''SELECT Season, SUM(AvgMarketValueHome) AS Num_of_Draws
               FROM Teams
               GROUP BY Season
               HAVING SUM(AvgMarketValueHome) >= 65000000;''')
Matches_df =pd.DataFrame(cur.fetchall())
Matches_df.columns = [x[0] for x in cur.description]
Matches_df

Unnamed: 0,Season,Num_of_Draws
0,2010,65782000
1,2011,69956000
2,2012,73202000
3,2013,75985000
4,2014,84207000
5,2015,87008000
6,2016,84025000
7,2017,100466000


In [None]:
Average market value (per player) of the team pre-season in EUR

In [None]:
hen using a __GROUP BY__ statement, every variable in the SELECT statement must either be used to group the data together in the __GROUP BY__ statement, or must be an aggregate function. To see how this works, we wil run the same query, but add another variable.

In [78]:
# View Teams dataframe

cur.execute('''SELECT * 
               FROM Teams;''')
Teams_df =pd.DataFrame(cur.fetchall())
Teams_df.columns = [x[0] for x in cur.description]
Teams_df

Unnamed: 0,Season,TeamName,KaderHome,AvgAgeHome,ForeignPlayersHome,OverallMarketValueHome,AvgMarketValueHome,StadiumCapacity
0,2017,Bayern Munich,27,26,15,597950000,22150000,75000
1,2017,Dortmund,33,25,18,416730000,12630000,81359
2,2017,Leverkusen,31,24,15,222600000,7180000,30210
3,2017,RB Leipzig,30,23,15,180130000,6000000,42959
4,2017,Schalke 04,29,24,17,179550000,6190000,62271
5,2017,M'gladbach,31,25,17,154400000,4980000,54014
6,2017,Wolfsburg,31,24,14,124430000,4010000,30000
7,2017,FC Koln,24,26,9,118550000,4940000,49968
8,2017,Hoffenheim,31,24,14,107330000,3460000,30164
9,2017,Hertha,26,26,12,86800000,3340000,74475


# ORDER BY Statement

In [None]:
# DESC & ASC