### Using a self Left JOIN to find rows at the minimum or maximum for a group

This workbook will demonstrate techniques for finding the row corresponding to a group minimum or maximum. 

In [1]:
import pandas as pd

In [2]:
entries = [
    ['Bob',1,10], 
    ['Shawn',1,12], 
    ['Jill',1,17], 
    ['Jake',2,19],  
    ['Paul',2,16],
    ['Laura',2,19],
    ['Suzie',3,15],
    ['Jim',3,13],  
    ['Bob',3,11],
    ['Gene',3,15]
] 

headers = ['Name','dept_id','years_of_service']

In [3]:
df = pd.DataFrame(entries, columns=headers)

In [4]:
df

Unnamed: 0,Name,dept_id,years_of_service
0,Bob,1,10
1,Shawn,1,12
2,Jill,1,17
3,Jake,2,19
4,Paul,2,16
5,Laura,2,19
6,Suzie,3,15
7,Jim,3,13
8,Bob,3,11
9,Gene,3,15


In [5]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

We can use an aggregation to get various calculations on years of service, including SUM, AVG, MIN, and MAX.

Here, we'll calculate the average

In [6]:
# average
pysqldf("""
SELECT 
    dept_id, AVG(years_of_service)
FROM
    df
GROUP BY
    dept_id
""")

Unnamed: 0,dept_id,AVG(years_of_service)
0,1,13.0
1,2,18.0
2,3,13.5


Alternatively, can can aggregate to find the maxiumum value for each group (dept_id)

In [7]:
pysqldf("""
SELECT 
    dept_id, MAX(years_of_service)
FROM
    df
GROUP BY
    dept_id
""")

Unnamed: 0,dept_id,MAX(years_of_service)
0,1,17
1,2,19
2,3,15


This query gives me the group and maximum, but what if I want the *name* of the longest serving person in each department in addition to the years of service?

#### Using a "bare column" aggregation 

SQLite will allow you to do list a column in an aggregation that isn't in the GROUP or the aggrevation function, but keep in mind that this isn't standard SQL and won't work in most other SQL databases. 

In [8]:
pysqldf("""
SELECT 
    name, dept_id, MAX(years_of_service) as max_yos
FROM 
    df 
GROUP BY 
    dept_id;
""")

Unnamed: 0,Name,dept_id,max_yos
0,Jill,1,17
1,Jake,2,19
2,Suzie,3,15


Take a moment and consider what is happening with this query. This is allowed in SQLite (and I believe it is allowed in MySQL), but would not be allowed in most SQL databases, including BigQuery on google cloud. 

In standard SQL, columns listed in a SELECT query must either be part of the aggregation function or part of the GROUP BY clause. This restriction is intuitive for many aggregation functions, such as AVG or SUM. If you were to select the average service years by department, it doesn't really make sense to try to include a name. Which one would you choose? 

Here's an example. SQLite will allow us to run this query (maany SQL databases will simply throw an error). Let's try it

In [9]:
pysqldf("""
SELECT 
    name, dept_id, AVG(years_of_service) as avg_yos
FROM 
    df 
GROUP BY 
    dept_id;
""")

Unnamed: 0,Name,dept_id,avg_yos
0,Bob,1,13.0
1,Jake,2,18.0
2,Suzie,3,13.5


The query is returning the first name in each group, along with the dept_id and the average years of service for each department. The average is properly calculated, but the name is very misleading. In this case, the name is the first row for each group, but you shouldn't count on this behavior - think of the name instead as a value from a randomly chosen row from each group.

The practice of picking a row can make sense for two aggregation functions - MIN and MAX, because while these are aggregation functions that operate on a group of rows, they return specific row values. In this case, SQLite will find the MIN or MAX value for the group and return a row with that value. Keep in mind, though, the MIN and MAX may not be unique! SQLite will pick *one* that matches, but won't list all the values at the min or max. 

Take a look at the original dataframe - you'll see that for two of the department groups, two employees share the maximum number of service years, but only one is listed in this query. 

For a discussion of how SQLite handle the unique case of MIN and MAX, see:

https://www.sqlite.org/lang_select.html#bareagg

#### Using a LEFT JOIN

To include all rows that match the minimum, as well as to write a portable query that will work in other database systems that don't allow bare columns in aggregations, you can use a LEFT OUTER self join.

In [10]:
pysqldf("""
SELECT 
    a.name, a.dept_id, a.years_of_service
FROM 
    df a                    
LEFT JOIN 
    df b           
ON 
    a.dept_id = b.dept_id AND a.years_of_service < b.years_of_service
WHERE 
    b.years_of_service is NULL                
""")

Unnamed: 0,Name,dept_id,years_of_service
0,Jill,1,17
1,Jake,2,19
2,Laura,2,19
3,Suzie,3,15
4,Gene,3,15


#### discussion

* https://www.sqlite.org/lang_select.html#bareagg
* https://stackoverflow.com/questions/12102200/get-records-with-max-value-for-each-group-of-grouped-sql-results
* https://stackoverflow.com/questions/52013306/leaving-terms-out-of-the-group-by-in-sqlite