# Aggregates - SQL

References
- [Select - Databricks](https://docs.databricks.com/spark/latest/spark-sql/language-manual/select.html)
- [SQL Guide - Databricks](https://docs.databricks.com/spark/latest/spark-sql/index.html)

## Contents
1. Setup
1. Without grouping
1. With grouping
1. `order by` clause

## 1. Setup

The following code creates the `iris` table.

In [6]:
%sql 
drop table if exists iris;
create temporary table iris 
using CSV 
options(path="mnt/datalab-datasets/file-samples/iris.csv", 
        header=TRUE)

List the columns of the `iris` table.

In [8]:
%sql 
show columns in iris

## 2. Without grouping

In [10]:
%sql
select count(*) as count,
       avg(SepalLength) as avg_SepalLength
from iris

## 3. With Grouping

Find the count of records an the average value of `SepalLength` for each of the three groups

In [14]:
%sql
select Name, 
       count(Name) as count_Name,
       avg(SepalLength) as avg_SepalLength
from iris
group by Name

Notice that there is one line for each of the three groups of rows (which correspond to the three unique values of the `Name` column).

## 4. `order by` and `sort by` clauses

The `sort by` clause sorts the records of the table named in the `from` sub-command. 

The `order by` clause sorts the records resulting from the `group by` command.

In [18]:
%sql
select Name, 
       count(Name) as count_Name,
       avg(SepalLength) as avg_SepalLength
from iris
group by Name
order by avg_SepalLength

Notice that the resulting rows are sorted by `avg_SepalLength`.

Notice that `order by` is used instead of `sort by`. 
- `order by` sorts rows produced by the `group by` clause
- `sort by` sorts rows produced by the `select` clause

## 5. `having` and `where` clauses

The `having` andf `where` clauses are used to filter rows. 
- `having` filters rows produced by the `group by` clause
- `where` filters rows produced by the `select` clause

Notice the difference in the output below.

In [23]:
%sql
select Name,
  count(Name) as count_Name,
  avg(SepalLength) as avg_SepalLength
from iris
group by Name
having avg_SepalLength > 6

The output includes only one row from the three produced by the `group by` command.

In [25]:
%sql
select 
  count(Name) as count_Name,
  avg(SepalLength) as avg_SepalLength
from iris
where SepalLength < 6
group by Name

The output includes three rows (one for each group), but the groups only contain rows where the `SepalLength` is less than `6`. 

Compare this with the command below that doesn't contain the `where` clause.

In [27]:
%sql
select 
  count(Name) as count_Name,
  avg(SepalLength) as avg_SepalLength
from iris
group by Name

The average sepal length differs in two of the three groups.

__The End__