## Data Aggregation using GROUP BY and ORDER BY:

The `GROUP BY` clause is an optional component of the `SELECT` statement that allows you to group a selected set of rows into summary rows based on the values of one or more columns.

When you apply the GROUP BY clause, it returns a single row for each group. Within each group, you can utilize aggregate functions like MIN, MAX, SUM, COUNT, or AVG to derive additional information about the grouped data. These aggregate functions provide valuable insights for each group.

In [None]:
%load_ext sql

### Connect to the database

In [None]:
%sql mysql://root:root@localhost:3306/training

### Grouping Data:

In SQL, grouping data involves combining rows in a table based on the values in one or more columns. The purpose of grouping is to create summary results for each distinct value in the specified column(s), often accompanied by aggregate functions to provide meaningful insights into the grouped data. The process of grouping data is typically done using the `GROUP BY` clause in conjunction with aggregate functions.

**Syntax:**

The basic syntax for grouping data using the `GROUP BY` clause is as follows:

`SELECT  column1, column2, aggregate_function(column3)  FROM  table_name  GROUP  BY  column1, column2;`

**Description:**

-   **SELECT:** The `SELECT` keyword is used to indicate that we are retrieving data from the table.
    
-   **column1, column2:** These are the columns we want to group the data by. Rows with the same values in these columns will be combined into groups.
    
-   **aggregate_function(column3):** An aggregate function is applied to a specific column (column3) within each group to derive summary results. Examples of aggregate functions include `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX`.
    
-   **FROM table_name:** Specifies the table from which we are retrieving the data.
    
-   **GROUP BY column1, column2:** The `GROUP BY` clause groups the data based on the values in column1 and column2. The result set will have one row for each unique combination of values in these columns.

Consider the table `rch` as a representative example.

####  Review the Table Columns First

Before proceeding with any data manipulation or analysis, it is essential to examine and verify the table columns. Understanding the structure of the table and the data it contains ensures that you have the necessary information to formulate accurate and effective SQL queries. By checking the table columns initially, you can identify the relevant columns needed for specific operations, and it helps in planning the subsequent steps in data processing.

In [None]:
%sql SELECT * From rch LIMIT 3

 * mysql://root:***@localhost:3306/sql-training
3 rows affected.


RCH,YR,MO,FLOW_INcms,FLOW_OUTcms,EVAPcms,TLOSScms,SED_INtons,SED_OUTtons,SEDCONCmg_kg,ORGN_INkg,ORGN_OUTkg,ORGP_INkg,ORGP_OUTkg,NO3_INkg,NO3_OUTkg,NH4_INkg,NH4_OUTkg,NO2_INkg,NO2_OUTkg,MINP_INkg,MINP_OUTkg,CHLA_INkg,CHLA_OUTkg,CBOD_INkg,CBOD_OUTkg,DISOX_INkg,DISOX_OUTkg,SOLPST_INmg,SOLPST_OUTmg,SORPST_INmg,SORPST_OUTmg,REACTPSTmg,VOLPSTmg,SETTLPSTmg,RESUSP_PSTmg,DIFFUSEPSTmg,REACBEDPSTmg,BURYPSTmg,BED_PSTmg,BACTP_OUTct,BACTLP_OUTct,CMETAL_1kg,CMETAL_2kg,CMETAL_3kg,TOT_Nkg,TOT_Pkg,NO3ConcMg_l,WTMPdegc
1,1981,1,146.34377,146.25249,0.091280885,0.0,2.3320462e-07,61619.465,155.3719,0.016086288,0.0,0.04825888,0.0,362.04868,361.8135,203.62085,421.18378,0.0,23.018433,0.016107244,0.0,1.1839052e-11,0.0,0.0,0.0,5627225.0,5623486.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,806.01575,0.0,0.0,0.0
2,1981,1,96.22569,96.18285,0.042821284,0.0,1.6426765e-07,0.0,0.0,0.013631537,0.0,0.04089462,0.0,315.60052,315.45798,0.0,127.00502,0.0,0.0,0.01365605,0.0,4.1369722000000005e-16,0.0,0.0,0.0,3757606.5,3698301.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,442.463,0.0,0.0,0.0
3,1981,1,11.952719,11.861368,0.09135183,0.0,2.0325824e-07,2.0325824e-07,6.595061e-09,0.011466288,0.0,0.03439886,0.009118038,48.296375,47.931503,0.0,62.46762,0.0,0.0,0.011485105,0.0,5.941029e-14,0.0,0.0,0.0,360979.9,456115.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,110.399124,0.009118038,0.0,0.0


#### Checking Unique Values:

To begin, we can examine the count of RCH (records). By utilizing the `DISTINCT` keyword along with the SELECT statement, we can eliminate any duplicate entries and retrieve only the distinct, unique records.

In essence, this approach allows us to identify and analyze the unique occurrences within the dataset, providing us with valuable insights into the distinct data points present.

In [None]:
%%sql
SELECT COUNT(DISTINCT RCH) AS nRCH
FROM rch

 * mysql://root:***@localhost:3306/sql-training
1 rows affected.


nRCH
23


Additionally, the `GROUP BY` clause allows us to obtain a more concise output, presenting fewer rows with only unique values.

In [None]:
%%sql
SELECT RCH
FROM rch
GROUP BY RCH

 * mysql://root:***@localhost:3306/sql-training
23 rows affected.


RCH
1
2
3
4
5
6
7
8
9
10


#### Utilizing Aggregate Functions with Grouped Data
By aggregating data on specific groups, we can extract more detailed information than by considering the entire columns as a whole.

In [None]:
%%sql
SELECT RCH, AVG(FLOW_INcms), AVG(FLOW_OUTcms)
FROM rch
GROUP BY RCH

 * mysql://root:***@localhost:3306/sql-training
23 rows affected.


RCH,AVG(FLOW_INcms),AVG(FLOW_OUTcms)
1,104.300084,103.3526398
2,40.5485714,40.4861037
3,185.30486956,185.02848704
4,21.59333614,21.3818216
5,801.470258,796.166714
6,2522.582614,2521.323754
7,197.08561,196.4364072
8,1724.320878,1720.0342920000005
9,361.428025,361.01458
10,26.65798225,26.464232


### Ordering Records:

To begin, we will sort the records based on the years and months with the highest FLOW_INcms.

In [None]:
%%sql
SELECT RCH, YR, MO, MAX(FLOW_INcms)
FROM rch
GROUP BY RCH, YR, MO;

 * mysql://root:***@localhost:3306/sql-training
100 rows affected.


RCH,YR,MO,MAX(FLOW_INcms)
1,1981,1,146.34377
2,1981,1,96.22569
3,1981,1,11.952719
4,1981,1,49.486492
5,1981,1,274.0668
6,1981,1,486.71063
7,1981,1,23.575495
8,1981,1,215.61806
9,1981,1,193.72772
10,1981,1,53.946423


Clearly, the year and month columns are not sorted naturally. This presents an opportunity to introduce the ORDER BY operator, which can be placed at the end of a SQL statement (after any WHERE and GROUP BY clauses). By using ORDER BY, we can arrange the query results in ascending or descending order based on the year and then the month, resulting in a more organized output.

In [None]:
%%sql
SELECT RCH, YR, MO, ROUND(MAX(FLOW_INcms), 2) AS max_flow
FROM rch
GROUP BY RCH, YR, MO
ORDER BY YR, MO;

 * mysql://root:***@localhost:3306/sql-training
100 rows affected.


RCH,YR,MO,max_flow
1,1981,1,146.34
2,1981,1,96.23
3,1981,1,11.95
4,1981,1,49.49
5,1981,1,274.07
6,1981,1,486.71
7,1981,1,23.58
8,1981,1,215.62
9,1981,1,193.73
10,1981,1,53.95


By default, data sorting is performed using the `ASC` operator, which arranges the data in ascending order. To sort the data in descending order, we can utilize the `DESC` operator.

In [None]:
%%sql
SELECT RCH, YR, MO, ROUND(MAX(FLOW_INcms), 2) AS max_flow
FROM rch
GROUP BY RCH, YR, MO
ORDER BY YR DESC, MO;

 * mysql://root:***@localhost:3306/sql-training
100 rows affected.


RCH,YR,MO,max_flow
1,1981,1,146.34
2,1981,1,96.23
3,1981,1,11.95
4,1981,1,49.49
5,1981,1,274.07
6,1981,1,486.71
7,1981,1,23.58
8,1981,1,215.62
9,1981,1,193.73
10,1981,1,53.95


### **Filtering Data on Groups with the HAVING Clause:**

In certain situations, there is a need to filter records based on group-specific or aggregated values. Although the initial inclination might be to use a WHERE statement, it won't produce the desired results since WHERE filters individual records and not the aggregations. For instance, attempting to use a WHERE clause to filter results where MAX(FLOW_INcms) is greater than 3000 would result in a MySQLdb.OperationalError due to an incorrect use of an aggregate function.

To address this scenario, we can utilize the HAVING clause in SQL. The HAVING clause is specifically designed to filter data after the grouping and aggregation process has taken place. It enables us to set conditions on the aggregated values to determine which groups to include in the final result.

**Example:**

Suppose we have a table named "water_flow" with columns "date," "flow_in_cms," and "location." If we want to find the locations where the maximum flow_in_cms is greater than 3000, we can use the HAVING clause as follows:


`SELECT  location,  MAX(flow_in_cms)  as  max_flow  FROM  water_flow  GROUP  BY  location  HAVINGMAX(flow_in_cms)  >  3000;`

In this example, the GROUP BY clause groups the records by "location," and the MAX(flow_in_cms) aggregate function calculates the maximum flow_in_cms for each location. The HAVING clause filters the grouped results and only includes groups where the maximum flow_in_cms value is greater than 3000.

**Advantages:**

The HAVING clause provides several advantages when filtering data on groups:

1.  **Group-Specific Filtering:** It allows filtering based on aggregated values, considering the entire group's characteristics rather than individual records.
    
2.  **Aggregate Filtering:** The HAVING clause is tailored to work with aggregate functions, making it a suitable choice for filtering based on aggregated data.
    
3.  **Efficient Data Analysis:** It streamlines data analysis by enabling direct filtering of summarized information, reducing the need for complex subqueries or post-processing.
    

In summary, when we need to filter data based on the results of an aggregate function or grouped data, the HAVING clause proves to be the appropriate solution in SQL. It enables us to apply conditions on the aggregated values, leading to accurate and effective data filtering on grouped results.

In [None]:
%%sql
SELECT RCH, YR, MO, MAX(FLOW_INcms) as MAX_FLOWIN
FROM rch
WHERE MAX_FLOWIN > 3000.0
GROUP BY RCH
ORDER BY YR DESC, MO

 * mysql://root:***@localhost:3306/sql-training
(MySQLdb.OperationalError) (1054, "Unknown column 'MAX_FLOWIN' in 'where clause'")
[SQL: SELECT RCH, YR, MO, MAX(FLOW_INcms) as MAX_FLOWIN
FROM rch
WHERE MAX_FLOWIN > 3000.0
GROUP BY RCH
ORDER BY YR DESC, MO]
(Background on this error at: https://sqlalche.me/e/20/e3q8)


In such situations, we can utilize the HAVING clause to define a filtering condition for a group or an aggregate. The HAVING clause is an optional part of the SELECT statement, and it is commonly used in conjunction with the GROUP BY clause. The GROUP BY clause groups rows into summary rows or groups, and the HAVING clause then filters these groups based on specified conditions.

>It is essential to emphasize that the HAVING clause must always come after the GROUP BY clause in the SQL statement.

In [None]:
%%sql
SELECT RCH, YR, MO, MAX(FLOW_INcms) as MAX_FLOWIN
FROM rch
GROUP BY RCH, YR, MO
HAVING MAX(FLOW_INcms) > 3000.0
ORDER BY YR DESC, MO;

 * mysql://root:***@localhost:3306/sql-training
2 rows affected.


RCH,YR,MO,MAX_FLOWIN
6,1981,5,8009.7104
8,1981,5,6357.578


### Conclusion
Throughout this notebook, we explored various SQL techniques to enhance our query results and gain valuable insights from the data.

Firstly, we grasped the usage of the DISTINCT operator, which allowed us to obtain unique and non-duplicate records in our queries.

Following that, we delved into the world of data aggregation and sorting using GROUP BY and ORDER BY. The GROUP BY clause enabled us to group data based on specific columns, while the powerful aggregate functions such as SUM(), MAX(), MIN(), AVG(), and COUNT() allowed us to derive summary statistics for each group.

Moreover, we uncovered the utility of the HAVING clause, which enabled us to filter aggregated data based on conditions that cannot be achieved using the WHERE clause.

By mastering these SQL concepts, we are now equipped with valuable tools to manipulate and analyze data effectively, making our queries more insightful and actionable.