# Group by

Spark SQL supports a typical "group by" operations. The corresponding tools are provided by the grouping data object that comes from the data frame's `groupBy` method. This page discusses the options for using the grouped data object.

In [2]:
from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName('Temp').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/19 15:34:52 WARN Utils: Your hostname, user-ThinkPad-E16-Gen-2, resolves to a loopback address: 127.0.1.1; using 10.202.22.210 instead (on interface enp0s31f6)
25/09/19 15:34:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/19 15:34:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/19 15:34:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/19 15:34:54 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


## Direct methos

There is a set of methods that directly return just one specific aggregation: `min`, `max`, `avg`, `mean`, and `count`. You can list the columns for which you want to compute these aggregations. The meaning completely matches the functions names. They will calculate the aggregations by all available columns by default, but you can also specify the specific columns to be used in the output.

---

The following cell defines the data frame and grouped data that will be used as the example.

In [11]:
df = spark_session.createDataFrame(
    [
        ("a", 10, 7, 9),
        ("a", 18, 3, 1),
        ("b", 12, 9, 1),
        ("b", 15, 7, 0),
        ("c", 4, 9, 12),
        ("c", 12, 15, 5) 
    ],
    schema=['group', "value1", "value2", "value3"]
)
gb = df.groupBy("group")

The following cell shows the application of the `min` function, without specifying wich column to use.

In [12]:
gb.min().show()

+-----+-----------+-----------+-----------+
|group|min(value1)|min(value2)|min(value3)|
+-----+-----------+-----------+-----------+
|    a|         10|          3|          1|
|    b|         12|          7|          0|
|    c|          4|          9|          5|
+-----+-----------+-----------+-----------+



The `max` function is only used for the `value1` column:

In [14]:
gb.max("value1").show()

+-----+-----------+
|group|max(value1)|
+-----+-----------+
|    a|         18|
|    b|         15|
|    c|         12|
+-----+-----------+



The application of the `avg` to `value1` and `value2`:

In [18]:
gb.avg("value1", "value2").show()

+-----+-----------+-----------+
|group|avg(value1)|avg(value2)|
+-----+-----------+-----------+
|    a|       14.0|        5.0|
|    b|       13.5|        8.0|
|    c|        8.0|       12.0|
+-----+-----------+-----------+

