# Group by

Spark SQL supports a typical "group by" operations. The corresponding tools are provided by the grouping data object that comes from the data frame's `groupBy` method. This page discusses the options for using the grouped data object.

In [1]:
from pyspark.sql import SparkSession

spark_session = SparkSession.builder.appName('Temp').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/20 12:49:01 WARN Utils: Your hostname, fedor-NUC10i7FNK, resolves to a loopback address: 127.0.1.1; using 192.168.100.19 instead (on interface wlp0s20f3)
25/09/20 12:49:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/20 12:49:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Direct methos

There is a set of methods that directly return just one specific aggregation: `min`, `max`, `avg`, `mean`, and `count`. You can list the columns for which you want to compute these aggregations. The meaning completely matches the functions names. They will calculate the aggregations by all available columns by default, but you can also specify the specific columns to be used in the output.

---

The following cell defines the data frame and grouped data that will be used as the example.

In [11]:
df = spark_session.createDataFrame(
    [
        ("a", 10, 7, 9),
        ("a", 18, 3, 1),
        ("b", 12, 9, 1),
        ("b", 15, 7, 0),
        ("c", 4, 9, 12),
        ("c", 12, 15, 5) 
    ],
    schema=['group', "value1", "value2", "value3"]
)
gb = df.groupBy("group")

The following cell shows the application of the `min` function, without specifying wich column to use.

In [12]:
gb.min().show()

+-----+-----------+-----------+-----------+
|group|min(value1)|min(value2)|min(value3)|
+-----+-----------+-----------+-----------+
|    a|         10|          3|          1|
|    b|         12|          7|          0|
|    c|          4|          9|          5|
+-----+-----------+-----------+-----------+



The `max` function is only used for the `value1` column:

In [14]:
gb.max("value1").show()

+-----+-----------+
|group|max(value1)|
+-----+-----------+
|    a|         18|
|    b|         15|
|    c|         12|
+-----+-----------+



The application of the `avg` to `value1` and `value2`:

In [18]:
gb.avg("value1", "value2").show()

+-----+-----------+-----------+
|group|avg(value1)|avg(value2)|
+-----+-----------+-----------+
|    a|       14.0|        5.0|
|    b|       13.5|        8.0|
|    c|        8.0|       12.0|
+-----+-----------+-----------+



## Agg

The `agg` method of the grouped data object provides general aggregations. You only need to list the expressions that instruct Spark what to compute. The following table lists the functions that can be used to design an aggregation:

| Function | Description |
|---------|-------------|
| `count(col)` | Number of rows for the given column (non-null only). |
| `countDistinct(col, *cols)` | Count of distinct values across one or more columns. |
| `approx_count_distinct(col, rsd=0.05)` | Approximate count of distinct values using HyperLogLog (faster than `countDistinct`). |
| `sum(col)` | Sum of values in a column. |
| `sumDistinct(col)` | Sum of distinct values in a column. |
| `avg(col)` / `mean(col)` | Average (mean) of column values. |
| `max(col)` | Maximum value in the column. |
| `min(col)` | Minimum value in the column. |
| `first(col, ignorenulls=False)` | First value in the group. |
| `last(col, ignorenulls=False)` | Last value in the group. |
| `collect_list(col)` | Collects values into a Python list (duplicates preserved). |
| `collect_set(col)` | Collects unique values into a Python set (duplicates removed). |
| `variance(col)` / `var_samp(col)` | Sample variance of values in the group. |
| `var_pop(col)` | Population variance of values in the group. |
| `stddev(col)` / `stddev_samp(col)` | Sample standard deviation of values in the group. |
| `stddev_pop(col)` | Population standard deviation of values in the group. |
| `corr(col1, col2)` | Pearson correlation coefficient between two columns. |
| `covar_samp(col1, col2)` | Sample covariance between two columns. |
| `covar_pop(col1, col2)` | Population covariance between two columns. |
| `skewness(col)` | Skewness of values in the group. |
| `kurtosis(col)` | Kurtosis of values in the group. |
| `approx_percentile(col, percentage, accuracy=10000)` | Approximate percentile of column values (for quantile analysis). |
| `bit_and(col)` | Bitwise AND of all values in the group. |
| `bit_or(col)` | Bitwise OR of all values in the group. |
| `bit_xor(col)` | Bitwise XOR of all values in the group. |
| `mode(col)` | Returns the most frequent value (mode) in the column. |

---

The following cell defines a data frame that will be used as an example:

In [2]:
df = spark_session.createDataFrame(
    [
        ("a", 10),
        ("a", 18),
        ("b", 12),
        ("b", 15),
        ("c", 4),
        ("c", 12) 
    ],
    schema=['group', "value"]
)
gb = df.groupBy("group")

There is also the usage of the `agg` method with a few aggregation functions.

In [8]:
from pyspark.sql import functions as F

gb.agg(
    F.sum("value"),
    F.avg("value").alias("new name"),
    F.mode("value")
).show()

+-----+----------+--------+-----------+
|group|sum(value)|new name|mode(value)|
+-----+----------+--------+-----------+
|    a|        28|    14.0|         18|
|    b|        27|    13.5|         15|
|    c|        16|     8.0|          4|
+-----+----------+--------+-----------+

