# Spark DataFrame Basics III - GroupBy and Aggregate functions

<p>Obs.: After download the databricks notebook to .ipynb we have problems in the output format but if you run this notebook in a databricks cluster you'll have a output in a table format.</p>

<p>E.g.:</p>
<p>The following output:</p>
<p>+----+-------+ age| name| +----+-------+ null|Michael| 30| Andy| 19| Justin| +----+-------+</p>
<p>actually is:</p>
<pre>+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+  </pre>

### Create session and load data

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('aggs').getOrCreate()

In [3]:
df = spark.read.csv('/FileStore/tables/sales_info.csv', inferSchema=True, header=True)

In [4]:
df.show()

In [5]:
df.printSchema()

### Use groupBy function

In [6]:
df.groupBy('Company')

In [7]:
df.groupBy('Company').mean().show()

### Use aggregate function

In [8]:
df.agg({'Sales': 'sum'}).show()

In [9]:
group_data = df.groupBy('Company')

In [10]:
group_data.agg({'Sales': 'max'}).show()

### Use pyspark functions for descriptive statistics analysis of the data

In [11]:
from pyspark.sql.functions import countDistinct, avg, stddev

In [12]:
df.select(countDistinct('Sales')).show()

In [13]:
df.select(avg('Sales').alias('Average Sales')).show()

In [14]:
df.select(stddev('Sales')).show()

### Use format_number function

In [15]:
from pyspark.sql.functions import format_number

In [16]:
sales_std = df.select(stddev('Sales').alias('std'))
sales_std.select(format_number('std', 2).alias('std')).show()

### Use orderBy function

In [17]:
df.orderBy('Sales').show()

In [18]:
df.orderBy(df['Sales'].desc()).show()