# Group By and Aggregation with Pyspark

> "Group By and Aggregation with Pyspark"

- toc: true- branch: master- badges: true
- comments: true
- author: David Kearney
- categories: [pyspark, jupyter]
- description: Group By and Aggregation with Pyspark
- title: Group By and Aggregation with Pyspark

## Read CSV and inferSchema

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import countDistinct, avg,stddev


# Load data from a CSV
file_location = "/FileStore/tables/df_panel_fix.csv"
df = spark.read.format("CSV").option("inferSchema", True).option("header", True).load(file_location)
display(df.take(5))


_c0,province,specific,general,year,gdp,fdi,rnr,rr,i,fr,reg,it
0,Anhui,147002.0,,1996,2093.3,50661,0.0,0.0,0.0,1128873,East China,631930
1,Anhui,151981.0,,1997,2347.32,43443,0.0,0.0,0.0,1356287,East China,657860
2,Anhui,174930.0,,1998,2542.96,27673,0.0,0.0,0.0,1518236,East China,889463
3,Anhui,285324.0,,1999,2712.34,26131,,,,1646891,East China,1227364
4,Anhui,195580.0,32100.0,2000,2902.09,31847,0.0,0.0,0.0,1601508,East China,1499110


In [2]:
df.printSchema()

## Using groupBy for Averages and Counts

In [3]:
df.groupBy("province")

In [4]:
df.groupBy("province").mean().show()

In [5]:
df.groupBy("reg").mean().show()

In [6]:
# Count
df.groupBy("reg").count().show()

In [7]:
# Max
df.groupBy("reg").max().show()

In [8]:
# Min
df.groupBy("reg").min().show()

In [9]:
# Sum
df.groupBy("reg").sum().show()

In [10]:
# Max it across everything
df.agg({'specific':'max'}).show()

In [11]:
grouped = df.groupBy("reg")
grouped.agg({"it":'max'}).show()

In [12]:
df.select(countDistinct("reg")).show()

In [13]:
df.select(countDistinct("reg").alias("Distinct Region")).show()

In [14]:
df.select(avg('specific')).show()

In [15]:
df.select(stddev("specific")).show()

## Choosing Significant Digits with format_number

In [16]:
from pyspark.sql.functions import format_number


In [17]:
specific_std = df.select(stddev("specific").alias('std'))
specific_std.show()

In [18]:
specific_std.select(format_number('std',0)).show()

## Using orderBy

In [19]:
df.orderBy("specific").show()

In [20]:
df.orderBy(df["specific"].desc()).show()


This post includes code adapted from [Spark and Python for Big Data udemy course](https://udemy.com/course/spark-and-python-for-big-data-with-pyspark) and [Spark and Python for Big Data notebooks](https://github.com/SuperJohn/spark-and-python-for-big-data-with-pyspark).