# Examples of using PySpark

04 03 25

---

Installation instructions for PySpark.

1) First make sure that the java sdk is installed. Type `java -version` and the version number should be above 11
2) If java is not installed, install it from the Developer Downloads link here: https://www.java.com/en/
3) The code in the second cell in this notebook avoids getting the error about the security manager

In [2]:
from pyspark.sql import SparkSession
import warnings

warnings.filterwarnings('ignore')

In [3]:
spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.driver.extraJavaOptions", "-Djava.security.manager=allow") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/04 22:41:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Load some data

In this case the GNAF psv file (available here: https://geoscape.com.au/solutions/g-naf/)

In [4]:
df = (
    spark
    .read
    .options(delimiter='|')
    .option('header', 'true')
    .csv("/Users/alexlee/Desktop/Data/geo/gnaf_november2024/GNAF_CORE.psv")
)

## Analyse the data

### How many rows are there in the dataframe?

In [5]:
df.count()

                                                                                

15659449

### Show the first five rows

In [6]:
df.show(5)

+------------------+------------+--------------------+-----------------+---------------+---------+-----------+----------+------------+------------+-----------+----------+-----------+-----------+-------------+-------------+-----+--------+---------------+-----------+---------------+-------------+-----------------+--------------+-----------------+------------+------------+
|ADDRESS_DETAIL_PID|DATE_CREATED|       ADDRESS_LABEL|ADDRESS_SITE_NAME|  BUILDING_NAME|FLAT_TYPE|FLAT_NUMBER|LEVEL_TYPE|LEVEL_NUMBER|NUMBER_FIRST|NUMBER_LAST|LOT_NUMBER|STREET_NAME|STREET_TYPE|STREET_SUFFIX|LOCALITY_NAME|STATE|POSTCODE|LEGAL_PARCEL_ID|    MB_CODE|ALIAS_PRINCIPAL|PRINCIPAL_PID|PRIMARY_SECONDARY|   PRIMARY_PID|     GEOCODE_TYPE|   LONGITUDE|    LATITUDE|
+------------------+------------+--------------------+-----------------+---------------+---------+-----------+----------+------------+------------+-----------+----------+-----------+-----------+-------------+-------------+-----+--------+---------------+-

25/03/04 22:42:09 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
25/03/04 22:42:11 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


### How many addresses for each postcode?

In [7]:
(
    df
    .groupBy("POSTCODE")
    .count()
    .orderBy("count", ascending=False)
    .show(10)
)



+--------+------+
|POSTCODE| count|
+--------+------+
|    3000|118720|
|    4350| 72025|
|    3029| 69749|
|    2000| 62767|
|    2170| 59867|
|    3064| 56874|
|    3030| 55812|
|    4870| 53134|
|    4217| 52100|
|    4670| 51676|
+--------+------+
only showing top 10 rows



                                                                                

### Select the main components of the addresses and show the first five rows

In [8]:
(
    df
    .select(
        'ADDRESS_DETAIL_PID', 
        'NUMBER_FIRST', 
        'STREET_NAME', 
        'STREET_TYPE', 
        'LOCALITY_NAME', 
        'STATE', 
        'POSTCODE', 
        'LATITUDE', 
        'LONGITUDE')
    .show(5)
)

+------------------+------------+-----------+-----------+-------------+-----+--------+------------+------------+
|ADDRESS_DETAIL_PID|NUMBER_FIRST|STREET_NAME|STREET_TYPE|LOCALITY_NAME|STATE|POSTCODE|    LATITUDE|   LONGITUDE|
+------------------+------------+-----------+-----------+-------------+-----+--------+------------+------------+
|    GATAS702553725|          50|    PRINCES|     STREET|    SANDY BAY|  TAS|    7005|-42.89734575|147.32265805|
|    GATAS702448146|         113|     CHAPEL|     STREET|    GLENORCHY|  TAS|    7010|-42.84005781| 147.2654763|
|    GATAS702765642|           2|    PINKARD|     STREET|KINGS MEADOWS|  TAS|    7249|-41.47229116|147.16425791|
|    GATAS717990791|       12990|     TASMAN|    HIGHWAY|      SWANSEA|  TAS|    7190| -42.1557993|148.07629134|
|    GATAS702391018|          5A|       ALMA|     STREET|    BELLERIVE|  TAS|    7018|-42.87202074|147.37119623|
+------------------+------------+-----------+-----------+-------------+-----+--------+----------

### How many addresses are there in South Australia?

In [9]:
df.filter(df.STATE == "SA").count()

                                                                                

1145456