<a href="https://colab.research.google.com/github/ducline/edit-data_processing/blob/main/spark/examples/06-write_partitioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Write
- .write
- .format (parquet, csv, json)
- options
- spark.sql.sources.partitionOverwriteMode dynamic

# Write Mode
- overwrite - The overwrite mode is used to overwrite the existing file, alternatively, you can use SaveMode.Overwrite
- append - To add the data to the existing file, alternatively, you can use SaveMode.Append
- ignore - Ignores write operation when the file already exists, alternatively, you can use SaveMode.Ignore.
- errorifexists or error - This is a default option when the file already exists, it returns an error, alternatively, you can use SaveMode.ErrorIfExists.

# Partitioning
Process to organize the data into multiple chunks based on some criteria.
Partitions are organized in sub-folders.
Partitioning improves performance in Spark.

# Setting up PySpark

In [1]:
%pip install pyspark



In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Spark Course').getOrCreate()

# Preparing data

In [3]:
!pip install faker

Collecting faker
  Downloading Faker-33.0.0-py3-none-any.whl.metadata (15 kB)
Downloading Faker-33.0.0-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-33.0.0


In [4]:
from faker import Faker
from datetime import datetime

fake = Faker()

users = []
for _ in range(50):
    user = {
        'date': fake.date_time_between_dates(datetime(2024, 5, 1), datetime(2024, 5, 5)),
        'name': fake.name(),
        'address': fake.address(),
        'email': fake.email(),
        'dob': fake.date_of_birth(),
        'phone': fake.phone_number()
    }
    users.append(user)

df = spark.createDataFrame(users)

df.show(10, False)


+-------------------------------------------------------+--------------------------+----------+-------------------------+------------------+---------------------+
|address                                                |date                      |dob       |email                    |name              |phone                |
+-------------------------------------------------------+--------------------------+----------+-------------------------+------------------+---------------------+
|4973 Amber Ranch Apt. 834\nStephensonmouth, CO 07625   |2024-05-03 11:37:52.167497|1918-06-08|odelacruz@example.org    |Dawn Ortega       |253.508.8977         |
|883 King Villages\nHamptonburgh, NJ 38378              |2024-05-03 09:31:21.718106|1993-02-26|yvega@example.com        |Arthur Mack       |2395131993           |
|700 Nicole Path\nSheilamouth, LA 62980                 |2024-05-03 20:57:51.270688|2014-11-29|torresrebecca@example.net|Justin Gonzalez   |9529406655           |
|02437 Luna Highway Ap

# Writing as PARQUET



In [5]:
# Writing as PARQUET with no partitions

path = "/content/write_partitioning/parquet_no_partitions"

df.write.mode("overwrite").format("parquet").save(path)

!ls /content/write_partitioning/parquet_no_partitions

spark.read.format("parquet").load(path).count()

part-00000-1fc4cfc2-4469-4c7b-b71c-207cdf78ddbc-c000.snappy.parquet  _SUCCESS


50

In [6]:
# Writing as PARQUET with partitions
from pyspark.sql.functions import *

path = "/content/write_partitioning/parquet_with_partitions"

# Creating partition column
df = df.withColumn("date_part", date_format(col("date"), "yyyyMMdd"))

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") # enable dynamic partition overwrite - only overwrites partitions that are coming in the dataframe

(df#.where("date_part = '20240503'")
 .write
 .mode("overwrite")                                               # overwrites the entire path with the new data
 .partitionBy("date_part")                                        # partition the data by column - creates sub-folders for each partition
 .format("parquet")                                               # format of output
 .save(path))                                                     # path

!ls /content/write_partitioning/parquet_with_partitions

spark.read.format("parquet").load(path).count()

'date_part=20240501'  'date_part=20240502'  'date_part=20240503'  'date_part=20240504'


50

In [9]:
# Checking single partition
spark.read.parquet("/content/write_partitioning/parquet_with_partitions/date_part=20240502").show()

+--------------------+--------------------+----------+--------------------+------------------+--------------------+
|             address|                date|       dob|               email|              name|               phone|
+--------------------+--------------------+----------+--------------------+------------------+--------------------+
|41718 Donald Isla...|2024-05-02 08:21:...|1929-08-07|candice41@example...|     Jeremy Holmes|001-950-609-8212x212|
|66919 Collins Sky...|2024-05-02 07:43:...|1928-11-14|vburnett@example.org| William Blackburn|    946.743.0437x584|
|222 Hannah Park A...|2024-05-02 09:19:...|1929-11-14|bennettanthony@ex...|      Heather Gray|+1-554-698-6296x1976|
|57547 Kimberly Da...|2024-05-02 22:44:...|2012-02-10|  okirby@example.net|    Olivia Johnson|001-570-547-8859x679|
|936 Christina Clu...|2024-05-02 01:57:...|1980-10-19|timothy34@example...|    Brittany White|        751.283.8413|
|0538 Thomas Locks...|2024-05-02 07:46:...|1990-11-30|  xgrant@example.o

# Writing as CSV

https://spark.apache.org/docs/3.5.1/sql-data-sources-csv.html

In [10]:
df.count()

50

In [11]:
path = "/content/write_partitioning/csv_no_partitioning/"

# write as csv
(df
  .write
  .format("csv")
  .mode("overwrite")
  .option("delimiter", "|")
  .option("header", True)
  .save(path))

# listing files in the folder
!ls /content/write_partitioning/csv_no_partitioning/

# read as csv
(spark
  .read
  .options(sep="|", multiLine=True, header=True)
  .csv(path)
  .count())

part-00000-0064c4ed-5cc5-4f4a-9d62-97d34bfc4e68-c000.csv  _SUCCESS


50

# Writing as JSON

https://spark.apache.org/docs/3.5.1/sql-data-sources-json.html

In [13]:
path = "/content/write_partitioning/json_no_partitioning/"

# write as json
(df
.write
.mode("overwrite")
.format("json")
.save(path))

# listing files in the folder
!ls /content/write_partitioning/json_no_partitioning/

# read as json
(spark
  .read
  .json(path)
  .count())

part-00000-056371c8-33df-400b-835c-28462a32bdad-c000.json  _SUCCESS


50

In [14]:
# reading json as text
spark.read.text(path).show(10, False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"address":"4973 Amber Ranch Apt. 834\nStephensonmouth, CO 07625","date":"2024-05-03T11:37:52.167Z","dob":"1918-06-08","email":"odelacruz@example.org","name":"Dawn Ortega","phone":"253.508.8977","date_part":"20240503"}           |
|{"address":"883 King Villages\nHamptonburgh, NJ 38378","date":"2024-05-

In [15]:
# reading json as text
spark.read.json(path).show(10, False)

+-------------------------------------------------------+------------------------+---------+----------+-------------------------+------------------+---------------------+
|address                                                |date                    |date_part|dob       |email                    |name              |phone                |
+-------------------------------------------------------+------------------------+---------+----------+-------------------------+------------------+---------------------+
|4973 Amber Ranch Apt. 834\nStephensonmouth, CO 07625   |2024-05-03T11:37:52.167Z|20240503 |1918-06-08|odelacruz@example.org    |Dawn Ortega       |253.508.8977         |
|883 King Villages\nHamptonburgh, NJ 38378              |2024-05-03T09:31:21.718Z|20240503 |1993-02-26|yvega@example.com        |Arthur Mack       |2395131993           |
|700 Nicole Path\nSheilamouth, LA 62980                 |2024-05-03T20:57:51.270Z|20240503 |2014-11-29|torresrebecca@example.net|Justin Gonzalez 

In [17]:
# partition json data + saveAsTable

# Creating partition column
df = df.withColumn("date_part", date_format(col("date"), "yyyyMMdd"))

# write as json
(df.write
  .partitionBy("date_part")
  .mode("overwrite")
  .format("json")
  .saveAsTable("tbl_json_part"))

# read as json
print(spark.table("tbl_json_part").count())

# read as json
spark.sql("show partitions tbl_json_part").show()

50
+------------------+
|         partition|
+------------------+
|date_part=20240501|
|date_part=20240502|
|date_part=20240503|
|date_part=20240504|
+------------------+



# Append Mode

In [27]:
# Writing as PARQUET with APPEND

path = "/content/write_partitioning/parquet_append"

df.write.mode("append").format("parquet").save(path)

!ls /content/write_partitioning/parquet_append

spark.read.format("parquet").load(path).count()

part-00000-018b8989-6340-420c-9f1f-510eb18f63f6-c000.snappy.parquet
part-00000-02799fa2-e41b-4cdb-8ba6-3e6fb5fa1124-c000.snappy.parquet
part-00000-18c62a08-1d06-4fb0-948d-5147b17e84dd-c000.snappy.parquet
part-00000-1b715300-051a-4713-bdd7-45870cf73ea9-c000.snappy.parquet
part-00000-1dd1f85b-f385-4f43-b747-169e6984bd39-c000.snappy.parquet
part-00000-29fef1fd-0454-4699-91b2-82c720fac0e5-c000.snappy.parquet
part-00000-46410ea7-14c1-47a8-ad15-02887f0f92be-c000.snappy.parquet
part-00000-a57562f5-80c0-4e75-9fa2-35f8e396635e-c000.snappy.parquet
part-00000-f3ea985a-863e-4db5-95da-8d4ded8b7104-c000.snappy.parquet
part-00000-fa4b706b-9d71-4f99-ae7a-4c1d4f5283cb-c000.snappy.parquet
_SUCCESS


500