Simple PySpark test
===================

A Spark DataFrame is created in memory and saved as a Parquet file on the local file-system.
Spark writes several Parquet files when saving a DataFrame. Hence, Pandas is used as a pivot
data format and when the Pandas DataFrame is saved, a single Parquet file is produced.


In [4]:
import sys, platform
print(sys.version)
print(platform.python_version())

3.11.11 (main, Dec  6 2024, 12:21:43) [Clang 16.0.0 (clang-1600.0.26.4)]
3.11.11


In [1]:
user_data_fp: str = "../data/parquet/user-details.parquet"

In [11]:
# Import Libraries
import pyspark.sql.types as T
from pyspark.sql import SparkSession

# Setup the Configuration
#conf = pyspark.SparkConf()

# Retrieve the Spark session
spark = SparkSession.builder.getOrCreate()
print(spark.version)

3.5.4


In [12]:
# Setup the Schema
schema = T.StructType([
T.StructField("User ID", T.IntegerType(), True),
T.StructField("Username", T.StringType(), True),
T.StructField("Browser", T.StringType(), True),
T.StructField("OS", T.StringType(), True),
])

# Add Data
data = ([
(1580, "Barry", "FireFox", "Windows" ),
(5820, "Sam", "MS Edge", "Linux"),
(2340, "Harry", "Vivaldi", "Windows"),
(7860, "Albert", "Chrome", "Windows"),
(1123, "May", "Safari", "macOS")
])

# Setup the Data Frame
user_data_df = spark.createDataFrame(data, schema=schema)

In [13]:
user_data_pdf = user_data_df.toPandas()
user_data_pdf

Unnamed: 0,User ID,Username,Browser,OS
0,1580,Barry,FireFox,Windows
1,5820,Sam,MS Edge,Linux
2,2340,Harry,Vivaldi,Windows
3,7860,Albert,Chrome,Windows
4,1123,May,Safari,macOS


In [7]:
user_data_pdf.to_parquet(user_data_fp)

In [8]:
%%sh
ls -lFh ../data/parquet/

total 8
-rw-r--r--@ 1 mac-DARNAU24  staff   3.1K Feb 13 17:12 user-details.parquet
