# PySpark DuckLake Example

An example notebook showing how to use [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) with DuckLake.

Install PySpark:

In [None]:
!pip install pyspark

Import the `pyspark` package:

In [2]:
import pyspark

Specify the DuckLake URI and table name:

In [3]:
ducklake_uri = "metadata.ducklake"
table_name = "penguins"

Create a Spark session:

In [4]:
spark = (
    pyspark.sql.SparkSession.builder
        .config("spark.jars.packages", "org.duckdb:duckdb_jdbc:1.3.1.0")
        .config("spark.driver.defaultJavaOptions", "-Djava.security.manager=allow")
        .getOrCreate()
)

Create a PySpark DataFrame from a table:

In [5]:
df = (
    spark.read.format("jdbc")
        .option("driver", "org.duckdb.DuckDBDriver")
        .option("url", f"jdbc:duckdb:ducklake:{ducklake_uri}")
        .option("dbtable", table_name)
        .load()
)

Preview the DataFrame:

In [6]:
df.show(10)

+-------+---------+--------------+-------------+-----------------+-----------+------+----+
|species|   island|bill_length_mm|bill_depth_mm|flipper_length_mm|body_mass_g|   sex|year|
+-------+---------+--------------+-------------+-----------------+-----------+------+----+
| Adelie|Torgersen|          39.1|         18.7|              181|       3750|  male|2007|
| Adelie|Torgersen|          39.5|         17.4|              186|       3800|female|2007|
| Adelie|Torgersen|          40.3|           18|              195|       3250|female|2007|
| Adelie|Torgersen|            NA|           NA|               NA|         NA|    NA|2007|
| Adelie|Torgersen|          36.7|         19.3|              193|       3450|female|2007|
| Adelie|Torgersen|          39.3|         20.6|              190|       3650|  male|2007|
| Adelie|Torgersen|          38.9|         17.8|              181|       3625|female|2007|
| Adelie|Torgersen|          39.2|         19.6|              195|       4675|  male|2007|

Get the count for each species:

In [7]:
df.groupby("species").count().show()

+---------+-----+
|  species|count|
+---------+-----+
|   Gentoo|  124|
|   Adelie|  152|
|Chinstrap|   68|
+---------+-----+



Register the DataFrame as a table and query the table with SQL:

In [8]:
df.createOrReplaceTempView(table_name)
spark.sql(f"SELECT count(*) from {table_name}").show()

+--------+
|count(1)|
+--------+
|     344|
+--------+



Stop the underlying Spark Context:

In [9]:
spark.stop()