# Loading Spark dataframes and Delta tables as Hugging Face datasets

This example notebook demonstrates how to create Hugging Face datasets from Spark dataframes and Delta tables.

## Initialize Spark session

In [None]:
import pyspark
from delta import *

builder = (
    pyspark.sql.SparkSession.builder.appName("huggingface-from-spark")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
    .config("spark.executor.memory", "8g")
)

spark = configure_spark_with_delta_pip(builder).getOrCreate()

## Basic Usage
As demonstrated [here](databricks.com/blog/contributing-spark-loader-for-hugging-face-datasets) and [here](https://huggingface.co/docs/datasets/use_with_spark), creating a Hugging Face Dataset from a Spark dataframe or from a Delta table loaded into a DataFrame is very straightforward: 

In [None]:
# Create sample Spark DataFrame
df = spark.createDataFrame([(1, "Hello"), (2, "World")])
type(df)

All we need to do to load the Spark DataFrame to a Hugging Face Dataset is call the `Dataset.from_spark()` function.

In [None]:
from datasets import Dataset, DatasetDict

dataset = Dataset.from_spark(df)

## Delta Table example
Let's try something more substantial. We'll download the full MNIST dataset, convert it to a Delta table, and then load it as a Hugging Face dataset.

In [None]:
from torchvision import datasets
from pyspark.sql.types import (
    IntegerType,
    StructType,
    StructField,
    FloatType,
    BinaryType,
)
import numpy as np

train_set = datasets.MNIST(root="./data", train=True, download=True)
test_set = datasets.MNIST(root="./data", train=False, download=True)

# Convert the data to a Spark DataFrame
schema = StructType(
    [
        StructField("id", IntegerType(), False),
        StructField("label", FloatType(), False),
        StructField(
            "features", BinaryType(), False
        ),  # Changed ArrayType(IntegerType()) to BinaryType()
    ]
)

# Convert images to numpy arrays and save as binary
train_data = [
    (i, float(y), bytearray(np.array(x))) for i, (x, y) in enumerate(train_set)
]
train_df = spark.createDataFrame(train_data, schema).repartition(50)

test_data = [
    (i, float(y), bytearray(np.array(x))) for i, (x, y) in enumerate(test_set)
]
test_df = spark.createDataFrame(test_data, schema).repartition(50)

# Write the DataFrame to Delta Lake format
train_df.write.format("delta").mode("overwrite").option(
    "overwriteSchema", "true"
).save("./data/mnist_delta/train")
test_df.write.format("delta").mode("overwrite").option(
    "overwriteSchema", "true"
).save("./data/mnist_delta/test")

Now we'll load the Delta Tables as DataFrames and use them to create a Hugging Face dataset.

In [None]:
train_df = spark.read.format("delta").load("./data/mnist_delta/train")
test_df = spark.read.format("delta").load("./data/mnist_delta/test")

In [None]:
# Load the Delta tables as Spark DataFrames
train_df = spark.read.format("delta").load("./data/mnist_delta/train")
test_df = spark.read.format("delta").load("./data/mnist_delta/test")

# Create Hugging Face Datasets from the Spark DataFrames
train_dataset = Dataset.from_spark(train_df)
test_dataset = Dataset.from_spark(test_df)

# Concatenate the train and test datasets into a single dataset
huggingface_dataset = dataset_dict = DatasetDict(
    {
        "train": train_dataset,
        "test": test_dataset,
    }
)

Let's confirm we can reclaim our original data after this process.

In [None]:
import matplotlib.pyplot as plt

row = train_dataset[8]
# Extract the image data and label
image_data = row["features"]
label = row["label"]

# Convert the binary data back to a NumPy array and reshape it
image_array = np.frombuffer(image_data, dtype=np.uint8).reshape(28, 28)

# Plot the image
plt.imshow(image_array, cmap="gray")
plt.title(f"Label: {label}")
plt.show()