# Dataset creation with pyspark

In this example, we'll save a dataset in parquet with pyspark and version the metadata inside Verta.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
  .master("local") \
  .appName("parquet_example") \
  .getOrCreate()

First, we read a local csv and save it out as parquet. Here we're saving to local disk, but we could have saved to HDFS or S3 instead.

In [2]:
df = spark.read.csv('census-train.csv', header=True, inferSchema=True)
df.repartition(5).write.mode('overwrite').parquet('datasets/census-train-parquet')

Now we can use pyspark to read information about the dataset back to us and create a Path dataset with all the metadata.

In [3]:
# This cell will be replaced with the following in a future client version
#
# from verta.dataset import Path
# path_dataset = Path.with_spark('datasets/census-train-parquet')

rdd = sc.binaryFiles('datasets/census-train-parquet')

# Create a Path metadta for each component
def process_component(entry):
    filepath, content = entry
    filepath = filepath[len("file:"):]
    from verta.dataset import Path
    return Path(filepath)

# Then reduce by just doing the sum
path_dataset = rdd.map(process_component).reduce(lambda a, b: a+b)
print(path_dataset)

Path Version
    /Users/conrado/workspace/modeldb/demos/census/datasets/census-train-parquet/part-00000-aa276e29-cd0e-4208-bead-14fec5a13149-c000.snappy.parquet
        57199 bytes
        last modified: 2021-01-29 12:29:51.080000
        MD5 checksum: 1f3201dadb184e7c0caed32c10c61206
    /Users/conrado/workspace/modeldb/demos/census/datasets/census-train-parquet/part-00001-aa276e29-cd0e-4208-bead-14fec5a13149-c000.snappy.parquet
        57017 bytes
        last modified: 2021-01-29 12:29:51.081000
        MD5 checksum: 5d86421785794e1374bf7635065e4551
    /Users/conrado/workspace/modeldb/demos/census/datasets/census-train-parquet/part-00002-aa276e29-cd0e-4208-bead-14fec5a13149-c000.snappy.parquet
        57283 bytes
        last modified: 2021-01-29 12:29:51.080000
        MD5 checksum: 4230c0bec341695187e9c1371c6fddc7
    /Users/conrado/workspace/modeldb/demos/census/datasets/census-train-parquet/part-00003-aa276e29-cd0e-4208-bead-14fec5a13149-c000.snappy.parquet
        57102 bytes


Then we can save that metadata into Verta with a new dataset version

In [4]:
from verta import Client

client = Client("https://point72.app.verta.ai")
dataset = client.get_or_create_dataset("Census parquet - pyspark example", workspace="p72-mi-data")
dataset_version = dataset.create_version(path_dataset)

set email from environment
set developer key from environment
connection successfully established
created new Dataset: Census parquet - pyspark example in workspace: p72-mi-data
created new Dataset Version: 1 for Census parquet - pyspark example
