d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 400px">
</div>

# Reader & Writer
1. Read from CSV files
1. Read from JSON files
1. Write DataFrame to files
1. Write DataFrame to tables
1. Write DataFrame to a Delta table

##### Methods
- DataFrameReader (<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output" target="_blank">Python</a>/<a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html" target="_blank">Scala</a>): `csv`, `json`, `option`, `schema`
- DataFrameWriter (<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output" target="_blank">Python</a>/<a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameWriter.html" target="_blank">Scala</a>): `mode`, `option`, `parquet`, `format`, `saveAsTable`
- StructType (<a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.StructType.html#pyspark.sql.types.StructType" target="_blank">Python</a>/<a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/types/StructType.html" target="_blank" target="_blank">Scala</a>): `toDDL`

##### Spark Types
- Types (<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#data-types" target="_blank">Python</a>/<a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/types/index.html" target="_blank">Scala</a>): `ArrayType`, `DoubleType`, `IntegerType`, `LongType`, `StringType`, `StructType`, `StructField`

In [0]:
%run ./Includes/Classroom-Setup

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Read from CSV files
Read from CSV with DataFrameReader's `csv` method and the following options:

Tab separator, use first line as header, infer schema

In [0]:
usersCsvPath = "/mnt/training/ecommerce/users/users-500k.csv"

usersDF = (spark.read
  .option("sep", "\t")
  .option("header", True)
  .option("inferSchema", True)
  .csv(usersCsvPath))

usersDF.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- user_first_touch_timestamp: long (nullable = true)
 |-- email: string (nullable = true)



Manually define the schema by creating a `StructType` with column names and data types

In [0]:
from pyspark.sql.types import LongType, StringType, StructType, StructField

userDefinedSchema = StructType([
  StructField("user_id", StringType(), True),
  StructField("user_first_touch_timestamp", LongType(), True),
  StructField("email", StringType(), True)
])

Read from CSV using this user-defined schema instead of inferring schema

In [0]:
usersDF = (spark.read
  .option("sep", "\t")
  .option("header", True)
  .schema(userDefinedSchema)
  .csv(usersCsvPath))

Alternatively, define the schema using a DDL formatted string.

In [0]:
DDLSchema = "user_id string, user_first_touch_timestamp long, email string"

usersDF = (spark.read
  .option("sep", "\t")
  .option("header", True)
  .schema(DDLSchema)
  .csv(usersCsvPath))

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Read from JSON files

Read from JSON with DataFrameReader's `json` method and the infer schema option

In [0]:
eventsJsonPath = "/mnt/training/ecommerce/events/events-500k.json"

eventsDF = (spark.read
  .option("inferSchema", True)
  .json(eventsJsonPath))

eventsDF.printSchema()

root
 |-- device: string (nullable = true)
 |-- ecommerce: struct (nullable = true)
 |    |-- purchase_revenue_in_usd: double (nullable = true)
 |    |-- total_item_quantity: long (nullable = true)
 |    |-- unique_items: long (nullable = true)
 |-- event_name: string (nullable = true)
 |-- event_previous_timestamp: long (nullable = true)
 |-- event_timestamp: long (nullable = true)
 |-- geo: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- coupon: string (nullable = true)
 |    |    |-- item_id: string (nullable = true)
 |    |    |-- item_name: string (nullable = true)
 |    |    |-- item_revenue_in_usd: double (nullable = true)
 |    |    |-- price_in_usd: double (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |-- traffic_source: string (nullable = true)
 |-- user_first_touch_timestamp: long (nullable = true)

Read data faster by creating a `StructType` with the schema names and data types

In [0]:
from pyspark.sql.types import ArrayType, DoubleType, IntegerType, LongType, StringType, StructType, StructField

userDefinedSchema = StructType([
  StructField("device", StringType(), True),
  StructField("ecommerce", StructType([
    StructField("purchaseRevenue", DoubleType(), True),
    StructField("total_item_quantity", LongType(), True),
    StructField("unique_items", LongType(), True)
  ]), True),
  StructField("event_name", StringType(), True),
  StructField("event_previous_timestamp", LongType(), True),
  StructField("event_timestamp", LongType(), True),
  StructField("geo", StructType([
    StructField("city", StringType(), True),
    StructField("state", StringType(), True)
  ]), True),
  StructField("items", ArrayType(
    StructType([
      StructField("coupon", StringType(), True),
      StructField("item_id", StringType(), True),
      StructField("item_name", StringType(), True),
      StructField("item_revenue_in_usd", DoubleType(), True),
      StructField("price_in_usd", DoubleType(), True),
      StructField("quantity", LongType(), True)
    ])
  ), True),
  StructField("traffic_source", StringType(), True),
  StructField("user_first_touch_timestamp", LongType(), True),
  StructField("user_id", StringType(), True)
])

eventsDF = (spark.read
  .schema(userDefinedSchema)
  .json(eventsJsonPath))

You can use the `StructType` Scala method `toDDL` to have a DDL-formatted string created for you.

In Python, create a Scala cell to create the string to copy and paste.

In [0]:
%scala
spark.read.parquet("/mnt/training/ecommerce/events/events.parquet").schema.toDDL

In [0]:
DDLSchema = "`device` STRING,`ecommerce` STRUCT<`purchase_revenue_in_usd`: DOUBLE, `total_item_quantity`: BIGINT, `unique_items`: BIGINT>,`event_name` STRING,`event_previous_timestamp` BIGINT,`event_timestamp` BIGINT,`geo` STRUCT<`city`: STRING, `state`: STRING>,`items` ARRAY<STRUCT<`coupon`: STRING, `item_id`: STRING, `item_name`: STRING, `item_revenue_in_usd`: DOUBLE, `price_in_usd`: DOUBLE, `quantity`: BIGINT>>,`traffic_source` STRING,`user_first_touch_timestamp` BIGINT,`user_id` STRING"

eventsDF = (spark.read
  .schema(DDLSchema)
  .json(eventsJsonPath))

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Write DataFrames to files

Write `usersDF` to parquet with DataFrameWriter's `parquet` method and the following configurations:

Snappy compression, overwrite mode

In [0]:
usersOutputPath = workingDir + "/users.parquet"

(usersDF.write
  .option("compression", "snappy")
  .mode("overwrite")
  .parquet(usersOutputPath)
)

-sandbox
### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Write DataFrames to tables

Write `eventsDF` to a table using the DataFrameWriter method `saveAsTable`

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This creates a global table, unlike the local view created by the DataFrame method `createOrReplaceTempView`

In [0]:
eventsDF.write.mode("overwrite").saveAsTable("events_p")

This table was saved in the database created for you in classroom setup. See database name printed below.

In [0]:
print(databaseName)

cenzwongekimetricscom_spark_programming_1_4_reader___writer_py


### ![Delta Lake Tiny Logo](https://pages.databricks.com/rs/094-YMS-629/images/delta-lake-tiny-logo.png) Best Practice: Write Results to a Delta Table

In almost all cases, the best practice is to use <a href="https://delta.io/" target="_blank">Delta Lake</a>, especially whenever the data will be referenced from a Databricks Workspace. Data in Delta tables is stored in Parquet format.

Write `eventsDF` to Delta with DataFrameWriter's `save` method and the following configurations:

Delta format, overwrite mode

In [0]:
eventsOutputPath = workingDir + "/delta/events"

(eventsDF.write
  .format("delta")
  .mode("overwrite")
  .save(eventsOutputPath)
)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Ingesting Data Lab

Read in CSV files containing products data.
1. Read with infer schema
2. Read with user-defined schema
3. Read with DDL formatted string
4. Write to Delta

### 1. Read with infer schema
- View the first CSV file using DBUtils method `fs.head` with the filepath provided in the variable `singleProductCsvFilePath`
- Create `productsDF` by reading from CSV files located in the filepath provided in the variable `productsCsvPath`
  - Configure options to use first line as header and infer schema

In [0]:
# ANSWER
singleProductCsvFilePath = "/mnt/training/ecommerce/products/products.csv/part-00000-tid-1663954264736839188-daf30e86-5967-4173-b9ae-d1481d3506db-2367-1-c000.csv"
print(dbutils.fs.head(singleProductCsvFilePath))

productsCsvPath = "/mnt/training/ecommerce/products/products.csv"

productsDF = (spark.read
  .option("header", True)
  .option("inferSchema", True)
  .csv(productsCsvPath))

productsDF.printSchema()

item_id,name,price
M_PREM_Q,Premium Queen Mattress,1795.0
M_STAN_F,Standard Full Mattress,945.0
M_PREM_F,Premium Full Mattress,1695.0

root
 |-- item_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- price: double (nullable = true)



-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert(productsDF.count() == 12)

### 2. Read with user-defined schema
Define schema by creating a `StructType` with column names and data types

In [0]:
# ANSWER
userDefinedSchema = StructType([
  StructField("item_id", StringType(), True),
  StructField("name", StringType(), True),
  StructField("price", DoubleType(), True)
])

productsDF2 = (spark.read
  .option("header", True)
  .schema(userDefinedSchema)
  .csv(productsCsvPath))

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert(userDefinedSchema.fieldNames() == ["item_id", "name", "price"])

In [0]:
from pyspark.sql import Row

expected1 = Row(item_id="M_STAN_Q", name="Standard Queen Mattress", price=1045.0)
result1 = productsDF2.first()

assert(expected1 == result1)

### 3. Read with DDL formatted string

In [0]:
# ANSWER
DDLSchema = "`item_id` STRING,`name` STRING,`price` DOUBLE"

productsDF3 = (spark.read
  .option("header", True)
  .schema(DDLSchema)
  .csv(productsCsvPath))

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert(productsDF3.count() == 12)

### 4. Write to Delta
Write `productsDF` to the filepath provided in the variable `productsOutputPath`

In [0]:
# ANSWER
productsOutputPath = workingDir + "/delta/products"
(productsDF.write
  .format("delta")
  .mode("overwrite")
  .save(productsOutputPath)
)

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
assert(len(dbutils.fs.ls(productsOutputPath)) == 5)

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup
