<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Writing Data in Spark

Just as there are many ways to read data, we have just as many ways to write data.

In this notebook, we will take a quick peek at how to write data back out to Parquet files.

**Technical Accomplishments:**
- Writing data to Parquet files

## Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io

s3 = boto3.client('s3')
baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

## Writing Data

Let's start with one of our original CSV data sources, **wikipedia_pageviews_by_second.tsv**:

In [None]:
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("timestamp", StringType(), False),
  StructField("site", StringType(), False),
  StructField("requests", IntegerType(), False)
])

csvFile = baseUri + "wikipedia_pageviews_by_second.tsv"

csvDF = (spark.read
  .option('header', 'true')
  .option('sep', "\t")
  .schema(csvSchema)
  .csv(csvFile)
)

Now that we have a `DataFrame`, we can write it back out as Parquet files or other various formats.

In [None]:
outputBaseUri = "/home/jovyan/data/pyspark/"

(csvDF.write                       # Our DataFrameWriter
  .option("compression", "snappy") # One of none, snappy, gzip, and lzo
  .mode("overwrite")               # Replace existing files
  .parquet(outputBaseUri + "wikipedia_pageviews_by_second.parquet") # Write DataFrame as a table into the Default database
)

Now that the file has been written, we can read it and count the number of row

In [None]:
outputBaseUri = "/home/jovyan/data/pyspark/"
spark.read.parquet(outputBaseUri + "wikipedia_pageviews_by_second.parquet").count()

Now we will try to append more rows to an existing file

In [None]:
outputBaseUri = "/home/jovyan/data/pyspark/"

(csvDF.write                       
  .option("compression", "snappy") 
  .mode("append")              
  .parquet(outputBaseUri + "wikipedia_pageviews_by_second.parquet"))
 
spark.read.parquet(outputBaseUri + "wikipedia_pageviews_by_second.parquet").count()

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.