## Overview of Spark Write APIs

Let us understand how we can write Data Frames to different file formats.

* All the batch write APIs are grouped under write which is exposed to Data Frame objects.
* All APIs are exposed under spark.read
  * `text` - to write single column data to text files.
  * `csv` - to write to text files with delimiters. Default is a comma, but we can use other delimiters as well.
  * `json` - to write data to JSON files
  * `orc` - to write data to ORC files
  * `parquet` - to write data to Parquet files.
* We can also write data to other file formats by plugging in and by using `write.format`, for example **avro**
* We can use options based on the type using which we are writing the Data Frame to.
  * `compression` - Compression codec (`gzip`, `snappy` etc)
  * `sep` - to specify delimiters while writing into text files using **csv**
* We can `overwrite` the directories or `append` to existing directories using `mode`
* Create copy of orders data in **parquet** file format with no compression. If the folder already exists overwrite it. Target Location: **/user/[YOUR_USER_NAME]/retail_db/orders**
* When you pass options, if there are typos then options will be ignored rather than failing. Be careful and make sure that output is validated.
* By default the number of files in the output directory is equal to number of tasks that are used to process the data in the last stage. However, we might want to control number of files so that we don't run into too many small files issue.
* We can control number of files by using `coalesce`. It has to be invoked on top of Data Frame before invoking `write`.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [3]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Data Processing - Overview'). \
    master('yarn'). \
    getOrCreate()

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [4]:
%%sh

hdfs dfs -rm -R -skipTrash /user/${USER}/retail_db

Deleted /user/itv002461/retail_db


In [5]:
orders = spark. \
    read. \
    csv('/public/retail_db/orders',
        schema='''
            order_id INT, 
            order_date STRING, 
            order_customer_id INT, 
            order_status STRING
        '''
       )

In [6]:
orders.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)



In [7]:
orders.show()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|             1837|         CLOSED|
|      13|

In [8]:
orders.count()

68883

In [9]:
orders. \
    write. \
    parquet(f'/user/{username}/retail_db/orders', 
            mode='overwrite', 
            compression='none'
           )

In [10]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db/orders

# File extension should not contain compression algorithms such as snappy.

Found 2 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-06 09:38 /user/itv002461/retail_db/orders/_SUCCESS
-rw-r--r--   3 itv002461 supergroup     495238 2022-06-06 09:38 /user/itv002461/retail_db/orders/part-00000-ea98f8cc-0f2c-49a3-8b5a-346bbac35d27-c000.parquet


In [11]:
# Alternative approach - using option
orders. \
    write. \
    mode('overwrite'). \
    option('compression', 'none'). \
    parquet(f'/user/{username}/retail_db/orders')

In [12]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db/orders

# File extension should not contain compression algorithms such as snappy.

Found 2 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-06 09:38 /user/itv002461/retail_db/orders/_SUCCESS
-rw-r--r--   3 itv002461 supergroup     495238 2022-06-06 09:38 /user/itv002461/retail_db/orders/part-00000-453d51a8-3b7f-4a86-91aa-17a8b723a92c-c000.parquet


In [13]:
# Alternative approach - using format
orders. \
    write. \
    mode('overwrite'). \
    option('compression', 'none'). \
    format('parquet'). \
    save(f'/user/{username}/retail_db/orders')

In [14]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db/orders

# File extension should not contain compression algorithms such as snappy.

Found 2 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-06 09:38 /user/itv002461/retail_db/orders/_SUCCESS
-rw-r--r--   3 itv002461 supergroup     495238 2022-06-06 09:38 /user/itv002461/retail_db/orders/part-00000-372fd3ce-45bc-4b54-8e24-c5f51cb1fd81-c000.parquet


* Read order_items data from **/public/retail_db_json/order_items** and write it to pipe delimited files with gzip compression. Target Location: **/user/[YOUR_USER_NAME]/retail_db/order_items**. Make sure to validate.
* Ignore the error if the target location already exists. Also make sure to write into only one file. We can use `coalesce` for it. 

**`coalesce` will be covered in detail at a later point in time**

In [23]:
order_items = spark. \
    read. \
    json('/public/retail_db_json/order_items')

In [24]:
order_items.show()

+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_product_price|order_item_quantity|order_item_subtotal|
+-------------+-------------------+---------------------+------------------------+-------------------+-------------------+
|            1|                  1|                  957|                  299.98|                  1|             299.98|
|            2|                  2|                 1073|                  199.99|                  1|             199.99|
|            3|                  2|                  502|                    50.0|                  5|              250.0|
|            4|                  2|                  403|                  129.99|                  1|             129.99|
|            5|                  4|                  897|                   24.99|                  2|              49.98|
|            6| 

In [25]:
order_items.printSchema()

root
 |-- order_item_id: long (nullable = true)
 |-- order_item_order_id: long (nullable = true)
 |-- order_item_product_id: long (nullable = true)
 |-- order_item_product_price: double (nullable = true)
 |-- order_item_quantity: long (nullable = true)
 |-- order_item_subtotal: double (nullable = true)



In [18]:
order_items.count()

172198

In [26]:
# Using format
order_items. \
    coalesce(1). \
    write. \
    mode('ignore'). \
    option('compression', 'gzip'). \
    option('sep', '|'). \
    format('csv'). \
    save(f'/user/{username}/retail_db/order_items')

In [27]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db/order_items

Found 2 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-06 09:39 /user/itv002461/retail_db/order_items/_SUCCESS
-rw-r--r--   3 itv002461 supergroup    1032820 2022-06-06 09:39 /user/itv002461/retail_db/order_items/part-00000-6c2b4139-5d31-4519-853a-0da32a28a5d5-c000.csv.gz


In [28]:
# Alternative approach - using keyword arguments
order_items. \
    coalesce(1). \
    write. \
    csv(f'/user/{username}/retail_db/order_items',
        sep='|',
        mode='overwrite',
        compression='gzip'
       )

In [29]:
%%sh

hdfs dfs -ls /user/${USER}/retail_db/order_items

Found 2 items
-rw-r--r--   3 itv002461 supergroup          0 2022-06-06 09:42 /user/itv002461/retail_db/order_items/_SUCCESS
-rw-r--r--   3 itv002461 supergroup    1032820 2022-06-06 09:42 /user/itv002461/retail_db/order_items/part-00000-b197bee8-4cc9-463c-a0ad-63a599f0aef9-c000.csv.gz
