# File Format

Store data in different file systems like the fourth extended filesystem (ext4) and Hadoop Distributed File System (HDFS) with different file formats like text file, [Resilient Distributed Datasets (RDDs)](https://spark.apache.org/docs/latest/rdd-programming-guide.html), [JSON lines (JSONL)](https://spark.apache.org/docs/latest/sql-data-sources-json.html), [Parquet](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html) and [ORC](https://orc.apache.org/).

## License

MIT License

Copyright (c) 2018 PT Bukalapak.com

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

## Software Version

In [1]:
import sys
print("Python %s" % sys.version)
import base64

Python 3.6.3 |Anaconda, Inc.| (default, Nov  9 2017, 00:19:18) 
[GCC 7.2.0]


In [2]:
import pyspark
print("PySpark %s" % pyspark.__version__)
from pyspark.sql import SparkSession, Row

PySpark 2.3.1


In [3]:
import platform
print("platform %s" % platform.__version__)

platform 1.0.8


In [4]:
print("OS", platform.platform())

OS Linux-4.15.0-38-generic-x86_64-with-debian-buster-sid


In [5]:
%%bash
cat /etc/os-release

NAME="Ubuntu"
VERSION="18.04 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic


## Setup Spark

In [6]:
APP_NAME = "bukalapak-core-ai.big-data-3v.variety-file-format-spark"
spark = SparkSession \
    .builder \
    .appName(APP_NAME) \
    .getOrCreate()

In [7]:
sc = spark.sparkContext

In [8]:
sc

## Get Serialized Image

In [9]:
pickle_path_filename = \
    '/home/jovyan/work/' + \
    'image/preprocessed_wo_norm_jacket_python_source.pkl'

In [10]:
with open(pickle_path_filename, "rb") as f:
    image_pkl = f.read()

In [11]:
image_pkl[:50]

b'\x80\x03cnumpy.core.multiarray\n_reconstruct\nq\x00cnumpy\nnda'

In [12]:
image_pkl[-20:]

b'\xdb\xdb\xdb\xdb\xdb\xdb\xdb\xdb\xdb\xdb]]]q\rtq\x0eb.'

## ext4 - Text File

### Write

In [13]:
ext4_text_file_path_filename = \
    '/home/jovyan/work/' + \
    'file_format/ext4_text_file.txt'

In [14]:
with open(ext4_text_file_path_filename, 'wb') as fh:
        fh.write(image_pkl)

### Read

In [15]:
with open(ext4_text_file_path_filename, "rb") as f:
    image_pkl_text_file_new = f.read()

Following has to be True!

In [16]:
image_pkl_text_file_new == image_pkl

True

## HDFS - RDD

Generate RDD of 3 serialized images.

In [17]:
total_images_list = [1, 2, 3]
total_images_rdd = sc.parallelize(total_images_list)

In [18]:
images_rdd = total_images_rdd.map(lambda count: image_pkl)

In [19]:
images_rdd

PythonRDD[1] at RDD at PythonRDD.scala:49

In [20]:
images_rdd.count()

3

### Write

Do not use `saveAsTextFile` function because it uses new line ('\n') as the delimiter. Since Pickle serialized data could contain new line and when we read back the data, the data will be broken down into multiple extra pieces. Use `saveAsPickleFile` function instead.

In [21]:
hdfs_rdd_path_filename = \
    '/home/jovyan/work/' + \
    'file_format/hdfs_rdd.rdd'

Note: Don't forget to delete existing files. For default, HDFS will not overwrite them, instead, it will throw following error message.
```
Py4JJavaError: An error occurred while calling o52.saveAsObjectFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/home/jovyan/work/file_format/hdfs_rdd.rdd already exists
```

In [22]:
images_rdd.saveAsPickleFile(hdfs_rdd_path_filename)

### Read

In [23]:
images_rdd_new = sc.pickleFile(hdfs_rdd_path_filename)

In [24]:
images_rdd_new

MapPartitionsRDD[7] at objectFile at NativeMethodAccessorImpl.java:0

In [25]:
images_rdd_new.count()

3

Following has to be all Trues!

In [26]:
images_pkl_rdd_new = images_rdd_new.take(3)
for image_pkl_rdd_new in images_pkl_rdd_new:
    print(image_pkl_rdd_new == image_pkl)

True
True
True


## HDFS - JSONL

Using previous set of images in RDD, convert it into DataFrame. Since DataFrame cannot store binary, convert the image to string using `base64` encoder. Column names for the image DataFrame are as follow:
 - `fn`: filename
 - `esp`: encoded serialized Pickle

In [27]:
images_dict_rdd = images_rdd \
    .map(lambda x: Row(fn="image.png", \
                       esp=base64.b64encode(x).decode('UTF-8')))

In [28]:
images_dict_rdd

PythonRDD[12] at RDD at PythonRDD.scala:49

In [29]:
images_dict_df = spark.createDataFrame(images_dict_rdd)

In [30]:
images_dict_df

DataFrame[esp: string, fn: string]

In [31]:
images_dict_df.show()

+--------------------+---------+
|                 esp|       fn|
+--------------------+---------+
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
+--------------------+---------+



### Write

In [32]:
hdfs_jsonl_path_filename = \
    '/home/jovyan/work/' + \
    'file_format/hdfs_jsonl.jsonl'

Note: Don't forget to delete existing files. For default, HDFS will not overwrite them, instead, it will throw following error message.
```
Py4JJavaError: An error occurred while calling o542.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/home/jovyan/work/file_format/hdfs_jsonl.jsonl already exists
```

In [33]:
images_dict_df.toJSON().saveAsTextFile(hdfs_jsonl_path_filename)

### Read

In [34]:
images_dict_df_jsonl_new = spark.read.json(hdfs_jsonl_path_filename)

In [35]:
images_dict_df_jsonl_new

DataFrame[esp: string, fn: string]

In [36]:
images_dict_df_jsonl_new.show()

+--------------------+---------+
|                 esp|       fn|
+--------------------+---------+
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
+--------------------+---------+



Following has to be all Trues!

In [37]:
for image_dict_row in images_dict_df_jsonl_new.toLocalIterator():
    print(image_dict_row.fn=="image.png", \
          base64.b64decode(image_dict_row.esp.encode('UTF-8'))==image_pkl)

True True
True True
True True


## HDFS - Parquet

We can use the same images DataFrame as before.

In [38]:
images_dict_df

DataFrame[esp: string, fn: string]

In [39]:
images_dict_df.show()

+--------------------+---------+
|                 esp|       fn|
+--------------------+---------+
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
+--------------------+---------+



## Write

In [40]:
hdfs_parquet_path_filename = \
    '/home/jovyan/work/' + \
    'file_format/hdfs_parquet.paquet'

Note: Don't forget to delete existing files. For default, HDFS will not overwrite them, instead, it will throw following error message.
```
Py4JJavaError: An error occurred while calling o154.save.
: org.apache.spark.sql.AnalysisException: path file:/home/jovyan/work/file_format/hdfs_parquet.paquet already exists.;
```

In [41]:
images_dict_df.write.save(hdfs_parquet_path_filename, format="parquet")

## Read

In [42]:
images_dict_df_parquet_new = spark.read.parquet(hdfs_parquet_path_filename)

In [43]:
images_dict_df_parquet_new

DataFrame[esp: string, fn: string]

In [44]:
images_dict_df_parquet_new.show()

+--------------------+---------+
|                 esp|       fn|
+--------------------+---------+
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
+--------------------+---------+



Following has to be all Trues!

In [45]:
for image_dict_row in images_dict_df_parquet_new.toLocalIterator():
    print(image_dict_row.fn=="image.png", \
          base64.b64decode(image_dict_row.esp.encode('UTF-8'))==image_pkl)

True True
True True
True True


## HDFS - ORC

We can use the same images DataFrame as before.

In [46]:
images_dict_df

DataFrame[esp: string, fn: string]

In [47]:
images_dict_df.show()

+--------------------+---------+
|                 esp|       fn|
+--------------------+---------+
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
+--------------------+---------+



## Write

In [48]:
hdfs_orc_path_filename = \
    '/home/jovyan/work/' + \
    'file_format/hdfs_orc.orc'

Note: Don't forget to delete existing files. For default, HDFS will not overwrite them, instead, it will throw following error message.
```
Py4JJavaError: An error occurred while calling o154.save.
: org.apache.spark.sql.AnalysisException: path file:/home/jovyan/work/file_format/hdfs_parquet.paquet already exists.;
```

In [49]:
images_dict_df.write.save(hdfs_orc_path_filename, format="orc")

## Read

In [50]:
images_dict_df_orc_new = spark.read.orc(hdfs_orc_path_filename)

In [51]:
images_dict_df_orc_new

DataFrame[esp: string, fn: string]

In [52]:
images_dict_df_orc_new.show()

+--------------------+---------+
|                 esp|       fn|
+--------------------+---------+
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
|gANjbnVtcHkuY29yZ...|image.png|
+--------------------+---------+



Following has to be all Trues!

In [53]:
for image_dict_row in images_dict_df_orc_new.toLocalIterator():
    print(image_dict_row.fn=="image.png", \
          base64.b64decode(image_dict_row.esp.encode('UTF-8'))==image_pkl)

True True
True True
True True
