# Data format lab

## Overview 

We will convert data formats using Spark.

We will execute this lab on local mode first.  And then on Hadoop cluster.

## Duration 

15 minutes

## Step-1: Verify data

We will use transaction data, this data is located in `data/transactions/transactions-sample.csv`.  Or you can use generated data as well

## Step-2: Init Spark

In [None]:
try:
    spark
except NameError:
    import findspark
    findspark.init()  # uses SPARK_HOME
    print("Spark found in : ", findspark.find())

    import pyspark
    from pyspark import SparkConf
    from pyspark.sql import SparkSession

    # use a unique tmep dir for warehouse dir, so we can run multiple spark sessions in one dir
    import tempfile
    tmpdir = tempfile.TemporaryDirectory()

    config = ( SparkConf()
             .setAppName("TestApp")
             .setMaster("local[*]")
             .set('executor.memory', '2g')
             .set('spark.sql.warehouse.dir', tmpdir.name)
             .set("some_property", "some_value") # another example
             )

    spark = SparkSession.builder.config(conf=config).getOrCreate()
    sc = spark.sparkContext

print('Spark UI running on port ' + spark.sparkContext.uiWebUrl.split(':')[2])

## Step-3: Load CSV Data

In [None]:
## Inspect the data

! cat "../data/transactions/transactions-sample.csv"  | head -n 5

In [None]:
# Load data in Spark

df_csv = spark.read.csv("../data/transactions/transactions-sample.csv", header=True, inferSchema=True)
df_csv.printSchema()

## Step-4: Save data in JSON format

In [None]:
df_csv.write.json("../data/transactions/json", "overwrite")

In [None]:
## inspect the data

! ls -l ../data/transactions/json/


In [None]:
! cat ../data/transactions/json/* | head -n 5

In [None]:
## Read json data back 

df_json = spark.read.json ('../data/transactions/json/')
df_json.show(5, truncate=True)

## Step-5: Save data in Parquet format

In [None]:
df_csv.write.parquet("../data/transactions/parquet", "overwrite")

In [None]:
! ls -l ../data/transactions/parquet/

In [None]:
## parquet files are binary, you can inspect them using parquet-tools

In [None]:
## read back parquet files

df_parquet = spark.read.parquet ('../data/transactions/parquet/')
df_parquet.show(5)