# Reading from CSV / json file and writing to Parquet file

This sample code reads a few fields from nested json, and creates a dataframe,

Then write the dataframe to storage.

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.types import *

In [None]:
spark = SparkSession.builder.appName('02 reading').getOrCreate()

The Spark UI is available at http://localhost:4040 when running locally in a PC

# Read nested json into a dataframe

HINT: During testing, create a tiny jsonl file so reading is fast. For example `head -n 12 the-file.json > test_12.json`

In [None]:
# https://sparkbyexamples.com/pyspark/pyspark-read-json-file-into-dataframe/
# https://bigdataprogrammers.com/read-nested-json-in-spark-dataframe/
# https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/
from pyspark.sql.types import MapType
fname = "../data/sample.json"

# Note: By default, the schema is inferred from the data.
# This is slower and may sometime fail due to bad input files.
# A possilbe workaround is to read a short well defined file, extract the schema from it, and then read
# the full file using this schema.
# inferred = spark.read.json(fname_ref)
# inferred.printSchema()
#bids = spark.read.schema(inferred.schema).json(fname)
df = spark.read.json(fname)
df.show()   
    

In [None]:
df.printSchema()

In [None]:
# Now we can get a few columns of our choice. note the nesting
subset = df.select('address.zip', 'name') 
subset.show(4)

If the notebooks runs inside a Docker container, we need to provide access to the hosted data directory.

For example, create a directory in the host and configure in docker-compose.

## Reading multiple files at a time
https://sparkbyexamples.com/pyspark/pyspark-read-json-file-into-dataframe/

Using the read.json() method you can also read multiple JSON files from different paths, just pass all file names with fully qualified paths by separating comma, or a list of files

# Writing the dataframe to storage

What if you want to persist (save values) of a DF?
It can be saved to a database (covered in another lesson), or saved to a file in the file system.
Using **Parquet** format is very efficient as we can see here.


For example, In one test I read a jsonl file (602MB) into a DF, then wrote it to parquet file (actually it creates a directory with several files).
The parquet file is compressed so the total saved storage was 92MB. 


## The Parquet format
Column based binary file format.
- write once, read many → immutable.
- optimized, compressed (per column). → write is slow, read is fast.
- not indexed?

## Alternative formats
file formats similar to Parquet:
 - apache Iceberg  https://www.infoworld.com/article/3669848/why-apache-iceberg-will-rule-data-in-the-cloud.html
 - snappy (which is more of a compression algorithm)
 - AVRO 

In [None]:
%%time 
# Read a CSV into a dataframe, inferring the schema.
dataPath = "../data/Open_Parking_and_Camera_Violations_1M.csv"
fines = spark.read.format("csv")\
  .option("header","true")\
  .option("inferSchema", "true")\
  .load(dataPath)
  

In [None]:
fines.columns

In [None]:
%%time
# the output file must NOT exist
# Column names must not include spaces (and some other characters)
newColNames = [ name.replace(' ','_') for name in fines.columns] # convert to valid names for Parquet
fines.toDF(*newColNames).write.parquet("./fines1M.parquet")

# You can also drop irrelevant columns:
fines.select(['Plate','Amount Due']).withColumnRenamed('Amount Due','AmountDue').write.parquet("./OnlyTwoFields.parquet")

In [None]:
%%time 
# read the DF from the parquet file:
restored_df = spark.read.parquet("./fines1M.parquet")

Let's run a few actions on the df:

In [None]:
restored_df.count()

In [None]:
restored_df.select(f.max("Plate")).collect()

In [None]:
restored_df.sort('Plate','County').limit(6).toPandas()

In [None]:
restored_df.sort(f.col('Plate').desc(),'County').limit(6).toPandas()

### Repartition before writing to storage

Spark DataFrameWriter provides partitionBy method which can be used to partition data on write. It repartition the data into separate files on write using a provided set of columns. [2]

Correctly choosing the key is important for good performance!

### Bucketizing  
(see SDG chapter 9, page 184)

To improve search speed we can use bucketing according to a column value.
(In spark 3.2.1, we must use `saveAsTable` when using `bucketBy`)

How many files are created?<br>
Answer: num_partitions * num_buckets == 8*4 in this example.
(Actually, for each file there is a CRC file, so about 2\*8\*4 files)


In [14]:
fines.select(['county', 'state', 'Violation'])\
.write.format("parquet").mode("overwrite").bucketBy(4, "county")\
.saveAsTable("bucketed")

# Check yourself

* what will happen if you replace `fines.toDF(*newColNames).show()` with `fines.toDF(*newColNames).toPandas()` ?