# ETL I

## Task I - Define correct schema for json data

* load json dataset
* look at the infered schema (is it inferred correctly or is it wrong?)
* define the schema explicitly
* see what happens if the schema is defined wrong

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, count, explode, split, regexp_replace, collect_list

from pyspark.sql.types import StructType, StructField, StringType, LongType, TimestampType

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('ETL I')
    .getOrCreate()
)

In [None]:
print(spark.version)

The input dataset is in the json format and is in the `data/questions-json` folder. Below is the path definition:

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

data_input_path = os.path.join(project_path, 'data/questions-json')

output_path = os.path.join(project_path, 'output/questions-transformed')

#### First let Spark infer the schema:

In [None]:
# your code here:

Note: Here we have only 8 json files. In case where you have lots of json files and you know that each file has the same schema, consider loading only one file to check the schema. Inferring the schema from many files can be expensive.

#### Define the schema:

Here it depends on the Spark version. 
* If Spark version is different from 3.0, the data type of `creation_date` is inferred as `StringType` however in reality it is a Timestamp. Define the schema by hand and provide it to create the DataFrame
* If Spark version = 3.0, the data type of `creation_date` is inferred correctly as `TimestampType`

In [None]:
# your code here:

# You can skip this if using Spark 3.x, because the schema is inferred correctly

#### What happens if the actual data type doesn't match the schema:

* set `title` as `LongType` in the defined schema

Hint
* Different things will happen depending on the `mode` option, where `mode` is one of the following:
    * FAILFAST
    * DROPMALFORMED
    * PERMISSIVE (default)
* For more details about the mode and also other json options see the [docs](https://spark.apache.org/docs/latest/sql-data-sources-json.html)


In [None]:
# Define the schema with a mistake in the title column:
# your code here:

In [None]:
# Try the default PERMISSIVE mode
# your code here:

In [None]:
# Try the DROPMALFORMED mode
# your code here:

In [None]:
# Try the FAILFAST mode
# your code here:

#### Note

* To read more about schema inferrence and schema evolution of json files in Spark SQL, read my article: https://medium.com/swlh/notes-about-json-schema-handling-in-spark-sql-be1e7f13839d

## Task II - Transfrom json to parquet and convert String column to an array

* convert column tags to array of tags 
* &lt;tag1&gt;&lt;tag2&gt;&lt;tag3&gt; ---> [tag1, tag2, tag3]

#### Convert tags to an array

Hint
* use [split](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.split.html#pyspark.sql.functions.split) to get an array
* [explode](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.explode.html#pyspark.sql.functions.explode) the array to access each element separately
* use [regexp_replace](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_replace.html#pyspark.sql.functions.regexp_replace) and split on ><
* [groupBy](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.groupBy.html#pyspark.sql.DataFrame.groupBy) + [collect_list](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.collect_list.html#pyspark.sql.functions.collect_list)
* join with original questions DataFrame

In [None]:
# your code here:

#### Note
This is an old-school solution used rather before 2.4. Since 2.4 we have higher order functions that can solve the problem more elegantly (we will see that later in the section with Higher Order Functions)

There are also some side-effects of this solution:

1. groupBy creates a shuffle (quite expensive)
2. the elements in the final array may come in different order
3. the groupBy key must be unique, otherwise we will reduce it

#### Save the data

Hint:
* use [write](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.write.html#pyspark.sql.DataFrame.write) + [save](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.save.html#pyspark.sql.DataFrameWriter.save)
* [repartition](http://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartition.html#pyspark.sql.DataFrame.repartition) the data to 8 partitions before saving
 * this will create 8 files
 
Note
* there are also other options how to save data with Spark and we will cover them in the Tables notebook

In [None]:
# your code here:

#### Check if we saved the data correctly:

In [None]:
# your code here:

In [None]:
spark.stop()