# ETL I

## Task - Transfrom json to parquet and convert String column to an array

* load json dataset
* look at the infered schema
* define the schema explicitly
* convert column tags to array of tags using
  * split
  * explode
  * regexp_replace
  * groupBy + collect_list
  * join

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc, count, explode, split, regexp_replace, collect_list

from pyspark.sql.types import StructType, StructField, StringType, LongType, TimestampType

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('ETL I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

data_input_path = os.path.join(project_path, 'data/questions-json')

output_path = os.path.join(project_path, 'output/questions-transformed')

<b>First let Spark infer the schema:</b>

In [None]:
questionsDF = (
    spark
    .read
    .format('json')
    .option('path', data_input_path)
    .load()
)

Note:
Consider loading only one file to check the schema. Loading the json with lots of files without specifying schema takes a lot of time.

data/questions-json/part-00000-1240736e-7aa5-41b0-9223-6d492e57da6a-c000.json

In [None]:
questionsDF.printSchema()

<b>Now define the schema:</b>

In [None]:
json_schema = StructType(
    [
        StructField('question_id', LongType(), True),
        StructField('creation_date', TimestampType(), True),
        StructField('title', StringType(), True),
        StructField('body', StringType(), True),
        StructField('tags', StringType(), True),
        StructField('accepted_answer_id', LongType(), True),
        StructField('answers', LongType(), True),
        StructField('comments', LongType(), True),
        StructField('user_id', LongType(), True),
        StructField('views', LongType(), True),
    ]
)

In [None]:
questionsDF = (
    spark
    .read
    .format('json')
    .schema(json_schema)
    .option('path', data_input_path)
    .load()
)

In [None]:
questionsDF.show(truncate=5)

#### Convert tags to an array

Hint
* use split to get an array
* explode the array
* use regexp_replace
* groupBy + collect_list
* join with original questions DataFrame

In [None]:
resultDF = (
    questionsDF
    .withColumn('tags_arr', split('tags', '><'))
    .withColumn('tag', explode('tags_arr'))
    .withColumn('tag', regexp_replace('tag', '(<|>)', ''))
    .groupBy('question_id')
    .agg(collect_list('tag').alias('tags'))
    .join(questionsDF.drop('tags'), 'question_id')
)

<b>Save the data</b>

Hint:

* repartition the data to 4 partitions before saving
 * this will create 4 files

In [None]:
(
    resultDF
    .repartition(4)
    .write
    .mode('overwrite')
    .option('path', output_path)
    .save()
)

<b>Check if we saved the data correctly:</b>

In [None]:
checkDF = (
    spark
    .read
    .parquet(output_path)
)

In [None]:
checkDF.count()

In [None]:
checkDF.show(truncate=5)

In [None]:
checkDF.select('tags').show(truncate=False)

In [None]:
spark.stop()