# Partitioning

In this notebook you will partition data in the storage layout and see how to steere the number of generated files.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, round, rand

import os

In [None]:
spark = (
    SparkSession
    .builder
    .appName('Partitioning I')
    .getOrCreate()
)

In [None]:
base_path = os.getcwd()

project_path = ('/').join(base_path.split('/')[0:-3]) 

questions_input_path = os.path.join(project_path, 'data/questions')

output_path_I = os.path.join(project_path, 'output/questions-partitioned/1')
output_path_II = os.path.join(project_path, 'output/questions-partitioned/2')

# Task I

* read the questions dataset into a DataFrame
* add a new column `year` that is derived from the `creation_date`
* partition the questions dataset by this new `year` column and make sure that there is only one file per folder created
* save the data in the `output_path_I` location

In [None]:
# read the questions data and add column year

questionsDF = (
    spark
    .read
    .option('path', questions_input_path)
    .load()
    .withColumn('year', year('creation_date'))
)

#### Save the data:

Hint:
* [repartition](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.repartition.html#pyspark.sql.DataFrame.repartition) data by the `year` column to achive one file per folder
* call [partitionBy](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.partitionBy.html#pyspark.sql.DataFrameWriter.partitionBy) on DataFrameWriter

In [None]:
(
    questionsDF
    .repartition('year')
    .write
    .mode('overwrite')
    .partitionBy('year')
    .option('path', output_path_I)
    .save()
)

# Task II

Partition the data again. Do the same as before but this time make sure there are five files per folder created

Hint:
* repartition data by `year` and a random expression which generates random number from intrval [0, 4]
    * use [rand](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.rand.html#pyspark.sql.functions.rand)
    * use modulo operator %

In [None]:
(
    questionsDF
    .repartition('year', (rand(12) * 100).cast('int') % 5)
    .write
    .mode('overwrite')
    .partitionBy('year')
    .option('path', output_path_II)
    .save()
)

In [None]:
spark.stop()