### Reading

In [0]:
import getpass
username = getpass.getuser()

In [0]:
input_dir = '/public/retail_db_json'
output_dir = f'/user/{username}/retail_db_parquet'

In [0]:
for file_details in dbutils.fs.ls(input_dir):
    if not ('.git' in file_details.path or file_details.path.endswith('sql')):
        print(f'Converting data in {file_details.path} folder from json to parquet')
        data_set_dir = file_details.path.split('/')[-2]
        df = spark.read.json(file_details.path)
        df.coalesce(1).write.parquet(f'{output_dir}/{data_set_dir}', mode='overwrite')

Converting data in dbfs:/public/retail_db_json/categories/ folder from json to parquet
Converting data in dbfs:/public/retail_db_json/customers/ folder from json to parquet
Converting data in dbfs:/public/retail_db_json/departments/ folder from json to parquet
Converting data in dbfs:/public/retail_db_json/order_items/ folder from json to parquet
Converting data in dbfs:/public/retail_db_json/orders/ folder from json to parquet
Converting data in dbfs:/public/retail_db_json/products/ folder from json to parquet


In [0]:
orders = spark.read.parquet(f'/user/{username}/retail_db_parquet/orders')
orders.show(5)

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
|            12111|2013-07-25 00:00:...|       3|       COMPLETE|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|
+-----------------+--------------------+--------+---------------+
only showing top 5 rows



In [0]:
orders = spark.read.csv(f'/public/retail_db_json/orders', header=None,inferSchema=False).toDF('order_id', 'order_date', 'order_customer_id', 'order_status')
orders.show(1)

+-------------+--------------------+--------------------+--------------------+
|     order_id|          order_date|   order_customer_id|        order_status|
+-------------+--------------------+--------------------+--------------------+
|{"order_id":1|"order_date":"201...|"order_customer_i...|"order_status":"C...|
+-------------+--------------------+--------------------+--------------------+
only showing top 1 row



In [0]:
df = spark.read.json('/public/retail_db_json/orders')
df = spark.read.format('json').load('/public/retail_db_json/orders')

***Parquet to SparkDF***

In [0]:
df = spark.read.parquet(f'/user/{username}/retail_db_parquet/orders')
df = spark.read.format('parquet').load(f'/user/{username}/retail_db_parquet/orders')

df.show(5)

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
|            12111|2013-07-25 00:00:...|       3|       COMPLETE|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|
+-----------------+--------------------+--------+---------------+
only showing top 5 rows



### Writing

In [0]:
courses = [{'course_id': 1,
  'course_name': '2020 Complete Python Bootcamp: From Zero to Hero in Python',
  'suitable_for': 'Beginner',
  'enrollment': 1100093,
  'stars': 4.6,
  'number_of_ratings': 318066},
 {'course_id': 4,
  'course_name': 'Angular - The Complete Guide (2020 Edition)',
  'suitable_for': 'Intermediate',
  'enrollment': 422557,
  'stars': 4.6,
  'number_of_ratings': 129984},
 {'course_id': 12,
  'course_name': 'Automate the Boring Stuff with Python Programming',
  'suitable_for': 'Advanced',
  'enrollment': 692617,
  'stars': 4.6,
  'number_of_ratings': 70508},
 {'course_id': 10,
  'course_name': 'Complete C# Unity Game Developer 2D',
  'suitable_for': 'Advanced',
  'enrollment': 364934,
  'stars': 4.6,
  'number_of_ratings': 78989},
 {'course_id': 5,
  'course_name': 'Java Programming Masterclass for Software Developers',
  'suitable_for': 'Advanced',
  'enrollment': 502572,
  'stars': 4.6,
  'number_of_ratings': 123798},
 {'course_id': 15,
  'course_name': 'Learn Python Programming Masterclass',
  'suitable_for': 'Advanced',
  'enrollment': 240790,
  'stars': 4.5,
  'number_of_ratings': 58677},
 {'course_id': 3,
  'course_name': 'Machine Learning A-Z™: Hands-On Python & R In Data Science',
  'suitable_for': 'Intermediate',
  'enrollment': 692812,
  'stars': 4.5,
  'number_of_ratings': 132228},
 {'course_id': 14,
  'course_name': 'Modern React with Redux [2020 Update]',
  'suitable_for': 'Intermediate',
  'enrollment': 203214,
  'stars': 4.7,
  'number_of_ratings': 60835},
 {'course_id': 8,
  'course_name': 'Python for Data Science and Machine Learning Bootcamp',
  'suitable_for': 'Intermediate',
  'enrollment': 387789,
  'stars': 4.6,
  'number_of_ratings': 87403},
 {'course_id': 6,
  'course_name': 'React - The Complete Guide (incl Hooks, React Router, Redux)',
  'suitable_for': 'Intermediate',
  'enrollment': 304670,
  'stars': 4.6,
  'number_of_ratings': 90964},
 {'course_id': 18,
  'course_name': 'Selenium WebDriver with Java -Basics to Advanced+Frameworks',
  'suitable_for': 'Advanced',
  'enrollment': 148562,
  'stars': 4.6,
  'number_of_ratings': 49947},
 {'course_id': 21,
  'course_name': 'Spring & Hibernate for Beginners (includes Spring Boot)',
  'suitable_for': 'Advanced',
  'enrollment': 177053,
  'stars': 4.6,
  'number_of_ratings': 45329},
 {'course_id': 7,
  'course_name': 'The Complete 2020 Web Development Bootcamp',
  'suitable_for': 'Beginner',
  'enrollment': 270656,
  'stars': 4.7,
  'number_of_ratings': 88098},
 {'course_id': 9,
  'course_name': 'The Complete JavaScript Course 2020: Build Real Projects!',
  'suitable_for': 'Intermediate',
  'enrollment': 347979,
  'stars': 4.6,
  'number_of_ratings': 83521},
 {'course_id': 16,
  'course_name': 'The Complete Node.js Developer Course (3rd Edition)',
  'suitable_for': 'Advanced',
  'enrollment': 202922,
  'stars': 4.7,
  'number_of_ratings': 50885},
 {'course_id': 13,
  'course_name': 'The Complete Web Developer Course 2.0',
  'suitable_for': 'Intermediate',
  'enrollment': 273598,
  'stars': 4.5,
  'number_of_ratings': 63175},
 {'course_id': 11,
  'course_name': 'The Data Science Course 2020: Complete Data Science Bootcamp',
  'suitable_for': 'Beginner',
  'enrollment': 325047,
  'stars': 4.5,
  'number_of_ratings': 76907},
 {'course_id': 20,
  'course_name': 'The Ultimate MySQL Bootcamp: Go from SQL Beginner to Expert',
  'suitable_for': 'Beginner',
  'enrollment': 203366,
  'stars': 4.6,
  'number_of_ratings': 45382},
 {'course_id': 2,
  'course_name': 'The Web Developer Bootcamp',
  'suitable_for': 'Beginner',
  'enrollment': 596726,
  'stars': 4.6,
  'number_of_ratings': 182997},
 {'course_id': 19,
  'course_name': 'Unreal Engine C++ Developer: Learn C++ and Make Video Games',
  'suitable_for': 'Advanced',
  'enrollment': 229005,
  'stars': 4.5,
  'number_of_ratings': 45860},
 {'course_id': 17,
  'course_name': 'iOS 13 & Swift 5 - The Complete iOS App Development Bootcamp',
  'suitable_for': 'Advanced',
  'enrollment': 179598,
  'stars': 4.8,
  'number_of_ratings': 49972}]

In [0]:
from pyspark.sql import Row
courses_df = spark.createDataFrame([Row(**course) for course in courses])

In [0]:
courses_df.write.csv(f'/user/{username}/courses')

> using compression while writing

In [0]:
courses_df. \
    coalesce(1). \
    write. \
    csv(
        f'/user/{username}/courses', 
        mode='overwrite', 
        compression='gzip',
        header=True)
    

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[12]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1672743223000),
 FileInfo(path='dbfs:/user/root/courses/_committed_664287561301651003', name='_committed_664287561301651003', size=368, modificationTime=1672743006000),
 FileInfo(path='dbfs:/user/root/courses/_committed_7514389859754421215', name='_committed_7514389859754421215', size=468, modificationTime=1672743222000),
 FileInfo(path='dbfs:/user/root/courses/_started_664287561301651003', name='_started_664287561301651003', size=0, modificationTime=1672743005000),
 FileInfo(path='dbfs:/user/root/courses/_started_7514389859754421215', name='_started_7514389859754421215', size=0, modificationTime=1672743222000),
 FileInfo(path='dbfs:/user/root/courses/part-00000-tid-7514389859754421215-c6a2dadb-3e51-4370-9f0c-2056f9584666-7-1-c000.csv.gz', name='part-00000-tid-7514389859754421215-c6a2dadb-3e51-4370-9f0c-2056f9584666-7-1-c000.csv.gz', size=896, modificationTime=1672743222000)]

* `coalesce` and `repartition` are functions on top of the dataframe.
* `coalesce` is typically used to **reduce number of partitions** to deal with as part of downstream processing and the coalesce function available to deal with null values in a given column. 
* `repartition` is used to reshuffle the data to **higher or lower number of partitions** to deal with as part of downstream partitioning.
* Make sure to use a cluster with higher configuration, if you would like to run and experience by your self.
  * 2 to 3 worker nodes using Standard with 14 to 16 GB RAM and 4 cores each.

* `repartition` incurs **shuffling** and it takes time as data has to be shuffled to newer number of partitions.
* Also you can `repartition` the Data Frame based on specified columns.
* `coalesce` does not incur shuffling.
* We use `coalesce` quite often before writing the data to fewer number of files.

Different ways mode can be specified while writing data frame into files. `file_format` can be any valid out of the box format such as `text`, `csv`, `json`, `parquet`, `orc`.
* `courses_df.write.mode(saveMode).file_format(path_to_folder)`
* `courses_df.write.file_format(path_to_folder, mode=saveMode)`
* `courses_df.write.mode(saveMode).format('file_format').save(path_to_folder)`
* `courses_df.write.format('file_format').save(path_to_folder, mode=saveMode)`