* Reading files using direct APIs such as `csv`, `json`, etc under `spark.read`.
* Reading files using `format` and `load` under `spark.read`.
* Specifying options as arguments as well as using functions such as `option` and `options`.
* Supported file formats.
  * `csv`
  * `text`
  * `json`
  * `parquet`
  * `orc`
* Other common file formats.
  * `xml`
  * `avro`
* Important file formats for certification - `csv`, `json`, `parquet`
* Reading compressed files

Steps to follow to while reading files

* Check if the files are compressed (gz, snappy, bz2, etc). Most common ones are gz and snappy.
* Understand the file format (text, json, avro, parquet, orc, etc). Sometimes files will not have extensions.
* If files does not have extensions, make sure to confirm the details by going through the tech spec or by opening the file.
* We will get tech specs from our leads or architects while working on real world projects.
* If the files are of text file format, check if the data is delimited or separated by a specific character.
* Use appropriate API under `spark.read` to read the data.

* We can read the data from CSV files into Spark Data Frame using multiple approaches.
* Approach 1: `spark.read.csv('path_to_folder')`
* Approach 2: `spark.read.format('csv').load('path_to_folder')`
* We can explicitly specify the schema as `string` or using `StructType`.
* We can also read the data which is delimited or separated by other characters than comma.
* If the files have header we can create the Data Frame with schema by using options such as `header` and `inferSchema`. It will pick column names from the header while data types will be inferred based on the data.
* If the files does not have header we can create the Data Frame with schema by passing column names using `toDF` and by using `inferSchema` option.

We can pass the options using different ways while creating the Data Frame.
* Using key word arguments as part of APIs. We can use key word arguments as part of `load` as well as direct API (`csv`).
* `spark.read.option`
* `spark.read.options`
* If key in the option is incorrect then the options will be ignored.

Depending up on the API based on the file format the options as well as arguments vary.

Side effects of inferring schema while creating Spark Data Frame

* If inferSchema is used entire data need to be read to infer the schema accurately while creating the Data Frame.
* If the data size is too big then additional time will be spent to infer the schema.
* When we explicitly specify the schema, data will not be read while creating the Data Frame.
* As we have seen we should be able to explicitly specify the schema using string or StructType.
* Inferring Schema will come handy to quickly understand the structure of the data as part of proof of concepts as well as design.
* Schema will be inferred by default for files of type JSON, Parquet and ORC. Column names and data types will be inferred using metadata that will be associated with these types of files.
* Inferring the schema on CSV files will create data frames with system generated column names. If inferSchema is used, then the data frame will determine the data types. If the files contain header, then column names can be inherited using it. If not, we need to explicitly pass the columns using `toDF`.

Writing

* Writing files using direct APIs such as `csv`,`json`, etc under `df.write` where df is of Spark's DataFrameWrite.
* Writing files using `format` and `save` under `df.write`
* Specifying options as arguments as well as using functions such as `option` and `options`.
* Supported file formats.
    * `csv`
    * `text`
    * `json`
    * `parquet`
    * `orc`
* Other common file formats.
    * `xml`
    * `avro`
* Important file formats for certification - `csv`,`json`,`parquet`
* Writing into compressed files

In [0]:
courses = [{'course_id': 1,
  'course_name': '2020 Complete Python Bootcamp: From Zero to Hero in Python',
  'suitable_for': 'Beginner',
  'enrollment': 1100093,
  'stars': 4.6,
  'number_of_ratings': 318066},
 {'course_id': 4,
  'course_name': 'Angular - The Complete Guide (2020 Edition)',
  'suitable_for': 'Intermediate',
  'enrollment': 422557,
  'stars': 4.6,
  'number_of_ratings': 129984},
 {'course_id': 12,
  'course_name': 'Automate the Boring Stuff with Python Programming',
  'suitable_for': 'Advanced',
  'enrollment': 692617,
  'stars': 4.6,
  'number_of_ratings': 70508},
 {'course_id': 10,
  'course_name': 'Complete C# Unity Game Developer 2D',
  'suitable_for': 'Advanced',
  'enrollment': 364934,
  'stars': 4.6,
  'number_of_ratings': 78989},
 {'course_id': 5,
  'course_name': 'Java Programming Masterclass for Software Developers',
  'suitable_for': 'Advanced',
  'enrollment': 502572,
  'stars': 4.6,
  'number_of_ratings': 123798},
 {'course_id': 15,
  'course_name': 'Learn Python Programming Masterclass',
  'suitable_for': 'Advanced',
  'enrollment': 240790,
  'stars': 4.5,
  'number_of_ratings': 58677},
 {'course_id': 3,
  'course_name': 'Machine Learning A-Z™: Hands-On Python & R In Data Science',
  'suitable_for': 'Intermediate',
  'enrollment': 692812,
  'stars': 4.5,
  'number_of_ratings': 132228},
 {'course_id': 14,
  'course_name': 'Modern React with Redux [2020 Update]',
  'suitable_for': 'Intermediate',
  'enrollment': 203214,
  'stars': 4.7,
  'number_of_ratings': 60835},
 {'course_id': 8,
  'course_name': 'Python for Data Science and Machine Learning Bootcamp',
  'suitable_for': 'Intermediate',
  'enrollment': 387789,
  'stars': 4.6,
  'number_of_ratings': 87403},
 {'course_id': 6,
  'course_name': 'React - The Complete Guide (incl Hooks, React Router, Redux)',
  'suitable_for': 'Intermediate',
  'enrollment': 304670,
  'stars': 4.6,
  'number_of_ratings': 90964},
 {'course_id': 18,
  'course_name': 'Selenium WebDriver with Java -Basics to Advanced+Frameworks',
  'suitable_for': 'Advanced',
  'enrollment': 148562,
  'stars': 4.6,
  'number_of_ratings': 49947},
 {'course_id': 21,
  'course_name': 'Spring & Hibernate for Beginners (includes Spring Boot)',
  'suitable_for': 'Advanced',
  'enrollment': 177053,
  'stars': 4.6,
  'number_of_ratings': 45329},
 {'course_id': 7,
  'course_name': 'The Complete 2020 Web Development Bootcamp',
  'suitable_for': 'Beginner',
  'enrollment': 270656,
  'stars': 4.7,
  'number_of_ratings': 88098},
 {'course_id': 9,
  'course_name': 'The Complete JavaScript Course 2020: Build Real Projects!',
  'suitable_for': 'Intermediate',
  'enrollment': 347979,
  'stars': 4.6,
  'number_of_ratings': 83521},
 {'course_id': 16,
  'course_name': 'The Complete Node.js Developer Course (3rd Edition)',
  'suitable_for': 'Advanced',
  'enrollment': 202922,
  'stars': 4.7,
  'number_of_ratings': 50885},
 {'course_id': 13,
  'course_name': 'The Complete Web Developer Course 2.0',
  'suitable_for': 'Intermediate',
  'enrollment': 273598,
  'stars': 4.5,
  'number_of_ratings': 63175},
 {'course_id': 11,
  'course_name': 'The Data Science Course 2020: Complete Data Science Bootcamp',
  'suitable_for': 'Beginner',
  'enrollment': 325047,
  'stars': 4.5,
  'number_of_ratings': 76907},
 {'course_id': 20,
  'course_name': 'The Ultimate MySQL Bootcamp: Go from SQL Beginner to Expert',
  'suitable_for': 'Beginner',
  'enrollment': 203366,
  'stars': 4.6,
  'number_of_ratings': 45382},
 {'course_id': 2,
  'course_name': 'The Web Developer Bootcamp',
  'suitable_for': 'Beginner',
  'enrollment': 596726,
  'stars': 4.6,
  'number_of_ratings': 182997},
 {'course_id': 19,
  'course_name': 'Unreal Engine C++ Developer: Learn C++ and Make Video Games',
  'suitable_for': 'Advanced',
  'enrollment': 229005,
  'stars': 4.5,
  'number_of_ratings': 45860},
 {'course_id': 17,
  'course_name': 'iOS 13 & Swift 5 - The Complete iOS App Development Bootcamp',
  'suitable_for': 'Advanced',
  'enrollment': 179598,
  'stars': 4.8,
  'number_of_ratings': 49972}]

from pyspark.sql import Row
courses_df = spark.createDataFrame([Row(**course) for course in courses])

In [0]:
type(courses_df.write)

Out[2]: pyspark.sql.readwriter.DataFrameWriter

In [0]:
import getpass
username = getpass.getuser()

In [0]:
courses_df.write.json(f'/user/{username}/courses', mode='overwrite')

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[5]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651321814000),
 FileInfo(path='dbfs:/user/root/courses/_committed_5079863179303282773', name='_committed_5079863179303282773', size=728, modificationTime=1651321813000),
 FileInfo(path='dbfs:/user/root/courses/_started_2283988279216726262', name='_started_2283988279216726262', size=0, modificationTime=1651321086000),
 FileInfo(path='dbfs:/user/root/courses/_started_2689050122303283664', name='_started_2689050122303283664', size=0, modificationTime=1651320895000),
 FileInfo(path='dbfs:/user/root/courses/_started_5079863179303282773', name='_started_5079863179303282773', size=0, modificationTime=1651321810000),
 FileInfo(path='dbfs:/user/root/courses/_started_9222419460581728937', name='_started_9222419460581728937', size=0, modificationTime=1651320835000),
 FileInfo(path='dbfs:/user/root/courses/part-00000-tid-5079863179303282773-6ae761ae-0644-4001-8937-851eac58aca4-0-1-c000.json', name

In [0]:
courses_df.write.format('json').save(f'/user/{username}/courses', mode = 'overwrite')

##### Steps to follow to write Spark Data Frames into Files.

* Make sure to analyse the schema of the Data Frame.
* Make aure you have write permission on the target location.
* Understand whether you want to overwrite or append or ignore or throw exception in case target folder already exists.
* Decide whether you would like to compress the data or not.
* Make sure you understand whether the files will be compressed or not by default.
* Use appropriate APIs along with right arguments based up on the requirements.

* We can write the data from Spark Data Frame into CSV files using multiple approches.
* Approach 1: `df.write.csv('path_to_folder')`
* Approach 2: `df.write.format('csv').save('path_to_folder')`
* The column names from the schema can be added as header to each of the files by saying `header = True`
* We can also write the data into files using characters other than comma as delimiters or separators.
* We can also compress the data while writing into files using csv.

In [0]:
courses = [{'course_id': 1,
  'course_name': '2020 Complete Python Bootcamp: From Zero to Hero in Python',
  'suitable_for': 'Beginner',
  'enrollment': 1100093,
  'stars': 4.6,
  'number_of_ratings': 318066},
 {'course_id': 4,
  'course_name': 'Angular - The Complete Guide (2020 Edition)',
  'suitable_for': 'Intermediate',
  'enrollment': 422557,
  'stars': 4.6,
  'number_of_ratings': 129984},
 {'course_id': 12,
  'course_name': 'Automate the Boring Stuff with Python Programming',
  'suitable_for': 'Advanced',
  'enrollment': 692617,
  'stars': 4.6,
  'number_of_ratings': 70508},
 {'course_id': 10,
  'course_name': 'Complete C# Unity Game Developer 2D',
  'suitable_for': 'Advanced',
  'enrollment': 364934,
  'stars': 4.6,
  'number_of_ratings': 78989},
 {'course_id': 5,
  'course_name': 'Java Programming Masterclass for Software Developers',
  'suitable_for': 'Advanced',
  'enrollment': 502572,
  'stars': 4.6,
  'number_of_ratings': 123798},
 {'course_id': 15,
  'course_name': 'Learn Python Programming Masterclass',
  'suitable_for': 'Advanced',
  'enrollment': 240790,
  'stars': 4.5,
  'number_of_ratings': 58677},
 {'course_id': 3,
  'course_name': 'Machine Learning A-Z™: Hands-On Python & R In Data Science',
  'suitable_for': 'Intermediate',
  'enrollment': 692812,
  'stars': 4.5,
  'number_of_ratings': 132228},
 {'course_id': 14,
  'course_name': 'Modern React with Redux [2020 Update]',
  'suitable_for': 'Intermediate',
  'enrollment': 203214,
  'stars': 4.7,
  'number_of_ratings': 60835},
 {'course_id': 8,
  'course_name': 'Python for Data Science and Machine Learning Bootcamp',
  'suitable_for': 'Intermediate',
  'enrollment': 387789,
  'stars': 4.6,
  'number_of_ratings': 87403},
 {'course_id': 6,
  'course_name': 'React - The Complete Guide (incl Hooks, React Router, Redux)',
  'suitable_for': 'Intermediate',
  'enrollment': 304670,
  'stars': 4.6,
  'number_of_ratings': 90964},
 {'course_id': 18,
  'course_name': 'Selenium WebDriver with Java -Basics to Advanced+Frameworks',
  'suitable_for': 'Advanced',
  'enrollment': 148562,
  'stars': 4.6,
  'number_of_ratings': 49947},
 {'course_id': 21,
  'course_name': 'Spring & Hibernate for Beginners (includes Spring Boot)',
  'suitable_for': 'Advanced',
  'enrollment': 177053,
  'stars': 4.6,
  'number_of_ratings': 45329},
 {'course_id': 7,
  'course_name': 'The Complete 2020 Web Development Bootcamp',
  'suitable_for': 'Beginner',
  'enrollment': 270656,
  'stars': 4.7,
  'number_of_ratings': 88098},
 {'course_id': 9,
  'course_name': 'The Complete JavaScript Course 2020: Build Real Projects!',
  'suitable_for': 'Intermediate',
  'enrollment': 347979,
  'stars': 4.6,
  'number_of_ratings': 83521},
 {'course_id': 16,
  'course_name': 'The Complete Node.js Developer Course (3rd Edition)',
  'suitable_for': 'Advanced',
  'enrollment': 202922,
  'stars': 4.7,
  'number_of_ratings': 50885},
 {'course_id': 13,
  'course_name': 'The Complete Web Developer Course 2.0',
  'suitable_for': 'Intermediate',
  'enrollment': 273598,
  'stars': 4.5,
  'number_of_ratings': 63175},
 {'course_id': 11,
  'course_name': 'The Data Science Course 2020: Complete Data Science Bootcamp',
  'suitable_for': 'Beginner',
  'enrollment': 325047,
  'stars': 4.5,
  'number_of_ratings': 76907},
 {'course_id': 20,
  'course_name': 'The Ultimate MySQL Bootcamp: Go from SQL Beginner to Expert',
  'suitable_for': 'Beginner',
  'enrollment': 203366,
  'stars': 4.6,
  'number_of_ratings': 45382},
 {'course_id': 2,
  'course_name': 'The Web Developer Bootcamp',
  'suitable_for': 'Beginner',
  'enrollment': 596726,
  'stars': 4.6,
  'number_of_ratings': 182997},
 {'course_id': 19,
  'course_name': 'Unreal Engine C++ Developer: Learn C++ and Make Video Games',
  'suitable_for': 'Advanced',
  'enrollment': 229005,
  'stars': 4.5,
  'number_of_ratings': 45860},
 {'course_id': 17,
  'course_name': 'iOS 13 & Swift 5 - The Complete iOS App Development Bootcamp',
  'suitable_for': 'Advanced',
  'enrollment': 179598,
  'stars': 4.8,
  'number_of_ratings': 49972}]

In [0]:
from pyspark.sql import Row

In [0]:
courses_df = spark.createDataFrame([Row(**course) for course in courses])

In [0]:
courses_df.show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

In [0]:
courses_df.dtypes

Out[11]: [('course_id', 'bigint'),
 ('course_name', 'string'),
 ('suitable_for', 'string'),
 ('enrollment', 'bigint'),
 ('stars', 'double'),
 ('number_of_ratings', 'bigint')]

In [0]:
import getpass
username = getpass.getuser()

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[13]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651321825000),
 FileInfo(path='dbfs:/user/root/courses/_committed_5079863179303282773', name='_committed_5079863179303282773', size=728, modificationTime=1651321813000),
 FileInfo(path='dbfs:/user/root/courses/_committed_6197698546425054723', name='_committed_6197698546425054723', size=1448, modificationTime=1651321823000),
 FileInfo(path='dbfs:/user/root/courses/_started_2283988279216726262', name='_started_2283988279216726262', size=0, modificationTime=1651321086000),
 FileInfo(path='dbfs:/user/root/courses/_started_2689050122303283664', name='_started_2689050122303283664', size=0, modificationTime=1651320895000),
 FileInfo(path='dbfs:/user/root/courses/_started_5079863179303282773', name='_started_5079863179303282773', size=0, modificationTime=1651321810000),
 FileInfo(path='dbfs:/user/root/courses/_started_6197698546425054723', name='_started_6197698546425054723', size=0, modifica

In [0]:
dbutils.fs.rm(f'/user/{username}/courses',recurse=True)

Out[14]: True

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

[0;31m---------------------------------------------------------------------------[0m
[0;31mExecutionError[0m                            Traceback (most recent call last)
[0;32m<command-2016470539919608>[0m in [0;36m<module>[0;34m[0m
[0;32m----> 1[0;31m [0mdbutils[0m[0;34m.[0m[0mfs[0m[0;34m.[0m[0mls[0m[0;34m([0m[0;34mf'/user/{username}/courses'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/python_shell/dbruntime/dbutils.py[0m in [0;36mf_with_exception_handling[0;34m(*args, **kwargs)[0m
[1;32m    387[0m                     [0mexc[0m[0;34m.[0m[0m__context__[0m [0;34m=[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[1;32m    388[0m                     [0mexc[0m[0;34m.[0m[0m__cause__[0m [0;34m=[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 389[0;31m                     [0;32mraise[0m [0mexc[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m    390[0m [0;34m[0m[0m
[1;32m    391[0m             [0;32mreturn[0m [0mf

In [0]:
# Default behavior
# It will delimit the data using comma as separator

courses_df.write.csv(f'/user/{username}/courses')

In [0]:
# Default number of files will be determind implicitly based on several factors

dbutils.fs.ls(f'/user/{username}/courses')

Out[17]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651322047000),
 FileInfo(path='dbfs:/user/root/courses/_committed_3880254170003546332', name='_committed_3880254170003546332', size=728, modificationTime=1651322046000),
 FileInfo(path='dbfs:/user/root/courses/_started_3880254170003546332', name='_started_3880254170003546332', size=0, modificationTime=1651322044000),
 FileInfo(path='dbfs:/user/root/courses/part-00000-tid-3880254170003546332-8d8cc4a0-a31b-416b-bfa8-464a632bfedc-24-1-c000.csv', name='part-00000-tid-3880254170003546332-8d8cc4a0-a31b-416b-bfa8-464a632bfedc-24-1-c000.csv', size=166, modificationTime=1651322045000),
 FileInfo(path='dbfs:/user/root/courses/part-00001-tid-3880254170003546332-8d8cc4a0-a31b-416b-bfa8-464a632bfedc-25-1-c000.csv', name='part-00001-tid-3880254170003546332-8d8cc4a0-a31b-416b-bfa8-464a632bfedc-25-1-c000.csv', size=144, modificationTime=1651322045000),
 FileInfo(path='dbfs:/user/root/courses/part-000

In [0]:
# Using spark.read.text we can read the raw data as single column data frame 
# We can confirm the default delimiter as comma by looking at the data

spark.read.text(f'/user/{username}/courses').show(truncate=False)

+----------------------------------------------------------------------------------------------+
|value                                                                                         |
+----------------------------------------------------------------------------------------------+
|5,Java Programming Masterclass for Software Developers,Advanced,502572,4.6,123798             |
|15,Learn Python Programming Masterclass,Advanced,240790,4.5,58677                             |
|3,Machine Learning A-Z™: Hands-On Python & R In Data Science,Intermediate,692812,4.5,132228   |
|14,Modern React with Redux [2020 Update],Intermediate,203214,4.7,60835                        |
|7,The Complete 2020 Web Development Bootcamp,Beginner,270656,4.7,88098                        |
|9,The Complete JavaScript Course 2020: Build Real Projects!,Intermediate,347979,4.6,83521     |
|16,The Complete Node.js Developer Course (3rd Edition),Advanced,202922,4.7,50885              |
|13,The Complete Web Developer

In [0]:
courses_df.columns

Out[19]: ['course_id',
 'course_name',
 'suitable_for',
 'enrollment',
 'stars',
 'number_of_ratings']

In [0]:
courses_df.\
  coalesce(1).\
  write.\
  csv(f'/user/{username}/courses', mode='overwrite', header=True)

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[21]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651322055000),
 FileInfo(path='dbfs:/user/root/courses/_committed_3880254170003546332', name='_committed_3880254170003546332', size=728, modificationTime=1651322046000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4072114454121548129', name='_committed_4072114454121548129', size=826, modificationTime=1651322054000),
 FileInfo(path='dbfs:/user/root/courses/_started_3880254170003546332', name='_started_3880254170003546332', size=0, modificationTime=1651322044000),
 FileInfo(path='dbfs:/user/root/courses/_started_4072114454121548129', name='_started_4072114454121548129', size=0, modificationTime=1651322053000),
 FileInfo(path='dbfs:/user/root/courses/part-00000-tid-4072114454121548129-562e36c2-b2bb-4772-844b-2bc1da316fcc-40-1-c000.csv', name='part-00000-tid-4072114454121548129-562e36c2-b2bb-4772-844b-2bc1da316fcc-40-1-c000.csv', size=1775, modificationTime=1651322053000)]

In [0]:
spark.read.csv(f'/user/{username}/courses',header=True).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

In [0]:
courses_df.\
  coalesce(1).\
  write.\
  format('csv').\
  save(f'/user/{username}/courses',mode='overwrite',header=True)

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[24]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651322062000),
 FileInfo(path='dbfs:/user/root/courses/_committed_3880254170003546332', name='_committed_3880254170003546332', size=728, modificationTime=1651322046000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4072114454121548129', name='_committed_4072114454121548129', size=826, modificationTime=1651322054000),
 FileInfo(path='dbfs:/user/root/courses/_committed_6883595519992830733', name='_committed_6883595519992830733', size=199, modificationTime=1651322061000),
 FileInfo(path='dbfs:/user/root/courses/_started_3880254170003546332', name='_started_3880254170003546332', size=0, modificationTime=1651322044000),
 FileInfo(path='dbfs:/user/root/courses/_started_4072114454121548129', name='_started_4072114454121548129', size=0, modificationTime=1651322053000),
 FileInfo(path='dbfs:/user/root/courses/_started_6883595519992830733', name='_started_6883595519992830733', size=0, mod

In [0]:
spark.read.csv(f'/user/{username}/courses',header=True).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

Using Compression while writing Spark Data Frames

In [0]:
courses_df.\
  coalesce(1).\
  write.\
  csv(f'/user/{username}/courses', mode='overwrite', header=True)

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[27]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651322067000),
 FileInfo(path='dbfs:/user/root/courses/_committed_3880254170003546332', name='_committed_3880254170003546332', size=728, modificationTime=1651322046000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4072114454121548129', name='_committed_4072114454121548129', size=826, modificationTime=1651322054000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4960594449838999904', name='_committed_4960594449838999904', size=199, modificationTime=1651322066000),
 FileInfo(path='dbfs:/user/root/courses/_committed_6883595519992830733', name='_committed_6883595519992830733', size=199, modificationTime=1651322061000),
 FileInfo(path='dbfs:/user/root/courses/_started_3880254170003546332', name='_started_3880254170003546332', size=0, modificationTime=1651322044000),
 FileInfo(path='dbfs:/user/root/courses/_started_4072114454121548129', name='_started_4072114454121548129', size=

In [0]:
help(courses_df.write.csv)

Help on method csv in module pyspark.sql.readwriter:

csv(path, mode=None, compression=None, sep=None, quote=None, escape=None, header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None, timestampFormat=None, ignoreLeadingWhiteSpace=None, ignoreTrailingWhiteSpace=None, charToEscapeQuoteEscaping=None, encoding=None, emptyValue=None, lineSep=None) method of pyspark.sql.readwriter.DataFrameWriter instance
    Saves the content of the :class:`DataFrame` in CSV format at the specified path.
    
    .. versionadded:: 2.0.0
    
    Parameters
    ----------
    path : str
        the path in any Hadoop supported file system
    mode : str, optional
        specifies the behavior of the save operation when data already exists.
    
        * ``append``: Append contents of this :class:`DataFrame` to existing data.
        * ``overwrite``: Overwrite existing data.
        * ``ignore``: Silently ignore this operation if data already exists.
        * ``error`` or ``errorife

In [0]:
courses_df.\
  coalesce(1).\
  write.\
  csv(f'/user/{username}/courses',mode='overwrite',compression='gzip',header=True)

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[30]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651322071000),
 FileInfo(path='dbfs:/user/root/courses/_committed_3880254170003546332', name='_committed_3880254170003546332', size=728, modificationTime=1651322046000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4072114454121548129', name='_committed_4072114454121548129', size=826, modificationTime=1651322054000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4125942472826022451', name='_committed_4125942472826022451', size=202, modificationTime=1651322070000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4960594449838999904', name='_committed_4960594449838999904', size=199, modificationTime=1651322066000),
 FileInfo(path='dbfs:/user/root/courses/_committed_6883595519992830733', name='_committed_6883595519992830733', size=199, modificationTime=1651322061000),
 FileInfo(path='dbfs:/user/root/courses/_started_3880254170003546332', name='_started_3880254170003546332',

In [0]:
spark.read.csv(f'/user/{username}/courses', header=True).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

Specifying Delimiter while Writing

In [0]:
courses_df.\
  coalesce(1).\
  write.\
csv(f'/user/{username}/courses_pipe',mode='overwrite',compression='gzip',header=True,sep='|')

Reading CSV with pipe delimiter

In [0]:
spark.read.csv(f'/user/{username}/courses_pipe', header=True, sep='|').show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

Using option/options while Writing

In [0]:
help(courses_df.write.options)

Help on method options in module pyspark.sql.readwriter:

options(**options) method of pyspark.sql.readwriter.DataFrameWriter instance
    Adds output options for the underlying data source.
    
    .. versionadded:: 1.4



In [0]:
courses_df.\
  coalesce(1).\
  write.\
  mode('overwrite').\
  option('compression','gzip').\
  option('header',True).\
  option('sep','|').\
  csv(f'/user/{username}/courses_pipe')

In [0]:
dbutils.fs.ls(f'/user/{username}/courses_pipe')

Out[46]: [FileInfo(path='dbfs:/user/root/courses_pipe/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651322908000),
 FileInfo(path='dbfs:/user/root/courses_pipe/_committed_8090480025241025876', name='_committed_8090480025241025876', size=115, modificationTime=1651322564000),
 FileInfo(path='dbfs:/user/root/courses_pipe/_committed_8773122826888077239', name='_committed_8773122826888077239', size=216, modificationTime=1651322907000),
 FileInfo(path='dbfs:/user/root/courses_pipe/_started_8090480025241025876', name='_started_8090480025241025876', size=0, modificationTime=1651322563000),
 FileInfo(path='dbfs:/user/root/courses_pipe/_started_8773122826888077239', name='_started_8773122826888077239', size=0, modificationTime=1651322906000),
 FileInfo(path='dbfs:/user/root/courses_pipe/part-00000-tid-8773122826888077239-ebf11f7f-dab4-4547-851e-e756a284bfd1-69-1-c000.csv.gz', name='part-00000-tid-8773122826888077239-ebf11f7f-dab4-4547-851e-e756a284bfd1-69-1-c000.csv.gz', size=896, modifi

In [0]:
spark.read.csv(f'/user/{username}/courses_pipe', sep='|', header=True, inferSchema=True).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

In [0]:
courses_df. \
    coalesce(1). \
    write. \
    mode('overwrite'). \
    options(sep='|', header=True, compression='gzip'). \
    csv(f'/user/{username}/courses_pipe')

In [0]:
spark.read.csv(f'/user/{username}/courses_pipe', sep='|', header=True, inferSchema=True).show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

In [0]:
options = {
  'sep':'|',
  'header':True,
  'compression':'gzip'
}

In [0]:
# reading using options
# we can use **options in same way while writing
spark.read.options(**options).csv(f'/user/{username}/courses_pipe').show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

writing data frames into JSON Files

In [0]:
courses_df.\
  coalesce(1).\
  write.\
  json(f'/user/{username}/courses', mode='overwrite')

In [0]:
spark.read.json(f'/user/{username}/courses').show()

+---------+--------------------+----------+-----------------+-----+------------+
|course_id|         course_name|enrollment|number_of_ratings|stars|suitable_for|
+---------+--------------------+----------+-----------------+-----+------------+
|        1|2020 Complete Pyt...|   1100093|           318066|  4.6|    Beginner|
|        4|Angular - The Com...|    422557|           129984|  4.6|Intermediate|
|       12|Automate the Bori...|    692617|            70508|  4.6|    Advanced|
|       10|Complete C# Unity...|    364934|            78989|  4.6|    Advanced|
|        5|Java Programming ...|    502572|           123798|  4.6|    Advanced|
|       15|Learn Python Prog...|    240790|            58677|  4.5|    Advanced|
|        3|Machine Learning ...|    692812|           132228|  4.5|Intermediate|
|       14|Modern React with...|    203214|            60835|  4.7|Intermediate|
|        8|Python for Data S...|    387789|            87403|  4.6|Intermediate|
|        6|React - The Compl

In [0]:
courses_df.\
  coalesce(1).\
  write.\
  format('json').\
save(f'/user/{username}/courses',mode='overwrite')

In [0]:
spark.read.json(f'/user/{username}/courses').show()

+---------+--------------------+----------+-----------------+-----+------------+
|course_id|         course_name|enrollment|number_of_ratings|stars|suitable_for|
+---------+--------------------+----------+-----------------+-----+------------+
|        1|2020 Complete Pyt...|   1100093|           318066|  4.6|    Beginner|
|        4|Angular - The Com...|    422557|           129984|  4.6|Intermediate|
|       12|Automate the Bori...|    692617|            70508|  4.6|    Advanced|
|       10|Complete C# Unity...|    364934|            78989|  4.6|    Advanced|
|        5|Java Programming ...|    502572|           123798|  4.6|    Advanced|
|       15|Learn Python Prog...|    240790|            58677|  4.5|    Advanced|
|        3|Machine Learning ...|    692812|           132228|  4.5|Intermediate|
|       14|Modern React with...|    203214|            60835|  4.7|Intermediate|
|        8|Python for Data S...|    387789|            87403|  4.6|Intermediate|
|        6|React - The Compl

In [0]:
spark.read.json(f'/user/{username}/courses').dtypes

Out[58]: [('course_id', 'bigint'),
 ('course_name', 'string'),
 ('enrollment', 'bigint'),
 ('number_of_ratings', 'bigint'),
 ('stars', 'double'),
 ('suitable_for', 'string')]

Compression while writing Spark Data Frames into json files

In [0]:
courses_df.\
  coalesce(1).\
  write.\
  json(f'/user/{username}/courses',mode='overwrite',compression='gzip')

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[60]: [FileInfo(path='dbfs:/user/root/courses/_committed_3880254170003546332', name='_committed_3880254170003546332', size=728, modificationTime=1651322046000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4072114454121548129', name='_committed_4072114454121548129', size=826, modificationTime=1651322054000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4125942472826022451', name='_committed_4125942472826022451', size=202, modificationTime=1651322070000),
 FileInfo(path='dbfs:/user/root/courses/_committed_43006940038443296', name='_committed_43006940038443296', size=201, modificationTime=1651330793000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4960594449838999904', name='_committed_4960594449838999904', size=199, modificationTime=1651322066000),
 FileInfo(path='dbfs:/user/root/courses/_committed_6747008991652655760', name='_committed_6747008991652655760', size=204, modificationTime=1651331210000),
 FileInfo(path='dbfs:/user/root/courses/_committed_68835955199928

writing data frames into Parquet Files

In [0]:
courses_df.\
  coalesce(1).\
  write.\
parquet(f'/user/{username}/courses',mode='overwrite')

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[63]: [FileInfo(path='dbfs:/user/root/courses/_committed_3880254170003546332', name='_committed_3880254170003546332', size=728, modificationTime=1651322046000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4072114454121548129', name='_committed_4072114454121548129', size=826, modificationTime=1651322054000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4125942472826022451', name='_committed_4125942472826022451', size=202, modificationTime=1651322070000),
 FileInfo(path='dbfs:/user/root/courses/_committed_43006940038443296', name='_committed_43006940038443296', size=201, modificationTime=1651330793000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4960594449838999904', name='_committed_4960594449838999904', size=199, modificationTime=1651322066000),
 FileInfo(path='dbfs:/user/root/courses/_committed_6747008991652655760', name='_committed_6747008991652655760', size=204, modificationTime=1651331210000),
 FileInfo(path='dbfs:/user/root/courses/_committed_68835955199928

In [0]:
spark.read.parquet(f'/user/{username}/courses').show()


+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

In [0]:
spark.read.parquet(f'/user/{username}/courses').dtypes

Out[65]: [('course_id', 'bigint'),
 ('course_name', 'string'),
 ('suitable_for', 'string'),
 ('enrollment', 'bigint'),
 ('stars', 'double'),
 ('number_of_ratings', 'bigint')]

In [0]:
courses_df. \
    coalesce(1). \
    write. \
    format('parquet'). \
    save(
        f'/user/{username}/courses', 
        mode='overwrite'
    )

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[67]: [FileInfo(path='dbfs:/user/root/courses/_committed_3708197124227421430', name='_committed_3708197124227421430', size=221, modificationTime=1651331456000),
 FileInfo(path='dbfs:/user/root/courses/_committed_3880254170003546332', name='_committed_3880254170003546332', size=728, modificationTime=1651322046000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4072114454121548129', name='_committed_4072114454121548129', size=826, modificationTime=1651322054000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4125942472826022451', name='_committed_4125942472826022451', size=202, modificationTime=1651322070000),
 FileInfo(path='dbfs:/user/root/courses/_committed_43006940038443296', name='_committed_43006940038443296', size=201, modificationTime=1651330793000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4960594449838999904', name='_committed_4960594449838999904', size=199, modificationTime=1651322066000),
 FileInfo(path='dbfs:/user/root/courses/_committed_67470089916526

In [0]:
spark.read.parquet(f'/user/{username}/courses').show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

Compression while writing Spark Data Frames into json files

In [0]:
# By default parquet files are compressed using snappy
courses_df. \
    coalesce(1). \
    write. \
    parquet(
        f'/user/{username}/courses', 
        mode='overwrite'
    )

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[70]: [FileInfo(path='dbfs:/user/root/courses/_committed_3708197124227421430', name='_committed_3708197124227421430', size=221, modificationTime=1651331456000),
 FileInfo(path='dbfs:/user/root/courses/_committed_3747276753524983091', name='_committed_3747276753524983091', size=221, modificationTime=1651331519000),
 FileInfo(path='dbfs:/user/root/courses/_committed_3880254170003546332', name='_committed_3880254170003546332', size=728, modificationTime=1651322046000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4072114454121548129', name='_committed_4072114454121548129', size=826, modificationTime=1651322054000),
 FileInfo(path='dbfs:/user/root/courses/_committed_4125942472826022451', name='_committed_4125942472826022451', size=202, modificationTime=1651322070000),
 FileInfo(path='dbfs:/user/root/courses/_committed_43006940038443296', name='_committed_43006940038443296', size=201, modificationTime=1651330793000),
 FileInfo(path='dbfs:/user/root/courses/_committed_49605944498389

In [0]:
spark.read.parquet(f'/user/{username}/courses').show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

In [0]:
dbutils.fs.rm(f'/user/{username}/courses', recurse=True)

Out[72]: True

In [0]:
help(courses_df.write.parquet)

Help on method parquet in module pyspark.sql.readwriter:

parquet(path, mode=None, partitionBy=None, compression=None) method of pyspark.sql.readwriter.DataFrameWriter instance
    Saves the content of the :class:`DataFrame` in Parquet format at the specified path.
    
    .. versionadded:: 1.4.0
    
    Parameters
    ----------
    path : str
        the path in any Hadoop supported file system
    mode : str, optional
        specifies the behavior of the save operation when data already exists.
    
        * ``append``: Append contents of this :class:`DataFrame` to existing data.
        * ``overwrite``: Overwrite existing data.
        * ``ignore``: Silently ignore this operation if data already exists.
        * ``error`` or ``errorifexists`` (default case): Throw an exception if data already                 exists.
    partitionBy : str or list, optional
        names of partitioning columns
    
    Other Parameters
    ----------------
    Extra options
        For the extr

In [0]:
spark.conf.get('spark.sql.parquet.compression.codec')

Out[74]: 'snappy'

In [0]:
# Write parquet files without compression
# compression can be set to none or uncompressed

courses_df.\
  coalesce(1).\
  write.\
  parquet(f'/user/{username}/courses',mode='overwrite',compression='none')

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[76]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651331717000),
 FileInfo(path='dbfs:/user/root/courses/_committed_6205692918970763963', name='_committed_6205692918970763963', size=117, modificationTime=1651331716000),
 FileInfo(path='dbfs:/user/root/courses/_started_6205692918970763963', name='_started_6205692918970763963', size=0, modificationTime=1651331716000),
 FileInfo(path='dbfs:/user/root/courses/part-00000-tid-6205692918970763963-cbeddb83-3606-41b6-8966-f9d17aa5e49b-101-1-c000.parquet', name='part-00000-tid-6205692918970763963-cbeddb83-3606-41b6-8966-f9d17aa5e49b-101-1-c000.parquet', size=3912, modificationTime=1651331716000)]

In [0]:
courses_df.\
  coalesce(1).\
  write.\
parquet(f'/user/{username}/courses', mode='overwrite',compression='gzip')

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[80]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651331916000),
 FileInfo(path='dbfs:/user/root/courses/_committed_5979609897587269052', name='_committed_5979609897587269052', size=223, modificationTime=1651331916000),
 FileInfo(path='dbfs:/user/root/courses/_committed_6205692918970763963', name='_committed_6205692918970763963', size=117, modificationTime=1651331716000),
 FileInfo(path='dbfs:/user/root/courses/_started_5979609897587269052', name='_started_5979609897587269052', size=0, modificationTime=1651331915000),
 FileInfo(path='dbfs:/user/root/courses/_started_6205692918970763963', name='_started_6205692918970763963', size=0, modificationTime=1651331716000),
 FileInfo(path='dbfs:/user/root/courses/part-00000-tid-5979609897587269052-e3062872-42f6-46c8-893a-b431edcaa184-103-1-c000.gz.parquet', name='part-00000-tid-5979609897587269052-e3062872-42f6-46c8-893a-b431edcaa184-103-1-c000.gz.parquet', size=3189, modificationTime=16513319

In [0]:
# set compression to none at current session level

spark.conf.set('spark.sql.parquet.compression.codec','none')

In [0]:
courses_df. \
    coalesce(1). \
    write. \
    parquet(
        f'/user/{username}/courses', 
        mode='overwrite'
    )

In [0]:
dbutils.fs.ls(f'/user/{username}/courses')

Out[83]: [FileInfo(path='dbfs:/user/root/courses/_SUCCESS', name='_SUCCESS', size=0, modificationTime=1651332014000),
 FileInfo(path='dbfs:/user/root/courses/_committed_5979609897587269052', name='_committed_5979609897587269052', size=223, modificationTime=1651331916000),
 FileInfo(path='dbfs:/user/root/courses/_committed_6205692918970763963', name='_committed_6205692918970763963', size=117, modificationTime=1651331716000),
 FileInfo(path='dbfs:/user/root/courses/_committed_8360750057545770571', name='_committed_8360750057545770571', size=212, modificationTime=1651332014000),
 FileInfo(path='dbfs:/user/root/courses/_started_5979609897587269052', name='_started_5979609897587269052', size=0, modificationTime=1651331915000),
 FileInfo(path='dbfs:/user/root/courses/_started_6205692918970763963', name='_started_6205692918970763963', size=0, modificationTime=1651331716000),
 FileInfo(path='dbfs:/user/root/courses/_started_8360750057545770571', name='_started_8360750057545770571', size=0, mod

##### Different Modes to write Spark Data Frame into Files

In [0]:
help(courses_df.write.mode)

Help on method mode in module pyspark.sql.readwriter:

mode(saveMode) method of pyspark.sql.readwriter.DataFrameWriter instance
    Specifies the behavior when data or table already exists.
    
    Options include:
    
    * `append`: Append contents of this :class:`DataFrame` to existing data.
    * `overwrite`: Overwrite existing data.
    * `error` or `errorifexists`: Throw an exception if data already exists.
    * `ignore`: Silently ignore this operation if data already exists.
    
    .. versionadded:: 1.4.0
    
    Examples
    --------
    >>> df.write.mode('append').parquet(os.path.join(tempfile.mkdtemp(), 'data'))



Different ways mode can be specified while writing data frame into files. `file_format` can be any valid out of the box format such as `text`, `csv`, `json`, `parquet`, `orc`.
* `courses_df.write.mode(saveMode).file_format(path_to_folder)`
* `courses_df.write.file_format(path_to_folder, mode=saveMode)`
* `courses_df.write.mode(saveMode).format('file_format').save(path_to_folder)`
* `courses_df.write.format('file_format').save(path_to_folder, mode=saveMode)`

* Understand default behavior.
    * Fails if folder exists.
    * Creates folder and then adds files to it.

In [0]:
dbutils.fs.rm(f'/user/{username}/courses', recurse=True)

Out[86]: True

In [0]:
# Will create folder and add files with data using parquet format

courses_df.\
  coalesce(1).\
  write.\
parquet(f'/user/{username}/courses')

In [0]:
spark.read.parquet(f'/user/{username}/courses').show()

+---------+--------------------+------------+----------+-----+-----------------+
|course_id|         course_name|suitable_for|enrollment|stars|number_of_ratings|
+---------+--------------------+------------+----------+-----+-----------------+
|        1|2020 Complete Pyt...|    Beginner|   1100093|  4.6|           318066|
|        4|Angular - The Com...|Intermediate|    422557|  4.6|           129984|
|       12|Automate the Bori...|    Advanced|    692617|  4.6|            70508|
|       10|Complete C# Unity...|    Advanced|    364934|  4.6|            78989|
|        5|Java Programming ...|    Advanced|    502572|  4.6|           123798|
|       15|Learn Python Prog...|    Advanced|    240790|  4.5|            58677|
|        3|Machine Learning ...|Intermediate|    692812|  4.5|           132228|
|       14|Modern React with...|Intermediate|    203214|  4.7|            60835|
|        8|Python for Data S...|Intermediate|    387789|  4.6|            87403|
|        6|React - The Compl

In [0]:
# Fails as mode is error or errorIfExists by default

courses_df.write.parquet(f'/user/{username}/courses')

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
[0;32m<command-2395919756489905>[0m in [0;36m<module>[0;34m[0m
[1;32m      1[0m [0;31m# Fails as mode is error or errorIfExists by default[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[1;32m      2[0m [0;34m[0m[0m
[0;32m----> 3[0;31m [0mcourses_df[0m[0;34m.[0m[0mwrite[0m[0;34m.[0m[0mparquet[0m[0;34m([0m[0;34mf'/user/{username}/courses'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m/databricks/spark/python/pyspark/sql/readwriter.py[0m in [0;36mparquet[0;34m(self, path, mode, partitionBy, compression)[0m
[1;32m    883[0m             [0mself[0m[0;34m.[0m[0mpartitionBy[0m[0;34m([0m[0mpartitionBy[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m    884[0m         [0mself[0m[0;34m.[0m[0m_set_opts[0m[0;34m([0m[0mcompression[0m[0;34m=[0m[0mcompression[0m[0;34m)

In [0]:
spark.read.parquet(f'/user/{username}/courses').count()


Out[90]: 21

In [0]:
courses_df. \
    coalesce(1). \
    write. \
    parquet(f'/user/{username}/courses', mode='overwrite')

In [0]:
courses_df. \
    coalesce(1). \
    write. \
    mode('overwrite'). \
    parquet(f'/user/{username}/courses')

In [0]:
courses_df. \
    coalesce(1). \
    write. \
    format('parquet'). \
    save(f'/user/{username}/courses', mode='overwrite')

In [0]:
spark.read.parquet(f'/user/{username}/courses').count()

Out[94]: 21

In [0]:
courses_df. \
    coalesce(1). \
    write. \
    mode('append'). \
    parquet(f'/user/{username}/courses')

In [0]:
spark.read.parquet(f'/user/{username}/courses').count()

Out[96]: 42

In [0]:
# ignore if data already exists
courses_df. \
    coalesce(1). \
    write. \
    mode('ignore'). \
    parquet(f'/user/{username}/courses')

In [0]:
spark.read.parquet(f'/user/{username}/courses').count()

Out[98]: 42

##### Coalesce and Repartitioning of Spark Data Frames

* `coalesce` and `repartition` are functions on top of the dataframe. Do not get confused between `coalesce` on Data Frame and the coalesce functions available to deal with null vales in a given column.

* `coalesce` is typically used to **reduce number of partitions** to deal with as part of downstream processing.

* `repartition` is used to reshuffle the data to **higher or lower number of partitions** to deal with as part of downstream partitioning.