#PySpark Read JSON file into DataFrame

**PySpark SQL provides read.json("path") to read a single line or multiline (multiple lines) JSON file into PySpark DataFrame and write.json("path") to save or write to JSON file, In this tutorial, you will learn how to read a single file, multiple files, all files from a directory into DataFrame and writing DataFrame back to JSON file using Python example.**

---


###Note: PySpark API out of the box supports to read JSON files and many more file formats into PySpark DataFrame.


---


**Table of contents:**

- PySpark Read JSON file into DataFrame

- Read JSON file from multiline

- Read multiple files at a time

- Read all files in a directory

- Read file with a user-specified schema

- Read file using PySpark SQL

- Options while reading JSON file
  - nullValues
  - dateFormat

- PySpark Write DataFrame to JSON file
  - Using options 
  - Saving Mode
  
  
---

##PySpark Read JSON file into DataFrame


**Using read.json("path") or read.format("json").load("path") you can read a JSON file into a PySpark DataFrame, these methods take a file path as an argument.**

**Unlike reading a CSV, By default JSON data source inferschema from an input file.**

In [0]:
# Read JSon file into dataframe

df = spark.read.json("dbfs:/FileStore/json_resources/zipcodes.json")
df.printSchema()
df.show(truncate=False)

root
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Decommisioned: boolean (nullable = true)
 |-- EstimatedPopulation: long (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- LocationText: string (nullable = true)
 |-- LocationType: string (nullable = true)
 |-- Long: double (nullable = true)
 |-- Notes: string (nullable = true)
 |-- RecordNumber: long (nullable = true)
 |-- State: string (nullable = true)
 |-- TaxReturnsFiled: long (nullable = true)
 |-- TotalWages: long (nullable = true)
 |-- WorldRegion: string (nullable = true)
 |-- Xaxis: double (nullable = true)
 |-- Yaxis: double (nullable = true)
 |-- Zaxis: double (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: long (nullable = true)

+-------------------+-------+-------------+-------------------+-----+----------------------------+-----------------------+--------------+-------+-------------+------------+-----+------------

**When you use format("json") method, you can also specify the Data sources by their fully qualified name as below.**

In [0]:
# Read JSON file into dataframe
df_format = spark.read.format("org.apache.spark.sql.json")\
.load("dbfs:/FileStore/json_resources/zipcodes.json")

df_format.printSchema()

root
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Decommisioned: boolean (nullable = true)
 |-- EstimatedPopulation: long (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- LocationText: string (nullable = true)
 |-- LocationType: string (nullable = true)
 |-- Long: double (nullable = true)
 |-- Notes: string (nullable = true)
 |-- RecordNumber: long (nullable = true)
 |-- State: string (nullable = true)
 |-- TaxReturnsFiled: long (nullable = true)
 |-- TotalWages: long (nullable = true)
 |-- WorldRegion: string (nullable = true)
 |-- Xaxis: double (nullable = true)
 |-- Yaxis: double (nullable = true)
 |-- Zaxis: double (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: long (nullable = true)



##Read JSON file from multiline


**PySpark JSON data source provides multiple options to read files in different options, use multiline option to read JSON files scattered across multiple lines. By default multiline option, is set to false.**


---

[{
  "RecordNumber": 2,
  "Zipcode": 704,
  "ZipCodeType": "STANDARD",
  "City": "PASEO COSTA DEL SUR",
  "State": "PR"
},

{
  "RecordNumber": 10,
  "Zipcode": 709,
  "ZipCodeType": "STANDARD",
  "City": "BDA SAN LUIS",
  "State": "PR"
}]

---


**Using read.option("multiline","true")**

In [0]:
# Read multiline json file 
multiline_df = spark.read.option("multiline", "true")\
.json("dbfs:/FileStore/json_resources/multiline_zipcode.json")

multiline_df.printSchema()

multiline_df.show(truncate=False)

root
 |-- City: string (nullable = true)
 |-- RecordNumber: long (nullable = true)
 |-- State: string (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- Zipcode: long (nullable = true)

+-------------------+------------+-----+-----------+-------+
|City               |RecordNumber|State|ZipCodeType|Zipcode|
+-------------------+------------+-----+-----------+-------+
|PASEO COSTA DEL SUR|2           |PR   |STANDARD   |704    |
|BDA SAN LUIS       |10          |PR   |STANDARD   |709    |
+-------------------+------------+-----+-----------+-------+



##Reading multiple files at a time


**Using the read.json() method you can also read multiple JSON files from different paths, just pass all file names with fully qualified paths by separating comma, for example**

In [0]:
#Read multiple files

df2 = spark.read.json(["dbfs:/FileStore/json_resources/zipcode1.json", "dbfs:/FileStore/json_resources/zipcode2.json"])

df2.show(truncate=False)

+-------------------+-------+-------------+-----+----------------------------+-----------------------+--------------+------+------------+-----+-----------+-----+-----+-----+-----------+-------+
|City               |Country|Decommisioned|Lat  |Location                    |LocationText           |LocationType  |Long  |RecordNumber|State|WorldRegion|Xaxis|Yaxis|Zaxis|ZipCodeType|Zipcode|
+-------------------+-------+-------------+-----+----------------------------+-----------------------+--------------+------+------------+-----+-----------+-----+-----+-----+-----------+-------+
|PASEO COSTA DEL SUR|US     |false        |17.96|NA-US-PR-PASEO COSTA DEL SUR|Paseo Costa Del Sur, PR|NOT ACCEPTABLE|-66.22|2           |PR   |NA         |0.38 |-0.87|0.3  |STANDARD   |704    |
|BDA SAN LUIS       |US     |false        |18.14|NA-US-PR-BDA SAN LUIS       |Bda San Luis, PR       |NOT ACCEPTABLE|-66.26|10          |PR   |NA         |0.38 |-0.86|0.31 |STANDARD   |709    |
|PARC PARQUE        |US     |f

##Reading all files in a directory


**We can read all JSON files from a directory into DataFrame just by passing directory as a path to the json() method.**

In [0]:
# Read all JSON files from a folder

df3 = spark.read.json("dbfs:/FileStore/json_resources/*.json")
df3.show(truncate=False)

+-------------------+-------+-------------+-------------------+-----+----------------------------+-----------------------+--------------+-------+-------------+------------+-----+---------------+----------+-----------+-----+-----+-----+-----------+-------+---------------+
|City               |Country|Decommisioned|EstimatedPopulation|Lat  |Location                    |LocationText           |LocationType  |Long   |Notes        |RecordNumber|State|TaxReturnsFiled|TotalWages|WorldRegion|Xaxis|Yaxis|Zaxis|ZipCodeType|Zipcode|_corrupt_record|
+-------------------+-------+-------------+-------------------+-----+----------------------------+-----------------------+--------------+-------+-------------+------------+-----+---------------+----------+-----------+-----+-----+-----+-----------+-------+---------------+
|PARC PARQUE        |US     |false        |null               |17.96|NA-US-PR-PARC PARQUE        |Parc Parque, PR        |NOT ACCEPTABLE|-66.22 |null         |1           |PR   |null  

##Reading files with a user-specified custom schema


**PySpark Schema defines the structure of the data, in other words, it is the structure of the DataFrame. PySpark SQL provides StructType & StructField classes to programmatically specify the structure to the DataFrame.**

**If you know the schema of the file ahead and do not want to use the default inferSchema option, use schema option to specify user-defined custom column names and data types.**

**Use the PySpark StructType class to create a custom schema, below we initiate this class and use add a method to add columns to it by providing the column name, data type and nullable option.**

In [0]:
# Define custom schema 
from pyspark.sql.types import *

schema = StructType([
      StructField("RecordNumber",IntegerType(),True),
      StructField("Zipcode",IntegerType(),True),
      StructField("ZipCodeType",StringType(),True),
      StructField("City",StringType(),True),
      StructField("State",StringType(),True),
      StructField("LocationType",StringType(),True),
      StructField("Lat",DoubleType(),True),
      StructField("Long",DoubleType(),True),
      StructField("Xaxis",IntegerType(),True),
      StructField("Yaxis",DoubleType(),True),
      StructField("Zaxis",DoubleType(),True),
      StructField("WorldRegion",StringType(),True),
      StructField("Country",StringType(),True),
      StructField("LocationText",StringType(),True),
      StructField("Location",StringType(),True),
      StructField("Decommisioned",BooleanType(),True),
      StructField("TaxReturnsFiled",StringType(),True),
      StructField("EstimatedPopulation",IntegerType(),True),
      StructField("TotalWages",IntegerType(),True),
      StructField("Notes",StringType(),True)
  ])



df_with_schema = spark.read.schema(schema).json("dbfs:/FileStore/json_resources/zipcodes.json")

df_with_schema.printSchema()

df_with_schema.show()

root
 |-- RecordNumber: integer (nullable = true)
 |-- Zipcode: integer (nullable = true)
 |-- ZipCodeType: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- LocationType: string (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Long: double (nullable = true)
 |-- Xaxis: integer (nullable = true)
 |-- Yaxis: double (nullable = true)
 |-- Zaxis: double (nullable = true)
 |-- WorldRegion: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- LocationText: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Decommisioned: boolean (nullable = true)
 |-- TaxReturnsFiled: string (nullable = true)
 |-- EstimatedPopulation: integer (nullable = true)
 |-- TotalWages: integer (nullable = true)
 |-- Notes: string (nullable = true)

+------------+-------+-----------+-------------------+-----+--------------+-----+-------+-----+-----+-----+-----------+-------+--------------------+--------------------+------

## Read JSON file using PySpark SQL

**PySpark SQL also provides a way to read a JSON file by creating a temporary view directly from the reading file using spark.sqlContext.sql(“load JSON to temporary view”)**

In [0]:
spark.sql("CREATE OR REPLACE TEMPORARY VIEW zipcode USING json OPTIONS" +" (path 'dbfs:/FileStore/json_resources/zipcodes.json') ")
spark.sql(" select * from zipcode").show(truncate=False)

+-------------------+-------+-------------+-------------------+-----+----------------------------+-----------------------+--------------+-------+-------------+------------+-----+---------------+----------+-----------+-----+-----+-----+-----------+-------+
|City               |Country|Decommisioned|EstimatedPopulation|Lat  |Location                    |LocationText           |LocationType  |Long   |Notes        |RecordNumber|State|TaxReturnsFiled|TotalWages|WorldRegion|Xaxis|Yaxis|Zaxis|ZipCodeType|Zipcode|
+-------------------+-------+-------------+-------------------+-----+----------------------------+-----------------------+--------------+-------+-------------+------------+-----+---------------+----------+-----------+-----+-----+-----+-----------+-------+
|PARC PARQUE        |US     |false        |null               |17.96|NA-US-PR-PARC PARQUE        |Parc Parque, PR        |NOT ACCEPTABLE|-66.22 |null         |1           |PR   |null           |null      |NA         |0.38 |-0.87|0.3

##Options while reading JSON file


###nullValues

**Using nullValues option you can specify the string in a JSON to consider as null. For example, if you want to consider a date column with a value “1900-01-01” set null on DataFrame.**

---

###dateFormat

**dateFormat option to used to set the format of the input DateType and TimestampType columns. Supports all java.text.SimpleDateFormat formats.**

---

###Note: Besides the above options, PySpark JSON dataset also supports many other options.


---


##Applying DataFrame transformations

**Once you have create PySpark DataFrame from the JSON file, you can apply all transformation and actions DataFrame support.**

##Write PySpark DataFrame to JSON file

**Use the PySpark DataFrameWriter object “write” method on DataFrame to write a JSON file.**

In [0]:
df2.write.json("/tmp/spark_output_json/zipcodes.json")

##PySpark Options while writing JSON files


**While writing a JSON file you can use several options.**

**Other options available nullValue,dateFormat**

---

###PySpark Saving modes**

**PySpark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes overwrite, append, ignore, errorifexists.**

- overwrite – mode is used to overwrite the existing file

- append – To add the data to the existing file

- ignore – Ignores write operation when the file already exists

- errorifexists or error – This is a default option when the file already exists, it returns an error

In [0]:
df2.write.mode('Overwrite').json("/tmp/spark_output_json/zipcodes.json")