# 09: SparkSQL - File Formats

This script demonstrates the methods for reading and writing files in the [Parquet](http://parquet.io) and JSON formats. It reads in the same data as in the previous example, writes it to new files in Parquet format, then reads it back in and runs queries on it. Then it repeats the exercise using JSON. Additional support is now available in Spark for ORC files, as well as old stand-bys like CSV.

The key [SparkSession](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SparkSession) and [Dataset](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset) methods are `SparkSession.read.parquet(inpath)` and `Dataset.write.save(outpath)` for reading and writing Parquet, and `SparkSession.read.json(inpath)` and `Dataset.write.json(outpath)` for reading and writing JSON. (The format for the first `write.save` method can be overridden to default to a different format.)

See the corresponding "script" suitable for _spark-shell_, [SparkSQLFileFormats9-script.scala](https://github.com/deanwampler/spark-scala-tutorial/blob/master/src/main/scala/sparktutorial/SparkSQLFileFormats9-script.scala).

In [1]:
val in = "../data/kjvdat.txt"

in = ../data/kjvdat.txt


../data/kjvdat.txt

Parse the lines into `Verse` instances as before.

In [2]:
case class Verse(book: String, chapter: Int, verse: Int, text: String)

defined class Verse


In [3]:
val lineRE = """^\s*([^|]+)\s*\|\s*([\d]+)\s*\|\s*([\d]+)\s*\|\s*(.*)~?\s*$""".r

lineRE = ^\s*([^|]+)\s*\|\s*([\d]+)\s*\|\s*([\d]+)\s*\|\s*(.*)~?\s*$


^\s*([^|]+)\s*\|\s*([\d]+)\s*\|\s*([\d]+)\s*\|\s*(.*)~?\s*$

In [4]:
val versesRDD = sc.textFile(in).flatMap {
  case lineRE(book, chapter, verse, text) =>
    Seq(Verse(book, chapter.toInt, verse.toInt, text))
  case line =>
    Console.err.println(s"Unexpected line: $line")
    Nil // or use Seq.empty[Verse]. It will be eliminated by flattening.
}

versesRDD = MapPartitionsRDD[2] at flatMap at <console>:33


MapPartitionsRDD[2] at flatMap at <console>:33

In [5]:
val verses = spark.createDataFrame(versesRDD)
verses.createOrReplaceTempView("kjv_bible")
verses.cache()  // not really needed in this notebook, but...

verses = [book: string, chapter: int ... 2 more fields]


[book: string, chapter: int ... 2 more fields]

Save as Parquet. (Remember to delete this output if you run this bit again...)

In [6]:
val parquetDir = "output/parquet"
println(s"Saving 'verses' as a Parquet file to $parquetDir.")
verses.write.parquet(parquetDir)

Saving 'verses' as a Parquet file to output/parquet.


parquetDir = output/parquet


output/parquet

Now read it back in as a new `Dataset`.

In [7]:
println(s"Reading in the Parquet file from $parquetDir:")
val verses2 = spark.read.parquet(parquetDir)
verses2.createOrReplaceTempView("verses2")
verses2.show

Reading in the Parquet file from output/parquet:
+----+-------+-----+--------------------+
|book|chapter|verse|                text|
+----+-------+-----+--------------------+
| Psa|     68|   11|The Lord gave the...|
| Psa|     68|   12|Kings of armies d...|
| Psa|     68|   13|Though ye have li...|
| Psa|     68|   14|When the Almighty...|
| Psa|     68|   15|The hill of God i...|
| Psa|     68|   16|Why leap ye, ye h...|
| Psa|     68|   17|The chariots of G...|
| Psa|     68|   18|Thou hast ascende...|
| Psa|     68|   19|Blessed be the Lo...|
| Psa|     68|   20|He that is our Go...|
| Psa|     68|   21|But God shall wou...|
| Psa|     68|   22|The Lord said, I ...|
| Psa|     68|   23|That thy foot may...|
| Psa|     68|   24|They have seen th...|
| Psa|     68|   25|The singers went ...|
| Psa|     68|   26|Bless ye God in t...|
| Psa|     68|   27|There is little B...|
| Psa|     68|   28|Thy God hath comm...|
| Psa|     68|   29|Because of thy te...|
| Psa|     68|   30|Rebuke 

verses2 = [book: string, chapter: int ... 2 more fields]


[book: string, chapter: int ... 2 more fields]

Run a SQL query...

In [8]:
val jesusVerses = spark.sql("SELECT * FROM verses2 WHERE text LIKE '%Jesus%'")
println("Number of Jesus Verses: "+jesusVerses.count())
jesusVerses.show

Number of Jesus Verses: 936
+----+-------+-----+--------------------+
|book|chapter|verse|                text|
+----+-------+-----+--------------------+
| Mat|      1|    1|The book of the g...|
| Mat|      1|   16|And Jacob begat J...|
| Mat|      1|   18|Now the birth of ...|
| Mat|      2|    1|Now when Jesus wa...|
| Mat|      3|   13|Then cometh Jesus...|
| Mat|      3|   15|And Jesus answeri...|
| Mat|      3|   16|And Jesus, when h...|
| Mat|      4|    1|Then was Jesus le...|
| Mat|      4|    7|Jesus said unto h...|
| Mat|      4|   10|Then saith Jesus ...|
| Mat|      4|   12|Now when Jesus ha...|
| Mat|      4|   17|From that time Je...|
| Mat|      4|   18|And Jesus, walkin...|
| Mat|      4|   23|And Jesus went ab...|
| Mat|      7|   28|And it came to pa...|
| Mat|      8|    3|And Jesus put for...|
| Mat|      8|    4|And Jesus saith u...|
| Mat|      8|    5|And when Jesus wa...|
| Mat|      8|    7|And Jesus saith u...|
| Mat|      8|   10|When Jesus heard ...|
+----+

jesusVerses = [book: string, chapter: int ... 2 more fields]


[book: string, chapter: int ... 2 more fields]

Now work with JSON. This _requires_ each JSON "document" to be on a single line.
Let's first right some JSON.

In [9]:
val jsonDir = "output/json"
println(s"Saving 'verses' as a JSON file to $jsonDir.")
verses.write.json(jsonDir)

Saving 'verses' as a JSON file to output/json.


jsonDir = output/json


output/json

... and read it back in.

In [10]:
val versesJSON = spark.read.json(jsonDir)
versesJSON.show

+----+-------+--------------------+-----+
|book|chapter|                text|verse|
+----+-------+--------------------+-----+
| Psa|     68|The Lord gave the...|   11|
| Psa|     68|Kings of armies d...|   12|
| Psa|     68|Though ye have li...|   13|
| Psa|     68|When the Almighty...|   14|
| Psa|     68|The hill of God i...|   15|
| Psa|     68|Why leap ye, ye h...|   16|
| Psa|     68|The chariots of G...|   17|
| Psa|     68|Thou hast ascende...|   18|
| Psa|     68|Blessed be the Lo...|   19|
| Psa|     68|He that is our Go...|   20|
| Psa|     68|But God shall wou...|   21|
| Psa|     68|The Lord said, I ...|   22|
| Psa|     68|That thy foot may...|   23|
| Psa|     68|They have seen th...|   24|
| Psa|     68|The singers went ...|   25|
| Psa|     68|Bless ye God in t...|   26|
| Psa|     68|There is little B...|   27|
| Psa|     68|Thy God hath comm...|   28|
| Psa|     68|Because of thy te...|   29|
| Psa|     68|Rebuke the compan...|   30|
+----+-------+--------------------

versesJSON = [book: string, chapter: bigint ... 2 more fields]


[book: string, chapter: bigint ... 2 more fields]

Note that the book order isn't preserved.