# Different File Formats and Custom Delimiters

As part of this session we will talk about dealing with different file formats and also custom delimiters in text data. We will see how to read and how to write the data. Also we will understand APIs such as persist/cache on Data Frames.

* Overview of write APIs – dataframe.write
* Overview of read APIs – spark.read
* Supported file formats
    * csv, text (for text file formats)
    * json (using complex schema)
    * orc
    * parquet
    * avrò (3rd party)
* Text Data – Custom Delimiters
* Data Frames Persistence

### Overview of write APIs – dataframe.write

Let us see how we can write data to different targets using APIs under write on top of data frame.

* Supported file formats – csv, text json, orc, parquet etc.
* We can also write data to 3rd party supported file formats such as avro
* Data can be written to Hive tables as well
* We can also connect to relational databases over JDBC and save our output into remote relational databases.
* We can also connect to any 3rd party database using relevant plugin and preserve data over there.


In [1]:
import org.apache.spark.sql.types._
val schemaString = "order_id:int order_date:string order_customer_id:int order_status:string"

val a = schemaString.split(" ")

// Using pattern matching
val fields = a.map(f => f.split(":")(1) match {
  case "int" => StructField(f.split(":")(0), IntegerType)
  case _ => StructField(f.split(":")(0), StringType)
})

val schema = StructType(fields)
val inputBaseDir = "/public/retail_db"

val ordersDF = spark.
  read.
  schema(schema).
  csv(inputBaseDir + "/orders")

ordersDF.printSchema
ordersDF.show

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:0

schemaString = order_id:int order_date:string order_customer_id:int order_status:string
a = Array(order_id:int, order_date:string, order_customer_id:int, order_status:string)
fields = Array(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(order_status,StringType,true))
schema = StructType(StructField(order_id,IntegerType,true), StructField(order_date,StringType,true), StructField(order_customer_id,IntegerType,true), StructField(order_status,StringType,true))
inputBaseDir = /public/retail_db
ordersDF = [order_id: i...


[order_id: i...

In [None]:
ordersDF.write.
  format("json").
  save("/user/training/sparkscala/orders_json")

In [None]:
ordersDF.write.json("/user/training/sparkscala/orders_json")

### Overview of read APIs – spark.read

spark.read have bunch of APIs to read data from different source types.

* Supported file formats- csv, text, json, orc, parquet etc
* We can also read data from 3rd party supported file formats such as avro
* We can read data directly from hive tables
* JDBC – to read data from relational databases
* There is generic API called format which can be used in conjunction with option to pass relevant arguments and then load data from either files or over JDBC.

In [7]:
// dataframe-read-examples-01-files.scala

// val orders = spark.read.
//   format("json").
//   load("/public/retail_db_json/orders")

val orders = spark.read.json("/public/retail_db_json/orders")

orders.show()
orders.printSchema()

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
|            12111|2013-07-25 00:00:...|       3|       COMPLETE|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|
|             7130|2013-07-25 00:00:...|       6|       COMPLETE|
|             4530|2013-07-25 00:00:...|       7|       COMPLETE|
|             2911|2013-07-25 00:00:...|       8|     PROCESSING|
|             5657|2013-07-25 00:00:...|       9|PENDING_PAYMENT|
|             5648|2013-07-25 00:00:...|      10|PENDING_PAYMENT|
|              918|2013-07-25 00:00:...|      11| PAYMENT_REVIEW|
|             1837|2013-07-25 00:00:...|      12|         CLOSED|
|         

orders = [order_customer_id: bigint, order_date: string ... 2 more fields]


[order_customer_id: bigint, order_date: string ... 2 more fields]

### Supported file formats

Let us see details about all the supported formats in Spark to create data frames and save them.

* Following file formats are supported out of the box with Spark
    * text – using text (fixed length) or csv (delimited)
    * json
    * orc
    * parquet
* Avro is available with 3rd party plugins

In [3]:
val ordersCSV = spark.read.csv("/public/retail_db/orders").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

import org.apache.spark.sql.types.{IntegerType, FloatType}
val orders = ordersCSV.
  withColumn("order_id", $"order_id".cast("int")).
  withColumn("order_customer_id", $"order_customer_id".cast(IntegerType))


val orders_read = spark.read.format("text").
  load("/user/training/sparkscala/orders_text")

Name: org.apache.hadoop.security.AccessControlException
Message: Permission denied: user=nagasreed, access=EXECUTE, inode="/user/ramchander_chikkala/sparkscala/orders_text":ramchander_chikkala:hdfs:drwx------
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:353)
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:292)
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:238)
	at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1950)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:108)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4146)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo

In [4]:
orders.selectExpr("concat(order_id, ',', order_date, ',', order_customer_id, ',', order_status)").
  write.
  format("text").
  save("/user/training/sparkscala/orders_text")

Name: Unknown Error
Message: lastException: Throwable = null
<console>:26: error: not found: value orders
       orders.selectExpr("concat(order_id, ',', order_date, ',', order_customer_id, ',', order_status)").
       ^

StackTrace: 

In [3]:
orders.selectExpr("concat(order_id, ',', order_date, ',', order_customer_id, ',', order_status)").
  write.
  mode("overwrite").
  text("/user/training/sparkscala/orders_text")

lastException: Throwable = null


In [4]:
val orders_read = spark.read.text("/user/training/sparkscala/orders_text")

orders_read = [value: string]


[value: string]

In [6]:
orders_read.printSchema()

root
 |-- value: string (nullable = true)



In [5]:
orders_read.show()

+--------------------+
|               value|
+--------------------+
|1,2013-07-25 00:0...|
|2,2013-07-25 00:0...|
|3,2013-07-25 00:0...|
|4,2013-07-25 00:0...|
|5,2013-07-25 00:0...|
|6,2013-07-25 00:0...|
|7,2013-07-25 00:0...|
|8,2013-07-25 00:0...|
|9,2013-07-25 00:0...|
|10,2013-07-25 00:...|
|11,2013-07-25 00:...|
|12,2013-07-25 00:...|
|13,2013-07-25 00:...|
|14,2013-07-25 00:...|
|15,2013-07-25 00:...|
|16,2013-07-25 00:...|
|17,2013-07-25 00:...|
|18,2013-07-25 00:...|
|19,2013-07-25 00:...|
|20,2013-07-25 00:...|
+--------------------+
only showing top 20 rows

root
 |-- value: string (nullable = true)



In [1]:
val ordersCSV = spark.read.csv("/public/retail_db/orders").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

import org.apache.spark.sql.types.{IntegerType, FloatType}
val orders = ordersCSV.
  withColumn("order_id", $"order_id".cast("int")).
  withColumn("order_customer_id", $"order_customer_id".cast(IntegerType))

orders.write.
  format("csv").
  save("/user/training/bootcampdemo/pyspark/orders_csv")

orders.write.csv("/user/training/bootcampdemo/pyspark/orders_csv")

Name: org.apache.spark.sql.AnalysisException
Message: path hdfs://nn01.itversity.com:8020/user/ramchander_chikkala/bootcampdemo/pyspark/orders_csv already exists.;
StackTrace:   at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:109)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apa

In [2]:
val orders_read1 = spark.read.
  format("csv").
  load("/user/training/bootcampdemo/pyspark/orders_csv").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

orders_read1.show()
orders_read1.printSchema()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|             1837|         CLOSED|
|      13|

orders_read1 = [order_id: string, order_date: string ... 2 more fields]


lastException: Throwable = null


[order_id: string, order_date: string ... 2 more fields]

In [4]:
val orders_read2 = spark.read.
  csv("/user/training/bootcampdemo/pyspark/orders_csv").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

orders_read2.show()
orders_read2.printSchema()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|             1837|         CLOSED|
|      13|

orders_read2 = [order_id: string, order_date: string ... 2 more fields]


lastException: Throwable = null


[order_id: string, order_date: string ... 2 more fields]

In [5]:
val ordersCSV = spark.read.csv("/public/retail_db/orders").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

ordersCSV = [order_id: string, order_date: string ... 2 more fields]


[order_id: string, order_date: string ... 2 more fields]

In [6]:
import org.apache.spark.sql.types.{IntegerType, FloatType}
val orders = ordersCSV.
  withColumn("order_id", $"order_id".cast("int")).
  withColumn("order_customer_id", $"order_customer_id".cast(IntegerType))


orders = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]

In [7]:
orders.write.
  format("json").
  mode("overwrite").
  save("/user/training/sparkscala/orders_json")

In [8]:
orders.write.
  mode("overwrite").
  json("/user/training/sparkscala/orders_json")

In [9]:
val orders_read = spark.read.
  format("json").
  load("/user/training/sparkscala/orders_json")

orders_read = [order_customer_id: bigint, order_date: string ... 2 more fields]


[order_customer_id: bigint, order_date: string ... 2 more fields]

In [10]:
val orders_read = spark.read.
  json("/user/training/sparkscala/orders_json")

orders_read = [order_customer_id: bigint, order_date: string ... 2 more fields]


[order_customer_id: bigint, order_date: string ... 2 more fields]

In [13]:
orders_read.show()

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
|            12111|2013-07-25 00:00:...|       3|       COMPLETE|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|
|             7130|2013-07-25 00:00:...|       6|       COMPLETE|
|             4530|2013-07-25 00:00:...|       7|       COMPLETE|
|             2911|2013-07-25 00:00:...|       8|     PROCESSING|
|             5657|2013-07-25 00:00:...|       9|PENDING_PAYMENT|
|             5648|2013-07-25 00:00:...|      10|PENDING_PAYMENT|
|              918|2013-07-25 00:00:...|      11| PAYMENT_REVIEW|
|             1837|2013-07-25 00:00:...|      12|         CLOSED|
|         

In [12]:
orders_read.printSchema()

root
 |-- order_customer_id: long (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_id: long (nullable = true)
 |-- order_status: string (nullable = true)



In [14]:
val ordersCSV = spark.read.csv("/public/retail_db/orders").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

import org.apache.spark.sql.types.{IntegerType, FloatType}
val orders = ordersCSV.
  withColumn("order_id", $"order_id".cast("int")).
  withColumn("order_customer_id", $"order_customer_id".cast(IntegerType))


ordersCSV = [order_id: string, order_date: string ... 2 more fields]
orders = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]

In [15]:
orders.write.
  format("orc").
  mode("overwrite").
  save("/user/training/sparkscala/orders_orc")

In [17]:
orders.write.
  mode("overwrite").
  orc("/user/training/sparkscala/orders_orc")

lastException: Throwable = null


In [19]:
val orders_read = spark.read.
  format("orc").
  load("/user/training/sparkscala/orders_orc")

orders_read = [order_id: int, order_date: string ... 2 more fields]


lastException: Throwable = null


[order_id: int, order_date: string ... 2 more fields]

In [21]:
val orders_read = spark.read.
  orc("/user/training/sparkscala/orders_orc")

Name: Unknown Error
Message: <console>:2: error: unclosed string literal
  orc("/user/ramchander_chikkala
      ^
<console>:3: error: unclosed string literal
  /sparkscala/orders_orc")
                         ^

StackTrace: 

In [22]:
orders_read.show()
orders_read.printSchema()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|             1837|         CLOSED|
|      13|

In [23]:
val ordersCSV = spark.read.csv("/public/retail_db/orders").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

import org.apache.spark.sql.types.{IntegerType, FloatType}
val orders = ordersCSV.
  withColumn("order_id", $"order_id".cast("int")).
  withColumn("order_customer_id", $"order_customer_id".cast(IntegerType))

ordersCSV = [order_id: string, order_date: string ... 2 more fields]
orders = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]

In [24]:
orders.write.
  format("parquet").
  mode("overwrite").
  save("/user/training/sparkscala/orders_parquet")

In [25]:
orders.write.
  mode("overwrite").
  parquet("/user/training/sparkscala/orders_parquet")

In [26]:
val orders_read = spark.read.
  format("parquet").
  load("/user/training/sparkscala/orders_parquet")

orders_read = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]

In [27]:
val orders_read = spark.read.
  parquet("/user/training/sparkscala/orders_parquet")

orders_read = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]

In [28]:
orders_read.show()
orders_read.printSchema()

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
|       6|2013-07-25 00:00:...|             7130|       COMPLETE|
|       7|2013-07-25 00:00:...|             4530|       COMPLETE|
|       8|2013-07-25 00:00:...|             2911|     PROCESSING|
|       9|2013-07-25 00:00:...|             5657|PENDING_PAYMENT|
|      10|2013-07-25 00:00:...|             5648|PENDING_PAYMENT|
|      11|2013-07-25 00:00:...|              918| PAYMENT_REVIEW|
|      12|2013-07-25 00:00:...|             1837|         CLOSED|
|      13|

In [35]:
%%ssh
spark-shell --master yarn --conf spark.ui.port=12901 --packages com.databricks:spark-avro_2.11:4.0.0

Name: Error parsing magics!
Message: Magic ssh does not exist!
StackTrace: 

In [29]:
// spark-shell --master yarn --conf spark.ui.port=12901 --packages com.databricks:spark-avro_2.11:4.0.0

val ordersCSV = spark.read.csv("/public/retail_db/orders").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

ordersCSV = [order_id: string, order_date: string ... 2 more fields]


[order_id: string, order_date: string ... 2 more fields]

In [30]:
import org.apache.spark.sql.types.{IntegerType, FloatType}
val orders = ordersCSV.
  withColumn("order_id", $"order_id".cast("int")).
  withColumn("order_customer_id", $"order_customer_id".cast(IntegerType))

orders = [order_id: int, order_date: string ... 2 more fields]


[order_id: int, order_date: string ... 2 more fields]

In [32]:
orders.write.
  format("com.databricks.spark.avro").
  mode("overwrite").
  save("/user/training/sparkscala/orders_avro")

lastException = null


Name: org.apache.spark.sql.AnalysisException
Message: Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;
StackTrace:   at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:241)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)

In [None]:
orders.write.
  mode("overwrite").
  avro("/user/training/sparkscala/orders_avro")

In [None]:
val orders_read = spark.read.
  format("com.databricks.spark.avro").
  load("/user/training/sparkscala/orders_avro")

In [None]:
val orders_read = spark.read.
  avro("/user/training/sparkscala/orders_avro")

In [None]:
orders_read.show()
orders_read.printSchema()

### Text Data – Custom Delimiters

Now let us understand how to process text data with the different line as well as field delimiters.

* We can read text data into RDD using SparkContext’s textFile. It will treat newline character as a record delimiter.
* We have to parse each record in RDD and derive data to process further
* With Spark Data Frames we have csv and text APIs to read text data int Data Frame
* Both of them use newline character as a record delimiter. When we use csv API to create data frame we can also specify field separator/delimiter using sep as one of the options.
* We can also specify sep while writing data into text files with any field separator or delimiter using csv API. Also, we can concatenate data as part of selectExpr with a delimiter of our choice and use text API.
* Here is the example to read and write data with ascii null character.

In [None]:
val inputBaseDir = "/mnt/c/data/retail_db"
val outputBaseDir = "/mnt/c/data/sparkscala"

val ordersCSV = spark.
  read.
  csv(inputBaseDir +"/orders").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

In [None]:
import org.apache.spark.sql.types.{IntegerType, FloatType}
val orders = ordersCSV.
  withColumn("order_id", $"order_id".cast(IntegerType)).
  withColumn("order_customer_id", $"order_customer_id".cast("int"))

In [None]:
orders.
  selectExpr("concat(order_id, '\00', order_date, '\00', order_customer_id, '\00', order_status)").
  write.
  text(outputBaseDir + "/orders_null")

In [None]:
orders.write.
  options("sep", "\00").
  csv(outputBaseDir + "/orders_null")

In [None]:
val orders_read_csv = spark.
  read.
  option("sep", "\00").
  csv(outputBaseDir + "/orders_null").
  toDF("order_id", "order_date", "order_customer_id", "order_status")

In [None]:
val orders_read = orders_read_csv.
  withColumn("order_id", $"order_id".cast(IntegerType)).
  withColumn("order_customer_id", $"order_customer_id".cast(IntegerType))

In [None]:
orders_read.show()
orders_read.printSchema()

* At times, we might have to deal with text data where line delimiter is different than newline character.
* In this case, we need to use HDFS APIs to read data from files with custom line delimiter into RDD and process further (either using transformations/actions or data frame operations)

In [None]:
val path = "/public/yelp-dataset/yelp_review.csv"

spark.read.text(path).
  select(size(split($"value", ",")).alias("value")).
  groupBy("value").
  count.
  show(200)

In [None]:
val conf = sc.hadoopConfiguration
conf.set("textinputformat.record.delimiter", "\r")

In [None]:
val yelpReview = sc.newAPIHadoopFile(path, 
  classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat], 
  classOf[org.apache.hadoop.io.LongWritable], 
  classOf[org.apache.hadoop.io.Text], 
  conf)

In [None]:
yelpReview.count()

In [None]:
yelpReview.map(rec => rec._2.toString).
  take(10).
  foreach(println)

In [None]:
yelpReview.
  map(rec => (rec._2.toString.split("\",\"").size, 1)).
  reduceByKey((x, y) => x + y).
  collect.
  foreach(println)

### Data Frames Persistence

Now let us see how we can persist data frames.

* By default, data will be streamed as data frames to executor tasks as data being processed.
* Here is what will happen when data is read into executor task while it is being processed
    * Deserialize into object
    * Stream into memory
    * Process data by executor task by applying logic
    * Flush deserialized objects from memory as executor tasks are terminated
* Some times we might have to read same data multiple times for processing with in the same job. By default every time data need to be deserialized and submitted to executor tasks for processing
* To avoid deserializing into java objects when same data have to be read multiple times we can leverage caching.
* There are 2 methods persist and cache. By default with data frames caching will be done as MEMORY_AND_DISK from Spark 2.
* cache is shorthand method for persist at MEMORY_AND_DISK
* This is what happens when we cache Data Frame
    * Caching will be done only when data is read at least once for processing
    * Each record will be deserialized into object
    * These deserialized objects will be cached in memory as long as they fit
    * If not, deserialized objects will be spilled out to disk
* You can get details about different persistence levels from [here](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.StorageLevel).