# Processing text data with different delimiters

As part of this topic and next few topics let us see different file formats supported Spark along with compression algorithms. File formats include csv, orc, parquet, avro etc and compression algorithms include gzip, deflate, snappy etc.

* Dealing with different delimiters for text file format (e. g: yelp data set)
* Different file formats and APIs associated with them
* Spark 2.x have support to these file formats out of the box
    * json
    * csv
    * orc
    * parquet
    * and more
* We can also use 3rd party APIs to read data from file formats such as Avro
* Compression can be extensively used to save storage requirements and also to improve performance
* For each file format, compressing data is a bit different
* Most of the file formats such as orc, parquet etc compress data by default

### Custom Record Delimiter
* sc.textFile or spark.read.csv works fine as long as record delimiter is new line character
* But if record delimiter is any other character than new line, then we have to use lower level HDFS APIs such as org.apache.hadoop.mapreduce.lib.input.TextInputFormat
* Spark Context (sc) have API called newAPIHadoopFile to use lower level HDFS APIs
* newAPIHadoopFile takes 5 arguments
    * path
    * input file format
    * key type
    * value type
    * configuration
* We need to first get hadoop configuration from spark context and set textinputformat.record.delimiter
* Key type and value type are purely based on the file format. For text file format, key type is org.apache.hadoop.io.LongWritable and value type is org.apache.hadoop.io.Text
* To preview the data we have to convert into toString as part of map
* You can see the complete code snippet here

In [3]:
al path = "/public/yelp-dataset/yelp_review.csv"
val conf = sc.hadoopConfiguration
conf.set("textinputformat.record.delimiter", "\r")

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text

val yelpReview = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
yelpReview.count
yelpReview.map(r => r._2.toString).take(10).foreach(println)
yelpReview.map(r => (r._2.toString.split("\",\"").size, 1)).reduceByKey(_ + _).collect.foreach(println)

Name: Compile Error
Message: <console>:35: error: not found: value path
       val yelpReview = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
                                            ^
<console>:42: error: not found: value al
val $ires7 = al.path
             ^
<console>:27: error: not found: value al
       al path = "/public/yelp-dataset/yelp_review.csv"
       ^

StackTrace: 