# Processing text data with different delimiters

As part of this topic and next few topics let us see different file formats supported Spark along with compression algorithms. File formats include csv, orc, parquet, avro etc and compression algorithms include gzip, deflate, snappy etc.

* Dealing with different delimiters for text file format (e. g: yelp data set)
* Different file formats and APIs associated with them
* Spark 2.x have support to these file formats out of the box
    * json
    * csv
    * orc
    * parquet
    * and more
* We can also use 3rd party APIs to read data from file formats such as Avro
* Compression can be extensively used to save storage requirements and also to improve performance
* For each file format, compressing data is a bit different
* Most of the file formats such as orc, parquet etc compress data by default

### Custom Record Delimiter
* sc.textFile or spark.read.csv works fine as long as record delimiter is new line character
* But if record delimiter is any other character than new line, then we have to use lower level HDFS APIs such as org.apache.hadoop.mapreduce.lib.input.TextInputFormat
* Spark Context (sc) have API called newAPIHadoopFile to use lower level HDFS APIs
* newAPIHadoopFile takes 5 arguments
    * path
    * input file format
    * key type
    * value type
    * configuration
* We need to first get hadoop configuration from spark context and set textinputformat.record.delimiter
* Key type and value type are purely based on the file format. For text file format, key type is org.apache.hadoop.io.LongWritable and value type is org.apache.hadoop.io.Text
* To preview the data we have to convert into toString as part of map
* You can see the complete code snippet here

In [1]:
// spark-newAPIHadoopFile-yelp_review.scala
 
val path = "/public/yelp-dataset/yelp_review.csv"
val conf = sc.hadoopConfiguration
conf.set("textinputformat.record.delimiter", "\r")

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text

val yelpReview = sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
yelpReview.count
yelpReview.map(r => r._2.toString).take(10).foreach(println)
yelpReview.map(r => (r._2.toString.split("\",\"").size, 1)).reduceByKey(_ + _).collect.foreach(println)

"review_id","user_id","business_id","stars","date","text","useful","funny","cool"

"vkVSCC7xljjrAI4UGfnKEQ","bv2nCi5Qv5vroFiqKGopiw","AEx2SYEUJmTxVVB18LlCwA","5","2016-05-28","Super simple place but amazing nonetheless. It's been around since the 30's and they still serve the same thing they started with: a bologna and salami sandwich with mustard. 

Staff was very helpful and friendly.","0","0","0"

"n6QzIUObkYshz4dz2QRJTw","bv2nCi5Qv5vroFiqKGopiw","VR6GpWIda3SfvPC-lg9H3w","5","2016-05-28","Small unassuming place that changes their menu every so often. Cool decor and vibe inside their 30 seat restaurant. Call for a reservation. 

We had their beef tartar and pork belly to start and a salmon dish and lamb meal for mains. Everything was incredible! I could go on at length about how all the listed ingredients really make their dishes amazing but honestly you just need to go. 

A bit outside of downtown montreal but take the metro out and it's less than a 10 minute walk from the station."

path = /public/yelp-dataset/yelp_review.csv
conf = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark2/2.6.5.0-292/0/hive-site.xml
yelpReview = /public/yelp-dataset/yelp_review.csv NewHadoopRDD[0] at newAPIHadoopFile at <console>:37


/public/yelp-dataset/yelp_review.csv NewHadoopRDD[0] at newAPIHadoopFile at <console>:37