In [2]:
import org.apache.spark.sql.functions.col

### The Spark Programming Model
- Spark programming consists of operations on a data set, usually residing in some form of distributed, persistent storage (e.g. HDFS)
- consists of the following steps:
    - Define a set of transformations on the input data set.
    - Invoke actions that output the transformed data sets to persistent storage or return results to the driver's local memory.
    - Run local computations that operate on the results computed in a distributed fashion.
    
### Record Linkage
- the problem of tying multiple duplicate records to the same underlying entity when we have a large collection of records from one or more source systems
- difficulty comes from the fact that criteria for determining duplicate/not-duplicate varies from a case to case basis
    - in some cases, very different looking records will refer to the same entity, and in other case, very similar looking records will actually refer to different entities despite the similarity
    
##### spark-shell instructions
- if running examples on personal computer, can launch a local Spark cluster by specifying ```--master local[N]```, where N is the number of threads to run
    - specifying local[\*] will match the number of threads to the number of cores available on machine
- other arguments
    - ```--driver-memory 2g``` -> lets single local process use 2 GB of memory

In [3]:
// The SparkContext object
sc

#### Resilient Distributed Datasets
- ```SparkContext``` has methods that allow us to create _Resilient Distributed Datasets_, or _RDDs_, which are Spark's abstraction for representing a collection of objects that can be distributed across multiple machines in a cluster
- two ways to create _RDDs_
    - use ```SparkContext``` to create RDD from external data source
    - perform a transformation on one or more existing RDDs, yielding an RDD as a result (e.g. filtering records, aggregating records by common key, joining multiple RDDs together)
- _RDDs_ are laid out across the cluster of machines as a collection of _partitions_, each including a subset of the data
    - Spark then processes the objects within a partition in sequence, and processes multiple partitions in parallel
- One simple way to create an RDD is to use ```parallelize``` method on  ```SparkContext``` with a local collection of objects
    - first arg is the collection of objects to parallelize, in an ```Array```
    - second arg is number of partitions to create

In [4]:
val rdd = sc.parallelize(Array(1, 2, 2, 4), 4)

rdd = ParallelCollectionRDD[0] at parallelize at <console>:28


ParallelCollectionRDD[0] at parallelize at <console>:28

- to create RDD from text file or directory of text files, pass the name of the file or directory to ```textFile``` 
    - ```textFile``` can access paths that reside on the local file system
    - if given a directory, it will consider all of the files in that directory as part of the given RDD
    - no data has yet been read by Spark or loaded into memory yet; instead, objects are loaded into the cluster at computation time

In [8]:
val rawblocks = sc.textFile("linkage1")

rawblocks = linkage1 MapPartitionsRDD[4] at textFile at <console>:28


linkage1 MapPartitionsRDD[4] at textFile at <console>:28

#### The REPL and Compilation

- Spark supports both interactive shell and compiled applications, which can be compiled and managed using _Apache Maven_
- shell method
    - starting work in the REPL enables quick prototyping, faster iteration, and less lag between ideas and results
    - drawbacks: not suited for large programs, since Scala interpretation takes longer
- hybrid method
    - develop in the REPL, but move established pieces of code into compiled library
    - ```spark-shell``` can use compiled JAR files with the ```--jars``` flag

### Bringing Data from the Cluster to the Client

- RDDs have various method allowing to read data from cluster into Scala REPL
- ```RDD.first``` returns the first element of the RDD into the client

In [9]:
rawblocks.first

"id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match"

- ```RDD.collect``` returns all the contents of an RDD to the client as an array
    - not recommended for huge data sets
- ```RDD.take``` allows us to read a given nmber of records into an array on the client

In [11]:
val head = rawblocks.take(10)
head.length

head = Array("id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match", 6698,40542,1,1,1,?,1,1,1,1,1,TRUE, 45037,49220,1,?,1,?,1,1,1,1,1,TRUE, 31835,69902,1,?,1,1,1,1,1,1,1,TRUE, 4356,31352,0.875,?,1,?,1,1,1,1,1,TRUE, 45723,49837,1,?,1,?,1,1,1,1,1,TRUE, 39716,49297,1,?,1,?,1,1,1,1,1,TRUE, 71970,71971,1,?,1,?,1,1,1,1,1,TRUE, 96601,96625,1,?,1,?,1,1,1,1,1,TRUE, 28553,71491,1,?,1,?,1,1,1,1,1,TRUE)


10

- creating an RDD does not cause distrbuted computation to take place on the cluster
- instead, RDDs define logical data sets that are more like intermediate computation steps
- distributed computation occurs upon invoking an _action_ on an RDD
    - e.g. ```count``` action return # objects in RDD

In [12]:
rdd.count()

4

In [13]:
// brings objects from the RDD into local memory as an Array
rdd.collect()

[1, 2, 2, 4]

- ```saveAsTextFile``` saves contents of RDD to persistent storage
    - creates a directory, and writes out each partition as a separate file
    - this created directory might be used as an input directory by a future Spark job

In [14]:
rdd.saveAsTextFile("numbers")

Name: org.apache.hadoop.mapred.FileAlreadyExistsException
Message: Output directory file:/home/chtka/Projects/Spark/numbers already exists
StackTrace:   at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1184)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
  at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1161)
  at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:10

In [9]:
var rdd2 = sc.textFile("numbers")
rdd2.collect()

Array(1, 2, 4, 2)

- ```foreach``` method can be used in conjunction with ```println``` to print out each value in the array on its own line:

In [15]:
head.foreach(println)

"id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match"
6698,40542,1,1,1,?,1,1,1,1,1,TRUE
45037,49220,1,?,1,?,1,1,1,1,1,TRUE
31835,69902,1,?,1,1,1,1,1,1,1,TRUE
4356,31352,0.875,?,1,?,1,1,1,1,1,TRUE
45723,49837,1,?,1,?,1,1,1,1,1,TRUE
39716,49297,1,?,1,?,1,1,1,1,1,TRUE
71970,71971,1,?,1,?,1,1,1,1,1,TRUE
96601,96625,1,?,1,?,1,1,1,1,1,TRUE
28553,71491,1,?,1,?,1,1,1,1,1,TRUE


lastException: Throwable = null


- examining the data, we see a header row that we might want to remove

In [16]:
def isHeader(line: String) = line.contains("id_1")

isHeader: (line: String)Boolean


In [18]:
head.filterNot(isHeader)
head.filter(x => !isHeader(x))

[6698,40542,1,1,1,?,1,1,1,1,1,TRUE, 45037,49220,1,?,1,?,1,1,1,1,1,TRUE, 31835,69902,1,?,1,1,1,1,1,1,1,TRUE, 4356,31352,0.875,?,1,?,1,1,1,1,1,TRUE, 45723,49837,1,?,1,?,1,1,1,1,1,TRUE, 39716,49297,1,?,1,?,1,1,1,1,1,TRUE, 71970,71971,1,?,1,?,1,1,1,1,1,TRUE, 96601,96625,1,?,1,?,1,1,1,1,1,TRUE, 28553,71491,1,?,1,?,1,1,1,1,1,TRUE]

### Shipping Code from the Client to the Cluster
- we can interactively develop and debug data-munging code against a small amount of data that we sample from the cluster before applying to the entire data set when we're ready to transform it

In [19]:
val noheader = rawblocks.filter(x => !isHeader(x))

noheader = MapPartitionsRDD[6] at filter at <console>:32


MapPartitionsRDD[6] at filter at <console>:32

In [20]:
noheader.first

6698,40542,1,1,1,?,1,1,1,1,1,TRUE

### From RDDs to Data Frames

- Spark's ```DataFrame``` is an abstraction built on top of RDDs for data sets with regular structure
    - each row is made up of a set of columns, and each column has well-defined data type
    - basically Spark analogue of a table in a relational databse
    - differ from Python's ```pandas.DataFrame``` in that they represent distributed data sets on a cluster, instead of local data
- ```SparkSession``` is a wrapper around the ```SparkContext``` object
- can create a Data Frame from ```csv``` method on ```SparkSession```'s Reader API

In [21]:
val prev = spark.read.csv("linkage1")

prev = [_c0: string, _c1: string ... 10 more fields]


[_c0: string, _c1: string ... 10 more fields]

In [22]:
prev.show()

+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
|  _c0|  _c1|         _c2|         _c3|         _c4|         _c5|    _c6|   _c7|   _c8|   _c9|   _c10|    _c11|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
| 3148| 8326|           1|           ?|           1|           ?|      1|     1|     1|     1|      1|    TRUE|
|14055|94934|           1|           ?|           1|           ?|      1|     1|     1|     1|      1|    TRUE|
|33948|34740|           1|           ?|           1|           ?|      1|     1|     1|     1|      1|    TRUE|
|  946|71870|           1|           ?|           1|           ?|      1|     1|     1|     1|      1|    TRUE|
|64880|71676|           1|           ?|           1|           ?|      1|     1|     1|     1|      1|  

- Spark can do some data processing while parsing, like inferring column names from a header, recognizing null values, and inferring the data types of each column


In [23]:
val parsed = spark.read.
    option("header", "true").
    option("nullValue", "?").
    option("inferSchema", "true").
    csv("linkage1")



parsed = [id_1: int, id_2: int ... 10 more fields]


[id_1: int, id_2: int ... 10 more fields]

In [24]:
parsed.show()

+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| id_1| id_2|cmp_fname_c1|cmp_fname_c2|cmp_lname_c1|cmp_lname_c2|cmp_sex|cmp_bd|cmp_bm|cmp_by|cmp_plz|is_match|
+-----+-----+------------+------------+------------+------------+-------+------+------+------+-------+--------+
| 3148| 8326|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|14055|94934|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|33948|34740|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|  946|71870|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|64880|71676|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|    true|
|25739|45991|         1.0|        null|         1.0|        null|      1|     1|     1|     1|      1|  

- we can examine the schema of the ```parsed``` Data Frame with ```printSchema```
    - each ```StructField``` contains the name of the column, the most specific data type that could handle the type of data contained in each record, a a boolean field that indicates whether a column may contain null values
    - to do this, Spark does _two_ passes over the data set: one pass to figure out column types, and a second pass to do the actual parsing
    - if schema is known in advance, can create instance of ```org.apache.spark.sql.types.StructType``` and pass to Reader API via ```schema``` function, possibly saving significant resources when the data set is very large

In [25]:
parsed.printSchema()

root
 |-- id_1: integer (nullable = true)
 |-- id_2: integer (nullable = true)
 |-- cmp_fname_c1: double (nullable = true)
 |-- cmp_fname_c2: double (nullable = true)
 |-- cmp_lname_c1: double (nullable = true)
 |-- cmp_lname_c2: double (nullable = true)
 |-- cmp_sex: integer (nullable = true)
 |-- cmp_bd: integer (nullable = true)
 |-- cmp_bm: integer (nullable = true)
 |-- cmp_by: integer (nullable = true)
 |-- cmp_plz: integer (nullable = true)
 |-- is_match: boolean (nullable = true)



- through ```DataFrameReader``` and ```DataFrameWriter``` APIs, Spark supports reading and writing data frames in a variety of formats
    - _json_ - similar functionality to CSV format
    - _parquet_ and _orc_ - columnar-oriented binary file formats
    - _jdbc_ - connects to relational database via JDBC data connection standard
    - _libsvm_ - popular text file format for representing labeled observations with sparse features
    - _text_ - maps each line of a file to a data frame with a single column of type ```String```
- access ```DataFrameReader``` API through ```read``` method on a ```SparkSession``` instance
    - load data from file using either combination of ```format``` and ```load``` methods, or one of the shortcuts for built-in formats
- to write out data, access ```DataFrameWriter``` via ```write``` method on any DataFrame Instance
- Spark will throw error if you try to save data frame to file that already exists by default; control this behavior using ```SaveMode``` enum, with ```Overwrite```, ```Append```, and ```Ignore``` options

// val d1 = spark.read.format("json").load("file.json")
// val d2 = spark.read.json("file.json")

### Analyzing Data with the DataFrame API

- Every time we've processed the data set, Spark has re-opened the file, reparsed the rows, and then perform the action requested
- Instead of doing this, we can save the data in its parsed form on teh cluster
- we can accomplish via the ```cache``` method on the Data Frame instance
- ```cache``` call indicates that contents of DataFrame should be stored in memory the next time it's computed
    - so in this example, the call to ```count``` does the re-opening, reparsing, and action (counting)
    - the call to ```take``` accesses the cached data instead

In [26]:
parsed.cache()

[id_1: int, id_2: int ... 10 more fields]

In [27]:
parsed.count()



5749132

In [28]:
parsed.take(10)

0,1,2,3,4,5,6,7,8,9,10,11
3148,8326,1.0,,1.0,,1,1,1,1,1,True
14055,94934,1.0,,1.0,,1,1,1,1,1,True
33948,34740,1.0,,1.0,,1,1,1,1,1,True
946,71870,1.0,,1.0,,1,1,1,1,1,True
64880,71676,1.0,,1.0,,1,1,1,1,1,True
25739,45991,1.0,,1.0,,1,1,1,1,1,True
62415,93584,1.0,,1.0,,1,1,1,1,0,True
27995,31399,1.0,,1.0,,1,1,1,1,1,True
4909,12238,1.0,,1.0,,1,1,1,1,1,True
15161,16743,1.0,,1.0,,1,1,1,1,1,True


- ```StorageLevel``` values indicate where data should be stored
    - for example, ```cache()``` is shorthand for ```persist(StorageLevel.MEMORY)```, which stores rows as unserialized Java objects 
    - if a partition is estimated to not fit in memory, Spark will simply not store it and just recompute next time it's needed
    - ```MEMORY``` level makes most sense when objects are referenced frequently or require low-latency access
- ```MEMORY_SER``` - allocates large byte buffers in memory and serializes the records into them, taking up less space
- ```MEMORY_AND_DISK``` and ```MEMORY_AND_DISK_SER``` will store partitions that don't fit in memory on the disk
- both RDDs and DataFrames can cache data, but the knowledge of the data gained through a DataFrame's schema allow for far more efficient storage
- data should be cached when likely to be referenced by multiple actions, is relatively small compared to availbale memory/disk, and is expensive to regenerate
- RDDs are made out of ```org.apache.spark.sql.Row``` classes, which have accessor methods for getting values by index position, as well as the ```getAs[T]``` method, allowing us to look up fields of a given type by their name

In [29]:
parsed.rdd.map(_.getAs[Boolean]("is_match")).
    countByValue()



Map(true -> 20931, false -> 5728201)

- problems with ```countByValue``` -> we only want to use this when we know there are only a few distinct values in the data set
    - otherwise, we would want to use a function that won't return results to client, like ```reduceByKey```
- if we need the results for a subsequent computation, we need to ship out the data back to the cluster by the ```parallelize``` method

In [21]:
// reducebyKey example
val letters = sc.parallelize(Array(("a", 1), ("a", 1), ("c", 2)))
letters.reduceByKey((a, b) => a + b).collect()

letters = ParallelCollectionRDD[98] at parallelize at <console>:31


[(a,2), (c,2)]

In [22]:
val res = parsed.
    groupBy("is_match").
    count().
    orderBy(col("count").desc)

res = [is_match: boolean, count: bigint]


[is_match: boolean, count: bigint]

In [19]:
res.show()

+--------+-------+                                                              
|is_match|  count|
+--------+-------+
|   false|5728201|
|    true|  20931|
+--------+-------+



#### DataFrame Aggregation Functions

- other more complex aggregations likes sums, mins, maxes, means, standard deviations can be computed using ```agg``` method of DataFrame
    - these functions are located in ```org.apache.spark.sql.functions``` package


In [26]:
import org.apache.spark.sql.functions.avg
import org.apache.spark.sql.functions.stddev

parsed.agg(avg(col("cmp_sex")), stddev(col("cmp_sex"))).show()

+-----------------+--------------------+                                        
|     avg(cmp_sex)|stddev_samp(cmp_sex)|
+-----------------+--------------------+
|0.955001381078048|  0.2073011111689795|
+-----------------+--------------------+



- functions are similar to components of SQL queries; we can actually treat any DataFrame as a database table and express queries using SQL syntax
- we can create a temporary SQL table with the Spark SQL engine by the ```createOrReplaceTempView``` function from the DataFrame API

In [27]:
parsed.createOrReplaceTempView("linkage")

In [29]:
// triple quotes are part of Scala; allow us to write multiline quotes
spark.sql("""
    SELECT is_match, COUNT(*) cnt
    FROM linkage
    GROUP BY is_match
    ORDER BY cnt DESC
    """).show()

+--------+-------+
|is_match|    cnt|
+--------+-------+
|   false|5728201|
|    true|  20931|
+--------+-------+



- Spark SQL vs. DataFrame API
    - SQL is very familiar and expressive for simple queries, and is the best way to quickly read and filter data stored in columnar file formats
    - DataFrame API shines in complex, multistage analyses

### Fast Summary Statistics for DataFrames

- compute min, max, mean, and stddev of all non-null values in numerical columns of data frame by using ```describe``` (same name as in Pandas)

In [31]:
val summary = parsed.describe()



summary = [summary: string, id_1: string ... 10 more fields]


[summary: string, id_1: string ... 10 more fields]

- one column for each variable in the ```parsed``` DataFrame, plus an additional column named ```summary``` that indicates which metric is present in the rest of the columns of the row
- use the ```select``` method to choose subset of columns you want to read

In [35]:
summary.select("summary", "cmp_fname_c1", "cmp_fname_c2").show()

+-------+------------------+------------------+
|summary|      cmp_fname_c1|      cmp_fname_c2|
+-------+------------------+------------------+
|  count|           5748125|            103698|
|   mean|0.7129024704436274|0.9000176718903216|
| stddev|0.3887583596162788|0.2713176105782331|
|    min|               0.0|               0.0|
|    max|               1.0|               1.0|
+-------+------------------+------------------+



- to understand correlations between each feature and the value of the ```is_match``` column, we might try computing these summary statistics for the rows of the DataFrame that correspond to matches, and then for nonmatches
- use ```where``` method with SQL-style syntax on DataFrames, or can use ```Column``` objects

In [37]:
// SQL-style syntax
val matches = parsed.where("is_match = true")
val matchSummary = matches.describe()

[Stage 54:>                                                         (0 + 4) / 5]

matches = [id_1: int, id_2: int ... 10 more fields]
matchSummary = [summary: string, id_1: string ... 10 more fields]


[summary: string, id_1: string ... 10 more fields]

In [40]:
matchSummary.select("summary", "cmp_fname_c1", "cmp_fname_c2").show()

+-------+--------------------+-------------------+
|summary|        cmp_fname_c1|       cmp_fname_c2|
+-------+--------------------+-------------------+
|  count|               20922|               1333|
|   mean|  0.9973163859635038| 0.9898900320318174|
| stddev|0.036506675848336785|0.08251973727615237|
|    min|                 0.0|                0.0|
|    max|                 1.0|                1.0|
+-------+--------------------+-------------------+



In [41]:
// Column object syntax
// "filter" and "where" methods are the same method
val misses = parsed.filter(col("is_match") === false)
val missSummary = misses.describe()



misses = [id_1: int, id_2: int ... 10 more fields]
missSummary = [summary: string, id_1: string ... 10 more fields]


[summary: string, id_1: string ... 10 more fields]

In [43]:
missSummary.select("summary", "cmp_fname_c1", "cmp_fname_c2").show()

+-------+-------------------+------------------+
|summary|       cmp_fname_c1|      cmp_fname_c2|
+-------+-------------------+------------------+
|  count|            5727203|            102365|
|   mean| 0.7118634802174252|0.8988473514090173|
| stddev|0.38908060096985714|0.2727209029401023|
|    min|                0.0|               0.0|
|    max|                1.0|               1.0|
+-------+-------------------+------------------+



- we want to pivot the DataFrame so that the statistics methods are the features, and the original features become the rows
- "wide" form - rows of metrics and columns of variables
- "long" form - rows consisting of one metric, one variable, and the value of that metric/variable pair
- to conver from wide form to long form, can use ```flatMap``` function
    - takes a function argument that processes each input record and returns a sequence of zero or more output records
- need the ```schema``` object of the ```DataFrame```

In [44]:
summary.printSchema()

root
 |-- summary: string (nullable = true)
 |-- id_1: string (nullable = true)
 |-- id_2: string (nullable = true)
 |-- cmp_fname_c1: string (nullable = true)
 |-- cmp_fname_c2: string (nullable = true)
 |-- cmp_lname_c1: string (nullable = true)
 |-- cmp_lname_c2: string (nullable = true)
 |-- cmp_sex: string (nullable = true)
 |-- cmp_bd: string (nullable = true)
 |-- cmp_bm: string (nullable = true)
 |-- cmp_by: string (nullable = true)
 |-- cmp_plz: string (nullable = true)



- since every field is a string, we need to convert values from strings to doubles
- output should be data frame w/ three columns: name of the metric, name of the column, and Double value of summary statistic for that column

In [49]:
val schema = summary.schema
val longForm = summary.flatMap(row => {
    val metric = row.getString(0)
    (1 until row.size).map(i => {
        (metric, schema(i).name, row.getString(i).toDouble)
    })
})

schema = StructType(StructField(summary,StringType,true), StructField(id_1,StringType,true), StructField(id_2,StringType,true), StructField(cmp_fname_c1,StringType,true), StructField(cmp_fname_c2,StringType,true), StructField(cmp_lname_c1,StringType,true), StructField(cmp_lname_c2,StringType,true), StructField(cmp_sex,StringType,true), StructField(cmp_bd,StringType,true), StructField(cmp_bm,StringType,true), StructField(cmp_by,StringType,true), StructField(cmp_plz,StringType,true))
longForm = [_1: string, _2: string ... 1 more field]


[_1: string, _2: string ... 1 more field]

In [52]:
longForm.show()

+-----+------------+-------------------+
|   _1|          _2|                 _3|
+-----+------------+-------------------+
|count|        id_1|          5749132.0|
|count|        id_2|          5749132.0|
|count|cmp_fname_c1|          5748125.0|
|count|cmp_fname_c2|           103698.0|
|count|cmp_lname_c1|          5749132.0|
|count|cmp_lname_c2|             2464.0|
|count|     cmp_sex|          5749132.0|
|count|      cmp_bd|          5748337.0|
|count|      cmp_bm|          5748337.0|
|count|      cmp_by|          5748337.0|
|count|     cmp_plz|          5736289.0|
| mean|        id_1|  33324.48559643438|
| mean|        id_2|  66587.43558331935|
| mean|cmp_fname_c1| 0.7129024704436274|
| mean|cmp_fname_c2| 0.9000176718903216|
| mean|cmp_lname_c1| 0.3156278193084133|
| mean|cmp_lname_c2|0.31841283153174377|
| mean|     cmp_sex|  0.955001381078048|
| mean|      cmp_bd|0.22446526708507172|
| mean|      cmp_bm|0.48885529849763504|
+-----+------------+-------------------+
only showing top

- ```flatMap```, in general, operates on one argument and returns a sequence of 0 or values
    - in this case, flatMap operates on a row (corresponding to each metric), and returns a sequence of tuples of the form (_metric_, _sequence_, _value_)
- ```toDouble``` is an example of _implicit types_
    - ```java.lang.String``` does not have a ```toDouble``` method, so Scala will try to convert String into a class that does have one
    - in this case ```StringOps``` class has a ```toDouble``` method, and a method that can covert ```String``` to ```StringOps```
    - so compiler converts ```String``` -> ```StringOps``` -> ```Double```
- _implicit type conversion_ enhances functionality of core classes like ```String``` that are otherwise unmodifiable

In [54]:
longForm.getClass

class org.apache.spark.sql.Dataset

- ```longForm``` is of type ```Dataset[T]```, which generalizes ```DataFrame``` to be able to handle more data types than just instances of the ```Row``` class
- convert ```Dataset``` back to DataFrame through ```toDF``` method

In [58]:
val longDF = longForm.toDF("metric", "field", "value")

longDF = [metric: string, field: string ... 1 more field]


[metric: string, field: string ... 1 more field]

- we can trasform from long form to wide form by using the ```groupBy``` operator on the column we want to use as the pivot table's row, followed by the ```pivot``` operator on the column we want to use as the pivot table's column