# Preamble

In [None]:
import $ivy.`org.apache.spark::spark-sql:2.4.5` 
import $ivy.`sh.almond::almond-spark:0.4.0`

//import org.apache.spark._
import org.apache.spark.sql.{NotebookSparkSession, SparkSession}
import org.apache.spark.sql.{functions => func, _}
import org.apache.spark.sql.types._

val spark = NotebookSparkSession
      .builder()
      .config("spark.sql.join.preferSortMergeJoin", false)
      .config("spark.sql.shuffle.partitions", 64)
      .master("local[*]")
      .getOrCreate()

import spark.implicits._

import org.slf4j.LoggerFactory
import org.apache.log4j.{Level, Logger}

Logger.getRootLogger().setLevel(Level.ERROR)

def run[A](code: => A): A = {
    val start = System.currentTimeMillis()
    val res = code
    println(s"Took ${System.currentTimeMillis() - start}")
    res
}

# On `DataFrame`s

We can create datasets from external data sources using different formats, e.g. Json, parquet, CSV, etc. 

###### Para leer datos en otros formatos usamos `.read.formato("fichero.formato")`
###### Con multiline permitimos que los datos JSON se encuentren en diferentes lineas

In [None]:
val data: DataFrame = spark.read.option("multiline", "true").json("D:/TFGAlvaroSanchez/data2/2916A(Vitigudino)-2018.json")

Note that we created a `DataFrame`, not a `Dataset`. Dataframes are like datasets, i.e. programs to generate distributed data sets, but *dynamically typed*. This means that, at compile time, Scala only knows that a dataframe consists of `Row`s.

In [None]:
data.collect

In fact, a `DataFrame` is defined as an alias of `Dataset`: 

In [None]:
val dataDS: Dataset[Row] = data

But the type of the information to be processed is there! 

In [None]:
data.schema
data.printSchema

and we can convert a dataframe into a dataset: 

###### Con `.as[class]` transformamos el DataFrame en DataSet

In [None]:
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)

case class Data(fecha: String, indicativo: String, p_max: String, hr: String, inso: String, q_max: String,
                nw_55: String, q_mar: String, q_med: String, tm_min: String, ta_max : String, 
               ts_min : String ,nt_30: String, w_racha: String, np_100: String, p_sol: String, nw_91: String, np_001: String,
               ta_min: String, w_rec: String, e: String, np_300: String, p_mes: String, w_med: String, 
               nt_00: String, ti_max: String, tm_mes: String, tm_max: String, q_min: String, np_010: String)

val dataDs: Dataset[Data] = data.as[Data]

In [None]:
dataDs.show
data.show

# Untyped transformations

The `Dataset` API includes a section on _untyped transformations_. These are transformations that are not defined over the Scala types but over the inner Spark SQL types (i.e. `StructType`s). More exactly, these could be named *dynamically typed transformations*.

These transformations are in close corresponde with their SQL counterparts: `SELECT`, `WHERE`, `GROUP BY`, `FROM`, etc. 

### The `select` transformation

For instance, the equivalent to the `map` typed transformation is `select`: 

In [None]:
val ds: Dataset[String] = dataDs.map(_.ta_max)
ds.collect
ds.show
ds.explain

In [None]:
val df: DataFrame = 
    spark.read.option("multiline", "true").json("D:/TFGAlvaroSanchez/data2/2916A(Vitigudino)-2018.json").select($"ta_max")
df.collect
df.show
df.schema

###### Tenga en cuenta que perdimos la etiqueta de la columna (ta_max) en el caso de la transformación del conjunto de datos (DataSet). Esto no está sucediendo con select (DataFrame). Además, tenemos más control sobre el esquema resultante:


###### `substring()` -> `substring(inicio, fin)`
###### `substr()` -> `substr(inicio, longitud)`  

In [None]:
dataDs.schema
dataDS.map(t => (t(1).toString, t(20).toString.substring(0,4), t(1).toString.substring(5,6)))
    .show

In [None]:
data.select($"fecha", $"ta_max".substr(0,4) as "temperatura max", $"fecha".substr(6,1) as "mes")
    .show

The [org.apache.spark.sql.functions](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$) contains dozens of column operators.

Note that _untyped_, or more properly, _dynamically typed_, character means that the Scala compiler won't complain if we choose a non-existent column:

###### El compilador de Scala no se quejará si elegimos una columna inexistente, señala el error en tiempo de ejecucion

In [None]:
lazy val df: DataFrame = spark.read.option("multiline", "true").json("D:/TFGAlvaroSanchez/data2/2916A(Vitigudino)-2018.json").select($"nam")

The error will be shown at runtime: 

In [None]:
df

On the contrary, the error in the dataset transformation manifests at compile-time:

###### En el DataSet nos muestra el error en tiempo de ejecucion

In [None]:
dataDs.map(_.nam)

### The `filter` transformation

This is the equivalent to the typed `filter` transformation:

###### En este caso primero hemos tenido que transformar la columna ta_max en tipo integer. Lo realizamos mediante `.withColumn("nuevoNombreColumna", "nombreColumna".cast(tipo))`

In [None]:
val data1 = data.withColumn("ta_max", $"ta_max".substr(0,4).cast(IntegerType))
data1.filter($"ta_max" > 30)
    .show

If we pass a column function not denoting a boolean value, we won't even get a run-time exception:

In [None]:
def df: DataFrame = 
    data.filter($"fecha" > 2001)

In [None]:
df.show

### The `groupBy` transformation

###### Agrupamos los datos mediante la columna que le indiquemos. Si tiene el mismo valor en esta columna se agrupan, mediante `count` realizamos un recuento de cuantos se han agrupado

In [None]:
val students: DataFrame = spark.read.json("D:/GitHub/spark-intro/data/students.json")

In [None]:
students.groupBy($"degree").count.show

### `Join` transformations

We already discussed joins, but we didn't mention that the resulting type of a join is a dataframe, not a dataset: 

###### El tipo resultante de una unión es un DataFrame, no un DataSet

In [None]:
org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)

case class Student(name: String, degree: String)
case class Person(name: String, age: Long)

val people : DataFrame = spark.read.json("D:/GitHub/spark-intro/data/people.json") 
val peopleDs: Dataset[Person] = people.as[Person]

In [None]:
peopleDs.join(students.as[Student], "name")

###### Podemos observar que cambia el tipo de dato de la variable peopleDs

# The problems of `Dataset`s

Datasets are nice because they are type safe, but, unfortunately, they are less efficient than data frames in several respects. This can be best shown by reading from parquet source files. 

###### Los DataSets son buenos porque son de tipo seguro, pero, desafortunadamente, son menos eficientes que los DataFrames.

Parquet is a _columnar_ format, which means that it stores physically data around columns, allowing us to read only data from a particular column without reading the entire row.

###### Parquet es un formato de columnas, lo que significa que almacena datos físicos alrededor de las columnas, lo que nos permite leer solo los datos de una columna en particular sin leer toda la fila.

In [None]:
people.write.mode("overwrite").parquet("D:/GitHub/spark-intro/data2/people.parquet")

In [None]:
spark.read.parquet("D:/GitHub/spark-intro/data2/people.parquet").schema

### The `ReadSchema` optimization

Let's create a program that simply read the _name_ column of the people dataset:

In [None]:
val ds: Dataset[String] = 
    spark.read.parquet("D:/GitHub/spark-intro/data2/people.parquet").as[Person]
        .map(_.name)

which works as intended: 

In [None]:
ds.show

We have a problem, however: 

In [None]:
ds.explain

As we can see, the plan includes the directive `ReadSchema: struct<age:bigint,name:string>`, which generates a query to scan the full schema of the parquet file. But we just want to read the names! We can create an optimun program using dataframes:

In [None]:
val df: DataFrame = 
    spark.read.parquet("D:/GitHub/spark-intro/data/people.parquet").select($"name")

which works similarly: 

In [None]:
df.show

but more efficiently (note the the value of the `ReadSchema` directive):

In [None]:
df.explain

We can empirically check that it actually works using the Spark UI. First, we create a parquet file with enough rows and several columns:

In [None]:
import org.apache.spark.sql.functions.{lit, rand, round}
spark.range(0, 1000000)
    .select($"id" as "_1", lit(1) as "_2")
    .write.mode("overwrite").parquet("D:/GitHub/spark-intro/data/test")

Now, we read the second column using both datasets and dataframes, and check the Spark UI for the _Input Size_ field.

In [None]:
val test = spark.read.parquet("D:/GitHub/spark-intro/data/test")
test.as[Tuple2[Long, Int]].map(_._2).collect

Using dataframes the input size is much lower since we only read the second column:

In [None]:
test.select($"_2").collect

### The `PushedFilter` optimization

Let's consider the following equivalent dataset and dataframe programs: 

In [None]:
val ds: Dataset[(Long, Int)] = 
    test.as[(Long, Int)]
        .filter(_._1 >= 999995)

val df: DataFrame = 
    test
        .filter($"_1" >= 999995)

Functionally, they are equivalent, but their performance differ significantly:

In [None]:
df.collect
ds.collect

The explanation of this difference lies in another optimization applied by the Spark SQL compiler: the so-called push-down filter optimization. In the previous `ReadSchema` optimization, we skipped certain columns of the dataset; now, we skip rows and read only the ones we are interested in (those that satisfy the predicate). We can check if the push-down filter optimization is actually applied by inspecting the query plan. 

In [None]:
df.explain
ds.explain

### The `PartitionFilters` optimization

Let's create a test file with an additional column: 

In [None]:
spark.range(0, 1000000)
    .select($"id" as "_1", lit(1) as "_2", round(rand() * 10) mod lit(10) as "_3")
    .write.mode("overwrite").parquet("D:/GitHub/spark-intro/data/test")

In [None]:
val test: DataFrame = spark.read.parquet("D:/GitHub/spark-intro/data/test")

In [None]:
test.show

Let's suppose that we want to read data with value `_3` equal to `9.0`:

In [None]:
test.filter($"_3" === lit(9.0)).show

A pushed filter optimization is created, but it would be better if we could just read directly those rows with the exact value for the thrid column. We can achieve that as follows:

In [None]:
test.write.mode("overwrite").partitionBy("_3").parquet("D:/GitHub/spark-intro/data/test/testP")

As we can see, the parquet file is splitted into ten partitions. Now, if we just want to process data with a particular key, Spark will generate an optimun query: 

In [None]:
val testP: DataFrame = spark.read.parquet("D:/GitHub/spark-intro/data/test/testP")

In [None]:
testP.filter($"_3" === lit(9.0)).show

We can inspet the Spark UI to check that we read less data in the last action.