<img align="right" width="200" height="200" src="https://static.tildacdn.com/tild6236-6337-4339-b337-313363643735/new_logo.png">

# Настройка, мониторинг и оптимизация Spark
**Андрей Титов**  
tenke.iu8@gmail.com  

## На этом занятии
+ Настройка spark-submit окружения
+ Работа со Spark UI
+ "Вредные" советы

## Настройка spark-submit окружения
Дистрибутив Spark содержит в себе различные библиотеки, примеры конфигурационных файлов и набор утилит. Любое Spark приложение запускается одной из утилит:
- spark-shell - запуска интерактивного `Scala REPL` с поддержкой Spark
- pyspark - запуск интеракивного `python` шела с поддержкой Spark
- spark-submit - запуск Spark приложений, собранных в виде `jar` или `py` файла с зависимостями

### Структура дистрибутива Spark
Скачаем последнюю версию с официального сайта:
https://spark.apache.org/downloads.html

In [None]:
import sys.process._

println("wget https://apache-mirror.rbc.ru/pub/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz -P lib/".!!)

Распакуем `tar.gz` архив:

In [None]:
println("tar xvf lib/spark-2.4.5-bin-hadoop2.7.tgz -C lib/".!!)

Архив теперь можно удалить:

In [None]:
println("rm -f lib/spark-2.4.5-bin-hadoop2.7.tgz".!!)

Изучим содержимое распакованного дистрибутива:

In [None]:
println("ls -alh lib/spark-2.4.5-bin-hadoop2.7".!!)

`spark-submit`, `pyspark`, `spark-shell` находятся в каталоге `bin`. Когда вы запускаете свои Spark задачи, под капотом используются именно эти утилиты. Достаточно, чтобы они были установлены только на те хосты, с которых происходит запуск приложений. На обычных узлах кластера они не нужны. 

In [None]:
println("ls -alh lib/spark-2.4.5-bin-hadoop2.7/bin".!!)

В каталоге `python/lib` находятся зависимости для `python` - когда вы натраиваете у себя среду разработки, не нужно ставить pyspark в свой venv. Правильнее добавить эти библиотеки из дистрибутива. Так вы избежите возможных конфликтов версий в будущем

In [None]:
println("ls -alh lib/spark-2.4.5-bin-hadoop2.7/python/lib".!!)

In [None]:
println("ls -alh lib/spark-2.4.5-bin-hadoop2.7/jars".!!)

В данных `jar` находятся библиотеки, включая классы драйвера и воркеров:
- org.apache.spark.repl.Main
- org.apache.spark.deploy.SparkSubmit
- org.apache.spark.executor.CoarseGrainedExecutorBackend

```scala
object SparkSubmit extends CommandLineUtils with Logging {
  override def main(args: Array[String]): Unit
}
```

```scala
private[spark] object CoarseGrainedExecutorBackend extends Logging {
    def main(args: Array[String])
}
```

```scala
object Main extends Logging {
    def main(args: Array[String]): Unit
}
```

Также в дистрибутив входят библиотеки для работы с Hadoop, встроенными источниками данных, например Parquet, и т. д.

В каталоге `conf` находятся шаблоны конфигурационных файлов:

In [None]:
println("ls -alh lib/spark-2.4.5-bin-hadoop2.7/conf".!!)

In [None]:
YarnScheulerBackend

**spark-defaults.conf**  
Основной конфигурационный файл Spark. В нем указываются опции запуска, зависимости, количество воркеров и т. п.

**spark-env.sh**  
Скрипт, в котором устанавливаются переменные окружения (например, `HADOOP_CONF_DIR` или `YARN_CONF_DIR`)

**log4j.properties**
Конфигурация логгеров - здесь можно задать необходимый уровень логирования для различных компонентов Spark

Изучим `spark-submit --help`:

In [None]:
println("lib/spark-2.4.5-bin-hadoop2.7/bin/spark-submit --help".!!)

Приоритет опций (по уменьшению приоритета):
- параметры `spark-submit`
- переменные окружения
- конфигурационный файл `spark-defaults.conf`

### Выводы:
- Наличие Hadoop не является необходимым условием для работы Spark
- В Spark есть большое количество параметров, определяющих режим его работы. Большинство описано здесь: https://spark.apache.org/docs/latest/configuration.html
- Утилита `spark-submit` позволяет запустить Spark приложение как локально, так и на кластере

## Работа со Spark UI

Каждое Spark приложение по умолчанию поднимает UI, который позволяет изучить состояние задачи и провести диагностику производительности. Данные, представленные в UI, также можно получить через REST API

Получить Spark UI URL можно следующим образом:

In [1]:
val sparkUiUrl: Option[String] = spark.sparkContext.uiWebUrl

sparkUiUrl = Some(http://192.168.88.241:4040)


Some(http://192.168.88.241:4040)

In [2]:
import sys.process._
sparkUiUrl.foreach( x => println(s"curl -s $x/api/v1/applications".!!))

[ {
  "id" : "local-1634749158085",
  "name" : "Apache Toree",
  "attempts" : [ {
    "startTime" : "2021-10-20T16:59:16.972GMT",
    "endTime" : "1969-12-31T23:59:59.999GMT",
    "lastUpdated" : "2021-10-20T16:59:16.972GMT",
    "duration" : 0,
    "sparkUser" : "t3nq",
    "completed" : false,
    "appSparkVersion" : "2.4.5",
    "startTimeEpoch" : 1634749156972,
    "endTimeEpoch" : -1,
    "lastUpdatedEpoch" : 1634749156972
  } ]
} ]



In [3]:
val appId = spark.sparkContext.applicationId

appId = local-1634749158085


local-1634749158085

In [4]:
sparkUiUrl.foreach( x => println(s"curl -s $x/api/v1/applications/$appId".!!))

{
  "id" : "local-1634749158085",
  "name" : "Apache Toree",
  "attempts" : [ {
    "startTime" : "2021-10-20T16:59:16.972GMT",
    "endTime" : "1969-12-31T23:59:59.999GMT",
    "lastUpdated" : "2021-10-20T16:59:16.972GMT",
    "duration" : 0,
    "sparkUser" : "t3nq",
    "completed" : false,
    "appSparkVersion" : "2.4.5",
    "startTimeEpoch" : 1634749156972,
    "endTimeEpoch" : -1,
    "lastUpdatedEpoch" : 1634749156972
  } ]
}



Список доступных методов доступен по ссылке: https://spark.apache.org/docs/latest/monitoring.html#rest-api

Прочитаем датасет [Airport Codes](https://datahub.io/core/airport-codes):

## Вредные советы
В данной секции приведены частые ошибки, которые допускаются при работе с данными в DataFrame API.

In [11]:
// Эта нормальная :)
val csvOptions = Map("header" -> "true", "inferSchema" -> "true")
val airports = spark.read.options(csvOptions).csv("/tmp/datasets/airport-codes.csv")
airports.printSchema
airports.show(numRows = 1, truncate = 100, vertical = true)

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: integer (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)

-RECORD 0------------------------------------------
 ident        | 00A                                
 type         | heliport                           
 name         | Total Rf Heliport                  
 elevation_ft | 11                                 
 continent    | NA                                 
 iso_country  | US                                 
 iso_region   | US-PA                              
 municipality | Bensalem                           
 gps_code     | 00A                 

csvOptions = Map(header -> true, inferSchema -> true)
airports = [ident: string, type: string ... 10 more fields]


[ident: string, type: string ... 10 more fields]

In [9]:
val data = airports.groupBy('iso_country).count
data.cache()
data.count()

data = [iso_country: string, count: bigint]


243

In [10]:
data.unpersist()

[iso_country: string, count: bigint]

###  Window funcs

In [12]:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

val wnd = Window.partitionBy()

val ranked = airports.select('ident, 'iso_country, 'elevation_ft, 'type, count("*").over(wnd))
ranked.show(20, false)
ranked.explain

+-----+-----------+------------+-------------+-----------------------------------+
|ident|iso_country|elevation_ft|type         |count(1) OVER (unspecifiedframe$())|
+-----+-----------+------------+-------------+-----------------------------------+
|00A  |US         |11          |heliport     |56226                              |
|00AA |US         |3435        |small_airport|56226                              |
|00AK |US         |450         |small_airport|56226                              |
|00AL |US         |820         |small_airport|56226                              |
|00AR |US         |237         |closed       |56226                              |
|00AS |US         |1100        |small_airport|56226                              |
|00AZ |US         |3810        |small_airport|56226                              |
|00CA |US         |3038        |small_airport|56226                              |
|00CL |US         |87          |small_airport|56226                              |
|00C

wnd = org.apache.spark.sql.expressions.WindowSpec@449f17c7
ranked = [ident: string, iso_country: string ... 3 more fields]


[ident: string, iso_country: string ... 3 more fields]

In [13]:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

// val wnd = Window.partitionBy()
val count = airports.count

val ranked = airports.select('ident, 'iso_country, 'elevation_ft, 'type, lit(count).alias("count"))
ranked.show(20, false)
ranked.explain

+-----+-----------+------------+-------------+-----+
|ident|iso_country|elevation_ft|type         |count|
+-----+-----------+------------+-------------+-----+
|00A  |US         |11          |heliport     |56226|
|00AA |US         |3435        |small_airport|56226|
|00AK |US         |450         |small_airport|56226|
|00AL |US         |820         |small_airport|56226|
|00AR |US         |237         |closed       |56226|
|00AS |US         |1100        |small_airport|56226|
|00AZ |US         |3810        |small_airport|56226|
|00CA |US         |3038        |small_airport|56226|
|00CL |US         |87          |small_airport|56226|
|00CN |US         |3350        |heliport     |56226|
|00CO |US         |4830        |closed       |56226|
|00FA |US         |53          |small_airport|56226|
|00FD |US         |25          |heliport     |56226|
|00FL |US         |35          |small_airport|56226|
|00GA |US         |700         |small_airport|56226|
|00GE |US         |957         |heliport     |

count = 56226
ranked = [ident: string, iso_country: string ... 3 more fields]


[ident: string, iso_country: string ... 3 more fields]

In [25]:
val wnd = Window.partitionBy(col("one")).orderBy("elevation_ft")

wnd = org.apache.spark.sql.expressions.WindowSpec@2d6faf1a


org.apache.spark.sql.expressions.WindowSpec@2d6faf1a

In [37]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Dataset

def printItemPerPartition[T](ds: Dataset[T]): Unit = {
    ds.mapPartitions { x => Iterator(x.length) }
    .withColumnRenamed("value", "itemPerPartition")
    .show(300, false)
}

printItemPerPartition: [T](ds: org.apache.spark.sql.Dataset[T])Unit


In [48]:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

val ranked = airports.withColumn("one", lit(1))
                .select('ident, 'iso_country, 'elevation_ft, 'type, row_number().over(wnd))

ranked.cache
ranked.count

ranked.groupBy(spark_partition_id()).count.show
ranked.groupBy(spark_partition_id()).count.explain
// printItemPerPartition(ranked)
// ranked.explain
ranked.unpersist

+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
|                  43|56226|
+--------------------+-----+

== Physical Plan ==
*(2) HashAggregate(keys=[_nondeterministic#1397], functions=[count(1)])
+- Exchange hashpartitioning(_nondeterministic#1397, 200)
   +- *(1) HashAggregate(keys=[_nondeterministic#1397], functions=[partial_count(1)])
      +- *(1) Project [SPARK_PARTITION_ID() AS _nondeterministic#1397]
         +- InMemoryTableScan [ident#167, iso_country#172, elevation_ft#170, type#168, row_number() OVER (PARTITION BY one ORDER BY elevation_ft ASC NULLS FIRST unspecifiedframe$())#1269]
               +- InMemoryRelation [ident#167, iso_country#172, elevation_ft#170, type#168, row_number() OVER (PARTITION BY one ORDER BY elevation_ft ASC NULLS FIRST unspecifiedframe$())#1269], StorageLevel(disk, memory, deserialized, 1 replicas)
                     +- Window [row_number() windowspecdefinition(1, elevation_ft#170 ASC NULLS FIRST, specified

ranked = [ident: string, iso_country: string ... 3 more fields]


[ident: string, iso_country: string ... 3 more fields]

In [50]:
case class Distribution(id: Int, count: Long)

defined class Distribution


In [54]:
val data = airports.select('ident, 'iso_country, 'elevation_ft, 'type)

val counts = data.groupBy(spark_partition_id().alias("id")).count.as[Distribution].collect.toList
println(counts)

List(Distribution(1,17662), Distribution(0,38564))


data = [ident: string, iso_country: string ... 2 more fields]
counts = List(Distribution(1,17662), Distribution(0,38564))


List(Distribution(1,17662), Distribution(0,38564))

In [59]:
val arr_column = array(counts.map(x => struct(lit(x.id).alias("id"), lit(x.count).alias("count")).alias("foo")):_*)

val zoom = data.withColumn("pid", spark_partition_id()).withColumn("bar", arr_column)
zoom.show(2, false)

+-----+-----------+------------+-------------+---+------------------------+
|ident|iso_country|elevation_ft|type         |pid|bar                     |
+-----+-----------+------------+-------------+---+------------------------+
|00A  |US         |11          |heliport     |0  |[[1, 17662], [0, 38564]]|
|00AA |US         |3435        |small_airport|0  |[[1, 17662], [0, 38564]]|
+-----+-----------+------------+-------------+---+------------------------+
only showing top 2 rows



arr_column = array(named_struct(id, 1 AS `id`, count, 17662 AS `count`) AS `foo`, named_struct(id, 0 AS `id`, count, 38564 AS `count`) AS `foo`)
zoom = [ident: string, iso_country: string ... 4 more fields]


[ident: string, iso_country: string ... 4 more fields]

In [71]:
case class Result(start: Long, count: Long)

defined class Result


In [72]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

val calculate_udf = udf { (pid: Int, bar: Seq[Row]) => 
    val d = bar.map { row =>
        val id = row.getAs[Int]("id")
        val count = row.getAs[Long]("count")
        Distribution(id, count)
    }.toList
    val start = d.filter(x => x.id < pid).map(x => x.count).sum
    val thisDistro = d.find(x => x.id == pid).get
    Result(start, thisDistro.count)
}

calculate_udf = UserDefinedFunction(<function2>,StructType(StructField(start,LongType,false), StructField(count,LongType,false)),None)


UserDefinedFunction(<function2>,StructType(StructField(start,LongType,false), StructField(count,LongType,false)),None)

In [74]:
zoom.select(calculate_udf('pid, 'bar)).toJSON.show(20, false)

+-------------------------------------------+
|value                                      |
+-------------------------------------------+
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"count":38564}}|
|{"UDF(pid, bar)":{"start":0,"coun

### Sorting

In [76]:
val ret = 
    airports.groupBy('iso_country).agg(max('elevation_ft).alias("height")).repartition(1)
            .sortWithinPartitions('height.asc)
ret.write.mode("ignore").parquet("/tmp/datasets/out")
ret.explain
println(ret.rdd.getNumPartitions)

== Physical Plan ==
*(3) Sort [height#1777 ASC NULLS FIRST], false, 0
+- Exchange RoundRobinPartitioning(1)
   +- *(2) HashAggregate(keys=[iso_country#172], functions=[max(elevation_ft#170)])
      +- Exchange hashpartitioning(iso_country#172, 200)
         +- *(1) HashAggregate(keys=[iso_country#172], functions=[partial_max(elevation_ft#170)])
            +- *(1) FileScan csv [elevation_ft#170,iso_country#172] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/datasets/airport-codes.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<elevation_ft:int,iso_country:string>
1


ret = [iso_country: string, height: int]


[iso_country: string, height: int]

### Dataset API

In [77]:
val jsoned = airports.toJSON
jsoned.show(10, 50)
jsoned.explain

+--------------------------------------------------+
|                                             value|
+--------------------------------------------------+
|{"ident":"00A","type":"heliport","name":"Total ...|
|{"ident":"00AA","type":"small_airport","name":"...|
|{"ident":"00AK","type":"small_airport","name":"...|
|{"ident":"00AL","type":"small_airport","name":"...|
|{"ident":"00AR","type":"closed","name":"Newport...|
|{"ident":"00AS","type":"small_airport","name":"...|
|{"ident":"00AZ","type":"small_airport","name":"...|
|{"ident":"00CA","type":"small_airport","name":"...|
|{"ident":"00CL","type":"small_airport","name":"...|
|{"ident":"00CN","type":"heliport","name":"Kitch...|
+--------------------------------------------------+
only showing top 10 rows

== Physical Plan ==
*(2) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#1791]
+- MapPartitions <function1>, obj#1790:

jsoned = [value: string]


[value: string]

In [78]:
airports.select(to_json(struct(col("*")))).explain

== Physical Plan ==
Project [structstojson(named_struct(ident, ident#167, type, type#168, name, name#169, elevation_ft, elevation_ft#170, continent, continent#171, iso_country, iso_country#172, iso_region, iso_region#173, municipality, municipality#174, gps_code, gps_code#175, iata_code, iata_code#176, local_code, local_code#177, coordinates, coordinates#178), Some(Europe/Moscow)) AS structstojson(named_struct(NamePlaceholder(), unresolvedstar()))#1809]
+- *(1) FileScan csv [ident#167,type#168,name#169,elevation_ft#170,continent#171,iso_country#172,iso_region#173,municipality#174,gps_code#175,iata_code#176,local_code#177,coordinates#178] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/datasets/airport-codes.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ident:string,type:string,name:string,elevation_ft:int,continent:string,iso_country:string,...


In [79]:
case class Apple(size: Int, color: String)

defined class Apple


In [80]:
List(Apple(1, "red")).toDS.map(x => x.size).explain

== Physical Plan ==
*(1) SerializeFromObject [input[0, int, false] AS value#1820]
+- *(1) MapElements <function1>, obj#1819: int
   +- *(1) DeserializeToObject newInstance(class $line255.$read$$iw$$iw$Apple), obj#1818: $line255.$read$$iw$$iw$Apple
      +- LocalTableScan [size#1814, color#1815]


In [83]:
List(Apple(1, "red")).toDS.localCheckpoint.select('size).explain

== Physical Plan ==
*(1) Project [size#1837]
+- Scan ExistingRDD[size#1837,color#1838]


### UDF

In [None]:
org.apache.spark.sql.catalyst.InternalRow
java.lang.Long // != Long
java.lang.Integer // != Int
java.lang.String // == String (UTF8String)

In [88]:
val mega_udf = udf { (left: java.lang.Integer, right: String) => "ok" }

mega_udf = UserDefinedFunction(<function2>,StringType,Some(List(IntegerType, StringType)))


UserDefinedFunction(<function2>,StringType,Some(List(IntegerType, StringType)))

In [89]:
spark
    .range(1)
    .select(
        lit(1).alias("left"), 
        lit("foo").alias("right")
    )
    .select(mega_udf('left, 'right))
    .show

+----------------+
|UDF(left, right)|
+----------------+
|              ok|
+----------------+



In [90]:
spark
    .range(1)
    .select(
        lit(null).alias("left"), 
        lit("foo").alias("right")
    )
    .select(mega_udf('left, 'right))
    .show

+----------------+
|UDF(left, right)|
+----------------+
|              ok|
+----------------+



In [91]:
spark
    .range(1)
    .select(
        lit(1).alias("left"), 
        lit(null).alias("right")
    )
    .select(mega_udf('left, 'right))
    .show

+----------------+
|UDF(left, right)|
+----------------+
|              ok|
+----------------+



In [96]:
val foo: java.lang.Integer = null
val foe: Int = if (foo == null) ??? else foo

lastException = null


Name: java.lang.NullPointerException
Message: null
StackTrace:   at scala.Predef$.Integer2int(Predef.scala:362)

In [97]:
case class Foo(first: Int, second: Int, third: Int)

defined class Foo


lastException: Throwable = null


In [102]:
val mega_udf2 = udf { () => Thread.sleep(1000); Foo(1,2,3) }.asNondeterministic

mega_udf2 = UserDefinedFunction(<function0>,StructType(StructField(first,IntegerType,false), StructField(second,IntegerType,false), StructField(third,IntegerType,false)),Some(List()))


UserDefinedFunction(<function0>,StructType(StructField(first,IntegerType,false), StructField(second,IntegerType,false), StructField(third,IntegerType,false)),Some(List()))

In [103]:
spark.time { 
spark
    .range(0, 10, 1, 1)
    .select(mega_udf2().alias("res"))
    .select('res("first"), 'res("second"), 'res("third")).show
}

+---------+----------+---------+
|res.first|res.second|res.third|
+---------+----------+---------+
|        1|         2|        3|
|        1|         2|        3|
|        1|         2|        3|
|        1|         2|        3|
|        1|         2|        3|
|        1|         2|        3|
|        1|         2|        3|
|        1|         2|        3|
|        1|         2|        3|
|        1|         2|        3|
+---------+----------+---------+

Time taken: 10135 ms


In [None]:
spark.time { 
spark
    .range(0, 10, 1, 1)
    .select(mega_udf2().alias("res"))
    .select(col("res.*")).show
}

In [104]:
spark
    .range(0, 10, 1, 1)
    .select(mega_udf2().alias("res"))
    .select('res("first"), 'res("second"), 'res("third")).explain(true)

== Parsed Logical Plan ==
'Project [unresolvedalias('res[first], None), unresolvedalias('res[second], None), unresolvedalias('res[third], None)]
+- Project [UDF() AS res#1987]
   +- Range (0, 10, step=1, splits=Some(1))

== Analyzed Logical Plan ==
res.first: int, res.second: int, res.third: int
Project [res#1987.first AS res.first#1996, res#1987.second AS res.second#1997, res#1987.third AS res.third#1998]
+- Project [UDF() AS res#1987]
   +- Range (0, 10, step=1, splits=Some(1))

== Optimized Logical Plan ==
Project [res#1987.first AS res.first#1996, res#1987.second AS res.second#1997, res#1987.third AS res.third#1998]
+- Project [UDF() AS res#1987]
   +- Range (0, 10, step=1, splits=Some(1))

== Physical Plan ==
*(1) Project [res#1987.first AS res.first#1996, res#1987.second AS res.second#1997, res#1987.third AS res.third#1998]
+- *(1) Project [UDF() AS res#1987]
   +- *(1) Range (0, 10, step=1, splits=1)


### Cache

In [105]:
import sys.process._
"""cp -f /tmp/datasets/source/1.txt /tmp/datasets/cache.txt/1.txt""".!
"""cp -f /tmp/datasets/source/2.txt /tmp/datasets/cache.txt/2.txt""".!

0

In [114]:
val df = spark.read.text("/tmp/datasets/cache.txt")
// df.show(10)

// df.cache
df.show(1000)

+-----+
|value|
+-----+
|    1|
|    1|
|    1|
|    1|
|    1|
|    3|
|    3|
|    3|
|    3|
|    3|
+-----+



df = [value: string]


[value: string]

In [116]:
df.localCheckpoint
df.unpersist

[value: string]

In [108]:
"""cp -f /tmp/datasets/source/1.txt /tmp/datasets/cache.txt/2.txt""".!
"""cp -f /tmp/datasets/source/3.txt /tmp/datasets/cache.txt/1.txt""".!

0

In [111]:
df.show()

+-----+
|value|
+-----+
|    1|
|    1|
|    1|
|    1|
|    1|
|    3|
|    3|
|    3|
|    3|
|    3|
+-----+



In [112]:
spark.read.text("/tmp/datasets/cache.txt").show

+-----+
|value|
+-----+
|    1|
|    1|
|    1|
|    1|
|    1|
|    3|
|    3|
|    3|
|    3|
|    3|
+-----+



In [122]:
spark.sharedState.cacheManager.clearCache

### Coalesce

In [123]:
spark.time {
val ret = spark.range(0, 10, 1, 2).withColumn("foo", mega_udf2()).coalesce(1)
ret.collect
ret.explain
}

== Physical Plan ==
Coalesce 1
+- *(1) Project [id#2185L, UDF() AS foo#2187]
   +- *(1) Range (0, 10, step=1, splits=2)
Time taken: 10108 ms


In [124]:
spark.time {
val ret = spark.range(0, 10, 1, 2).withColumn("foo", mega_udf2())
ret.cache
ret.count
ret.coalesce(1).collect
ret.coalesce(1).explain
}

== Physical Plan ==
Coalesce 1
+- InMemoryTableScan [id#2192L, foo#2194]
      +- InMemoryRelation [id#2192L, foo#2194], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(1) Project [id#2192L, UDF() AS foo#2194]
               +- *(1) Range (0, 10, step=1, splits=2)
Time taken: 5163 ms


In [None]:
spark.sharedState.cacheManager.clearCache