<br><br><br>
<span style="color:red;font-size:60px">Apache Spark</span>
<br><br>
<div class="list">
<span>A <b>framework</b> for <b>cluster computing</b></span>
<ul>
    <li>becoming a standard for "big data" analytics</li>
    <li>provides services for streaming analytics</li>
    <li>provides services for graph analytics</li>
    <li>provides services for machine learning</li>
    <li>provides services for in-memory data querying</li>
</ul>
    </div>

<div class="list">
<li>Spark is written in <span style="color:blue">Scala</span></li>
<li>Can be compiled on JVM(Java Virtual Machine)</li>
<li>Has APIs for Scala, Python, Java, and R</li>
    </div>

Classnotes:

Advantage of Scala over Python:
- strongly typed
- immutable

<br><br><br>
<span style="color:blue;font-size:x-large">A Spark Program</span>


In [1]:
val text = sc.textFile("shakespeare.txt")
// spark function
// text is scala variable contains spark object
val relevant_lines = text.filter(l => l.contains("Music"))
val result = relevant_lines.count()//scala object

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.11.104:4041
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1675373514373)
SparkSession available as 'spark'


text: org.apache.spark.rdd.RDD[String] = shakespeare.txt MapPartitionsRDD[1] at textFile at <console>:24
relevant_lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27
result: Long = 52


<img src="spark_context.png">

<br><br><br>
<span style="color:blue;font-size:x-large">Spark context</span>
<div class="list">
<li>Spark context encapsulates the connection between the application and the cluster</li>
    <li>It handles job distribution, broadcasting, creating in-memory datasets (distributed), etc.</li> 
<li>Jupyter, spark-shell both automatically create the environment for us</li>
<li>Only one spark context can be active at a time on  a Java Virtual Machine (Spark runs as a JVM) - two or more would be very confusing - why?</li>
    </div>

In [10]:
sc

res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1493a28c


In [11]:
sc.appName

res1: String = spylon-kernel


In [12]:
sc.master

res2: String = local[*]


<img src="spark_environment.png">
<p>
</p>
<h3>Notes</h3>
<li>Each application gets its own environment</li>
<li>The environment runs from the point the application starts until it terminates</li>
<li><b>Each application is, thus, isolated from other applications!</b></li>
<ul>
    <li>Each application schedules its own tasks</li>
    <li>Each application runs in its own JVM</li>
    <li>However, data cannot be shared between applications (but can be shared through an external file system)</li>
</ul>
<li>The cluster manager can be Spark's own cluster manager or some other cluster manager (YARN, Kubernetes). I.e., Spark's driver program can talk to non-spark cluster managers</li>
<li>The driver program is network addressable. This makes it possible for its workers to send messages to the driver through the lifetime of the application without requiring physical connectivity</li>
<li>However, to ensure quick turnaround, spark drivers and workers are typically on the same local network and use local network addressing to communicate</li>

<br><br><br>
<span style="color:blue;font-size:large">Lazy Spark</span>
<br><br>
<li>Spark is intrinically lazy. Nothing is evaluated unless there is an action step</li>
<li>In our program, count() is the evaluation step</li>
<li>For example, the following code does not give an error</li>

In [13]:
//lazy steps
val text = sc.textFile("file_does_not_exist") //did not check whether the resource exists or not
val relevant_lines = text.filter(l => l.contains("Music"))
//val result = relevant_lines.count()

text: org.apache.spark.rdd.RDD[String] = file_does_not_exist MapPartitionsRDD[7] at textFile at <console>:26
relevant_lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[8] at filter at <console>:27


<li>But this code does</li>

In [14]:
val text = sc.textFile("file_does_not_exist")
val relevant_lines = text.filter(l => l.contains("Music"))
val result = relevant_lines.count()

org.apache.hadoop.mapred.InvalidInputException:  Input path does not exist: file:/Users/weizhou/Columbia Coursework/Cloud Analysis 4526/file_does_not_exist

In [15]:
val text = sc.textFile("shakespeare.txt")
val relevant_lines = text.filter(l => l.contains("Music"))
val result = relevant_lines.count()

text: org.apache.spark.rdd.RDD[String] = shakespeare.txt MapPartitionsRDD[13] at textFile at <console>:26
relevant_lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14] at filter at <console>:27
result: Long = 52


<br><br><br>
<span style="color:green;font-size:xx-large">Resilient Distributed Datasets</span>
<br><p>
A Resilient Distributed Dataset (RDD) is the primary data abstraction used in Spark
    <p>
<br>RDDs are:
<li>immutable (read only)</li>
<li>resilient (fault tolerant) </li>
<li>distributed (spread on multiple nodes) </li>
<p>
    RDDs allow for low level (programming level) operations on data

<br><br><br>
<span style="color:green;font-size:xx-large">Properties of RDDs</span>
<br><p>
<li>immutable, resilient, distributed (see above)</li>
<li><b>in-memory</b> as far as possible, an RDD is maintained in memory for faster computation</li>
<li><b>cached</b> if space is not available, the RDD is stored on disk in a memory cache</li>
<li><b>lazy evaluation</b> RDDs are not evaluated unless an action step is encountered</li>
<li><b>parallel computation</b> when evaluation is necessary, it is done in parallel on each partition</li>



In [12]:
val text = sc.textFile("shakespeare.txt")
//val text1 = sc.textFile("shakespeare.txt")

text: org.apache.spark.rdd.RDD[String] = shakespeare.txt MapPartitionsRDD[9] at textFile at <console>:24


<br><br><br>
<span style="color:green;font-size:xx-large">Partitioning</span>
<br><br>
<li>Notice that the data type of the shakepeare text rdd is MapPartitionsRDD</li>
<li>Spark automatically partitions the data and, when a task needs to be run, allocates jobs to each partition to run on its data</li>
<li>Partitions can be set by the programmer or by default (usually based on the number of cores)</li>
<li>The number of partitions is a trade-off between communication overhead (master nodes to worker nodes) and processing time within a partition (a function of the size of the data in that node)</li>
<li>Note that you can have multiple partitions running on a single core</li>


<br><br><br>
<span style="color:blue;font-size:large">Cores on a mac</span>

In [17]:
!sysctl hw.physicalcpu hw.logicalcpu

hw.physicalcpu: 8
hw.logicalcpu: 8



<br><br><br>
<span style="color:blue;font-size:large">Cores on windows</span>


https://support.microsoft.com/en-us/help/4026757/windows-10-find-out-how-many-cores-your-processor-has

<br><br><br>
<span style="color:blue;font-size:large">partitions are maintained in an array of rdds</span>

In [18]:
text

res3: org.apache.spark.rdd.RDD[String] = shakespeare.txt MapPartitionsRDD[16] at textFile at <console>:24


In [13]:
text.partitions

res9: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.HadoopPartition@4b9, org.apache.spark.rdd.HadoopPartition@4ba)


In [14]:
text.partitions.length

res10: Int = 2


<br><br><br>
<span style="color:blue;font-size:large">Setting the number of partitions</span>

In [8]:
val text = sc.textFile("shakespeare.txt",5)
val np = text.partitions.length

text: org.apache.spark.rdd.RDD[String] = shakespeare.txt MapPartitionsRDD[6] at textFile at <console>:25
np: Int = 5


In [10]:
text.getNumPartitions

res7: Int = 5


In [24]:
text.collect()

res7: Array[String] = Array("The Project Gutenberg EBook of The Complete Works of William Shakespeare, by ", William Shakespeare, "", This eBook is for the use of anyone anywhere at no cost and with, almost no restrictions whatsoever.  You may copy it, give it away or, re-use it under the terms of the Project Gutenberg License included, with this eBook or online at www.gutenberg.org, "", ** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **, **     Please follow the copyright guidelines in this file.     **, "", Title: The Complete Works of William Shakespeare, "", Author: William Shakespeare, "", Posting Date: September 1, 2011 [EBook #100], Release Date: January, 1994, "", Language: English, "", "", *** START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESP...


In [28]:
text.collect

res11: Array[String] = Array("The Project Gutenberg EBook of The Complete Works of William Shakespeare, by ", William Shakespeare, "", This eBook is for the use of anyone anywhere at no cost and with, almost no restrictions whatsoever.  You may copy it, give it away or, re-use it under the terms of the Project Gutenberg License included, with this eBook or online at www.gutenberg.org, "", ** This is a COPYRIGHTED Project Gutenberg eBook, Details Below **, **     Please follow the copyright guidelines in this file.     **, "", Title: The Complete Works of William Shakespeare, "", Author: William Shakespeare, "", Posting Date: September 1, 2011 [EBook #100], Release Date: January, 1994, "", Language: English, "", "", *** START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKES...


<br><br><br>
<span style="color:blue;font-size:large">Creating partitions using parallelize</span>
<li><span style="color:green">parallelize</span> can be used to partition any data object</li>
<li>as with sc.textFile, you can specify the number of partitions</li>
    </div>

In [11]:
val text = sc.parallelize("shakespeare.txt")
val np = text.getNumPartitions//number of CPUs

text.collect

text: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[7] at parallelize at <console>:25
np: Int = 8
res8: Array[Char] = Array(s, h, a, k, e, s, p, e, a, r, e, ., t, x, t)


In [7]:
text.partitions.length

res5: Int = 8


In [23]:
val names = Array("John","Qing","Vladimir","Audrey","Baskin","Robbins")
val n = sc.parallelize(names)
n.getNumPartitions
n.collect

names: Array[String] = Array(John, Qing, Vladimir, Audrey, Baskin, Robbins)
n: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:27
res15: Array[String] = Array(John, Qing, Vladimir, Audrey, Baskin, Robbins)


In [31]:
val text = sc.parallelize("shakespeare.txt",25)
val np = text.getNumPartitions




text: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[21] at parallelize at <console>:25
np: Int = 25


<div class="list">
<li><span style="color:blue">sc.textFile</span> uses hadoop's block size as a guide and a default minimum partitions as a floor for the number of partitions</li>
<li><span style="color:blue">sc.parallelize</span> uses a default number of partitions 
    </div>

In [32]:

print(sc.defaultParallelism)

8

In [33]:
print(sc.defaultMinPartitions)

2

<br><br><br><br>
<p>
<span style="color:green;font-size:xx-large">RDD operations</span>
<li>RDDs are lazy</li>
<li>Two kinds of operations are allowed</li>
<ol><li>transformations: a function that produces a new RDD from an existing RDD</li>
    <li>actions: a function that returns actual values (not RDDs) from an RDD</li>
    </ol>
    <li>The set of transformations and their order associated with an RDD is called its <b>lineage</b></li>
<br><br>
<li><a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations">transformations</a> </li>
<li><a htef="https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions">actions</a></li>
<li><b>Important!</b> Transformations and actions may look like Scala functions but they are Spark RDD operations!</li>

<img src="lineage.png">

<br><br><br>
<span style="color:blue;font-size:large">lineage of relevant_lines</span>
<p>
    <li>The RDD attribute <span style="color:blue">toDebugString</span> returns the lineage of an RDD</li>

In [16]:
val text = sc.textFile("shakespeare.txt")
val relevant_lines = text.filter(l => l.contains("Music"))
val result = relevant_lines.count()

text: org.apache.spark.rdd.RDD[String] = shakespeare.txt MapPartitionsRDD[10] at textFile at <console>:26
relevant_lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at filter at <console>:27
result: Long = 52


In [17]:
relevant_lines.toDebugString


res11: String =
(2) MapPartitionsRDD[11] at filter at <console>:27 []
 |  shakespeare.txt MapPartitionsRDD[10] at textFile at <console>:26 []
 |  shakespeare.txt HadoopRDD[9] at textFile at <console>:26 []


<br><br><br>
<span style="color:green;font-size:xx-large">RDD actions</span>
<li><span style="color:red">.toDebugString</span>: returns the lineage of an RDD
<li><span style="color:red">.count:</span> counts the number of elements in the RDD
<li><span style="color:red">.first</span>: returns the first element of the RDD
<li><span style="color:red">.take(n)</span>: returns the first n elements in an RDD
<li><span style="color:red">.takeOrdered(n)(function)</span>: returns the first n according to the ordering function
<li><span style="color:red">.collect</span>: collects all the data resulting from the series of operations in one node. Use carefully, because the data must fit in the master node memory!
<li><span style="color:red">.foreach</span>: applies a function to each element in an RDD


In [18]:
val text = sc.textFile("shakespeare.txt")
val relevant_lines = text.filter(l => l.contains("Music"))
val result = relevant_lines.count()

text: org.apache.spark.rdd.RDD[String] = shakespeare.txt MapPartitionsRDD[13] at textFile at <console>:26
relevant_lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14] at filter at <console>:27
result: Long = 52


In [19]:
text.take(5).foreach(println)

The Project Gutenberg EBook of The Complete Works of William Shakespeare, by 
William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or


In [20]:
val r3 = text.takeOrdered(3)(Ordering[String].reverse).foreach(println)


your written explanation.  The person or entity that provided you with
your equipment.
your debt. But a good conscience will make any possible


r3: Unit = ()


In [51]:
val r4 = text.takeOrdered(3)(Ordering[Double].reverse.on(x => 1.0 * x.length)).foreach(println)

    whither wilt?' ROSALIND. Nay, you might keep that check for it, till you met your
*** START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***
*** END OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***


r4: Unit = ()


In [56]:
val r1 = text.first
val r2 = text.take(3)
val r3 = text.takeOrdered(3)(Ordering[String].reverse)
val r4 = text.takeOrdered(3)(Ordering[Double].reverse.on(x => 1.0*x.length))
val r5 = text.takeOrdered(3)(Ordering[Double].on(x => 1.0/x.length))
val r6 = text.collect

r1: String = "The Project Gutenberg EBook of The Complete Works of William Shakespeare, by "
r2: Array[String] = Array("The Project Gutenberg EBook of The Complete Works of William Shakespeare, by ", William Shakespeare, "")
r3: Array[String] = Array(your written explanation.  The person or entity that provided you with, your equipment., your debt. But a good conscience will make any possible)
r4: Array[String] = Array("    whither wilt?' ROSALIND. Nay, you might keep that check for it, till you met your", *** START OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***, *** END OF THIS PROJECT GUTENBERG EBOOK COMPLETE WORKS--WILLIAM SHAKESPEARE ***)
r5: Array[String] = Array("    whither wilt?' ROSALIND. Nay, you might keep that check for it, till you met your", *** ST...


In [57]:
text.takeOrdered(10)(Ordering[String].reverse).foreach(println)

your written explanation.  The person or entity that provided you with
your equipment.
your debt. But a good conscience will make any possible
you- but, indeed, to pray for the Queen.
you!) can copy and distribute it in the United States without permission
you must, at no additional cost, fee or expense to the user, provide a
you good night.
written explanation to the person you received the work from.  If you
works.  See paragraph 1.E below.
works.


<br><br><br>
<span style="color:blue;font-size:large">Try this: Extend our simple program</span>
<li>limit results to lines that are less than 30 characters in length and contain the word Music or music</li>


In [58]:
val text = sc.textFile("shakespeare.txt")//scala string//spark API textFile
val relevant_lines = text.filter(l => l.length<30 & (l.contains("Music") | l.contains("music"))) //scala doing the filter
val result = relevant_lines.count()
relevant_lines.foreach(println)

  ALL. The music, ho!
    provided this music?
[followed by Musicians].
    Music.
    Come, some music!
    Music do I hear?
    music.
    Like music.
  Three Musicians.
    Held current music too.
    What music is this?
    As howling after music.
  JULIA. That will be music.


text: org.apache.spark.rdd.RDD[String] = shakespeare.txt MapPartitionsRDD[50] at textFile at <console>:26
relevant_lines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[51] at filter at <console>:27
result: Long = 13


In [None]:
//if in python kernel
text = sc.textFile("shakespeare.txt")
relevant_lines = text.filter(lambda l: l.length<30 and (l.contains("Music") or l.contains("music")))
print(result = relevant_lines.count())