# Module 1: Introduction to Spark

## Our first Spark Application

### Lesson Objectives

After completing this lesson, you should be able to:
* Describe an RDD and its properties
* Understand the basic workflow of a Word Count application with Spark

### What is an RDD?

* Resilient Distributed Dataset
* The core abstraction of Spark
* Immutable at its core, assuring thread safety
* Odersky has called it the "ultimate Scala collection"
* Each step of a dataflow that transforms an RDD results in a new RDD being created
* RDDs are "lazy"
  - A DAG (directed acyclic graph) of computation is constructed, where lopping is not possible within it
  - The actual data is processed only when results are requested
* RDDs know their "parents", and transitively, all of their "ancestors" in the lineage of the data flow
* RDDs are resilient, and a lost partition can be reconstructed from its lineage

### Word Count Application with Spark Example

The file `/resources/data/input/all-shakespeare.txt` contains the data in which we are going to count the words. Note that you can permanently save the result of the processing with `wc.saveAsTextFile(outpath)` where `outpath` is some file path. However, you will have to manually delete the output files every time you re-run the job. Spark does not override output files.

In [1]:
import sys.process._

"wget https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/data/all-shakespeare.txt" !

--2022-03-04 20:27:56--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/data/all-shakespeare.txt
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5342761 (5.1M) [text/plain]
Saving to: ‘all-shakespeare.txt’

     0K .......... .......... .......... .......... ..........  0% 85.4M 0s
    50K .......... .......... .......... .......... ..........  1% 83.4M 0s
   100K .......... .......... .......... .......... ..........  2%  177M 0s
   150K .......... .......... .......... .......... ..........  3%  203M 0s
   200K .......... .......... .......... .......... ..........  4% 37.8M 0s
   250K .......... .......... .......... .......... ..........  5% 36.6M 0s
   300K .......... .......... .........



0

In [2]:
//package course2.module1
 
val inpath = "all-shakespeare.txt"
    
val input = sc.textFile(inpath)
val wc = input.
  map(_.toLowerCase).
  flatMap(text => text.split("""\W+""")).
  groupBy(word => word). // Like SQL GROUP BY: RDD[(String,Iterator[String])]
  mapValues(group => group.size) // RDD[(String,Int)]
    
println("Output")
wc.take(10).foreach(t => println(s"${t._1} - ${t._2}"))
println("\n")

Output
bone - 21
vailing - 3
bombast - 4
fartuous - 2
hem - 10
stinks - 1
fuller - 2
tough - 8
jade - 16
countervail - 3




inpath = all-shakespeare.txt
input = all-shakespeare.txt MapPartitionsRDD[1] at textFile at <console>:34
wc = MapPartitionsRDD[6] at mapValues at <console>:39


MapPartitionsRDD[6] at mapValues at <console>:39

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.