<img src="uva_seal.png">  

## Spark Getting Started

### University of Virginia
### DS 5110: Big Data Systems
### Last Updated: January 18, 2026

---  

### SOURCES
Learning Spark, First Edition

Chapter 1: Introduction to Data Analysis with Spark  
Chapter 2: Getting Started

### OBJECTIVES
-  Spark background
-  Setup and installation
-  Basic concepts
-  Minimal code examples
-  Running Spark: Interactive Session
-  Running Spark: Command Line

### CONCEPTS

- Cluster: a set of connected computers (nodes)

- Functional programming

- SparkSession - single point of entry to interact w Spark functionality

- Resilient Distributed Datasets (RDDs) - Spark’s fundamental abstraction for distributed data and computation

- Dataset

- Driver Program - contains application main function, defines RDDs on cluster, applies operations to them.

- Worker Node or Executor - the units that perform tasks

---

### 1. Spark Benefits

- Designed to be fast  
no waiting around for hours, need to work interactively with data  

- Designed to handle big data

- General Purpose  
Unlike Hadoop, several modules in one place: 
  - Machine learning
  - SQL queries
  - Streaming
  - Graph analytics


- Caching is possible, so intermediate data can be stored in memory on workers

- Highly accessible: simple APIs to Python, Java, Scala, R, SQL  
Integrates w other Big Data tools such as Hadoop, Cassandra  
Can access HDFS data, Amazon S3, and others

---

### 2. Set up a Spark Session with Minimal Parameters

- use local machine as master
- name the app

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
        .master("local") \
        .appName("pyspark_test") \
        .getOrCreate()

In [None]:
# print info about the session
spark

In [None]:
# need this for working with RDDs
sc = spark.sparkContext

---

### 3. RDDs (Resilient Distributed Datasets)

We study RDD objects, which are the most basic abstraction in Spark. 

#### 3.1 RDD Background

RDDs have these properties:

- **resilient**: list of dependencies instructs Spark how the RDD is constructed from inputs.  
  In the event the RDD is compromised, Spark can recreate it from dependencies.
  
- uses `partitions` for storing pieces of the data. Spark automatically partitions RDDs and distributes the partitions across nodes in the cluster.
  
- **distributed**: placing the partitions across nodes in the cluster allows for storing massive datasets that wouldn't fit on a single machine. 

**RDD History**

Before Spark 2.0, the main programming interface of Spark was the *Resilient Distributed Dataset (RDD)*.  

Starting with Spark 2.0, the *Dataset* and *DataFrame* objects were released. They are built up from RDDs.  
We work with DataFrames later.

The RDD interface is still supported  

---

#### 3.2 Computing with RDDs

We will look at several examples.

**Example 1: Read lines from text file**

In [None]:
data_filename = 'README.txt'

In [None]:
lines = sc.textFile(data_filename)

In [None]:
lines.count()

In [None]:
lines.first()

In [None]:
lines.collect()

In [None]:
type(lines.collect())

In [None]:
lines.collect()[1]

In [None]:
type(lines.collect()[0])

**Example 2: Text Search  - apply filter and print all lines containing “Spark”**

In [None]:
spark_lines = lines.filter(lambda x: "Spark" in x)

In [None]:
# return list of first 5 records
spark_lines.take(5)   

In [None]:
type(spark_lines)

**Example 3: Word Count**

In [None]:
# Read the file into an RDD
lines = sc.textFile(data_filename)

In [None]:
type(lines)

In [None]:
words = lines.flatMap(lambda x: x.split())

In [None]:
words.take(5)

In [None]:
wordcounts = words.map(lambda x: (x, 1)) \
                  .reduceByKey(lambda x,y:x+y) \
                  .map(lambda x:(x[1],x[0])) \
                  .sortByKey(False)

In [None]:
wordcounts.take(10)

---

**TRY FOR YOURSELF (UNGRADED EXERCISES)**

1) Convert the Word Count example into a function called `word_count()`.

The function assumes you have already read in the text file into `lines`.  
It should take two inputs: 
- `lines`  this is the RDD containing text
- `num_records`  this is the number of wordcount pairs to return

It should output a list of the `num_records` most frequent word count pairs.  
Enter the code for `word_count()` in the cell below.

In [None]:
## definition of word_count()


Now test that `word_count()` returns the expected result.  
Also insure that the output type is a list.

In [None]:
## test function: word_count()
## calling type(output) should return a list


**SOLUTIONS**

In [None]:
def word_count(lines, num_records):
    
    wordcounts = lines.flatMap(lambda x: x.split()) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(lambda x,y:x+y) \
                  .map(lambda x:(x[1],x[0])) \
                  .sortByKey(False)
    
    return(wordcounts.take(num_records))

In [None]:
out = word_count(lines, 10)
out

In [None]:
type(out)