<a href="https://colab.research.google.com/github/aaabhijith13/linkedIN_posts/blob/main/Spark_rdd's.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This simple word count example demonstrates Spark’s core strengths: lazy evaluation, DAG-based execution, in-memory processing, and fault tolerance through lineage — all without writing intermediate data to disk.**

Author: Abhijith

Date: 1/6/2026

Happy New year!

#Install Pyspark


In [1]:
!pip install pyspark



# Start Pyspark And Context

In [2]:
from pyspark import SparkConf, SparkContext

conf = SparkConf() \
    .setAppName("RDD-Deep-Dive") \
    .setMaster("local[*]")

sc = SparkContext.getOrCreate(conf)


# Creating RDD's


In [3]:
data = """Spark is a fast and general-purpose cluster computing system
Spark provides high-level APIs in Java Python Scala and R
Spark is optimized for large-scale data processing
Hadoop MapReduce is a disk-based processing framework
Spark improves performance by keeping data in memory
ERROR Failed to connect to data source
Spark supports batch processing streaming and machine learning
ERROR Timeout occurred while reading data
HDFS stores data reliably across distributed nodes
Spark runs on top of YARN Mesos or its own standalone cluster manager
ERROR Disk read failure on worker node
Big data processing requires fault tolerance and scalability
Spark uses lazy evaluation to optimize execution plans
MapReduce processes data in multiple stages
ERROR Network failure during shuffle phase
"""
with open('spark_file.txt', 'w') as file:
    file.write(data)

In [4]:
rdd = sc.textFile("spark_file.txt")
rdd.take(5)

['Spark is a fast and general-purpose cluster computing system',
 'Spark provides high-level APIs in Java Python Scala and R',
 'Spark is optimized for large-scale data processing',
 'Hadoop MapReduce is a disk-based processing framework',
 'Spark improves performance by keeping data in memory']

In [5]:
numbers = sc.parallelize([1, 2, 3, 4, 5, 9, 67, 97, 2122, 3, 4, 4, 5, 2, 5,])
numbers.collect()

[1, 2, 3, 4, 5, 9, 67, 97, 2122, 3, 4, 4, 5, 2, 5]

In [6]:
numbers.getNumPartitions()

2

In [7]:
numbers = numbers.repartition(4)
numbers.getNumPartitions()

4

# Transformations (Lazy Operations)

Transformations do not trigger execution.

In [8]:
mapped = rdd.map(lambda line: line.upper())
filtered = mapped.filter(lambda line: "ERROR" in line)
#No execution yet

In [9]:
filtered.count() #now exectued

4

In [10]:
filtered.toDebugString() #string representation of the RDD lineage

b'(2) PythonRDD[10] at RDD at PythonRDD.scala:56 []\n |  spark_file.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []\n |  spark_file.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []'


#### Lineage
1.   PythonRDD[18] at RDD at PythonRDD.scala:56 []\n
2.   spark_file.txt MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0
3. spark_file.txt HadoopRDD[2] at textFile at NativeMethodAccessorImpl.java:0



# Caching RDDs (Performance Optimization)

In [11]:
mapped = rdd.map(lambda line: line.upper())

errors = rdd.filter(lambda line: "ERROR" in line)

errors.cache()

PythonRDD[11] at RDD at PythonRDD.scala:56

In [12]:
errors.count(), errors.take(10) #count of the number of error messages in the file above

(4,
 ['ERROR Failed to connect to data source',
  'ERROR Timeout occurred while reading data',
  'ERROR Disk read failure on worker node',
  'ERROR Network failure during shuffle phase'])



1. Cached in executor memory
2. Avoids recomputation
3. Programmer-controlled



In [13]:
full_example = sc.textFile("spark_file.txt")

word_count = (
    full_example.flatMap(lambda line: line.split())
       .map(lambda word: (word.lower(), 1))
       .reduceByKey(lambda a, b: a + b)
)

word_count.saveAsTextFile("finalwordcount")

In [14]:
word_count.collect() #Use only when results fit in driver memory. Spark builds a DAG and executes the entire pipeline.

[('fast', 1),
 ('and', 4),
 ('computing', 1),
 ('java', 1),
 ('python', 1),
 ('optimized', 1),
 ('for', 1),
 ('hadoop', 1),
 ('framework', 1),
 ('performance', 1),
 ('by', 1),
 ('memory', 1),
 ('error', 4),
 ('failed', 1),
 ('to', 3),
 ('supports', 1),
 ('batch', 1),
 ('streaming', 1),
 ('machine', 1),
 ('learning', 1),
 ('occurred', 1),
 ('while', 1),
 ('reading', 1),
 ('stores', 1),
 ('distributed', 1),
 ('runs', 1),
 ('of', 1),
 ('mesos', 1),
 ('own', 1),
 ('disk', 1),
 ('read', 1),
 ('failure', 2),
 ('big', 1),
 ('requires', 1),
 ('fault', 1),
 ('tolerance', 1),
 ('uses', 1),
 ('lazy', 1),
 ('optimize', 1),
 ('processes', 1),
 ('multiple', 1),
 ('network', 1),
 ('shuffle', 1),
 ('phase', 1),
 ('spark', 7),
 ('is', 3),
 ('a', 2),
 ('general-purpose', 1),
 ('cluster', 2),
 ('system', 1),
 ('provides', 1),
 ('high-level', 1),
 ('apis', 1),
 ('in', 3),
 ('scala', 1),
 ('r', 1),
 ('large-scale', 1),
 ('data', 7),
 ('processing', 4),
 ('mapreduce', 2),
 ('disk-based', 1),
 ('improves', 1

In [15]:
word_count.count() #total number of items

85

In [16]:
print(word_count.toDebugString())

b'(2) PythonRDD[24] at collect at /tmp/ipython-input-879584768.py:1 []\n |  MapPartitionsRDD[20] at mapPartitions at PythonRDD.scala:168 []\n |  ShuffledRDD[19] at partitionBy at NativeMethodAccessorImpl.java:0 []\n +-(2) PairwiseRDD[18] at reduceByKey at /tmp/ipython-input-2954201668.py:6 []\n    |  PythonRDD[17] at reduceByKey at /tmp/ipython-input-2954201668.py:6 []\n    |  spark_file.txt MapPartitionsRDD[16] at textFile at NativeMethodAccessorImpl.java:0 []\n    |  spark_file.txt HadoopRDD[15] at textFile at NativeMethodAccessorImpl.java:0 []'


In [17]:
word_count.unpersist() #remove a cached or persisted DataFrame, Dataset, or RDD from memory and/or disk storage
#Essentially Delete all 85 items.

PythonRDD[24] at collect at /tmp/ipython-input-879584768.py:1

# Fault Tolerance via Lineage

If a partition of anystep is lost:

Spark replays:

flatmap → reduce → filtreduceByKey

Look below once we .collect() Spark goes through the process again, demonstrating Fault Tolerance

In [18]:
word_count.collect() #Use only when results fit in driver memory.

[('fast', 1),
 ('and', 4),
 ('computing', 1),
 ('java', 1),
 ('python', 1),
 ('optimized', 1),
 ('for', 1),
 ('hadoop', 1),
 ('framework', 1),
 ('performance', 1),
 ('by', 1),
 ('memory', 1),
 ('error', 4),
 ('failed', 1),
 ('to', 3),
 ('supports', 1),
 ('batch', 1),
 ('streaming', 1),
 ('machine', 1),
 ('learning', 1),
 ('occurred', 1),
 ('while', 1),
 ('reading', 1),
 ('stores', 1),
 ('distributed', 1),
 ('runs', 1),
 ('of', 1),
 ('mesos', 1),
 ('own', 1),
 ('disk', 1),
 ('read', 1),
 ('failure', 2),
 ('big', 1),
 ('requires', 1),
 ('fault', 1),
 ('tolerance', 1),
 ('uses', 1),
 ('lazy', 1),
 ('optimize', 1),
 ('processes', 1),
 ('multiple', 1),
 ('network', 1),
 ('shuffle', 1),
 ('phase', 1),
 ('spark', 7),
 ('is', 3),
 ('a', 2),
 ('general-purpose', 1),
 ('cluster', 2),
 ('system', 1),
 ('provides', 1),
 ('high-level', 1),
 ('apis', 1),
 ('in', 3),
 ('scala', 1),
 ('r', 1),
 ('large-scale', 1),
 ('data', 7),
 ('processing', 4),
 ('mapreduce', 2),
 ('disk-based', 1),
 ('improves', 1

In [19]:
word_count.count() #Same count as above, before we removed the data.

85

# CleanUp

In [20]:
sc.stop()