<a href="https://colab.research.google.com/github/arthi-rajendran-DS/Medium-Implementations/blob/main/PyBytes_Day22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Resilient Distributed Dataset - CheatSheet

RDD stands for Resilient Distributed Dataset. It is one of the core data structures in Apache Spark, designed to handle distributed data processing tasks efficiently.

An RDD is an immutable, partitioned collection of objects that can be processed in parallel across a cluster of computers. It represents a logical division of data that can be stored in memory or persisted on disk. RDDs are fault-tolerant, meaning they can recover from failures and maintain data consistency.

Key characteristics of RDDs:

**Distributed**: RDDs are distributed across multiple nodes in a cluster, allowing parallel processing of data. Each RDD partition is processed independently on different nodes.

**Resilient**: RDDs are fault-tolerant, meaning they can recover from node failures. If a partition is lost, RDDs can recompute it using the lineage of operations.

**Immutable**: RDDs are read-only and cannot be modified once created. However, you can create new RDDs by transforming existing ones using various operations.

**Lazily Evaluated**: RDD transformations are lazily evaluated, which means they are not executed immediately. Transformations build a lineage of operations that are only computed when an action is called.

**Cacheable**: RDDs can be cached in memory, allowing faster access for subsequent operations. Caching is beneficial for iterative algorithms or when the same RDD is used multiple times.

RDDs provide a programming abstraction that allows developers to write parallel and fault-tolerant data processing tasks without worrying about the underlying distribution and fault recovery mechanisms. However, in recent versions of Apache Spark, the DataFrame and Dataset APIs have become the preferred interface for working with structured and semi-structured data due to their more optimized performance and ease of use.

Note: Starting from Spark 3.0, RDDs are considered a low-level API, and the DataFrame and Dataset APIs are recommended for most use cases.






In [1]:
!pip install pyspark
!pip install findspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=d4f4515888a0f55de0b6192331e2839da7a21bcab71738df79d93a9724522206
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 k

In [3]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Create an RDD from a list:
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Apply a transformation to an RDD:
transformed_rdd = rdd.map(lambda x: x * 2)
# Filter elements in an RDD based on a condition:
filtered_rdd = rdd.filter(lambda x: x > 3)
# Reduce an RDD using a custom function:
result = rdd.reduce(lambda x, y: x + y)
# Collect the elements of an RDD into a list:
collected_list = rdd.collect()
# Count the number of elements in an RDD:
count = rdd.count()
# Save an RDD as a text file:
rdd.saveAsTextFile("/content/sample_data/output.txt")