# PySpark RDD Tutorial 

RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark which is fault-tolerant, immutable distributed collections of objects.
Immutable meaning once you create an RDD you cannot change it. Each record in RDD is divided into logical partitions, which can be computed
on different nodes of the cluster.

In other words, RDDs are a collection of objects similar to list in Python, with the difference being RDD is computed on several processes scattered 
across multiple physical servers also called nodes in a cluster while a Python collection lives and process in just one process.

Additionally, RDDs provide data abstraction of partitioning and distribution of the data designed to run computations in parallel on several nodes,
while doing transformations on RDD we don't have to worry about the parallelism as PySpark by default provides.

This PySpark RDD tutorial describes the basic operations available on RDDs, such as `map()` , `filter()` , and `persist()` and many more. In addition, 
this tutorial also explains Pair RDD functions that operate on RDDs of key-value pairs such as `groupByKey()` and `join()` etc.

Note: RDDs can have a name and unique idenitifier (id)

## PySpark RDD Benefits

PySpark is widely adapated for ML due to its advantages compared to traditional Python programming.

## In-Memory Processing

PySpark loaded the data from disk in memory and keeps it in-memory. This is the main difference between PySpark and MapREduce which is I/O intensive.
In between transformations, we can also cache/persist the RDD in memory to reuse the prvious computations.

## Immutability

PySpark RDD's are immutable in nature - once RDD is created you cannot modify it. When we apply transformations on RDD, PySpark creates a new RDD and 
maintains the RDD Lineage.

## Fault Tolerance

PySpark operates on fault-tolerant data stores on HDFS, S3 etc. If any RDD operation fails, it automatically reloads the data from other partitions. 
Also when PySpark applications are running on a cluster, PySpark task failures are automatically recovered for a certain number of times (as per the configuration) and finish the application seamlessly.

## Lazy Evolution

PySpark does not evaluate the RDD transformations as they appear or are encountered by driver. Instead it keeps all transformations as it encounters 
them (in a DAG) and evaluates all transformation when it sees the first RDD action.

## Partitioning

When you create RDD from data, by default PySpark by default partitions the elements in RDD. By default the Spark Engine partitions in a number 
equal to the number of cores available.

## PySpark RDD Limitations 

PySpark RDDs are not much suitable for applications that make updates to the state store such as storage systems for a web application. For these 
applications, it is more efficient to use systems that perform traditional update logging and data checkpointing, such as databases. The goal of RDD is
to provide an efficient programming model for batch analytics and leave these asynchronous applications aside.



