![Spark Image](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1200px-Apache_Spark_logo.svg.png)

## Introduction to RDDs

An RDD in Spark is simply an immutable distributed collection of objects. Each is split into multiple partitions, which may be computed on different nodes of the cluster.<br>
RDDs are `immutable`, `fault-tolerant`, `parallel data structures` that let users explicitly persist intermediate results `in memory`, control their partitioning to optimize data placement, and `manipulate` them using a rich set of `operators`.

## RDD Operations

Resilient Distributed Datasets (RDDs) are a fundamental data structure in Apache Spark. They are designed to provide a simple, read-only, and distributed collection of objects that can be processed in parallel across a Spark cluster. Here’s a brief overview of the key features and characteristics of RDDs:

### Key Features of RDDs

1. **Resilient**: RDDs are fault-tolerant, meaning they can recover quickly from node failures. They achieve this by lineage, which allows them to rebuild lost data using the operations that originally created them.

2. **Distributed**: The data in an RDD is distributed across many nodes in a cluster, allowing computations to be performed on multiple nodes simultaneously.

3. **Immutable**: Once an RDD is created, it cannot be changed. Transformations on an RDD create a new RDD. This immutability helps to ensure consistency during computations.

4. **Lazy Evaluation**: RDDs use lazy evaluation, meaning that the computation on them is not executed immediately after a transformation is applied. Instead, the execution is delayed until an action (like `collect()`, `count()`, `reduce()`) that requires a result to be returned to the driver program is called.

5. **Parallel**: Operations on RDDs are inherently parallel, and Spark automatically distributes the data contained in RDDs across the cluster and parallelizes the operations that process the data.

### Creating and Manipulating RDDs

RDDs can be created through two primary methods:

1. **Loading an external dataset**: Spark can create RDDs from files stored in various storage systems like HDFS (Hadoop Distributed File System), S3 (Amazon S3), or local file systems.

2. **Parallelizing an existing collection**: RDDs can be created by parallelizing an existing collection in your driver program, such as a list or array, using SparkContext’s `parallelize` method.

### Operations on RDDs

RDDs provide a rich set of commonly needed data processing operations. They include the ability to perform data transformation, filtering, grouping, joining, aggregation, sorting, and counting.<br>
Each row in a dataset is represented as a Java object, and the structure of this Java object is opaque to Spark. The user of RDD has complete control over how to manipulate this Java object. This flexibility comes with a lot of responsibilities, meaning some of the commonly needed operations such as the computing average will have to be handcrafted. Higher-level abstractions such as the Spark SQL component will provide this functionality out of the box.<br>

***The RDD operations are classified into two types: `transformations` and `actions`***

| Type | Evaluation | Returned Value |
|--|--|--|
| Transformation | Lazy | Another RDD |
| Action | Eager | Some result or write result to disk |

Transformation operations are lazily evaluated, meaning Spark will delay the evaluations of the invoked operations until an action is taken. In other words, the transformation operations merely record the specified transformation logic and will apply them at a later point. On the other hand, invoking an action operation will trigger the evaluation of all the transformations that preceded it, and it will either return some result to the driver or write data to a storage system, such as HDFS or the local file system.

## Creating RDDs

**There are two ways to create RDDs:**

**`The first way to create an RDD is to parallelize an python object, meaning converting it to a distributed dataset that can be operated in parallel.`**

In [0]:
stringList = ["Spark is awesome","Spark is cool"]
stringRDD = sc.parallelize(stringList)

In [0]:
stringRDD

*One thing to notice is that you are not able to see the output, because of Spark's Lazy evaluation utill you call an action on that RDD.*

In [0]:
stringRDD.collect()

*.collect() is an `action` as it name suggests it collects all the rows from each of the partitions in an RDD and brings them over to the driver program.*

**`The second way to create an RDD is to read a dataset from a storage system, which can be a local computer file system, HDFS, Cassandra, Amazon S3, and so on.`**

In [0]:
testdata = sc.textFile("/FileStore/tables/testdata.txt")

In [0]:
testdata.collect()[:5]

In this particular example we had 1M rows calling .collect() of it didn't take lot of time but If your RDD contains 100 billion rows, then it is not a good idea to invoke the collect action because the driver program most likely doesn’t have sufficient memory to hold all those rows. As a result, the driver will most likely run into an out-of-memory error and your Spark application or shell will die. This action is typically used once the RDD is filtered down to a smaller size that can fit the memory size of the driver program.

In [0]:
ratings.take(5)

### Lazy Evaluation and Performance Features of Azure Databricks:

#### Lazy Evaluation:
Lazy evaluation is a powerful feature in Apache Spark and Azure Databricks. It refers to the delayed execution of operations on RDDs (Resilient Distributed Datasets) until an action is triggered.
When you perform transformations (such as map, filter, or join) on an RDD, Spark doesn’t immediately execute them. Instead, it builds a logical execution plan (known as the RDD lineage) that represents the sequence of transformations.

The actual computation occurs only when an action (such as collect, count, or saveAsTextFile) is called. At that point, Spark optimizes the execution plan and processes the data efficiently.

#### Lazy evaluation provides several benefits:
Optimization Opportunities: Spark can analyze the entire execution plan and apply optimizations (e.g., predicate pushdown, common subexpression elimination) to minimize data shuffling and improve performance.
Efficient Resource Utilization: By deferring computation until necessary, Spark avoids unnecessary intermediate results and optimizes resource usage.

**Example:**
Suppose you have an RDD with millions of records, and you want to filter out specific data. With lazy evaluation, Spark won’t process the entire dataset until you explicitly request the filtered results.

### Eager Functions
Eager Functions, on the other hand, are operations that trigger immediate computation. These are typically actions in Spark’s terminology, such as collect(), count(), and show(), which require the system to execute all transformations up to that point and produce an output.

#### Usage of Eager Functions:
Immediate Results: Useful for obtaining immediate results from computations, necessary for debugging, testing, or iterative development.
Trigger Execution: They are necessary to trigger the actual job execution in a Spark application.

## Transformations

Transformations are operations on RDDs that return a new RDD. Transformed RDDs are computed lazily, only when you
use them in an action.

Following Table describes commonly used transformations.

<table>
<tbody><tr><th style="width:25%">Transformation</th><th>Meaning</th></tr>
<tr>
  <td> <b>map</b>(<i>func</i>) </td>
  <td> Return a new distributed dataset formed by passing each element of the source through a function <i>func</i>. </td>
</tr>
<tr>
  <td> <b>filter</b>(<i>func</i>) </td>
  <td> Return a new dataset formed by selecting those elements of the source on which <i>func</i> returns true. </td>
</tr>
<tr>
  <td> <b>flatMap</b>(<i>func</i>) </td>
  <td> Similar to map, but each input item can be mapped to 0 or more output items (so <i>func</i> should return a Seq rather than a single item). </td>
</tr>
<tr>
  <td> <b>mapPartitions</b>(<i>func</i>) <a name="MapPartLink"></a> </td>
  <td> Similar to map, but runs separately on each partition (block) of the RDD, so <i>func</i> must be of type
    Iterator&lt;T&gt; =&gt; Iterator&lt;U&gt; when running on an RDD of type T. </td>
</tr>
<tr>
  <td> <b>mapPartitionsWithIndex</b>(<i>func</i>) </td>
  <td> Similar to mapPartitions, but also provides <i>func</i> with an integer value representing the index of
  the partition, so <i>func</i> must be of type (Int, Iterator&lt;T&gt;) =&gt; Iterator&lt;U&gt; when running on an RDD of type T.
  </td>
</tr>
<tr>
  <td> <b>sample</b>(<i>withReplacement</i>, <i>fraction</i>, <i>seed</i>) </td>
  <td> Sample a fraction <i>fraction</i> of the data, with or without replacement, using a given random number generator seed. </td>
</tr>
<tr>
  <td> <b>union</b>(<i>otherDataset</i>) </td>
  <td> Return a new dataset that contains the union of the elements in the source dataset and the argument. </td>
</tr>
<tr>
  <td> <b>intersection</b>(<i>otherDataset</i>) </td>
  <td> Return a new RDD that contains the intersection of elements in the source dataset and the argument. </td>
</tr>
<tr>
  <td> <b>distinct</b>([<i>numPartitions</i>])) </td>
  <td> Return a new dataset that contains the distinct elements of the source dataset.</td>
</tr>
<tr>
  <td> <b>groupByKey</b>([<i>numPartitions</i>]) <a name="GroupByLink"></a> </td>
  <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable&lt;V&gt;) pairs. <br>
    <b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or
      average) over each key, using <code>reduceByKey</code> or <code>aggregateByKey</code> will yield much better
      performance.
    <br>
    <b>Note:</b> By default, the level of parallelism in the output depends on the number of partitions of the parent RDD.
      You can pass an optional <code>numPartitions</code> argument to set a different number of tasks.
  </td>
</tr>
<tr>
  <td> <b>reduceByKey</b>(<i>func</i>, [<i>numPartitions</i>]) <a name="ReduceByLink"></a> </td>
  <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function <i>func</i>, which must be of type (V,V) =&gt; V. Like in <code>groupByKey</code>, the number of reduce tasks is configurable through an optional second argument. </td>
</tr>
<tr>
  <td> <b>aggregateByKey</b>(<i>zeroValue</i>)(<i>seqOp</i>, <i>combOp</i>, [<i>numPartitions</i>]) <a name="AggregateByLink"></a> </td>
  <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in <code>groupByKey</code>, the number of reduce tasks is configurable through an optional second argument. </td>
</tr>
<tr>
  <td> <b>sortByKey</b>([<i>ascending</i>], [<i>numPartitions</i>]) <a name="SortByLink"></a> </td>
  <td> When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean <code>ascending</code> argument.</td>
</tr>
<tr>
  <td> <b>join</b>(<i>otherDataset</i>, [<i>numPartitions</i>]) <a name="JoinLink"></a> </td>
  <td> When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
    Outer joins are supported through <code>leftOuterJoin</code>, <code>rightOuterJoin</code>, and <code>fullOuterJoin</code>.
  </td>
</tr>
<tr>
  <td> <b>cogroup</b>(<i>otherDataset</i>, [<i>numPartitions</i>]) <a name="CogroupLink"></a> </td>
  <td> When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable&lt;V&gt;, Iterable&lt;W&gt;)) tuples. This operation is also called <code>groupWith</code>. </td>
</tr>
<tr>
  <td> <b>cartesian</b>(<i>otherDataset</i>) </td>
  <td> When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements). </td>
</tr>
<tr>
  <td> <b>pipe</b>(<i>command</i>, <i>[envVars]</i>) </td>
  <td> Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the
    process's stdin and lines output to its stdout are returned as an RDD of strings. </td>
</tr>
<tr>
  <td> <b>coalesce</b>(<i>numPartitions</i>) <a name="CoalesceLink"></a> </td>
  <td> Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently
    after filtering down a large dataset. </td>
</tr>
<tr>
  <td> <b>repartition</b>(<i>numPartitions</i>) </td>
  <td> Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them.
    This always shuffles all data over the network. <a name="RepartitionLink"></a></td>
</tr>
<tr>
  <td> <b>repartitionAndSortWithinPartitions</b>(<i>partitioner</i>) <a name="Repartition2Link"></a></td>
  <td> Repartition the RDD according to the given partitioner and, within each resulting partition,
  sort records by their keys. This is more efficient than calling <code>repartition</code> and then sorting within
  each partition because it can push the sorting down into the shuffle machinery. </td>
</tr>
</tbody></table>

## Transformation Examples

### Map transformation

*Return a new RDD by applying a function to each element of this RDD*

This example will illustrate the difference between a transformation (lazy evaluation) and an action (eager evaluation) in Spark.

In [0]:
# Transformation: This operation is lazy; Spark will not execute it until an action is called.
stringRDD_uppercase= stringRDD.map(lambda x: x.upper())


In [0]:
# Action: Show the RDD/Dataframe (eager evaluation)
# This operation is eager; Spark executes all transformations up to this point.
stringRDD_uppercase.collect()

Transformation (withColumn): This line only defines what operation should be performed and does not trigger any computation.
Action (show): This line actually triggers the execution of all preceding transformations. The entire computation graph (logical plan) is executed only when this action is called.

In [0]:
def alternate_char_upper(text):
    new_text= []
    for i, character in enumerate(text):
        if i % 2 == 0:
            new_text.append(character.upper())
        else:
            new_text.append(character)
    return ''.join(new_text)
stringRDD_alternate_uppercase= stringRDD.map(alternate_char_upper)
stringRDD_alternate_uppercase.collect()  

### Flat Map Transfermation

*Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results*

In [0]:
flatMap_Split= stringRDD.flatMap(lambda x: x.split(" "))
flatMap_Split.collect()

### Difference Between Map and FlatMap

In [0]:
print("Split using Map transformation:")
map_Split= stringRDD.map(lambda x: x.split(" "))
map_Split.collect()

In [0]:
print("Split using FlatMap transformation:")
flatMap_Split.collect()

### Filter Transformation

*Return a new RDD containing only the elements that satisfy a predicate*

In [0]:
awesomeLineRDD = stringRDD.filter(lambda x: "awesome" in x)
awesomeLineRDD.collect()

In [0]:
sparkLineRDD = stringRDD.filter(lambda x: "spark" in x.lower())
sparkLineRDD.collect()

### Union Transformation

*Return a new RDD containing all items from two original RDDs. Duplicates are not culled.*

In [0]:
rdd1 = sc.parallelize([1,2,3,4,5])
rdd2 = sc.parallelize([1,6,7,8])
rdd3 = rdd1.union(rdd2)
rdd3.collect()

### Intersection Transformation

*Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.*

In [0]:
rdd1 = sc.parallelize(["One", "Two", "Three"])
rdd2 = sc.parallelize(["two","One","threed","One"])
rdd3 = rdd1.intersection(rdd2)
rdd3.collect()

### Substract Trsnformation

*Return each value in `self` that is not contained in `other`.*

In [0]:
words = sc.parallelize(["The amazing thing about spark \
                is that it is very simple to learn"]).flatMap(lambda x: x.split(" ")).map(lambda c: c.lower())

stopWords = sc.parallelize(["the", "it", "is", "to", "that", ''])

realWords = words.subtract(stopWords)
realWords.collect()

### Distinct Transformation

*Return a new RDD containing distinct items from the original RDD (omitting all duplicates*

In [0]:
duplicateValueRDD = sc.parallelize(["one", 1,"two", 2, "three", "one", "two", 1, 2])
duplicateValueRDD.distinct().collect()

###  Sample Transformation

*Return a new RDD containing a statistical sample of the original RDD*

In [0]:
numbers = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
numbers.sample(True, 0.5).collect()

### GroupBy Transformation

*Group the data in the original RDD. Create pairs where the key is the output of a user function, and the value is all items for which the function yields this key.*

In [0]:
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
y = x.groupBy(lambda w: w[0])
print([(k, list(v)) for (k, v) in y.collect()])

## GroupByKey Transformation

*Group the values for each key in the original RDD. Create a new pair where the original key corresponds to this collected group of values.*

In [0]:
x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])
y = x.groupByKey()
print(x.collect())
print(list((j[0], list(j[1])) for j in y.collect()))

## MapPartitions Transformation

*Return a new RDD by applying a function to each partition of this RDD*

In [0]:
x = sc.parallelize([1,2,3], 2)
def f(iterator): yield sum(iterator); yield 42
y = x.mapPartitions(f)
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y.glom().collect())

### MapPartitionWithIndex Transformation

*Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.*

In [0]:
x = sc.parallelize([1,2,3], 2)
def f(partitionIndex, iterator): yield (partitionIndex, sum(iterator))
y = x.mapPartitionsWithIndex(f)
# glom() flattens elements on the same partition
print(x.glom().collect())
print(y.glom().collect())

### Join Transformation

*Return a new RDD containing all pairs of elements having the same key in the original RDDs*

`union(otherRDD, numPartitions=None)`

In [0]:
x = sc.parallelize([("a", 1), ("b", 2)])
y = sc.parallelize([("a", 3), ("a", 4), ("b", 5)])
z = x.join(y)
print(z.collect())

### Coalesce Transformation

*Return a new RDD which is reduced to a smaller number of partitions*

`coalesce(numPartitions, shuffle=False)`

In [0]:
x = sc.parallelize([1, 2, 3, 4, 5], 3)
y = x.coalesce(2)
print(x.glom().collect())
print(y.glom().collect())

### KeyBy Transformation

*Create a Pair RDD, forming one pair for each item in the original RDD. The pair’s key is calculated from the value via a user-supplied function.*

In [0]:
x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
y = x.keyBy(lambda w: w[0])
print(y.collect())

### PartitionBy Transformation

*Return a new RDD with the specified number of partitions, placing original items into the partition returned by a user supplied function*

`partitionBy(numPartitions, partitioner=portable_hash)`

In [0]:
x = sc.parallelize([('J','James'),('F','Fred'),
('A','Anna'),('J','John')], 3)
y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)
print(x.glom().collect())
print(y.glom().collect())

### Zip Transformation

*Return a new RDD containing pairs whose key is the item in the original RDD, and whose
value is that item’s corresponding element (same partition, same index) in a second RDD*

`zip(otherRDD)`

In [0]:
x = sc.parallelize([1, 2, 3])
y = x.map(lambda n:n*n)
z = x.zip(y)
print(z.collect())


## Actions

<table class="table">
<tbody><tr><th>Action</th><th>Meaning</th></tr>
<tr>
  <td> <b>reduce</b>(<i>func</i>) </td>
  <td> Aggregate the elements of the dataset using a function <i>func</i> (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. </td>
</tr>
<tr>
  <td> <b>collect</b>() </td>
  <td> Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. </td>
</tr>
<tr>
  <td> <b>count</b>() </td>
  <td> Return the number of elements in the dataset. </td>
</tr>
<tr>
  <td> <b>first</b>() </td>
  <td> Return the first element of the dataset (similar to take(1)). </td>
</tr>
<tr>
  <td> <b>take</b>(<i>n</i>) </td>
  <td> Return an array with the first <i>n</i> elements of the dataset. </td>
</tr>
<tr>
  <td> <b>takeSample</b>(<i>withReplacement</i>, <i>num</i>, [<i>seed</i>]) </td>
  <td> Return an array with a random sample of <i>num</i> elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.</td>
</tr>
<tr>
  <td> <b>takeOrdered</b>(<i>n</i>, <i>[ordering]</i>) </td>
  <td> Return the first <i>n</i> elements of the RDD using either their natural order or a custom comparator. </td>
</tr>
<tr>
  <td> <b>saveAsTextFile</b>(<i>path</i>) </td>
  <td> Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. </td>
</tr>
<tr>
  <td> <b>saveAsSequenceFile</b>(<i>path</i>) <br> (Java and Scala) </td>
  <td> Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also
   available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). </td>
</tr>
<tr>
  <td> <b>saveAsObjectFile</b>(<i>path</i>) <br> (Java and Scala) </td>
  <td> Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using
    <code>SparkContext.objectFile()</code>. </td>
</tr>
<tr>
  <td> <b>countByKey</b>() <a name="CountByLink"></a> </td>
  <td> Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. </td>
</tr>
<tr>
  <td> <b>foreach</b>(<i>func</i>) </td>
  <td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
  <br><b>Note</b>: modifying variables other than Accumulators outside of the <code>foreach()</code> may result in undefined behavior. See Understanding closures for more details.</td>
</tr>
</tbody></table>

### GetNumpartitions Action

*Return the number of partitions in RDD*

In [0]:
x = sc.parallelize([1,2,3], 2)
y = x.getNumPartitions()
print(x.glom().collect())
print(y)

### Collect Action

*Return all items in the RDD to the driver in a single list*

In [0]:
x = sc.parallelize([1,2,3], 3)
y = x.collect()
print(x.glom().collect())
print(y)

### Count Action

*Return the number of elements in this RDD.*

In [0]:
numberRDD = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
numberRDD.count()

### First Action

*Return the first element in this RDD.*

In [0]:
numberRDD = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
numberRDD.first()

### Take Action

*Take the first num elements of the RDD.*

In [0]:
numberRDD = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
numberRDD.take(4)

### Reduce Action

*Aggregate all the elements of the RDD by applying a user function pairwise to elements and partial results, and returns a result to the driver*

In [0]:
x = sc.parallelize([1,2,3,4])
y = x.reduce(lambda a,b: a+b)
print(x.collect())
print(y)

### Aggregate Action

Aggregate all the elements of the RDD by:
- applying a user function to combine elements with user-supplied objects,
- then combining those user-defined results via a second user function,
- and finally returning a result to the driver.

In [0]:
seqOp = lambda data, item: (data[0] + [item], data[1] + item)
combOp = lambda d1, d2: (d1[0] + d2[0], d1[1] + d2[1])
x = sc.parallelize([1,2,3,4])
y = x.aggregate(([], 0), seqOp, combOp)
print(y)

### Map Action

*Return the maximum item in the RDD*

In [0]:
x = sc.parallelize([2,4,1])
y = x.max()
print(x.collect())
print(y)

### The catalyst Optimizer

The **Catalyst Optimizer** is an integral part of Spark SQL, designed to optimize query plans. Azure Databricks uses this component to enhance the performance of data processing tasks. Here’s how it works:

- **Logical Plan Generation:** Initially, a user's query is converted into a logical plan, which represents a tree of logical operators (e.g., filters, joins) without concern for how the operations will be performed.
- **Logical Plan Optimization:** The optimizer applies various rules to transform this logical plan into a more efficient one, such as predicate pushdown, constant folding, and boolean simplification.
- **Physical Planning:** Catalyst then uses a cost-based optimizer to generate multiple physical plans from the optimized logical plan, choosing the most cost-effective one based on statistics from the data.
- **Code Generation:** Finally, it uses whole-stage code generation techniques to compile parts of this physical plan into bytecode, which runs directly on the JVM, minimizing any runtime overhead.


In [0]:
# Create a DataFrame
data = [("Alice", 21), ("Bob", 22), ("Carol", 19)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Applying multiple transformations
filtered_df = df.filter(df["Age"] > 20).select("Name")

# Explain the logical and physical plans
filtered_df.explain(True)

**Explanation:**
- The `explain()` method outputs the logical and physical plans. The logical plan shows how Spark understands the operations, and the physical plan shows how Spark will execute these operations.
- You might see how Spark rearranges filters and projections to optimize execution, demonstrating Catalyst's optimization capabilities.




### Performance Enhancements Enabled by Tungsten and Shuffle Operations

**Tungsten** and shuffle operations are critical to optimizing performance in Azure Databricks:

- **Tungsten:** An execution engine that improves the efficiency of memory and CPU for Spark applications. It employs techniques like binary processing, cache-aware computation, and whole-stage code generation to maximize computational efficiency and minimize memory usage.
- **Shuffle Operations:** These are involved when data needs to be redistributed across different nodes to perform certain transformations like `groupBy()` or `reduceByKey()`. Optimization of shuffle operations involves minimizing data transfer and efficiently managing data storage during these operations, reducing the overall time and resource consumption.

**Enhancements through Tungsten and Shuffle Operations:**
- **Memory Management:** Tungsten uses off-heap memory management to reduce garbage collection overheads.
- **Data Encoding:** Uses efficient data encoding schemes to reduce the size of the data stored in memory.
- **Optimized Shuffle Operations:** Implements algorithms that reduce data shuffle across the network and improve the speed of data grouping and aggregation tasks.

These features collectively enhance the performance of data processing tasks in Azure Databricks, making it a powerful platform for handling large-scale data analytics and machine learning workloads.



### Showcasing Performance Enhancements Through Tungsten

We can't directly show Tungsten's internal optimizations in code because they're part of the Spark engine's underlying implementation. However, we can enable Tungsten features and observe performance improvements via configuration settings and explain plans.

```python
from pyspark.sql import SparkSession

# Enable Tungsten and related optimizations
spark = SparkSession.builder \
    .appName("TungstenExample") \
    .config("spark.sql.tungsten.enabled", "true") \
    .config("spark.sql.codegen.wholeStage", "true") \
    .getOrCreate()

# Create a large DataFrame and perform operations
df = spark.range(0, 10000000)
df2 = df.selectExpr("id * 5 as id")

# Action to trigger computation
df2.show(5)

# Explain the plan to see whole-stage codegen
df2.explain()

# Stop the Spark session
spark.stop()
```

**Explanation:**
- **Whole-stage code generation** is a key feature of Tungsten that compiles entire stages of the query plan into compact Java bytecode, which runs faster than traditional interpreted execution plans.
- The `explain()` output in the above example will show if whole-stage code generation is applied, indicated by the presence of `WholeStageCodegen` in the plan.

These examples provide a practical look at how different Spark concepts are applied within the environment of Azure Databricks, illustrating the implementation of lazy evaluation, eager functions, and the workings of the Catalyst optimizer and Tungsten optimizations.

In [0]:
from pyspark.sql import SparkSession

# Enable Tungsten and related optimizations
spark2 = SparkSession.builder \
    .appName("TungstenExample") \
    .config("spark.sql.tungsten.enabled", "true") \
    .config("spark.sql.codegen.wholeStage", "true") \
    .getOrCreate()

# Create a large DataFrame and perform operations
df = spark2.range(0, 10000000)
df2 = df.selectExpr("id * 5 as id")

# Action to trigger computation
df2.show(5)

# Explain the plan to see whole-stage codegen
df2.explain()




In [0]:
# Stop the Spark session
spark2.stop()