Skip to content

Commit

Permalink
Resolve: #4 + Spark RDD example (#13)
Browse files Browse the repository at this point in the history
* Resolve: #4 + Spark RDD example

* Make headers h2

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Jonathan Porter (he/him) <JPHaus@users.noreply.github.com>
  • Loading branch information
icharo-tb and JPHaus committed Nov 4, 2022
1 parent 6ea3cc7 commit d29f6fa
Show file tree
Hide file tree
Showing 2 changed files with 87 additions and 1 deletion.
25 changes: 24 additions & 1 deletion Tools/Apache Spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,30 @@ https://spark.apache.org/docs/latest/

## Apache Spark Disadvantages

#placeholder/description
- Spark lacks a native storage option
- Wrong usage of RDD partitions on Spark Context can cause negative effects over HDFS and MemoryOverhead driver

## Apache Spark storage

Apache Spark is compatible with Hadoop APIs, like HDFS. Spark also works with other storage services such as NoSQL databases, ElasticSearch and Amazon S3.

## Apache Spark model

Apache Spark's programming model is based on **parallel operators**. This is Spark's main advantage and feature, which let us use a Leader-Follower strategy to tackle data.

Spark describes tasks based on a DAG (**Directed Acyclic Graphs**), it gives programmers the option of developing complex pipelines. To understand what DAG does, we must focus on its name:

- *Directed*: stands for a one way process
- *Acyclic*: means that there are no loops on the tasks
- *Graph*: is a reference on how they can be displayed as an actual graph

A DAG system is also fault-tolerant.

## Apache Spark RDDs

RDDs (**Resilient Distributed Datasets**) are the main Spark abstraction. They consist of element sets that are fault tolerance and able to be parallel processed. RDDs also provide scalability due to being distributed processes as well as being immutable.

RDDs emit two kind of operations: **transformations** (such as filter, map...) and **actions** (such as reduce, collect...). Spark RDDs tend to perform lazy evaluation, which improves efficiency by executing operations only when they are needed.

## Apache Spark Learning Resources

Expand Down
63 changes: 63 additions & 0 deletions Tutorials/Apache Spark RDD example.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
Aliases: []
Tags: [seedling]
publish: true
---

# Overview

This tutorial will cover basic RDD operations that can be run on either Google Colab or Databricks Community Edition.

## Official Documentation

https://spark.apache.org/docs/latest/rdd-programming-guide.html

## Configuration

While Databricks has Spark installed as a native module, Google Colab needs some previous configuration to set the environment for the RDD operations.

We will first install PySpark, a Python library that let us use Apache Spark:

```
!pip install pyspark
```

After the module is installed, we will set up Spark Configuration so we can use **SparkContext**:

```
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('test').setMaster('local')
sc = SparkContext(conf=conf)
```

## First Steps

In order to work with RDDs we need to understand how an RDD is created. We will execute `sc.parallelize([your_data])` to create an RDD. By default, Spark admits *lists* and *dictionaries* on the parallelize argument. Now, if we want to see the content of an RDD, we must execute `.collect()`.

We can see here an easy example:

```
nums2 = sc.parallelize([3,2,1,4,5])
evens = nums2.filter(lambda elem: elem%2==0)
odds = nums2.filter(lambda elem: elem%2!=0)
order = pairs.union(impairs)
order.takeOrdered(5)
```
```
[1, 2, 3, 4, 5]
```

Let's explain this operation:
- We first create our rdd with `sc.parallelize()` under the variable name "nums2"
- We use a transformation operator (filter) and set up a lambda algorithm that will search for even numbers.
- We will do the same but searching for odd numbers.
- We execute an action operator (union) to join even and odds on a single list.
- To end this operation we will execute a TakeOrdered so we get an ordered list.

%% wiki footer: Please don't edit anything below this line %%

## This note in GitHub

<span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Tutorials/Apache%20Spark%20RDD%20example.md "git-hub-edit-note") | [Copy this note](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Tutorials/Apache%20Spark%20RDD%20example.md "git-hub-copy-note") </span>

0 comments on commit d29f6fa

Please sign in to comment.