Resolve: #4 + Spark RDD example (#13)

* Resolve: #4 + Spark RDD example * Make headers h2 * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Jonathan Porter (he/him) <JPHaus@users.noreply.github.com>
data-engineering-community · Nov 4, 2022 · d29f6fa · d29f6fa
1 parent 6ea3cc7
commit d29f6fa
Show file tree

Hide file tree

Showing 2 changed files with 87 additions and 1 deletion.
diff --git a/Tools/Apache Spark.md b/Tools/Apache Spark.md
@@ -20,7 +20,30 @@ https://spark.apache.org/docs/latest/
 
 ## Apache Spark Disadvantages
 
-#placeholder/description 
+- Spark lacks a native storage option
+- Wrong usage of RDD partitions on Spark Context can cause negative effects over HDFS and MemoryOverhead driver
+
+## Apache Spark storage
+
+Apache Spark is compatible with Hadoop APIs, like HDFS. Spark also works with other storage services such as NoSQL databases, ElasticSearch and Amazon S3.
+
+## Apache Spark model
+
+Apache Spark's programming model is based on **parallel operators**. This is Spark's main advantage and feature, which let us use a Leader-Follower strategy to tackle data.
+
+Spark describes tasks based on a DAG (**Directed Acyclic Graphs**), it gives programmers the option of developing complex pipelines. To understand what DAG does, we must focus on its name:
+
+- *Directed*: stands for a one way process
+- *Acyclic*: means that there are no loops on the tasks
+- *Graph*: is a reference on how they can be displayed as an actual graph
+
+A DAG system is also fault-tolerant.
+
+## Apache Spark RDDs
+
+RDDs (**Resilient Distributed Datasets**) are the main Spark abstraction. They consist of element sets that are fault tolerance and able to be parallel processed. RDDs also provide scalability due to being distributed processes as well as being immutable. 
+
+RDDs emit two kind of operations: **transformations** (such as filter, map...) and **actions** (such as reduce, collect...). Spark RDDs tend to perform lazy evaluation, which improves efficiency by executing operations only when they are needed.
 
 ## Apache Spark Learning Resources
 

diff --git a/Tutorials/Apache Spark RDD example.md b/Tutorials/Apache Spark RDD example.md
@@ -0,0 +1,63 @@
+---
+Aliases: []
+Tags: [seedling]
+publish: true
+---
+
+# Overview
+
+This tutorial will cover basic RDD operations that can be run on either Google Colab or Databricks Community Edition.
+
+## Official Documentation
+
+https://spark.apache.org/docs/latest/rdd-programming-guide.html
+
+## Configuration
+
+While Databricks has Spark installed as a native module, Google Colab needs some previous configuration to set the environment for the RDD operations.
+
+We will first install PySpark, a Python library that let us use Apache Spark:
+
+```
+!pip install pyspark
+```
+
+After the module is installed, we will set up Spark Configuration so we can use **SparkContext**:
+
+```
+from pyspark import SparkContext, SparkConf
+
+conf = SparkConf().setAppName('test').setMaster('local')
+sc = SparkContext(conf=conf)
+```
+
+## First Steps
+
+In order to work with RDDs we need to understand how an RDD is created. We will execute `sc.parallelize([your_data])` to create an RDD. By default, Spark admits *lists* and *dictionaries* on the parallelize argument. Now, if we want to see the content of an RDD, we must execute `.collect()`.
+
+We can see here an easy example:
+
+```
+nums2 = sc.parallelize([3,2,1,4,5])
+evens = nums2.filter(lambda elem: elem%2==0)
+odds = nums2.filter(lambda elem: elem%2!=0)
+
+order = pairs.union(impairs)
+order.takeOrdered(5)
+```
+```
+[1, 2, 3, 4, 5]
+```
+
+Let's explain this operation:
+- We first create our rdd with `sc.parallelize()` under the variable name "nums2"
+- We use a transformation operator (filter) and set up a lambda algorithm that will search for even numbers.
+- We will do the same but searching for odd numbers.
+- We execute an action operator (union) to join even and odds on a single list.
+- To end this operation we will execute a TakeOrdered so we get an ordered list.
+
+%% wiki footer: Please don't edit anything below this line %%
+
+## This note in GitHub
+
+<span class="git-footer">[Edit In GitHub](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Tutorials/Apache%20Spark%20RDD%20example.md "git-hub-edit-note") | [Copy this note](https://github.dev/data-engineering-community/data-engineering-wiki/blob/main/Tutorials/Apache%20Spark%20RDD%20example.md "git-hub-copy-note") </span>