# Abstracting Data with RDDs

## Introduction

Resilient Distributed Datasets (RDDs) are collections of immutable JVM objects that are distributed across an Apache Spark cluster.klyor error.

An RDD is the most fundamental dataset type of Apache Spark; any action on a Spark DataFrame eventually gets translated into a highly optimized execution of transformations and actions on RDDs.

Data in an RDD is split into chunks based on a key and then dispersed across all the executor nodes. RDDs are highly resilient, that is, there are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. Thus, even if one executor fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes. RDDs keep a log of all the execution steps applied to each chunk. This, on top of the data replication, speeds up the computations and, if anything goes wrong, RDDs can still recover the portion of the data lost due to an executor error.

While it is common to lose a node in distributed environments (for example, due to connectivity issues, hardware problems), distribution and replication of the data defends against data loss, while data lineage allows the system to recover quickly.

## Creating RDDs

There are 
two ways to create an RDD in PySpark: you can either us 
the parallelize() method—a collection (list or an array of some elements)  r
reference a file (or files) located either locally or through an exter al
so.ipes

In [3]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').getOrCreate()
sc = spark.sparkContext

### Using parallelize() method

The following code snippet creates your RDD (myRDD) using
the sc.parallelize() method:

In [4]:
myRDD = sc.parallelize([('Mike', 19), ('June', 18), ('Rachel',16), ('Rob', 18),
('Scott', 17)])

To view what is inside your RDD, you can run the following code snippet:

In [5]:
myRDD.take(5)

                                                                                

[('Mike', 19), ('June', 18), ('Rachel', 16), ('Rob', 18), ('Scott', 17)]

#### How it works...
Let's break down the two methods in the preceding code snippet:`
sc.parallelize(`) and` take(`.)

##### Spark context parallelize method

The sc.parallelize() method is the SparkContext's parallelize method to create a parallelized collection.
This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data.

Now that we have created myRDD as a parallelized collection, Spark can 
operate against this data in parallel. Once created, the distributed datase 
(distData) can be operated on in parall.

##### .take(...) method

Now that you have created your RDD (myRDD), we will use the take() method to return the values to the console (or notebook cell). We will now execute an RDD action, take(). Note that a common approach in PySpark is to use collect(), which returns all values in your RDD from the Spark worker nodes to the driver. There are performance implications when working with a large amount of data as this translates to large volumes of data being transferred from the Spark worker nodes to the driver. For small amounts of data (such as this recipe), this is perfectly fine, but, as a matter of habit, you should pretty much
always use the take(n) method instead; it returns the first n elements of the RDD instead of the whole dataset. It is a more efficient method because it first scans one partition and uses those statistics to determine the number of partitions required to return the results.le

### Reading data from files

We will create an RDD by reading a local file in PySpark.

Note that while this recipe is specific 
to reading local files, a similar syntax can be applied for Hadoop, AWS S3 
Azure WASBs, and/or Google Cloud Storage

In [6]:
directory = "/home/elvin/Documents/Engenharia-De-Dados/Estudos/PySpark/"

myRDD = (sc.textFile(directory + 'airport-codes-na.txt', minPartitions=4, use_unicode=True)).map(lambda element: element.split("\t"))

In [7]:
myRDD.take(5)

[['City', 'State', 'Country', 'IATA'],
 ['Abbotsford', 'BC', 'Canada', 'YXX'],
 ['Aberdeen', 'SD', 'USA', 'ABR'],
 ['Abilene', 'TX', 'USA', 'ABI'],
 ['Akron', 'OH', 'USA', 'CAK']]

In [8]:
myRDD.count() # numero de linhas no RDD

527

In [9]:
myRDD.getNumPartitions() # numero de particoes que suportam este RDD

4

#### How it works...

The first code snippet to read the file and return values via take can be 
broken down into its two components: sc.textFile() and map()

##### .textFile(...) method

To read the file, we are using SparkContext's textFile() method.

Only the first parameter is required, which indicates the location of the text 
file as per ~/data/flights/airport-codes-na.txt. There are two optiona 
parameters as wel

* minPartitions: Indicates the minimum number of partitions that make up 
the RDD. The Spark engine can often determine the best number o 
partitions based on the file size, but you may want to change t e
number of partitions for performance reasons and, hence, the ability to
specify the minimum numb* er.
use_unicode: Engage this parameter if you are processing Unicode 

Note that if you were to execute this statement without the subsequent 
map() function, the resulting RDD would not reference the tab-delimite

##### .map(...) method

To make sense of the tab-delimiter with an RDD, we will use the 
.map(...) function to transform the data from a list of strings to a list of list.

The key components of this map transformation are:* 
lambda: An anonymous function (that is, a function defined without  
name) composed of a single expressio* n
split: We're using PySpark's split function (within pyspark.sql.functio s)
to split a string around a regular expression pattern; in this case, our
delimiter is a tab (that i)s, \tsrdata.l:)

## Partitions and performance

A key aspect of partitions for your RDD is that the more partitions you
have, the higher the parallelism. Potentially, having more partitions will
improve your query performan.ce

In [10]:
myRDD = sc.textFile(directory + "departuredelays.csv").map(lambda x: x.split(","))

In [11]:
myRDD.count()

                                                                                

1391579

In [12]:
myRDD.getNumPartitions()

2

In [13]:
myRDD = sc.textFile(directory + "departuredelays.csv", minPartitions=8).map(lambda x: x.split(","))

In [14]:
myRDD.count()

                                                                                

1391579

## Overview of RDD transformations

There are two types of operation that can be used to shape data in an RDD: transformations and actions. A transformation, as the name suggests, transforms one RDD into another. In other words, it takes an existing RDD and transforms it into one or more output RDDs. . In the preceding recipes, we had used a map() function, which 
is an example of a transformation to split the data by its tab-delimiter

Transformations are lazy (unlike actions). They only get executed when an 
action is called on an RDD. For example, calling the count()ffunction is a 
acti.on

In [33]:
airports = (sc.textFile(directory + 'airport-codes-na.txt')).map(lambda x: x.split("\t"))

In [34]:
airports.take(5)

[['City', 'State', 'Country', 'IATA'],
 ['Abbotsford', 'BC', 'Canada', 'YXX'],
 ['Aberdeen', 'SD', 'USA', 'ABR'],
 ['Abilene', 'TX', 'USA', 'ABI'],
 ['Akron', 'OH', 'USA', 'CAK']]

In [29]:
flights = (sc.textFile(directory + 'departuredelays.csv').map(lambda x: x.split(",")))

In [30]:
flights.take(5)

[['date', 'delay', 'distance', 'origin', 'destination'],
 ['01011245', '6', '602', 'ABE', 'ATL'],
 ['01020600', '-8', '369', 'ABE', 'DTW'],
 ['01021245', '-2', '602', 'ABE', 'ATL'],
 ['01020605', '-4', '602', 'ABE', 'ATL']]

The transformations include the following common tasks:
* Removing the header line from your text file: zipWithIndex()
* Selecting columns from your RDD: map()
* Running a WHERE (filter) clause: filter()
* Getting the distinct values: distinct()
* Getting the number of partitions: getNumPartitions()
* Determining the size of your partitions (that is, the number of elements within each partition): mapPartitionsWithIndex()

### .map(...) transformation

The map(f) transformation returns a new RDD formed by passing each
element through a function, f.

In [36]:
airports.map(lambda c: (c[0], c[1])).take(5)

[('City', 'State'),
 ('Abbotsford', 'BC'),
 ('Aberdeen', 'SD'),
 ('Abilene', 'TX'),
 ('Akron', 'OH')]

### .filter(...) transformation

The filter(f) transformation returns a new RDD based on selecting 
elements for which the f function returns tru.e

In [37]:
# User filter() to filter where second column == "WA"
airports.map(lambda c: (c[0], c[1])).filter(lambda c: c[1] == "WA").take(5)

[('Bellingham', 'WA'),
 ('Moses Lake', 'WA'),
 ('Pasco', 'WA'),
 ('Pullman', 'WA'),
 ('Seattle', 'WA')]