# Big Data Fundamentals with PySpark
There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is a "lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for Spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc. You will explore the works of William Shakespeare, analyze Fifa 2018 data and perform clustering on genomic datasets. At the end of this course, you will have gained an in-depth understanding of PySpark and its application to general Big Data analysis.

Instructor: Upendra Devisetty, Science Analyst at CyVerse

## $\star$ Introduction to Big Data Analysis with Spark
This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

#### The 3 Vs of Big Data
* The 3 Vs are used to describe big data's characteristics
* **Volume:** Size of the data 
* **Variety:** Different sources and formats of data
* **Velocity:** Speed at which the data is generated and available for processing

#### Big Data concepts and Terminology
* **Clustered computing:** collection of resources of multiple machines
* **Parallel computing:** a type of computation in which many calculations are carried out simultaneously
* **Distributed computing:** Collection of nodes (networked computers) that run in parallel
* **Batch processing:** Breaking the job into small piece and running them on individual machines
* **Real-time processing:** Immediate processing of data

#### Big Data processing systems
* **Hadoop/MapReduce:** Scalable and fault-tolerant framework; written in Java
    * Open source
    * Batch processing
* **Apache Spark:** General purpose and lightning fast cluster computing system
    * Open source
    * Suited for both batch and real-tine data processing

#### Features of Apache Spark framework
* Distributed cluster computing framework
* Efficient in-memory computations for large scale data sets
* Lightning-fast data processing framework
* Provides support for Java, Scala, Python, R, and SQL

#### Spark modes of deployment
* **Local mode:** Single machine such as your laptop
    * Convenient for testing, debugging, and demonstration
* **Cluster mode:** Set of pre-defined machines
    * Good for production
* Typical workflow: Local $\Rightarrow$ clusters
    * During this transition, no code change is necessary

### PySpark: Spark with Python
#### What is Spark shell?
* Interactive environment for running Spark jobs
* Helpful for fast interactive prototyping
* Spark's shells allow interacting with data on disk or in memory across many machines or one, and Spark takes care of automatically distributing this processing
* Three different Spark shells:
    * Spark-shell for Scala
    * PySpark-shell for Python
    * SparkR for R
    
#### PySpark shell
* PySpark shell is the Python-based command line tool
* PySpark shell sllows data scientists to interface with Spark data structures
* PySpark shell supports connecting to a cluster

#### Understanding SparkContext
* SparkContext is an entry point into the world of Spark
* An **entry point** is where control is transferred from the Operating system to the provided program.
    * An entry point is a way of connecting to Spark cluster
    * An entry point is "like a key to the house." 
* Access the SparkContext in the PySpark shell as a variable named `sc`

#### Inspecting SparkContext
* **Version:** to retrieve SparkContext version that you are currently running:
    * `sc.version`
* **Python Version:** to retrieve Python version *that SparkContext is currently using*
    * `sc.pythonVer`
* **Master:** URL of the cluster of "local" string to run in local mode of SparkContext
    * `sc.master`
    * If returns: `local[*]`, means SparkContext acts as a master on a local node using all available threads on the computer where it is running.
    
#### Loading data in PySpark
* SparkContext's **`parallelize()`** method (used on a list)
    * For example, to create parallelized collections holding the numbers 1 to 5:
    * `rdd = sc.parallelize([1, 2, 3, 4, 5])
* SparkContext's **`textFile()`** method (used on a file)
    * For example, to load a text file named `test.txt` using this method:
    * `rdd2 = sc.textFile("test.txt")`
    
```
# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is", sc.version)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is", sc.pythonVer)

# Print the master of SparkContext
print("The master of Spark Context in the PySpark shell is", sc.master)
```
***

```
# Create a Python list of numbers from 1 to 100 
numb = range(1, 101)

# Load the list into PySpark  
spark_data = sc.parallelize(numb)
```
***

```
# Load a local file into PySpark shell
lines = sc.textFile(file_path)
```

### Use of lambda function in python- filter()
* Understanding PySpark becomes a lot easier if we understand functional programming principles in Python:
    * `lambda`
    * `map`
    * `filter`
* Python supports the creation of anonymous functions.
    * **Anonymous functions** are functions that are not bound to a name at runtime, using a construct called `lambda`
    * Lambda functions are very powerful and well-integrated into Python
    * Lambda is especially efficient with `map()` and `filter()`
    * Like `def`, Python creates a function to be called later in the program. However, it returns the function instead of assigning it to a name (ie **anonymous**).
    * In practice, they are used as a way to inline a function definition, or to defer execution of a code. 
    
#### Lambda function syntax
* Lambda function can be used whenever function objects are required. 
* They can have any number of arguments, but only one expression, and the expression is evaluated and returned.
* **The general syntax of the lambda function is:**

**`lambda arguments: expression`**

Examples:

```
double = lambda x: x * 2
print(double(3))
```
*** 

```
g = lambda x: X**3
print(g(10))
```

In [1]:
double = lambda x: x * 2
print(double(3))

6


In [3]:
g = lambda x: x**3
print(g(10))

1000


* No return statement for lambda
* Can put lambda function anywhere, without ever assigning it to a variable
* We use lambda functions when we require a nameless function for a short period of time

#### Use of Lambda function in Python - map()
* `map()` function takes a function and a list and returns a new list which contains items returned by that function for each item
* General syntax of `map()`: 
    * **`map(function, list)`**
* Example of `map` with `lambda`:

```
items = [1, 2, 3, 4]
list(map(lambda x: x + 2, items))
```

result:

**`[3, 4, 5, 6]`**


In [4]:
items = [1, 2, 3, 4]
list(map(lambda x: x + 2, items))

[3, 4, 5, 6]

#### Use of lambda function in python- filter()
* `filter()` function takes a function and a list and returns a new list for which the function evaluates as true
* General syntax of filter():
    * **`filter(function, list`**
* Example of `filter()` with `lambda`:

```
items = [1, 2, 3, 4]
list(filter(lambda x: (x%2 != 0), items))
```

In [5]:
items = [1, 2, 3, 4]
list(filter(lambda x: (x%2 != 0), items))

[1, 3]

```
# Print my_list in the console
print("Input list is", my_list)

# Square all numbers in my_list
squared_list_lambda = list(map(lambda x: x**2, my_list))

# Print the result of the map function
print("The squared numbers are", squared_list_lambda)
```
***

```
# Print my_list2 in the console
print("Input list is:", my_list2)

# Filter numbers divisible by 10
filtered_list = list(filter(lambda x: (x%10 == 0), my_list2))

# Print the numbers divisible by 10
print("Numbers divisible by 10 are:", filtered_list)
```

## $\star$ Chapter  2: Programming in PySpark RDDs
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.

#### Introduction to PySpark RDD
In this chapter, we will start working with RDDs which are Spark's core abstraction for working with data. 

* **RDD** = **Resilient Distributed Datasets**
* RDDs are a collection of data distributed across the cluster
* RDD is the fundamental and backbone data type in PySpark

#### Decomposing RDDs
* Resilient Distributed Datasets
    * **Reslient:** ability to withstand failures
    * **Distributed:** spanning across multiple machines
    * **Datasets:** collection of partitioned data e.g., arrays, tables, tuples, etc. ...
    
#### Creating RDDs
* Three methods for creating RDDs
* **Parallelize:**
    * The **simplest method** to create RDDs is to take an existing collection of objects (for example a list, array, or set) and pass it to SparkContext's parallelize method.
* **External datasets:**
    * A **more common** way to create RDDs is to load data from external datasets such as:
        * files stored in HDFS
        * Objects in Amazon S3 bucket
        * lines in a text file
* **From existing RDDs**

#### Parallelized collection (parallelizing)
* RDDs are created from a list or a set using the SparkContext's `parallelize` method.

```
numRDD = sc.parallelize([1, 2, 3, 4])
helloRDD = sc.parallelize("Hello world")
type(helloRDD)
```

#### From external datasets
* Creating RDDs from external datasets is by far the most common method in PySpark
* `textFile()` for creating RDDs from external datasets
* For file README stored locally:
* `fileRDD = sc.textFile("README.md")`
* `type(fileRDD)`

#### Understanding Partitioning in PySpark
* Data partitioning is an important concept in Spark and understanding how Spark deals with partitions allows one to control parallelism.
* A **partition** is a logical division of a large distributed data set. 
* By default, Spark partitions the data at the time of creating RDD based on several factors such as available resources, external datasets, etc. but this behavior can also be controlled by passing a second argument called `minPartitions`, which defines the minimum number of partitions to be created for an RDD
* `parallelize()` method:
    * `numRDD = sc.parallelize(range(10), minPartitions = 6)`
* `textFile()` method:
    * `fileRDD = sc.textFile("README.md", minPartitions = 6)`
* The number of partitions in an RDD can always be found by using the `getNumPartitions()` method

```
# Create an RDD from a list of words
RDD = sc.parallelize(["Spark", "is", "a", "framework", "for", "Big Data processing"])

# Print out the type of the created object
print("The type of RDD is", type(RDD))
```
***

```
# Print the file_path
print("The file_path is", file_path)

# Create a fileRDD from file_path
fileRDD = sc.textFile(file_path)

# Check the type of fileRDD
print("The file type of fileRDD is", type(fileRDD))
```
***

```
# Check the number of partitions in fileRDD
print("Number of partitions in fileRDD is", fileRDD.getNumPartitions())

# Create a fileRDD_part from file_path with 5 partitions
fileRDD_part = sc.textFile(file_path, minPartitions = 5)

# Check the number of partitions in fileRDD_part
print("Number of partitions in fileRDD_part is", fileRDD_part.getNumPartitions())
```

#### RDD Operations in PySpark
* RDDs in PySpark supports two different types of operations:
    * Transformations
    * Actions
* **Transformations** are operations on RDDs that return a new RDD
    * follow lazy evaluation
    * basic RDD transformations:
        * `map()`
        * `filter()`
        * `flatMap()`
        * `union()`
* **Actions** are operations that perfomr some computation on the RDD

#### map() Transformation 
* The `map()` transformation applies a function to all elements in the RDD
* Example:

```
RDD = sc.parallelize([1, 2, 3, 4])
RDD_map = RDD.map(lambda x: x * x)
```

#### filter() Transformation
* The `filter()` transformation takes in a function and returns an RDD that only has elements that pass the condition.

```
RDD = sc.parallelize([1, 2, 3, 4])
RDD_filter = RDD.filter(lambda x: x > 2)
```

#### flatmap() Transformation 
* The `flatMap()` transformation is similar to `map()` transformation, except that it returns multiple values for each element in the source RDD.
* A simple usage of `flatMap()` is splitting up an input string into words
* Even thought input RDD has 2 elements, for example, the output RDD now contains 5 elements:

```
RDD = sc.parallelize(["hello world", "how are you"])
RDD_flatmap = RDD.flatMap(lambda x: x.split(" "))
```

#### union() Transformation
* The `union()` transformation returns the union of one RDD with another RDD
* Similar to pandas' `concat()`

```
inputRDD = sc.textFile("logs.txt")
errorRDD = inputRDD.filter(lambda x: "error" in x.split())
warningsRDD = inputRDD.filter(lambda c: "warnings" in x.split())
combinedRDD = errorRDD.union(warningsRDD)
```
### RDD Actions
* Operation return a value after running a computation on the RDD
* Basic RDD Actions
    * **`collect()`:** returns all the elements of the dataset as an array
        * `RDD_map.collect()`
    * **`take(n)`:** returns an array with the first N elements of the dataset
        * `RDD_map.take(2)`
    * **`first()`:** prints the first element of the RDD
        * `RDD_map.first()`
    * **`count()`:** return the number of elements in the RDD
        * `RDD_flatmap.count()`
        
```
# Create map() transformation to cube numbers
cubedRDD = numbRDD.map(lambda x: x**3)

# Collect the results
numbers_all = cubedRDD.collect()

# Print the numbers from numbers_all
for numb in numbers_all:
	print(numb)
```
*** 

```
# Filter the fileRDD to select lines with Spark keyword
fileRDD_filter = fileRDD.filter(lambda line: 'Spark' in line)

# How many lines are there in fileRDD?
print("The total number of lines with the keyword Spark is", fileRDD_filter.count())

# Print the first four lines of fileRDD
for line in fileRDD_filter.take(4): 
  print(line)
```

#### Working with Pair RDDs in PySpark
* Real life datasets are usually keyvaue pairs 
* Eah row is a key and maps to one or more values
* **Pair RDD** is a special data structure to work with this kind of dataset
* **Pair RDD:** Key is the identifier and value is data

#### Creating pair RDDs
* Two common ways to create pair RDDs:
    * From a list of key-value tuples
    * From a regular RDD
* Get the data into keyvalue form for paired RDD

```
my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]
pairRDD_tuple = sc.parallelize(my_tuple)
```
***

```
my_list = ['Sam 23', 'Mary 34', 'Peter 25']
regularRDD = sc.parallelize(my_list)
pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))
```

#### Transformations on pair RDDs
* All regular transformations work on pair RDDs
* Because they contain tuples, we need to pass functions that operate on key value pairs rather than on individual elements
* Examples of paired RDD Transformations:
    * **`reduceByKey()`**: Group values with the same key
    * **`groupByKey()`**: Group values with the same key
    * **`sortByKey()`**: Return an RDD sorted by the key
    * **`join()`**: Join two pair RDDs based on their key
    
#### reduceByKey() transformation
* The most popular pair RDD transformation which combines values with the same key using a function
* It runs parallel operations for each key in the dataset
* `reduceByKey()` is a transformation and not an action, as datasets can have very large numbers of keys

```
regularRDD = sc.parallelize(["Messi", 23), ("Ronaldo", 34), ("Neymar", 22), ("Messi", 24)])
pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x + y)
pairRDD_reducebykey.collect()
```
* result: `[('Neymar', 22), ('Ronaldo', 24), ('Messi', 47)]`

#### sortByKey() transformation
* Sorting of data is necessary for many downstream applications
* We can sort pair RDDs as long as there is an ordering defined in the key.
* `sortByKey()` operation orders pair RDD by key
* It returns an RDD sorted by key in ascending or descending order

```
pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1], x[0]))
pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()
```
* result: `[(47, 'Messi'), (34, 'Ronaldo'), (22, 'Neymar')]`

#### groupByKey() transformation
* `groupByKey()` groups all the values with the same key in the pair RDD
* If the data is already keyed in the way that we want, the `groupByKey` operation groups all the values with the same key in the pair RDD.

```
airports = [("US", "JFK"), ("UK", "LHR"), ("FR", "CDG"), ("US", "SFO")]
regularRDD = sc.parallelize(airports)
pairRDD_group = regularRDD.groupByKey().collect()
for cont, air in pairRDD_group:
    print(cont, list(air))
```
* result:

```
FR ['CDG']
US ['JFK', 'SFO']
UK ['LHR']
```

#### join() transformation
* `join()` joins two pair RDDs based on their key 

```
RDD1 = sc.parallelize([("Messi", 34), ("Ronaldo", 32), ("Neymar", 24)})
RDD2 = sc.parallelize([("Ronaldo", 80), ("Neymar", 120), ("Messi", 100)])
RDD1.join(RDD2).collect()
```
* results: `[('Neymar', (24, 120)), ('Ronaldo', (32, 80)), ("Messi", (34,100))]`

#### reduceByKey

```
# Create PairRDD Rdd with key value pairs
Rdd = sc.parallelize([(1,2), (3,4), (3,6), (4,5)])

# Apply reduceByKey() operation on Rdd
Rdd_Reduced = Rdd.reduceByKey(lambda x, y: x+y)

# Iterate over the result and print the output
for num in Rdd_Reduced.collect(): 
  print("Key {} has {} Counts".format(num[0], num[1]))
```
***

```
# Sort the reduced RDD with the key by descending order
Rdd_Reduced_Sort = Rdd_Reduced.sortByKey(ascending=False)

# Iterate over the result and retrieve all the elements of the RDD
for num in Rdd_Reduced_Sort.collect():
  print("Key {} has {} Counts".format(num[0], num[1]))
```

### More actions

#### reduct() action
* `reduce(func)` action is used for aggregating the elements of a regular RDD
* The function should be *commutative* (changing the order of the operands does not change the result) and associative

```
x = [1, 3, 4, 6]
RDD = sc.parallelize(x)
RDD.reduce(lambda x, y : x + y)
```

#### saveAsTextFile() action
* In many cases, it is not advisable to run the `collect()` action on RDDs because of the huge size of the data
* In these cases, it's common to write data out to a distributed storage system such as HDFS or Amazzon S3.
* `saveAsTextFile()` action saves RDD into a text file inside a directory with each partition as a separate file.
* Below is an example of `saveAsTextFile` that saves an RDD with each partition as a separate file inside a directory:

```
RDD.saveAsTextFile("tempFile")
```
* However, you can change it to return a new RDD that is reduced into a single partition using the `coalesce()` method
* The `coalesce()` method can be used to save RDD as a single text file.

```
RDD.coalesce(1).saveAsTextFile("tempFile")
```

### Action Operations on pair RDDs
* RDD actions available for PySpark pair RDDs
* Pair RDD actions leverage the key-value data
* Few examples of pair RDD actions include:
    * `countByKey()`
    * `collectAsMap()`
    
#### countByKey() action
* `countByKey()` is only available on RDDs of type (Key, Value)
* `countByKey()` operation counts the number of elements for each key.

```
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
for kee, val in rdd.countByKey().items():
    print(kee, val)
```
* result:

```
('a', 2)
('b', 1)
```
* One thing to **note** is that `countByKey()` should only be used on a datset whose size is small enough to fit in memory

#### collectAsMap() action
* `collectAsMap()` return the key-value pairs in the RDD as a dictionary

```
sc.parallelize([(1, 2), (3,4)]).collectAsMap()
```
* result: `{1:2, 3:4]`
* **Note** this actino should also only be used if the resulting data is expected to be small and all the data is loaded into memory.