# Big Data Fundamentals with PySpark
There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is a "lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for Spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc. You will explore the works of William Shakespeare, analyze Fifa 2018 data and perform clustering on genomic datasets. At the end of this course, you will have gained an in-depth understanding of PySpark and its application to general Big Data analysis.

Instructor: Upendra Devisetty, Science Analyst at CyVerse

## $\star$ Introduction to Big Data Analysis with Spark
This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

#### The 3 Vs of Big Data
* The 3 Vs are used to describe big data's characteristics
* **Volume:** Size of the data 
* **Variety:** Different sources and formats of data
* **Velocity:** Speed at which the data is generated and available for processing

#### Big Data concepts and Terminology
* **Clustered computing:** collection of resources of multiple machines
* **Parallel computing:** a type of computation in which many calculations are carried out simultaneously
* **Distributed computing:** Collection of nodes (networked computers) that run in parallel
* **Batch processing:** Breaking the job into small piece and running them on individual machines
* **Real-time processing:** Immediate processing of data

#### Big Data processing systems
* **Hadoop/MapReduce:** Scalable and fault-tolerant framework; written in Java
    * Open source
    * Batch processing
* **Apache Spark:** General purpose and lightning fast cluster computing system
    * Open source
    * Suited for both batch and real-tine data processing

#### Features of Apache Spark framework
* Distributed cluster computing framework
* Efficient in-memory computations for large scale data sets
* Lightning-fast data processing framework
* Provides support for Java, Scala, Python, R, and SQL

#### Spark modes of deployment
* **Local mode:** Single machine such as your laptop
    * Convenient for testing, debugging, and demonstration
* **Cluster mode:** Set of pre-defined machines
    * Good for production
* Typical workflow: Local $\Rightarrow$ clusters
    * During this transition, no code change is necessary

### PySpark: Spark with Python
#### What is Spark shell?
* Interactive environment for running Spark jobs
* Helpful for fast interactive prototyping
* Spark's shells allow interacting with data on disk or in memory across many machines or one, and Spark takes care of automatically distributing this processing
* Three different Spark shells:
    * Spark-shell for Scala
    * PySpark-shell for Python
    * SparkR for R
    
#### PySpark shell
* PySpark shell is the Python-based command line tool
* PySpark shell sllows data scientists to interface with Spark data structures
* PySpark shell supports connecting to a cluster

#### Understanding SparkContext
* SparkContext is an entry point into the world of Spark
* An **entry point** is where control is transferred from the Operating system to the provided program.
    * An entry point is a way of connecting to Spark cluster
    * An entry point is "like a key to the house." 
* Access the SparkContext in the PySpark shell as a variable named `sc`

#### Inspecting SparkContext
* **Version:** to retrieve SparkContext version that you are currently running:
    * `sc.version`
* **Python Version:** to retrieve Python version *that SparkContext is currently using*
    * `sc.pythonVer`
* **Master:** URL of the cluster of "local" string to run in local mode of SparkContext
    * `sc.master`
    * If returns: `local[*]`, means SparkContext acts as a master on a local node using all available threads on the computer where it is running.
    
#### Loading data in PySpark
* SparkContext's **`parallelize()`** method (used on a list)
    * For example, to create parallelized collections holding the numbers 1 to 5:
    * `rdd = sc.parallelize([1, 2, 3, 4, 5])
* SparkContext's **`textFile()`** method (used on a file)
    * For example, to load a text file named `test.txt` using this method:
    * `rdd2 = sc.textFile("test.txt")`
    
```
# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is", sc.version)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is", sc.pythonVer)

# Print the master of SparkContext
print("The master of Spark Context in the PySpark shell is", sc.master)
```
***

```
# Create a Python list of numbers from 1 to 100 
numb = range(1, 101)

# Load the list into PySpark  
spark_data = sc.parallelize(numb)
```
***

```
# Load a local file into PySpark shell
lines = sc.textFile(file_path)
```

### Use of lambda function in python- filter()
* Understanding PySpark becomes a lot easier if we understand functional programming principles in Python:
    * `lambda`
    * `map`
    * `filter`
* Python supports the creation of anonymous functions.
    * **Anonymous functions** are functions that are not bound to a name at runtime, using a construct called `lambda`
    * Lambda functions are very powerful and well-integrated into Python
    * Lambda is especially efficient with `map()` and `filter()`
    * Like `def`, Python creates a function to be called later in the program. However, it returns the function instead of assigning it to a name (ie **anonymous**).
    * In practice, they are used as a way to inline a function definition, or to defer execution of a code. 
    
#### Lambda function syntax
* Lambda function can be used whenever function objects are required. 
* They can have any number of arguments, but only one expression, and the expression is evaluated and returned.
* **The general syntax of the lambda function is:**

**`lambda arguments: expression`**

Examples:

```
double = lambda x: x * 2
print(double(3))
```
*** 

```
g = lambda x: X**3
print(g(10))
```

In [1]:
double = lambda x: x * 2
print(double(3))

6


In [3]:
g = lambda x: x**3
print(g(10))

1000


* No return statement for lambda
* Can put lambda function anywhere, without ever assigning it to a variable
* We use lambda functions when we require a nameless function for a short period of time

#### Use of Lambda function in Python - map()
* `map()` function takes a function and a list and returns a new list which contains items returned by that function for each item
* General syntax of `map()`: 
    * **`map(function, list)`**
* Example of `map` with `lambda`:

```
items = [1, 2, 3, 4]
list(map(lambda x: x + 2, items))
```

result:

**`[3, 4, 5, 6]`**


In [4]:
items = [1, 2, 3, 4]
list(map(lambda x: x + 2, items))

[3, 4, 5, 6]

#### Use of lambda function in python- filter()
* `filter()` function takes a function and a list and returns a new list for which the function evaluates as true
* General syntax of filter():
    * **`filter(function, list`**
* Example of `filter()` with `lambda`:

```
items = [1, 2, 3, 4]
list(filter(lambda x: (x%2 != 0), items))
```

In [5]:
items = [1, 2, 3, 4]
list(filter(lambda x: (x%2 != 0), items))

[1, 3]

```
# Print my_list in the console
print("Input list is", my_list)

# Square all numbers in my_list
squared_list_lambda = list(map(lambda x: x**2, my_list))

# Print the result of the map function
print("The squared numbers are", squared_list_lambda)
```
***

```
# Print my_list2 in the console
print("Input list is:", my_list2)

# Filter numbers divisible by 10
filtered_list = list(filter(lambda x: (x%10 == 0), my_list2))

# Print the numbers divisible by 10
print("Numbers divisible by 10 are:", filtered_list)
```

## $\star$ Chapter  2: Programming in PySpark RDDs
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.

#### Introduction to PySpark RDD
In this chapter, we will start working with RDDs which are Spark's core abstraction for working with data. 

* **RDD** = **Resilient Distributed Datasets**
* RDDs are a collection of data distributed across the cluster
* RDD is the fundamental and backbone data type in PySpark

#### Decomposing RDDs
* Resilient Distributed Datasets
    * **Reslient:** ability to withstand failures
    * **Distributed:** spanning across multiple machines
    * **Datasets:** collection of partitioned data e.g., arrays, tables, tuples, etc. ...
    
#### Creating RDDs
* Three methods for creating RDDs
* **Parallelize:**
    * The **simplest method** to create RDDs is to take an existing collection of objects (for example a list, array, or set) and pass it to SparkContext's parallelize method.
* **External datasets:**
    * A **more common** way to create RDDs is to load data from external datasets such as:
        * files stored in HDFS
        * Objects in Amazon S3 bucket
        * lines in a text file
* **From existing RDDs**

#### Parallelized collection (parallelizing)
* RDDs are created from a list or a set using the SparkContext's `parallelize` method.

```
numRDD = sc.parallelize([1, 2, 3, 4])
helloRDD = sc.parallelize("Hello world")
type(helloRDD)
```

#### From external datasets
* Creating RDDs from external datasets is by far the most common method in PySpark
* `textFile()` for creating RDDs from external datasets
* For file README stored locally:
* `fileRDD = sc.textFile("README.md")`
* `type(fileRDD)`

#### Understanding Partitioning in PySpark
* Data partitioning is an important concept in Spark 