# Big Data Fundamentals with PySpark
There's been a lot of buzz about Big Data over the past few years, and it's finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is a "lightning fast cluster computing" framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python package for Spark programming and its powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc. You will explore the works of William Shakespeare, analyze Fifa 2018 data and perform clustering on genomic datasets. At the end of this course, you will have gained an in-depth understanding of PySpark and its application to general Big Data analysis.

Instructor: Upendra Devisetty, Science Analyst at CyVerse

## $\star$ Introduction to Big Data Analysis with Spark
This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

#### The 3 Vs of Big Data
* The 3 Vs are used to describe big data's characteristics
* **Volume:** Size of the data 
* **Variety:** Different sources and formats of data
* **Velocity:** Speed at which the data is generated and available for processing

#### Big Data concepts and Terminology
* **Clustered computing:** collection of resources of multiple machines
* **Parallel computing:** a type of computation in which many calculations are carried out simultaneously
* **Distributed computing:** Collection of nodes (networked computers) that run in parallel
* **Batch processing:** Breaking the job into small piece and running them on individual machines
* **Real-time processing:** Immediate processing of data

#### Big Data processing systems
* **Hadoop/MapReduce:** Scalable and fault-tolerant framework; written in Java
    * Open source
    * Batch processing
* **Apache Spark:** General purpose and lightning fast cluster computing system
    * Open source
    * Suited for both batch and real-tine data processing

#### Features of Apache Spark framework
* Distributed cluster computing framework
* Efficient in-memory computations for large scale data sets
* Lightning-fast data processing framework
* Provides support for Java, Scala, Python, R, and SQL

#### Spark modes of deployment
* **Local mode:** Single machine such as your laptop
    * Convenient for testing, debugging, and demonstration
* **Cluster mode:** Set of pre-defined machines
    * Good for production
* Typical workflow: Local $\Rightarrow$ clusters
    * During this transition, no code change is necessary

### PySpark: Spark with Python
#### What is Spark shell?
* Interactive environment for running Spark jobs
* Helpful for fast interactive prototyping
* Spark's shells allow interacting with data on disk or in memory across many machines or one, and Spark takes care of automatically distributing this processing
* Three different Spark shells:
    * Spark-shell for Scala
    * PySpark-shell for Python
    * SparkR for R
    
#### PySpark shell
* PySpark shell is the Python-based command line tool
* PySpark shell sllows data scientists to interface with Spark data structures
* PySpark shell supports connecting to a cluster

#### Understanding SparkContext
* SparkContext is an entry point into the world of Spark
* An **entry point** is where control is transferred from the Operating system to the provided program.
    * An entry point is a way of connecting to Spark cluster
    * An entry point is "like a key to the house." 
* Access the SparkContext in the PySpark shell as a variable named `sc`

#### Inspecting SparkContext
* **Version:** to retrieve SparkContext version that you are currently running:
    * `sc.version`
* **Python Version:** to retrieve Python version *that SparkContext is currently using*
    * `sc.pythonVer`
* **Master:** URL of the cluster of "local" string to run in local mode of SparkContext
    * `sc.master`
    * If returns: `local[*]`, means SparkContext acts as a master on a local node using all available threads on the computer where it is running.
    
#### Loading data in PySpark
* SparkContext's **`parallelize()`** method (used on a list)
    * For example, to create parallelized collections holding the numbers 1 to 5:
    * `rdd = sc.parallelize([1, 2, 3, 4, 5])
* SparkContext's **`textFile()`** method (used on a file)
    * For example, to load a text file named `test.txt` using this method:
    * `rdd2 = sc.textFile("test.txt")`