# Introduction to Big Data analysis with Spark

## What is Big Data?

There is no single definition of Big Data since it is used quite differently.

The 3 V's of Big Data:
* Volume: Size of the data
* Variety: Different sources and formats
* Velocity: Speed of the data

Some of the concepts of Big Data

Clustered computing: Pooling resources of multiple macgines to complete jobs.

    Parallel computing: Simultaneous computation.
    Distributed computing: Nodes or networked computers that run jobs in parallel.
    Batch precessing: Breaking data into smaller pieces and running each piece on an individual machine.
    Real-time processing: Immediate processing of data.

There are two popular frameworks for big data processing.

    Hadoop/MapReduce: It is open source and scalable framework for batch data.
    Apache Spark: Parallel framework for storing and processing of big data across clustered computers. It is open source and is suited for both batch and  real-time data processing. 

Spark distribute data and computation across multiple computers. Runs most computations in memory and provides better performance for applications like interactive data mining. 

Spark can be run on two modes:

    Local mode: Single machine (laptop). It is very convenient for testing, debugging and demonstration purposes.
    Cluster mode: Spark is run on a cluster. Set of pre-defined machines. It is good for production.

## PySpark: Spark with Python

PySpark helps data scientists interface with Spark data structures in Apache Spark and python. In order to interact with Spark using PySpark shell, you need an entry point. SparkContext is an entry point to interact with underlying Spark functionality. An entry point is a way of connecting to Spark cluster.

You can load data into PySpark using SparkContext by two different methods. SparkContext's parallilize method on a list. Or SparkContext's textFile method on a file.

In [8]:
from pyspark import SparkContext
sc = SparkContext("local", "pyspark-shell")

### Understanding SparkContext

In [11]:
print("The version of Spark Context in the PySpark shell is", sc.version)
print()
print("The Python version of Spark Context in the PySpark shell is", sc.pythonVer)
print()
print("The master of Spark Context in the PySpark shell is", sc.master)

The version of Spark Context in the PySpark shell is 3.1.2

The Python version of Spark Context in the PySpark shell is 3.8

The master of Spark Context in the PySpark shell is local


### Interactive Use of PySpark


In [13]:
numb = range(1, 101)

spark_data = sc.parallelize(numb)

### Loading data in PySpark shell

In [None]:
lines = sc.textFile(file_path)

## Review of functional programming in Python

Lambda functions are powerful and quite efficient with map() and filter(). Lambda functions are anonymous. They return the functions instead of names.
Most of the times lambda functions are used with built-in functions like map and filter. map() function takes a function and a list and retuns a new list which contains items returned by that function for each item. filter() function takes a functions and a list and retuns a new list for which the function evaluates as true.

General syntax of filter and map:

    filter(function, list)
    map(function, list)

In [28]:
my_list =  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

print("Input list is", my_list)

squared_list_lambda = list(map(lambda x: x**2, my_list))

print("The squared numbers are", squared_list_lambda)

Input list is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The squared numbers are [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


### Use of lambda() with filter()


In [30]:
my_list2 = [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]
print("Input list is:", my_list2)

filtered_list = list(filter(lambda x: x%10 == 0, my_list2))
print("Numbers divisible by 10 are:", filtered_list)

Input list is: [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]
Numbers divisible by 10 are: [10, 40, 60, 80]
