<img src="images/spark.png" alt="drawing" width="200"/>

# Introduction Spark and Pyspark




## Lambda Function in Lambda Function

### Introduction

Taken literally, an anonymous function is a function without a name. In Python, an anonymous function is created with the lambda keyword. More loosely, it may or not be assigned a name. 
Lambda functions can have any number of arguments but only one expression. The expression is evaluated and returned. Lambda functions can be used wherever function objects are required.

**Syntax: lambda arguments: expression**

1. This function can have any number of arguments but only one expression, which is evaluated and returned.
1. One is free to use lambda functions wherever function objects are required.
1. You need to keep in your knowledge that lambda functions are syntactically restricted to a single expression.
1. It has various uses in particular fields of programming, besides other types of expressions in functions.



Consider a two-argument anonymous function defined with lambda but not bound to a variable. The lambda is not given a name.
The lambda is not given a name:

In [4]:
(lambda x, y: x + y)(2, 3)


5

Consider a one-argument anonymous function defined with lambda but  bound to a variable

In [1]:
# Multiply an element by 2 
double = lambda x: x * 2

print(double(5))

10


is nearly the same as:

In [6]:
def double(x):
   return x * 2

double(5)

10

In Python, we generally use it as an argument to a higher-order function (a function that takes in other functions as arguments). Lambda functions are used along with built-in functions like filter(), map() etc.

### Example Using filter function

The filter() function in Python takes in a function and a list as arguments.
The function is called with all the items in the list and a new list is returned which contains items for which the function evaluates to True.

In [1]:
# Program to filter out only the items > than 6  from a list
my_list = [1, 5, 4, 6, 8, 11, 3, 12]
print(my_list)

new_list = list(filter(lambda x: (x > 6) , my_list))

print(new_list)

[1, 5, 4, 6, 8, 11, 3, 12]
[8, 11, 12]


### Example Using map function

The map() function in Python takes in a function and a list.
The function is called with all the items in the list and a new list is returned which contains items returned by that function for each item.

In [9]:
# Program to double each item in a list using map()

my_list = [1, 5, 4, 6, 8, 11, 3, 12]

new_list = list(map(lambda x: x * 2 , my_list))

print(new_list)

[2, 10, 8, 12, 16, 22, 6, 24]


### Example Using reduce function

The reduce() function in Python is a function that implements a mathematical technique called folding or reduction. 
reduce() is useful when you need to apply a function to an iterable and reduce it to a single cumulative value.

In [2]:
from functools import reduce

lst = [1, 2, 3, 4, 5]
reduce(lambda x, y: x + y, lst)

15

### Pros and Cons of a Lambda Function in Python

* Pros
  * It’s an ideal choice for evaluating a single expression that is supposed to be evaluated only once.
  * It can be invoked as soon as it is defined.
  * Its syntax is more compact in comparison to a corresponding normal function.
  * It can be passed as a parameter to a higher-order function, like filter(), map(), and reduce().

* Cons
  * It can’t perform multiple expressions.
  * It can easily become cumbersome, for example when it includes an if-elif-…-else cycle.
  * It can’t contain any variable assignements (e.g., lambda x: x=0 will throw a SyntaxError).
  * We can’t provide a docstring to a lambda function.

## Spark
Apache Spark is a lightning fast real-time processing framework. 
It does in-memory computations to analyze data in real-time.
It came into picture as Apache Hadoop MapReduce was performing batch processing only and lacked a real-time processing feature. 
Hence, Apache Spark was introduced as it can perform stream processing in real-time and can also take care of batch processing.
Apart from real-time and batch processing, Apache Spark supports interactive queries and iterative algorithms also. Apache Spark has its own cluster manager, where it can host its application. It leverages Apache Hadoop for both storage and processing. It uses HDFS (Hadoop Distributed File system) for storage and it can run Spark applications on YARN as well.

## Parallel Programming in Python

### Parallelizable and non-parallelizable tasks

In [None]:
Some tasks are easily parallelizable while others inherently aren’t. 
However, it might not always be immediately apparent that a task is parallelizable.
Let us consider the following piece of code.

In [6]:
x = [1, 2, 3] # Write input

sum = 0 # Initialize output

for e in x:
  sum = sum + e # Add each element to the output variable

print(y) # Print output

6


Each successive loop uses the result of the previous loop. In that way, it is dependent on the previous loop. The following dependency diagram makes that clear:

In [None]:
<img src="images/serial.svg" alt="serial" width="200"/>

## PySpark

Apache Spark is written in Scala programming language. 
To support Python with Spark, Apache Spark Community released a tool, PySpark. 
Using PySpark, you can work with RDDs in Python programming language also. 
It is because of a library called Py4j that they are able to achieve this.
PySpark offers PySpark Shell which links the Python API to the spark core and initializes the Spark context.
Majority of data scientists and analytics experts today use Python because of its rich library set. 
Integrating Python with Spark is a boon to them.

### PySpark shell

PySpark shell is referred as REPL (Read Eval Print Loop) which is used to quickly test PySpark statements. 
Spark shell is available for Scala, Python and R. 
The pyspark command is used to launch Spark with Python shell also call PySpark.

<img src="images/pyspark shell.png" alt="drawing" width="800"/>



### Using Spark in Python - Spark Context

The first step in using Spark is connecting to a Spark cluster.
A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. 
SparkContext is the entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. 
The driver program then runs the operations inside the executors on worker nodes.
In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. 
There will be one computer, called the master that manages splitting up the data and the computations. 
The master is connected to the rest of the computers in the cluster, which are called worker. 
The master sends the workers data and calculations to run, and they send their results back to the master.

<img src="images/sparkContext.png" alt="drawing" width="600"/>

In [1]:
from pyspark import SparkConf 
from pyspark.context import SparkContext 

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]").setAppName("Intro pyspark"))


# Verify SparkContext
print(sc)

# Print Spark version
print(sc.version)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/06 13:32:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
<SparkContext master=local[*] appName=Intro pyspark>
3.3.0
