<a href="https://colab.research.google.com/github/cagBRT/PySpark/blob/master/PySpark_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/PySpark.git cloned-repo
%cd cloned-repo
#!ls

In [None]:
from IPython.display import Image
def page(num):
    return Image("/content/cloned-repo/"+str(num)+ ".png" , width=640)

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing.

In [None]:
page("PySpark")

In [None]:
!pip install pyspark

In [None]:
#Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
#getOrCreate gets or creates a session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

In [None]:
#Import a Spark function from library
from pyspark.sql.functions import col

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.master("local[*]").getOrCreate()
print("If no error - everything is working")


**Create a PySpark session**

In [None]:
sc = SparkContext.getOrCreate()

In [None]:
# Tools we need to connect to the Spark server, load our data,
# clean it and prepare it

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.sql.functions import isnan, when, count, col

In [None]:
import pandas as pd



---



---



In [None]:
page("distributed computing")

In [None]:
page("parallel computing")

A graphical representation of Amdahl's law. The speedup of a program from parallelization is limited by how much of the program can be parallelized. For example, if 90% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 10 times no matter how many processors are used.

In [None]:
page("amdahls law chart")

Assume that a task has two independent parts, A and B. Part B takes roughly 25% of the time of the whole computation. By working very hard, one may be able to make this part 5 times faster, but this only reduces the time for the whole computation by a little. In contrast, one may need to perform less work to make part A be twice as fast. This will make the computation much faster than by optimizing part B, even though part B's speedup is greater by ratio, (5 times versus 2 times).

In [None]:
page("speedup")

# **RDD stands for Resilient Distributed Dataset**,<br>
these are the elements that run and operate on multiple nodes to do parallel processing on a cluster.<br>
**Once you create an RDD you cannot change it.** <br>
RDDs are fault tolerant as well, hence in case of any failure, they recover automatically. You can apply multiple operations on these RDDs to achieve a certain task.<br>

To apply operations on these RDD's, there are two ways −

>Transformation<br>
Action<br>



**Transformation** − These are the operations, which are applied on a RDD **to create a new RDD**. Filter, groupBy and map are the examples of transformations.

**Action** − These are the operations that are applied on RDD, which instructs Spark to perform computation and send the result back to the driver.

**When to use RDD**<br>
Use an RDDs in situations where:<br>

Data is unstructured.<br><br> Unstructured data sources such as media or text streams benefit from the performance advantages RDDs offer.<br><br>
Transformations are low-level.<br><br> Data manipulation should be fast and straightforward when nearer to the data source.<br><br>
Schema is unimportant. Since RDDs do not impose schemas, use them when accessing specific data by column or attribute is not relevant.

**Create a PySpark RDD**

In [None]:
words = sc.parallelize (
   ["scala", 
   "java", 
   "hadoop", 
   "spark", 
   "akka",
   "spark vs hadoop", 
   "pyspark",
   "pyspark and spark"]
)

**Assignment 1**<br>
Create an RDD called favorites. <br>
It should contain 10 of your favorite things. 

In [None]:
#Assignment


**count()**

In [None]:
counts = words.count()
print ("Number of elements in RDD -> %i" % (counts))

**Assignment 2**<br>
Count the number of elements in the RDD you created called favorites

In [None]:
#Assignment 2

**collect()**<br>
All the elements in the RDD are returned

In [None]:
coll = words.collect()
print ("Elements in RDD -> %s" % (coll))

**Assignment 3**<br>
Collect the elements in favorites

In [None]:
#Assignment 3




---


---

Lambda Example Begin

---



---



**Lambda functions**<br>
A lambda function is a small anonymous function.<br>

A lambda function can take any number of arguments, but can only have one expression.


Syntax for lambda functions:<br>
>lambda arguments : expression

In [None]:
#a is the input to lambda, 
#returnOfFunction is the input+5
returnOfFunction=lambda a: a+5
#print the return of the function when a=3
print(returnOfFunction(3))

In [None]:
#a function with more than one input
returnOfFunction=lambda a,b: a+b+100
print(returnOfFunction(3,4))

**Lambda Assignment**<br>
Write a lambda function that adds three input values<br>


In [None]:
#@title
returnOfFunction=lambda a,b,c: a+b+c
print(returnOfFunction(3,4,5))

The power of lambda is better shown when you use them as an anonymous function inside another function.

1. define a function using a lambda function

In [None]:
def afunction(n):
  return lambda a:a*n

assign a variable to the output of the function<br>
(the "n" value)

In [None]:
multiplier=afunction(4)

Now assign a value to "a"

In [None]:
print(multiplier(3))

**Assignment:**
Create a function that uses a lambda function

In [None]:
#Assignment




---


---

Lambda Example End

---



---



**filter(f)**
A filter returns the elements that meet the requested condition

In the example below, the words dataset is searched for those elements containing 'spark'

The words RDD:<br>
>["scala", <br>
   "java",<br> 
   "hadoop", <br>
   "spark", <br>
   "akka",<br>
   "spark vs hadoop", <br>
   "pyspark",<br>
   "pyspark and spark"]<br>

In [None]:
words_filter = words.filter(lambda x: 'spark' in x)
filtered = words_filter.collect()
print ("Fitered RDD -> %s" % (filtered))

**filter(f)**
A filter returns the elements that meet the requested condition

In the example below, the words RDD is searched for those elements containing 'spark'

The words dataset:<br>
>["scala", <br>
   "java",<br> 
   "hadoop", <br>
   "spark", <br>
   "akka",<br>
   "spark vs hadoop", <br>
   "pyspark",<br>
   "pyspark and spark"]<br>

In [None]:
words_filter = words.filter(lambda x: 'spark' in x)
filtered = words_filter.collect()
print ("Filtered RDD -> %s" % (filtered))

**Assignment 4**:<br>
Find all the elements in words that contain the letter 'k'

In [None]:
#Assignment 4


In [None]:
#@title
words_filter = words.filter(lambda x: 'k' in x)
filtered = words_filter.collect()
print ("Filtered RDD -> %s" % (filtered))

**reduce()**<br>
After performing the **specified commutative and associative binary operation**, the element in the RDD is returned<br>

For example: <br>
reduce(lambda x, y : x + y, [1,2,3,4,5]) ==> (((1+2)+3)+4)+5)



In [None]:
from operator import add

listRdd = spark.sparkContext.parallelize([-1,33,3,4,5,3,2])
print("output min using binary : ", listRdd.reduce(min))
print("output add using binary : ", listRdd.reduce(add))

print("output max using binary : ", listRdd.max())
