<a href="https://colab.research.google.com/github/cagBRT/PySpark/blob/master/PySpark_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/PySpark.git cloned-repo
#%cd cloned-repo
#!ls

In [None]:
from IPython.display import Image
def page(num):
    return Image("/content/cloned-repo/"+str(num)+ ".png" , width=640)

In [None]:
page("PySpark")

In [None]:
!pip install pyspark

In [None]:
#Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
#getOrCreate gets or creates a session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

In [None]:
#Import a Spark function from library
from pyspark.sql.functions import col

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.master("local[*]").getOrCreate()
print("If no error - everything is working")


**Create a PySpark session**

In [None]:
sc = SparkContext.getOrCreate()

In [None]:
# Tools we need to connect to the Spark server, load our data,
# clean it and prepare it

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.sql.functions import isnan, when, count, col

In [None]:
import pandas as pd



---



---



# **RDD stands for Resilient Distributed Dataset**,<br>
these are the elements that run and operate on multiple nodes to do parallel processing on a cluster.<br>
**Once you create an RDD you cannot change it.** <br>
RDDs are fault tolerant as well, hence in case of any failure, they recover automatically. You can apply multiple operations on these RDDs to achieve a certain task.<br>

To apply operations on these RDD's, there are two ways −

>Transformation<br>
Action<br>



**Transformation** − These are the operations, which are applied on a RDD to create a new RDD. Filter, groupBy and map are the examples of transformations.

**Action** − These are the operations that are applied on RDD, which instructs Spark to perform computation and send the result back to the driver.

**When to use RDD**<br>
Use an RDDs in situations where:<br>

Data is unstructured.<br><br> Unstructured data sources such as media or text streams benefit from the performance advantages RDDs offer.<br><br>
Transformations are low-level.<br><br> Data manipulation should be fast and straightforward when nearer to the data source.<br><br>
Schema is unimportant. Since RDDs do not impose schemas, use them when accessing specific data by column or attribute is not relevant.

**Create a PySpark RDD**

In [None]:
words = sc.parallelize (
   ["scala", 
   "java", 
   "hadoop", 
   "spark", 
   "akka",
   "spark vs hadoop", 
   "pyspark",
   "pyspark and spark"]
)

**Assignment 1**<br>
Create an RDD called favorites. <br>
It should contain 10 of your favorite things. 

In [None]:
#Assignment


**count()**

In [None]:
counts = words.count()
print ("Number of elements in RDD -> %i" % (counts))

**Assignment 2**<br>
Count the number of elements in the RDD you created called favorites

In [None]:
#Assignment 2

**collect()**<br>
All the elements in the RDD are returned

In [None]:
coll = words.collect()
print ("Elements in RDD -> %s" % (coll))

**Assignment 3**<br>
Collect the elements in favorites

In [None]:
#Assignment 3

**filter(f)**
A filter returns the elements that meet the requested condition

In the example below, the words dataset is searched for those elements containing 'spark'

The words RDD:<br>
>["scala", <br>
   "java",<br> 
   "hadoop", <br>
   "spark", <br>
   "akka",<br>
   "spark vs hadoop", <br>
   "pyspark",<br>
   "pyspark and spark"]<br>

In [None]:
words_filter = words.filter(lambda x: 'spark' in x)
filtered = words_filter.collect()
print ("Fitered RDD -> %s" % (filtered))

**filter(f)**
A filter returns the elements that meet the requested condition

In the example below, the words RDD is searched for those elements containing 'spark'

The words dataset:<br>
>["scala", <br>
   "java",<br> 
   "hadoop", <br>
   "spark", <br>
   "akka",<br>
   "spark vs hadoop", <br>
   "pyspark",<br>
   "pyspark and spark"]<br>

In [None]:
words_filter = words.filter(lambda x: 'spark' in x)
filtered = words_filter.collect()
print ("Fitered RDD -> %s" % (filtered))

**Assignment 6**:<br>
Find all the elements in words that contain the letter 'k'

In [None]:
#Assignment 6

# **DataFrames**
RDD offers low-level control over data, Dataset and DataFrame APIs bring structure and high-level abstractions. Keep in mind that transformations from an RDD to a Dataset or DataFrame are easy to execute

**When to use Datasets**<br>
Use Datasets in situations where:<br>

Data requires a structure. <br><br>
DataFrames infer a schema on structured and semi-structured data.<br><br>
Transformations are high-level. If your data requires high-level processing, columnar functions, and SQL queries, use Datasets and DataFrames.<br><br>
A high degree of type safety is necessary. Compile-time type-safety takes full advantage of the speed of development and efficiency.

In [None]:
# Prepare Data
columns = ["Seqno","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")]

# Create DataFrame
df = spark.createDataFrame(data=data,schema=columns)
df.show()
df.select("Name").show()

In [None]:
df.count()

**Create and use a function on the dataset**

In [None]:
def f(df,col):
  df.select(col).show()

In [None]:
f(df,"Name")

In [None]:
f(df,"Seqno")

**Assignment 4**<br>
Use the given dataset<br>
Show the dataset<br>
Show each col individually<br>
Count the number of rows<br>

In [None]:
#Assignment 4
columns = ["Race_event","Winner"]
data_assignment4 = [
    ("100yd", "baby smith"),
    ("quarter Mile", "bonny jones"),
    ("half mile", "cutie topper"),
    ("mile", "legs mccarty"),
    ("two mile", "speed hari"),
    ("five mile", "betsy boop")
    ]

**foreach(f)**
Returns only those elements which meet the condition of the function inside foreach. In the following example, we call a print function in foreach, which prints all the elements in the RDD.



In this instance we use the accumulator function to gather values from the dataset<br><br>
An accumulator is created from an initial value v by calling SparkContext.accumulator(v). <br>

Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value.

Accumulators are for ints and floats. To do other values, use the AccumulatorParam.

In [None]:
def addup(df):
  accum=spark.sparkContext.accumulator(0)
  df.foreach(lambda x:accum.add(int(x.Seqno)))
  print (accum.value)

In [None]:
addup(df)

In the example below we comment out the accumulator(0) function. <br>
Note what happens

In [None]:
def addup(df):
  #accum=spark.sparkContext.accumulator(0)
  df.foreach(lambda x:accum.add(int(x.Seqno)))
  print(accum.value)

In [None]:
addup(df)

**Using foreach**

In [None]:
# foreach() with accumulator Example
accum=spark.sparkContext.accumulator(0)
df.foreach(lambda x:accum.add(int(x.Seqno)))
print(accum.value) #Accessed by driver

**Assignment 5**<br>
Use the given dataset<br>
Add the appropriate columns

In [None]:
columns = ["candy", "sales San Jose","sales for California","sales for US" ]
data_assignment5 = [
    ("snickers","34","124","564"),
    ("goodbar","23","445","876"),
    ("twix","12","765","1234"),
    ("m&m","365","987","121234"),
    ("mars","56","87","234657"),
    ("red hots","7","67","989877")
    ]

**filter(f)**
A filter returns the elements that meet the requested condition

In the example below, the words dataset is searched for those elements containing 'spark'

**max()**

In [None]:
from pyspark.sql.functions import max

In [None]:
columns = ["Number","letter"]
dataNum=[("44","a"),("5","z"),("7.3","c"),("5.5","x")]
dNum=spark.createDataFrame(data=dataNum,schema=columns)

In [None]:
dNum.show()

In [None]:
maxValue = dNum.agg(max("Number")).collect()[0][0]
print("maxValue: ",maxValue)

**Assignment 7**<br>
Use the given dataset, <br>
Find the maxvalue for all appropriate colmuns

In [None]:
#Assignment 7
columns = ["candy", "sales San Jose","sales for California","sales for US" ]
data_assignment7 = [
    ("snickers","34","124","564"),
    ("goodbar","23","445","876"),
    ("twix","12","765","1234"),
    ("m&m","365","987","121234"),
    ("mars","56","87","234657"),
    ("red hots","7","67","989877")
    ]

# What is Partitioning?

**map()** transforms the RDD by applying the lambda function to every element in the dataframe. 

In [None]:
data = ["Project","Gutenberg’s","Alice’s","Adventures",
"in","Wonderland","Project","Gutenberg’s","Adventures",
"in","Wonderland","Project","Gutenberg’s"]

rdd=spark.sparkContext.parallelize(data)

In this instance we are mapping  a second value (23) to each element

In [None]:
words_map = rdd.map(lambda x: (x, 23))
mapping = words_map.collect()
print ("Key value pair -> %s" % (mapping))

In [None]:
for element in words_map.collect():
    print(element)

In [None]:
print("partitions=",rdd.getNumPartitions())
print("count=",words_map.count())
print(words_map.collect())

**Assignment 8**<br>
Add another element to words_map

In [None]:
#Assignment 8

In [None]:
#@title 
words_map2 = words_map.map(lambda x: (x, "windy"))
mapping = words_map2.collect()
print ("Key value pair -> %s" % (mapping))

**reduce()**<br>
After performing the specified commutative and associative binary operation, the element in the RDD is returned<br>

For example: <br>
reduce(lambda x, y : x + y, [1,2,3,4,5]) ==> (((1+2)+3)+4)+5)



In [None]:
from operator import add

listRdd = spark.sparkContext.parallelize([-1,33,3,4,5,3,2])
print("output min using binary : ", listRdd.reduce(min))
print("output add using binary : ", listRdd.reduce(add))

print("output max using binary : ", listRdd.max())


In [None]:
  listRdd = spark.sparkContext.parallelize((1,2,3,4,5,3,2))
  print("output min using binary : ",listRdd.reduce( min ))
  print("output max using binary : ",listRdd.reduce(max))
  print("output sum using binary : ",listRdd.reduce(add))

**Assignment 9**

In [None]:
columns = ["candy", "sales San Jose","sales for California","sales for US" ]
data_assignment9 = [
    ("snickers","34","124","564"),
    ("goodbar","23","445","876"),
    ("twix","12","765","1234"),
    ("m&m","365","987","121234"),
    ("mars","56","87","234657"),
    ("red hots","7","67","989877")
    ]

adding = data_assignment9.reduce(add)
print ("Adding all the elements -> %i" % (adding))

**join**<br>

In [None]:
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y)
final = joined.collect()
print ("Join RDD -> %s" % (final))

**cache()**

In [None]:
words = sc.parallelize (
   ["scala", 
   "java", 
   "hadoop", 
   "spark", 
   "akka",
   "spark vs hadoop", 
   "pyspark",
   "pyspark and spark"]
) 
words.cache() 
caching = words.persist().is_cached 
print( "Words got chached > %s" % (caching))

**broadcast()**<br>
A Broadcast variable has an attribute called value, which stores the data and is used to return a broadcasted value.



In [None]:
words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]) 
data = words_new.value 
print ("Stored data -> %s" % (data) )
elem = words_new.value[2] 
print ("Printing a particular element in RDD -> %s" % (elem))

**Accumulator()**<br>

In [None]:
num = sc.accumulator(10) 
def f(x): 
   global num 
   num+=x 
rdd = sc.parallelize([20,30,40,50]) 
rdd.foreach(f) 
final = num.value 
print ("Accumulated value is -> %i" % (final))

# GroupBy()

Create a simple dataset

In [None]:
simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]


Define the data schema - the column names

In [None]:
schema = ["employee_name","department","state","salary","age","bonus"]

Create the dataframe

In [None]:
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

**Assignment:** <br>
Create a dataframe with the following column headings. <br>
Create at least 10 rows <br>

Name, state, team_name, event_name, race_time1, race_time2, race_time3

Using groupby()<br>
Group the dataframe into three categories: Sales, Finance, Marketing<br>
Notice the salary is the sum of the values in each category

In [None]:
df.groupBy("department").sum("salary").show(truncate=False)

Groupy by number of employees

In [None]:
df.groupBy("department").count().show()

In [None]:
df.groupBy("department").min("salary").show()

In [None]:
df.groupBy("department").max("salary").show()

In [None]:
df.groupBy("department").mean( "salary").show()

Groupby on multiple columns<br>
Group by department and state

In [None]:
#GroupBy on multiple columns
df.groupBy("department","state") \
    .sum("salary","bonus") \
    .show()


**Doing multiple aggrates at the same time**

In [None]:
from pyspark.sql.functions import sum,avg,max
df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
         avg("salary").alias("avg_salary"), \
         sum("bonus").alias("sum_bonus"), \
         max("bonus").alias("max_bonus") \
     ) \
    .show(truncate=False)

# Configuring Spark


Following are some of the most commonly used attributes of SparkConf −

>set(key, value) − To set a configuration property.

>setMaster(value) − To set the master URL.

>setAppName(value) − To set an application name.

>get(key, defaultValue=None) − To get a configuration value of a key.

>setSparkHome(value) − To set Spark installation path on worker nodes.




In [None]:
from pyspark import SparkConf

In [None]:
conf = SparkConf().setAppName("PySpark App").setMaster("spark://master:7077")
#sc = SparkContext(conf=conf)

get(filename)<br>
getrootdirectory()

In [None]:
from pyspark import SparkFiles
finddistance = "/content/cloned-repo/airlines.csv"
finddistancename = "finddistance.R"
sc.addFile(finddistance)
print ("Absolute Path -> %s" % SparkFiles.get(finddistancename))

**Storage Levels**

DISK_ONLY = StorageLevel(True, False, False, False, 1)

DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)

MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1)

MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)

MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)

MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)

MEMORY_ONLY = StorageLevel(False, True, False, False, 1)

MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)

MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)

MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)

OFF_HEAP = StorageLevel(True, True, True, False, 1)

In [None]:
import pyspark
rdd1 = sc.parallelize([1,2])
rdd1.persist( pyspark.StorageLevel.MEMORY_AND_DISK_2 )
rdd1.getStorageLevel()
print(rdd1.getStorageLevel())
