<a href="https://colab.research.google.com/github/antonioGoncalves64/pyspark/blob/main/TutorialPyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic Installation pySpark

The easy way of installing PySpark on Google Colab is to use pip install.

In [2]:
! pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


The second method of installing PySpark on Google Colab is to use pip install.

# Create a Spark session
After installation, we can create a Spark session and check its information.

In [6]:
from pyspark import SparkConf 
from pyspark.context import SparkContext 

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]").setAppName("Intro pyspark"))


# Verify SparkContext
sc



# Resilient Distributed Datasets (RDD)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

## How to create RDDs

There are two ways to create RDDs:

* parallelizing an existing collection in your driver program, or
* referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

### Parallelized collection

Parallelized collections are created by calling SparkContext’s parallelize method on an existing iterable or collection in your driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create two parallelized collection: one for holding the numbers 1 to 5 and the other for hoding a String.

In [9]:
numRDD = sc.parallelize([1,2,3,4,5])
print ("numRDD:   ", type(numRDD)) #confirm type of object RDD

helloRDD = sc.parallelize(("Hello world"))
print ("helloRDD: ",type(helloRDD)) #confirm type of object RDD

numRDD:    <class 'pyspark.rdd.RDD'>
helloRDD:  <class 'pyspark.rdd.RDD'>


Once created, the distributed dataset (distData) can be operated on in parallel. For example, we can call distData.reduce(lambda a, b: a + b) to add up the elements of the list.

In [10]:
distData = sc.parallelize([1,2,3])
distData.reduce(lambda a, b: a + b)

6

One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10))

In [11]:
data01 = sc.parallelize([1,2,3,4,5])
print ("Data01 NumPartitions: ", data01.getNumPartitions())

data02 = sc.parallelize([1,2,3,4,5],3)
print ("data02 NumPartitions: ", data02.getNumPartitions())

Data01 NumPartitions:  2
data02 NumPartitions:  3


### External Datasets
PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Text file RDDs can be created using SparkContext’s textFile method. This method takes a URI for the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) and reads it as a collection of lines. Here is an example invocation:

In [12]:
# Downloading and preprocessing Cars Data downloaded origianlly from https://perso.telecom-paristech.fr/eagan/class/igr204/datasets
!wget https://jacobceles.github.io/knowledge_repo/colab_and_pyspark/cars.csv

fileRDD = sc.textFile("cars.csv")

newRDD= fileRDD.take(3)

for i in newRDD:
    print(i)

--2022-12-14 16:51:29--  https://jacobceles.github.io/knowledge_repo/colab_and_pyspark/cars.csv
Resolving jacobceles.github.io (jacobceles.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to jacobceles.github.io (jacobceles.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://jacobcelestine.com/knowledge_repo/colab_and_pyspark/cars.csv [following]
--2022-12-14 16:51:29--  https://jacobcelestine.com/knowledge_repo/colab_and_pyspark/cars.csv
Resolving jacobcelestine.com (jacobcelestine.com)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to jacobcelestine.com (jacobcelestine.com)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22608 (22K) [text/csv]
Saving to: ‘cars.csv’


2022-12-14 16:51:29 (109 MB/s) - ‘cars.csv’ saved [22608/22608]

Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin
Chevrolet Che

### RDD Operations

RDDs support two types of operations: transformations and actions
 
* Transformations  create a new dataset from an existing one.
* Actions, which return a value to the driver program after running a computation on the dataset. 

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.


#### Transformations


In [14]:
# map( ) - Return a new RDD by applying a function to each element of this RDD.

RDD = sc.parallelize([1,2,3,4,5])
RDD_map = RDD.map(lambda x : x * 2)
print ("RDD_map: ",RDD_map.collect()) # action convert to a  List

RDD_map:  [2, 4, 6, 8, 10]


In [15]:
# filter( ) returns a new RDD with only the elements that pass the condition

RDD = sc.parallelize([1,2,3,4])
RDD_filter = RDD.filter(lambda x : x >2)
print ("RDD_filter: ", RDD_filter.collect()) # action convert to a  List

RDD_filter:  [3, 4]


In [16]:
# flatMap( ) returns multiple values for each element in the original RDD

RDD = sc.parallelize(["hello word", "How are you"])
RDD_flatMap = RDD.flatMap(lambda x : x.split(" "))
print ("RDD_flatMap: ", RDD_flatMap.collect()) # action convert to a  List

RDD_flatMap:  ['hello', 'word', 'How', 'are', 'you']


In [17]:
# union( ) Return the union of this RDD and another one

rdd01 = sc.parallelize([1, 3, 5, 7])
rdd02 = sc.parallelize([2, 4, 6, 8])
rdd03 = rdd01.union(rdd02)
rdd03.collect()

[1, 3, 5, 7, 2, 4, 6, 8]

#### Actions

In [18]:
# collection ( ) Return a list that contains all of the elements in this RDD

data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd  = sc.parallelize(data)

newData = rdd.collect()
for d in newData:
    print (f"Value: {d}")

Value: 1
Value: 2
Value: 3
Value: 4
Value: 5
Value: 6
Value: 7
Value: 8
Value: 9
Value: 10
Value: 11
Value: 12


In [19]:
# take(num) – Take the first num elements of the RDD

data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd  = sc.parallelize(data)

newData = rdd.take(2)
for d in newData:
    print (f"Value: {d}")

Value: 1
Value: 2


In [20]:
# first( ) – Returns the first record of the RDD

data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd  = sc.parallelize(data)

newData = rdd.first()
print (f"Value: {newData}")

Value: 1


In [None]:
# count( ) – Returns the number of records in an RDD

data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd  = sc.parallelize(data)

num = rdd.count()
print (f"Count: {num}")

In [None]:
# max( ) – Returns max record

data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd  = sc.parallelize(data)

num = rdd.max()
print (f"Max: {num}")

In [None]:
# reduce( ) – Reduces the records to single, we can use this to count or sum

data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd  = sc.parallelize(data)

num = rdd.reduce(lambda a,b: (a+b))
print (f"Max: {num}")

### Pair RDDs

Spark Paired RDDs are RDDs containing a key-value pair. Key-value pair (KVP) consists of a two linked data item in it. Here, the key is the identifier, whereas value is the data corresponding to the key value.

#### Creating Pair RDDs
Two common ways to create pair RDD:

* From a list of key-value tuples
* From a regular RDD

In [21]:
# Create a Pair RDD from regular RDD

rdd = sc.parallelize(["b", "a", "c"])
sorted(rdd.map(lambda x: (x, 1)).collect())

[('a', 1), ('b', 1), ('c', 1)]

In [22]:
# Create a Pair RDD from a list

rdd = sc.parallelize([(1,"a"), (2,"b"), (3,"c")])
rdd.collect()

[(1, 'a'), (2, 'b'), (3, 'c')]

#### Transformations on pair RDDs

All regular transformations work on pair RDD. Have to pass functions that operate on key value pairs rather than on individual elements

In [23]:
# reduceByKey(fun) - groups all the values with the same key

rdd = sc.parallelize([("a",1), ("b",2), ("c", 10),("a", 2), ("d", 5), ("a", 4) ])
rdd_reduceByKey = rdd.reduceByKey(lambda x, y: x+y )
rdd_reduceByKey.collect()

[('b', 2), ('c', 10), ('d', 5), ('a', 7)]

In [24]:
# sortByKey(fun) - Order RDD pair by key

rdd = sc.parallelize([("a",1), ("c",2), ("b", 10),("a", 2), ("d", 5), ("a", 4) ])
rdd_reduceByKey = rdd.reduceByKey(lambda x, y: x+y )
rdd_reduceByKey.sortByKey(ascending = True).collect()

[('a', 7), ('b', 10), ('c', 2), ('d', 5)]

In [27]:
# groupByKey( ) - Groups all the values with the same key in the pair

rdd = sc.parallelize([("a",1), ("c",2), ("b", 10),("a", 2), ("d", 5), ("a", 4) ])
rdd_groupByKey = rdd.groupByKey().collect()
for letter, value in  rdd_groupByKey:
    print (letter, list(value))

c [2]
b [10]
d [5]
a [1, 2, 4]


In [None]:
# join( ) - transformation joins the two pair RDDs based on their key

rdd01 = sc.parallelize([("a",1), ("b", 5),("c", 7) ])
rdd02 = sc.parallelize([("a",2), ("b", 3),("d", 4) ])

rdd01. join(rdd02).collect()

In [None]:
# countByKey( ) - action counts the number of elements for each key

rdd = sc.parallelize([("a",2), ("b", 4),("a", 3) ])
for key, val in  rdd.countByKey().items():
    print (key, val)

In [26]:
# collectAsMap( ) - action return the key-value pairs in the RDD as a dictionary

rdd = sc.parallelize([("a",2), ("b", 4),("c", 3) ])
rdd.collectAsMap()

{'a': 2, 'b': 4, 'c': 3}