<a href="https://colab.research.google.com/github/cagBRT/PySpark/blob/master/Distrib_vs_Single3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we compare doing a task in a distributed environment and on a single CPU (single computer)

In [None]:
!git clone https://github.com/cagBRT/PySpark.git

Cloning into 'PySpark'...
remote: Enumerating objects: 447, done.[K
remote: Counting objects: 100% (117/117), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 447 (delta 72), reused 0 (delta 0), pack-reused 330 (from 1)[K
Receiving objects: 100% (447/447), 3.41 MiB | 5.50 MiB/s, done.
Resolving deltas: 100% (270/270), done.




---



---



# Setup Spark<br>

If using DataBricks this section will be different

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=2e6eea042a7f9a45b6b72fd86dd07d24c2577fabbfa5349e0071aa788bed1bb7
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


In [None]:
#Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
#getOrCreate gets or creates a session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

In [None]:
import urllib.request
from pyspark.sql import SparkSession
from pyspark import SparkContext

In [None]:
spark = (
    SparkSession.builder
        .appName('learn')
        .config('spark.driver.memory', '8g')
        .master('local[4]',)
        .config('spark.sql.execution.arrow.pyspark.enabled', True)
        .config('spark.sql.execution.arrow.pyspark.fallback.enabled', False)
        .getOrCreate()
)

In [None]:
sc=spark.sparkContext.getOrCreate()



---



---



We will now get the Gettysburg address (a very small file)  and perform count and sort tasks.

# Get the Gettysburg address

In [None]:
import time



---

**Gettyburg Address**

---



**Count all the words using a distributed environment**

In [None]:
start_time = time.time()
Lincoln = sc.textFile("/content/PySpark/GettysBurg.txt")
LincolnCount=Lincoln.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))
dist_lincoln_count=LincolnCount.count()
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
dist_lincoln_count

--- 5.02198600769043 seconds ---


266

**Count all the words using a single CPU**

In [None]:
start_time = time.time()
f= open("/content/PySpark/GettysBurg.txt", "r")
single_lincoln_count= f.read()
f.close()
single_lincoln_count=single_lincoln_count.split(" ")
print("--- %s seconds ---" % (time.time() - start_time))
sc_time=time.time() - start_time
len(single_lincoln_count)

--- 0.0004978179931640625 seconds ---


263

In [None]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     5.022210597991943
single compute time:  0.0006508827209472656


**Count unique words using a distributed environment**

In [None]:
#Count the number of distinct words
start_time = time.time()
DistinctWordsCount=LincolnCount.reduceByKey(lambda a,b: a+b)
DistinctWordsCount.count()
dc_time=time.time() - start_time
print("--- %s seconds ---" % (time.time() - start_time))
DistinctWordsCount.count()

--- 2.343093156814575 seconds ---


158

**Count the unique words using a single computer**

In [None]:
start_time = time.time()
words_unique=set(single_lincoln_count)
print("--- %s seconds ---" % (time.time() - start_time))
sc_time=time.time() - start_time
len(words_unique)

--- 0.0001590251922607422 seconds ---


155

In [None]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     2.3430016040802
single compute time:  0.010693550109863281


**Count the word frequency using a distributed environment**

In [None]:
start_time = time.time()
#Sort the words by most-to-least words
SortedWordsCount=DistinctWordsCount.map(lambda a: (a[1], a[0])).sortByKey()
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
#print most frequent 20 words
#SortedWordsCount.top(20)

--- 1.5747027397155762 seconds ---


**Count the word frequency using a single computer**

In [None]:
start_time = time.time()
counted_words=[]
words_unique=list(words_unique)
for i in range(len(words_unique)):
  tuples= (words_unique.count(words_unique[i]), words_unique[i])
  counted_words.append(tuples)
print("--- %s seconds ---" % (time.time() - start_time))
sc_time= time.time() - start_time


--- 0.0010328292846679688 seconds ---


In [None]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     1.574899673461914
single compute time:  0.005253791809082031
