<a href="https://colab.research.google.com/github/cagBRT/PySpark/blob/master/Distrib_vs_Single.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!git clone https://github.com/cagBRT/PySpark.git

Cloning into 'PySpark'...
remote: Enumerating objects: 390, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 390 (delta 37), reused 0 (delta 0), pack-reused 330[K
Receiving objects: 100% (390/390), 3.36 MiB | 13.50 MiB/s, done.
Resolving deltas: 100% (235/235), done.




---



---



# Setup Spark

In [3]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=f68ff5c4f80079db13955e1bf7ff1fefdf78e2a0098d26645dcb7148ffdc2a0a
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [4]:
#Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
#getOrCreate gets or creates a session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

In [5]:
import urllib.request
from pyspark.sql import SparkSession
from pyspark import SparkContext

In [6]:
spark = (
    SparkSession.builder
        .appName('learn')
        .config('spark.driver.memory', '8g')
        .master('local[4]',)
        .config('spark.sql.execution.arrow.pyspark.enabled', True)
        .config('spark.sql.execution.arrow.pyspark.fallback.enabled', False)
        .getOrCreate()
)

In [7]:
sc=spark.sparkContext.getOrCreate()



---



---



# Get all of Shakespeare’s works

In [8]:
import time

**Let's find out how long it takes a distributed computer to perform the task of counting all the words in Shakespeare's works**.

In [17]:
start_time = time.time()
#Count the number of words in all of Shakespear's works
Words=sc.textFile("/content/PySpark/shakespeare.txt")
WordsCount=Words.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
WordsCount.count()

--- 0.07819986343383789 seconds ---


1418390

**Let's find out how long it takes a *single computer* to perform the task of counting all the words in Shakespeare's works**.

In [18]:
start_time = time.time()
f= open("/content/PySpark/shakespeare.txt", "r")
words_shakes= f.read()
f.close()
word_shakes_python=words_shakes.split(" ")
print("--- %s seconds ---" % (time.time() - start_time))
sc_time=time.time() - start_time
len(word_shakes_python)

--- 0.1267104148864746 seconds ---


1293935

In [19]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     0.07841944694519043
single compute time:  0.1269824504852295


In [22]:
start_time = time.time()
#Count the number of distinct words
DistinctWordsCount=WordsCount.reduceByKey(lambda a,b: a+b)
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
DistinctWordsCount.count()

--- 0.034563302993774414 seconds ---


67506

In [23]:
start_time = time.time()
words_unique=set(word_shakes_python)
print("--- %s seconds ---" % (time.time() - start_time))
sc_time=time.time() - start_time
len(words_unique)

--- 0.2365860939025879 seconds ---


86196

In [24]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     0.03486371040344238
single compute time:  0.2383289337158203


In [25]:
start_time = time.time()
#Sort the words by most-to-least words
SortedWordsCount=DistinctWordsCount.map(lambda a: (a[1], a[0])).sortByKey()
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
#print most frequent 20 words
SortedWordsCount.top(20)

--- 1.3808770179748535 seconds ---


[(517065, ''),
 (23242, 'the'),
 (19540, 'I'),
 (18297, 'and'),
 (15623, 'to'),
 (15544, 'of'),
 (12532, 'a'),
 (10824, 'my'),
 (9576, 'in'),
 (9081, 'you'),
 (7851, 'is'),
 (7531, 'that'),
 (7068, 'And'),
 (6948, 'not'),
 (6722, 'with'),
 (6218, 'his'),
 (6009, 'your'),
 (6002, 'be'),
 (5616, 'for'),
 (5236, 'have')]

In [27]:
#start_time = time.time()
#counted_words=[]
#words_unique=list(words_unique)
#for i in range(len(words_unique)):
#  tuples= (word_shakes_python.count(words_unique[i]), words_unique[i])
#  counted_words.append(tuples)
#print("--- %s seconds ---" % (time.time() - start_time))
#sc_time= time.time() - start_time

--- 2157.6034429073334 seconds ---


In [28]:
#print("distributed time:    ",dc_time )
#print("single compute time: ", sc_time)

distributed time:     1.3811061382293701
single compute time:  2157.603858232498
