<a href="https://colab.research.google.com/github/cagBRT/PySpark/blob/master/Distrib_vs_Single.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we compare doing a task in a distributed environment and on a single CPU (single computer)

In [1]:
!git clone https://github.com/cagBRT/PySpark.git

Cloning into 'PySpark'...
remote: Enumerating objects: 438, done.[K
remote: Counting objects: 100% (108/108), done.[K
remote: Compressing objects: 100% (108/108), done.[K
remote: Total 438 (delta 66), reused 0 (delta 0), pack-reused 330 (from 1)[K
Receiving objects: 100% (438/438), 3.39 MiB | 11.35 MiB/s, done.
Resolving deltas: 100% (264/264), done.




---



---



# Setup PySpark<br>

If using DataBricks this section will be different

In [31]:
!pip install pyspark



In [30]:
#Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
#getOrCreate gets or creates a session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

In [33]:
def setupSpark():
  # Spark needs to run with Java 8 ...
  !pip install -q findspark
  !apt-get install openjdk-8-jdk-headless > /dev/null
  !echo 2 | update-alternatives --config java > /dev/null
  # !java -version
  import os, findspark
  os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'
  # !echo JAVA_HOME=$JAVA_HOME
  !pip install -q pyspark
  findspark.init(spark_home='/usr/local/lib/python3.6/dist-packages/pyspark')
  !pyspark --version


setupSpark()

/usr/local/bin/pyspark: line 24: /usr/local/lib/python3.6/dist-packages/pyspark/bin/load-spark-env.sh: No such file or directory
/usr/local/bin/pyspark: line 68: /usr/local/lib/python3.6/dist-packages/pyspark/bin/spark-submit: No such file or directory


In [4]:
import urllib.request
from pyspark.sql import SparkSession
from pyspark import SparkContext

In [5]:
spark = (
    SparkSession.builder
        .appName('learn')
        .config('spark.driver.memory', '8g')
        .master('local[4]',)
        .config('spark.sql.execution.arrow.pyspark.enabled', True)
        .config('spark.sql.execution.arrow.pyspark.fallback.enabled', False)
        .getOrCreate()
)

In [6]:
sc=spark.sparkContext.getOrCreate()

In [7]:
import time



---



---



We will now get all the works of Shakespeare (a very large file) and perform count and sort tasks.

# Get all of Shakespeare’s works

**Let's find out how long it takes a distributed computer to perform the task of counting all the words in Shakespeare's works**.

In [8]:
start_time = time.time()
#Count the number of words in all of Shakespear's works
Words=sc.textFile("/content/PySpark/shakespeare.txt")
WordsCount=Words.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
print("Number of words:",WordsCount.count())

--- 1.2822182178497314 seconds ---
Number of words: 1418390


**Let's find out how long it takes a *single computer* to perform the task of counting all the words in Shakespeare's works**.

In [35]:
start_time = time.time()
f= open("/content/PySpark/shakespeare.txt", "r")
words_shakes= f.read()
f.close()
word_shakes_python=words_shakes.split(" ")
sc_time=time.time() - start_time
print("--- %s seconds ---" % (sc_time))
len(word_shakes_python)

--- 0.15586352348327637 seconds ---


1293935

In [36]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     1.501145601272583
single compute time:  0.15586352348327637


**How long for counting distinct words?**

In [11]:
start_time = time.time()
#Count the number of distinct words
DistinctWordsCount=WordsCount.reduceByKey(lambda a,b: a+b)
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
print("Number of distinct words:",DistinctWordsCount.count())

--- 0.0923318862915039 seconds ---
Number of distinct words: 67506


In [12]:
start_time = time.time()
words_unique=set(word_shakes_python)
print("--- %s seconds ---" % (time.time() - start_time))
sc_time=time.time() - start_time
len(words_unique)

--- 0.12170147895812988 seconds ---


86196

In [13]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     0.09261393547058105
single compute time:  0.12292981147766113


In [14]:
start_time = time.time()
#Sort the words by most-to-least words
SortedWordsCount=DistinctWordsCount.map(lambda a: (a[1], a[0])).sortByKey()
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
#print most frequent 20 words
SortedWordsCount.top(20)

--- 1.3316192626953125 seconds ---


[(517065, ''),
 (23242, 'the'),
 (19540, 'I'),
 (18297, 'and'),
 (15623, 'to'),
 (15544, 'of'),
 (12532, 'a'),
 (10824, 'my'),
 (9576, 'in'),
 (9081, 'you'),
 (7851, 'is'),
 (7531, 'that'),
 (7068, 'And'),
 (6948, 'not'),
 (6722, 'with'),
 (6218, 'his'),
 (6009, 'your'),
 (6002, 'be'),
 (5616, 'for'),
 (5236, 'have')]

In [15]:
#start_time = time.time()
#counted_words=[]
#words_unique=list(words_unique)
#for i in range(len(words_unique)):
#  tuples= (word_shakes_python.count(words_unique[i]), words_unique[i])
#  counted_words.append(tuples)
#print("--- %s seconds ---" % (time.time() - start_time))
#sc_time= time.time() - start_time

In [16]:
#print("distributed time:    ",dc_time )
#print("single compute time: ", sc_time)

distributed time:     1.38<br>
single compute time:  2158



---

# Gettyburg Address

---



**Count all the words using a distributed environment**

In [17]:
start_time = time.time()
Lincoln = sc.textFile("/content/PySpark/GettysBurg.txt")
LincolnCount=Lincoln.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))
dist_lincoln_count=LincolnCount.count()
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
print("Number of words:",dist_lincoln_count)

--- 0.5492563247680664 seconds ---
Number of words: 266


**Count all the words using a single CPU**

In [18]:
start_time = time.time()
f= open("/content/PySpark/GettysBurg.txt", "r")
single_lincoln_count= f.read()
f.close()
single_lincoln_count=single_lincoln_count.split(" ")
print("--- %s seconds ---" % (time.time() - start_time))
sc_time=time.time() - start_time
len(single_lincoln_count)

--- 0.0017499923706054688 seconds ---


263

In [19]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     0.5495297908782959
single compute time:  0.0018970966339111328


**Count unique words using a distributed environment**

In [20]:
#Count the number of distinct words
start_time = time.time()
DistinctWordsCount=LincolnCount.reduceByKey(lambda a,b: a+b)
DistinctWordsCount.count()
dc_time=time.time() - start_time
print("--- %s seconds ---" % (time.time() - start_time))
DistinctWordsCount.count()

--- 1.115626573562622 seconds ---


158

**Count the unique words using a single computer**

In [21]:
start_time = time.time()
words_unique=set(single_lincoln_count)
print("--- %s seconds ---" % (time.time() - start_time))
sc_time=time.time() - start_time
len(words_unique)

--- 0.00014400482177734375 seconds ---


155

In [22]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     1.1155269145965576
single compute time:  0.00784754753112793


**Count the word frequency using a distributed environment**

In [23]:
start_time = time.time()
#Sort the words by most-to-least words
SortedWordsCount=DistinctWordsCount.map(lambda a: (a[1], a[0])).sortByKey()
print("--- %s seconds ---" % (time.time() - start_time))
dc_time=time.time() - start_time
#print most frequent 20 words
#SortedWordsCount.top(20)

--- 1.5009355545043945 seconds ---


**Count the word frequency using a single computer**

In [24]:
start_time = time.time()
counted_words=[]
words_unique=list(words_unique)
for i in range(len(words_unique)):
  tuples= (words_unique.count(words_unique[i]), words_unique[i])
  counted_words.append(tuples)
print("--- %s seconds ---" % (time.time() - start_time))
sc_time= time.time() - start_time


--- 0.0011508464813232422 seconds ---


In [25]:
print("distributed time:    ",dc_time )
print("single compute time: ", sc_time)

distributed time:     1.501145601272583
single compute time:  0.013538599014282227


In [37]:
sc.stop()