# PySpark Scaling Demo

## HPC and Data Science Summer Institute

Mai H. Nguyen - UC San Diego

--- 

## Setup

In [1]:
# Initialize Spark

import pyspark
from pyspark.sql import SparkSession, Row

# Change N in local[N] to change number of resources 
# Note change in execution time
conf = pyspark.SparkConf().setAll([('spark.master', 'local[1]'),
                                   ('spark.app.name', 'Spark Demo'),
                                   ('spark.driver.memory','3G'),
                                   ('spark.driver.maxresultsize','2G'),
                                   ('spark.executor.memory','2G')])
spark = SparkSession.builder.config(conf=conf).getOrCreate()

print (spark.version, pyspark.version.__version__)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/08 01:44:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


3.2.1 3.2.1


In [2]:
print(spark.sparkContext.defaultParallelism)

1


In [3]:
# Record starting time

import time
start_time = time.time()

## Read data

In [4]:
# Read data into Spark DataFrame 
# Data source: https://jmcauley.ucsd.edu/data/amazon/ 

from os.path import expanduser
HOME = expanduser("~")

data_path = HOME + '/data/'

dataFileName = data_path + "BookReviews_5M.txt"
textDF = spark.read.text(dataFileName).cache()

## Process data

In [5]:
%%time

# Count number of rows
textDF.count()



CPU times: user 10.3 ms, sys: 2.68 ms, total: 13 ms
Wall time: 11.4 s


                                                                                

5000000

In [6]:
# Show first few rows

textDF.show()

+--------------------+
|               value|
+--------------------+
|This was the firs...|
|Also after going ...|
|As with all of Ms...|
|I've not read any...|
|This romance nove...|
|Carolina Garcia A...|
|Not only can she ...|
|Once again Garcia...|
|The timing is jus...|
|Engaging. Dark. R...|
|Set amid the back...|
|This novel is a d...|
|If readers are ad...|
| Reviewed by Phyllis|
|      APOOO BookClub|
|A guilty pleasure...|
|In the tradition ...|
|Beryl Unger, top ...|
|What follows is a...|
|The book flap say...|
+--------------------+
only showing top 20 rows



In [7]:
# Print time since execution start

print(time.time() - start_time)

13.902007579803467


## Stop Spark Session

In [8]:
spark.stop()