<br><br><br>
<span style="color:red;font-size:60px">PySpark</span>
<br><br>

Notes:

1. untyped
2. python way: index, col, lambda, continous line

<li>The PySpark package should already be installed</li>
<li>If it isn't, use pip or conda to install the correct version of pyspark</li>
<li>Check the spark version <span style="color:blue">spark.version</span> (mine is 3.3.0) and install the same version of pyspark (e.g., <span style="color:blue">pip install pyspark==3.3.0</span>)</li>

<span style="color:blue;font-size:large">Start a spark session</span>
<li>Unlike Scala Spark, you need to explicitly start a spark session</li>
<li>And you need to explicitly extract the spark context</li>


In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark ML Basics").getOrCreate()
sc = spark.sparkContext


23/02/02 16:33:15 WARN Utils: Your hostname, VickyZMacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 192.168.11.104 instead (on interface en0)
23/02/02 16:33:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/02/02 16:33:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/02/02 16:33:16 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/02/02 16:33:16 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [2]:
sc.setLogLevel("WARN")

<br><br><br>

<span style="color:green;font-size:xx-large">Working with RDDs in pyspark</span>

<li>The spark API for Python is similar to the API for Scala</li>
<li>With the difference that actual data objects are Python objects (untyped) and not Scala objects</li>
<li>Compare the data type of the RDD below with the equivalent RDD in Scala</li>
<pre>
x: Array[(String, Int, Int)] = Array((John,20,32), (Jill,15,45))
res6: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:33
</pre>

<span style="color:blue;font-size:large">Creating an RDD</span>

In [3]:
x = [("John",20,32),("Jill",15,45)]
sc.parallelize(x) #No type information!

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

<span style="color:blue;font-size:large">Reading a text file</span>

In [4]:
data = sc.textFile("nyc_311_2022_clean.csv")
data.first()

                                                                                

'Created Date,Closed Date,Agency,Agency Name,Complaint Type,Incident Zip,Borough,Latitude,Longitude,processing_time,processing_days'

In [6]:
data.take(5)

['Created Date,Closed Date,Agency,Agency Name,Complaint Type,Incident Zip,Borough,Latitude,Longitude,processing_time,processing_days',
 '2020-01-07 14:09:00,2020-01-13 11:20:00,DSNY,Department of Sanitation,Electronics Waste Appointment,11692,QUEENS,40.58993519447414,-73.78942049765358,5 days 21:11:00,5.882638888888889',
 '2020-01-04 11:37:00,2020-01-08 13:19:00,DSNY,Department of Sanitation,Electronics Waste Appointment,10310,STATEN ISLAND,40.62719924888892,-74.11245623591475,4 days 01:42:00,4.070833333333334',
 '2020-01-03 16:33:00,2020-01-05 00:00:00,DSNY,Department of Sanitation,Request Large Bulky Item Collection,11213,BROOKLYN,40.6672052181697,-73.93463635283278,1 days 07:27:00,1.3104166666666668',
 '2020-01-06 17:27:00,2020-01-11 00:00:00,DSNY,Department of Sanitation,Request Large Bulky Item Collection,11379,QUEENS,40.72687041685842,-73.8769198480706,4 days 06:33:00,4.272916666666666']

In [5]:
data

nyc_311_2022_clean.csv MapPartitionsRDD[2] at textFile at NativeMethodAccessorImpl.java:0

<span style="color:blue;font-size:large">Using map</span>
<li>Note the use of lambda functions (Python's anonymous function)</li>
<li>Actions like <span style="color:blue">take</span>, <span style="color:blue">first</span>, <span style="color:blue">collect</span>, etc., are the same as in the scala API</li>
<li>But, you need to always use parentheses to indicate that these are functions</li>

In [7]:
time_data = data.map(lambda l: l.split(",")).map(lambda l: (l[2],l[10]))
time_data.first()
#time_data

('Agency', 'processing_days')

In [8]:
time_data.take(4)

[('Agency', 'processing_days'),
 ('DSNY', '5.882638888888889'),
 ('DSNY', '4.070833333333334'),
 ('DSNY', '1.3104166666666668')]

In [9]:
time_data.getNumPartitions()

49

In [10]:
time_data.count() #python integer

                                                                                

8835630

<span style="color:blue;font-size:large">Drop the header row</span>

In [11]:
time_data.mapPartitionsWithIndex(lambda i,it: iter(list(it)[1:] if i==0 else it)).count()


                                                                                

8835629

In [12]:
time_data.mapPartitionsWithIndex(lambda i,it: iter(list(it)[1:] if i==0 else it)).take(4)



[('DSNY', '5.882638888888889'),
 ('DSNY', '4.070833333333334'),
 ('DSNY', '1.3104166666666668'),
 ('DSNY', '4.272916666666666')]

<span style="color:blue;font-size:large">Convert processing time to float</span>

In [13]:
time_data.mapPartitionsWithIndex(lambda i,it: iter(list(it)[1:] if i==0 else it)).\
    map(lambda l: (l[0],float(l[1]))).\
    take(4)

[('DSNY', 5.882638888888889),
 ('DSNY', 4.070833333333334),
 ('DSNY', 1.3104166666666668),
 ('DSNY', 4.272916666666666)]

In [14]:
processed_data = time_data.mapPartitionsWithIndex(lambda i,it: iter(list(it)[1:] if i==0 else it)).\
    map(lambda l: (l[0],float(l[1])))

<span style="color:blue;font-size:large">Calculate averages using combineByKey</span>

In [15]:
def combiner(x): #initialize 
    return (1,x)

def merger(x,y):
    return ((x[0]+1,x[1]+y))

def merge_and_combiner(x,y): #across different partition
    return ((x[0]+y[0],x[1]+y[1]))

processed_data.combineByKey(combiner,merger,merge_and_combiner)\
    .map(lambda x: (x[0], x[1][1]/x[1][0]))\
    .collect()
    

                                                                                

[('DHS', 1.257410746037668),
 ('OSE', 0.12648533950617283),
 ('DEP', 5.006676030990059),
 ('HPD', 13.242378653308977),
 ('DOT', 14.494687185115147),
 ('MAYORâ\x80\x99S OFFICE OF SPECIAL ENFORCEMENT', 10.993121710572918),
 ('EDC', 56.00096076391819),
 ('DOB', 39.10644259286542),
 ('DFTA', 13.390725308641976),
 ('DOHMH', 15.397117888616293),
 ('DOF', 19.76244362870432),
 ('TLC', 53.682035801957745),
 ('DSNY', 6.984900241530092),
 ('OFFICE OF TECHNOLOGY AND INNOVATION', 0.7780439814814815),
 ('DOITT', 28.392176800309237),
 ('NYPD', 0.33957275800732817),
 ('DCA', 3.1134426905912487),
 ('DPR', 65.99335925229713),
 ('FDNY', 402.1443981481481),
 ('DOE', 43.447434186637615)]

<br><br><br>
<span style="color:green;font-size:xx-large">PySpark Dataframes</span>
<br><br>

In [18]:
df = spark\
.read\
.option('header',True).csv("nyc_311_2022_clean.csv")



In [19]:
df.printSchema()

root
 |-- Created Date: string (nullable = true)
 |-- Closed Date: string (nullable = true)
 |-- Agency: string (nullable = true)
 |-- Agency Name: string (nullable = true)
 |-- Complaint Type: string (nullable = true)
 |-- Incident Zip: string (nullable = true)
 |-- Borough: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)
 |-- processing_time: string (nullable = true)
 |-- processing_days: string (nullable = true)



<span style="color:blue;font-size:large">Reassigning values is possible in PySpark</span>

In [22]:
df = df.withColumnRenamed("Incident Zip","Zip Code") #in scala, you cannot reassign df since it is val
df

DataFrame[Created Date: string, Closed Date: string, Agency: string, Agency Name: string, Complaint Type: string, Zip Code: string, Borough: string, Latitude: string, Longitude: string, processing_time: string, processing_days: string]

<span style="color:blue;font-size:large">df.select</span>


In [23]:
df.select("Agency","processing_days").show()

+------+------------------+
|Agency|   processing_days|
+------+------------------+
|  DSNY| 5.882638888888889|
|  DSNY| 4.070833333333334|
|  DSNY|1.3104166666666668|
|  DSNY| 4.272916666666666|
|  DSNY|1.5930555555555554|
|  DSNY| 2.272222222222222|
|  DSNY| 5.195138888888889|
|  DSNY|2.1868055555555554|
|  DSNY|2.4805555555555556|
|  DSNY|2.5652777777777778|
|  DSNY| 4.317361111111111|
|  DSNY|1.2472222222222222|
|  DSNY| 3.395138888888889|
|  DSNY| 5.979166666666667|
|  DSNY|1.3180555555555555|
|  DSNY| 3.870833333333333|
|  DSNY|1.5958333333333332|
|  DSNY| 9.120138888888889|
|  DSNY|3.5493055555555557|
|  DSNY|               2.0|
+------+------------------+
only showing top 20 rows



import org.apache.spark.sql.functions.col #in spark

In [24]:
from pyspark.sql.functions import col #cannot use $ sign in pyspark
df.select("Agency",col("processing_days")*24).toDF("Agency","processing_hours") #1

DataFrame[Agency: string, processing_hours: double]

In [25]:
df.select("Agency",df["processing_days"]*24).toDF("Agency","processing_hours") #2

DataFrame[Agency: string, processing_hours: double]

In [26]:
df.withColumn("processing_hours",df["processing_days"]*24).take(4)

[Row(Created Date='2020-01-07 14:09:00', Closed Date='2020-01-13 11:20:00', Agency='DSNY', Agency Name='Department of Sanitation', Complaint Type='Electronics Waste Appointment', Zip Code='11692', Borough='QUEENS', Latitude='40.58993519447414', Longitude='-73.78942049765358', processing_time='5 days 21:11:00', processing_days='5.882638888888889', processing_hours=141.18333333333334),
 Row(Created Date='2020-01-04 11:37:00', Closed Date='2020-01-08 13:19:00', Agency='DSNY', Agency Name='Department of Sanitation', Complaint Type='Electronics Waste Appointment', Zip Code='10310', Borough='STATEN ISLAND', Latitude='40.62719924888892', Longitude='-74.11245623591475', processing_time='4 days 01:42:00', processing_days='4.070833333333334', processing_hours=97.70000000000002),
 Row(Created Date='2020-01-03 16:33:00', Closed Date='2020-01-05 00:00:00', Agency='DSNY', Agency Name='Department of Sanitation', Complaint Type='Request Large Bulky Item Collection', Zip Code='11213', Borough='BROOKLYN

<span style="color:blue;font-size:large">SQL</span>



In [27]:
df.createOrReplaceTempView("dataDB")
spark.sql("select Agency, AVG(processing_days) from dataDB GROUP BY Agency").show()



+--------------------+--------------------+
|              Agency|avg(processing_days)|
+--------------------+--------------------+
|MAYORâS OFFICE ...|  10.993121710572932|
|                 DOT|  14.494687185115172|
|                 HPD|  13.242378653308977|
|                 DCA|  3.1134426905912487|
|                 DPR|   65.99335925229708|
|                 TLC|   53.68203580195775|
|                 EDC|   56.00096076391825|
|                 DOF|   19.76244362870433|
|                NYPD|  0.3395727580073328|
|                 DOB|  39.106442592865434|
|                 DEP|   5.006676030990062|
|                 DOE|   43.44743418663761|
|               DOHMH|  15.397117888616267|
|                DSNY|   6.984900241530156|
|                 DHS|   1.257410746037668|
|               DOITT|  28.392176800309258|
|                DFTA|  13.390725308641976|
|OFFICE OF TECHNOL...|  0.7780439814814815|
|                 OSE| 0.12648533950617283|
|                FDNY|   402.144

                                                                                

In [28]:
df.toPandas() #OutOfMemoryError: sometimes it is too big for pandas dataframe.

[Stage 17:>                                                        (0 + 8) / 13]

22/12/07 09:01:53 ERROR Executor: Exception in task 0.0 in stage 17.0 (TID 220)
java.lang.OutOfMemoryError: Java heap space
	at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
	at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
	at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
	at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
	at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
	at java.base/java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1849)
	at java.base/java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
	at org.apache.spark.util.Utils$.writeByteBuffer(Utils.scala:244)
	at org.apache.spark.scheduler.DirectTaskResult.$anonfun$writeExternal$1(TaskResult.scala:53)
	at org.apache.spark.scheduler.DirectTaskResult$$Lambda$1604/0x0000000800be0c40.apply$mcV$sp(Unknown Source)
	at scala.runtime.java8.JFunction0$

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 62851)
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/apache-spark/3.3.0/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/apache-spark/3.3.0/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/opt/homebrew/Cellar/apache-spark/3.3.0/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recen

ConnectionRefusedError: [Errno 61] Connection refused