<br><br><br>
<span style="color:red;font-size:60px">PySpark</span>
<br><br>

<li>The PySpark package should already be installed</li>
<li>If it isn't, use pip or conda to install the correct version of pyspark</li>
<li>Check the spark version <span style="color:blue">spark.version</span> (mine is 3.3.0) and install the same version of pyspark (e.g., <span style="color:blue">pip install pyspark==3.3.0</span>)</li>

<span style="color:blue;font-size:large">Start a spark session</span>
<li>Unlike Scala Spark, you need to explicitly start a spark session</li>
<li>And you need to explicitly extract the spark context</li>


In [None]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark ML Basics").getOrCreate()
sc = spark.sparkContext


<br><br><br>

<span style="color:green;font-size:xx-large">Working with RDDs in pyspark</span>

<li>The spark API for Python is similar to the API for Scala</li>
<li>With the difference that actual data objects are Python objects (untyped) and not Scala objects</li>
<li>Compare the data type of the RDD below with the equivalent RDD in Scala</li>
<pre>
x: Array[(String, Int, Int)] = Array((John,20,32), (Jill,15,45))
res6: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[7] at parallelize at <console>:33
</pre>

<span style="color:blue;font-size:large">Creating an RDD</span>

In [None]:
x = [("John",20,32),("Jill",15,45)]
sc.parallelize(x) #No type information!

<span style="color:blue;font-size:large">Reading a text file</span>

In [None]:
data = sc.textFile("../../DataAnalytics/DataVisualization/nyc_311_2022_clean.csv")
data.first()

In [None]:
data

<span style="color:blue;font-size:large">Using map</span>
<li>Note the use of lambda functions (Python's anonymous function)</li>
<li>Actions like <span style="color:blue">take</span>, <span style="color:blue">first</span>, <span style="color:blue">collect</span>, etc., are the same as in the scala API</li>
<li>But, you need to always use parentheses to indicate that these are functions</li>

In [None]:
time_data = data.map(lambda l: l.split(",")).map(lambda l: (l[2],l[10]))
time_data.first()
#time_data

In [None]:
time_data.take(4)

In [None]:
time_data.getNumPartitions()

In [None]:
time_data.count()

<span style="color:blue;font-size:large">Drop the header row</span>

In [None]:
time_data.mapPartitionsWithIndex(lambda i,it: iter(list(it)[1:] if i==0 else it)).count()


In [None]:
time_data.mapPartitionsWithIndex(lambda i,it: iter(list(it)[1:] if i==0 else it)).take(4)



<span style="color:blue;font-size:large">Convert processing time to float</span>

In [None]:
time_data.mapPartitionsWithIndex(lambda i,it: iter(list(it)[1:] if i==0 else it)).\
    map(lambda l: (l[0],float(l[1]))).\
    take(4)

In [None]:
processed_data = time_data.mapPartitionsWithIndex(lambda i,it: iter(list(it)[1:] if i==0 else it)).\
    map(lambda l: (l[0],float(l[1])))

<span style="color:blue;font-size:large">Calculate averages using combineByKey</span>

In [None]:
def combiner(x):
    return (1,x)

def merger(x,y):
    return ((x[0]+1,x[1]+y))

def merge_and_combiner(x,y):
    return ((x[0]+y[0],x[1]+y[1]))

processed_data.combineByKey(combiner,merger,merge_and_combiner)\
    .map(lambda x: (x[0], x[1][1]/x[1][0]))\
    .collect()
    

<br><br><br>
<span style="color:green;font-size:xx-large">PySpark Dataframes</span>
<br><br>

In [None]:
df = spark.read.option('header',True).csv("../../DataAnalytics/DataVisualization/nyc_311_2022_clean.csv")

In [None]:
df.printSchema()

<span style="color:blue;font-size:large">Reassigning values is possible in PySpark</span>

In [None]:
df = df.withColumnRenamed("Incident Zip","Zip Code")
df

<span style="color:blue;font-size:large">df.select</span>


In [None]:
df.select("Agency","processing_days")

In [None]:
from pyspark.sql.functions import col
df.select("Agency",col("processing_days")*24).toDF("Agency","processing_hours")

In [None]:
df.select("Agency",df["processing_days"]*24).toDF("Agency","processing_hours")

In [None]:
df.withColumn("processing_hours",df["processing_days"]*24).take(4)

<span style="color:blue;font-size:large">SQL</span>



In [None]:
df.createOrReplaceTempView("dataDB")
spark.sql("select Agency, AVG(processing_days) from dataDB GROUP BY Agency").show()