## PySpark SQL Recipes With HiveQL, Dataframe and Graphframes

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("create").getOrCreate()

In [2]:
swimmerDf = spark.read.csv('swimmerData.csv',header=True, inferSchema=True)

In [3]:
swimmerDf.printSchema()

root
 |-- id: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Occupation: string (nullable = true)
 |-- swimTimeInSecond: double (nullable = true)
 |-- Speed: double (nullable = true)



In [38]:
swimmerDf.show(4)

+---+------+-----------+----------------+-----+
| id|Gender| Occupation|swimTimeInSecond|Speed|
+---+------+-----------+----------------+-----+
|id1|  Male| programmer|           16.73|1.195|
|id2|Female|    Manager|           15.56|1.285|
|id3|  Male|    Manager|           15.15| 1.32|
|id4|  Male|riskanalyst|           15.27| 1.31|
+---+------+-----------+----------------+-----+
only showing top 4 rows



In [40]:
from pyspark.sql.types import *

**Creating the Schema of the DataFrame**

1. Let’s look at the arguments of StructField(). The first argument is the column name. We provide the column name as id.
2. The second argument is the datatype of the elements of the column. The datatype of the first column is StringType(). If some ID is missing then some element of a column might be null.
3. The last argument, whose value is True, tells you that this column might have null values or missing data.

In [41]:
idColumn = StructField("id",StringType(),True)
genderColumn = StructField("Gender",StringType(),True)
OccupationColumn = StructField("Occupation",StringType(),True)
swimTimeInSecondColumn = StructField("swimTimeInSecond",DoubleType(),True)
speed = StructField("Speed",DoubleType(),True)
columnList = [idColumn, genderColumn, OccupationColumn,swimTimeInSecondColumn,speed]
swimmerDfSchema = StructType(columnList)
swimmerDfSchema 

StructType(List(StructField(id,StringType,true),StructField(Gender,StringType,true),StructField(Occupation,StringType,true),StructField(swimTimeInSecond,DoubleType,true),StructField(Speed,DoubleType,true)))

In [43]:
swimmerDf = spark.read.csv('swimmerData.csv',header=True,schema=swimmerDfSchema)
swimmerDf.printSchema()

root
 |-- id: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Occupation: string (nullable = true)
 |-- swimTimeInSecond: double (nullable = true)
 |-- Speed: double (nullable = true)



In [10]:
swimmerDf1 = swimmerDf.withColumn('swimmerSpeed',20.0/swimmerDf.swimTimeInSecond)
swimmerDf1.show(4)

+---+------+-----------+----------------+-----+------------------+
| id|Gender| Occupation|swimTimeInSecond|Speed|      swimmerSpeed|
+---+------+-----------+----------------+-----+------------------+
|id1|  Male| programmer|           16.73|1.195| 1.195457262402869|
|id2|Female|    Manager|           15.56|1.285|1.2853470437017995|
|id3|  Male|    Manager|           15.15| 1.32|1.3201320132013201|
|id4|  Male|riskanalyst|           15.27| 1.31| 1.309757694826457|
+---+------+-----------+----------------+-----+------------------+
only showing top 4 rows



In [11]:
from pyspark.sql.functions import round
swimmerDf2 = swimmerDf1.withColumn('swimmerSpeed',
round(swimmerDf1.swimmerSpeed, 3))

swimmerDf2.show(5)

+---+------+-----------+----------------+-----+------------+
| id|Gender| Occupation|swimTimeInSecond|Speed|swimmerSpeed|
+---+------+-----------+----------------+-----+------------+
|id1|  Male| programmer|           16.73|1.195|       1.195|
|id2|Female|    Manager|           15.56|1.285|       1.285|
|id3|  Male|    Manager|           15.15| 1.32|        1.32|
|id4|  Male|riskanalyst|           15.27| 1.31|        1.31|
|id5|  Male| programmer|           15.65|1.278|       1.278|
+---+------+-----------+----------------+-----+------------+
only showing top 5 rows



In [17]:
swimmerDf3 = swimmerDf2.select("swimTimeInSecond")
swimmerDf3.show(4)

+----------------+
|swimTimeInSecond|
+----------------+
|           16.73|
|           15.56|
|           15.15|
|           15.27|
+----------------+
only showing top 4 rows



In [18]:
swimmerDf4 = swimmerDf2.select("id","swimmerSpeed")
swimmerDf4.show(3)

+---+------------+
| id|swimmerSpeed|
+---+------------+
|id1|       1.195|
|id2|       1.285|
|id3|        1.32|
+---+------------+
only showing top 3 rows



In [26]:
print(swimmerDf2.columns)
print('\n')
swimmerDf2.show(2)

['id', 'Gender', 'Occupation', 'swimTimeInSecond', 'Speed', 'swimmerSpeed']


+---+------+----------+----------------+-----+------------+
| id|Gender|Occupation|swimTimeInSecond|Speed|swimmerSpeed|
+---+------+----------+----------------+-----+------------+
|id1|  Male|programmer|           16.73|1.195|       1.195|
|id2|Female|   Manager|           15.56|1.285|       1.285|
+---+------+----------+----------------+-----+------------+
only showing top 2 rows



In [20]:
swimmerDf5 = swimmerDf2.select("id",swimmerDf2.swimmerSpeed*2)
swimmerDf5.show(3)

+---+------------------+
| id|(swimmerSpeed * 2)|
+---+------------------+
|id1|              2.39|
|id2|              2.57|
|id3|              2.64|
+---+------------------+
only showing top 3 rows



In [21]:
swimmerDf3 = swimmerDf2.filter(swimmerDf2.Gender == 'Male')
swimmerDf3.show()

+----+------+-----------+----------------+-----+------------+
|  id|Gender| Occupation|swimTimeInSecond|Speed|swimmerSpeed|
+----+------+-----------+----------------+-----+------------+
| id1|  Male| programmer|           16.73|1.195|       1.195|
| id3|  Male|    Manager|           15.15| 1.32|        1.32|
| id4|  Male|riskanalyst|           15.27| 1.31|        1.31|
| id5|  Male| programmer|           15.65|1.278|       1.278|
| id6|  Male|riskanalyst|           15.74|1.271|       1.271|
| id8|  Male|    Manager|           17.11|1.169|       1.169|
|id11|  Male| programmer|           15.96|1.253|       1.253|
+----+------+-----------+----------------+-----+------------+



In [7]:
swimmerDf4 = swimmerDf.filter((swimmerDf.Gender =='Male') & (swimmerDf.Occupation == 'programmer'))

swimmerDf4.show()

+----+------+----------+----------------+-----+
|  id|Gender|Occupation|swimTimeInSecond|Speed|
+----+------+----------+----------------+-----+
| id1|  Male|programmer|           16.73|1.195|
| id5|  Male|programmer|           15.65|1.278|
|id11|  Male|programmer|           15.96|1.253|
+----+------+----------+----------------+-----+



In [12]:
swimmerDf2.filter((swimmerDf2.Occupation ==
'programmer') & (swimmerDf2.swimmerSpeed > 1.17) ).show()

+----+------+----------+----------------+-----+------------+
|  id|Gender|Occupation|swimTimeInSecond|Speed|swimmerSpeed|
+----+------+----------+----------------+-----+------------+
| id1|  Male|programmer|           16.73|1.195|       1.195|
| id5|  Male|programmer|           15.65|1.278|       1.278|
| id7|Female|programmer|            16.8| 1.19|        1.19|
| id9|Female|programmer|           16.83|1.188|       1.188|
|id11|  Male|programmer|           15.96|1.253|       1.253|
+----+------+----------+----------------+-----+------------+



In [13]:
swimmerDf3 = swimmerDf2.drop(swimmerDf2.id)
swimmerDf3.show(6)

+------+-----------+----------------+-----+------------+
|Gender| Occupation|swimTimeInSecond|Speed|swimmerSpeed|
+------+-----------+----------------+-----+------------+
|  Male| programmer|           16.73|1.195|       1.195|
|Female|    Manager|           15.56|1.285|       1.285|
|  Male|    Manager|           15.15| 1.32|        1.32|
|  Male|riskanalyst|           15.27| 1.31|        1.31|
|  Male| programmer|           15.65|1.278|       1.278|
|  Male|riskanalyst|           15.74|1.271|       1.271|
+------+-----------+----------------+-----+------------+
only showing top 6 rows



In [14]:
swimmerDf5 = swimmerDf2.drop("id", "Occupation")
swimmerDf5.show(6)

+------+----------------+-----+------------+
|Gender|swimTimeInSecond|Speed|swimmerSpeed|
+------+----------------+-----+------------+
|  Male|           16.73|1.195|       1.195|
|Female|           15.56|1.285|       1.285|
|  Male|           15.15| 1.32|        1.32|
|  Male|           15.27| 1.31|        1.31|
|  Male|           15.65|1.278|       1.278|
|  Male|           15.74|1.271|       1.271|
+------+----------------+-----+------------+
only showing top 6 rows



In [29]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import udf

In [53]:
tempInCelsius = StructField("tempInCelsius",DoubleType(),True)
time = StructField("Time",StringType(),True)
columnList = [time,tempInCelsius ]
schema = StructType(columnList)

In [54]:
tempDf = spark.read.json('tempData.json', schema=schema)
tempDf.show(3)
tempDf.printSchema()

+----+-------------+
|Time|tempInCelsius|
+----+-------------+
| 6AM|         15.0|
| 8AM|         16.0|
|10AM|         17.0|
+----+-------------+
only showing top 3 rows

root
 |-- Time: string (nullable = true)
 |-- tempInCelsius: double (nullable = true)



We can observe that the udf() function has taken the name of the function in String format as its first argument and the return type of the UDF as the second argument.

In [69]:
def celsiustoFahrenheit(temp):
    return ((temp*9.0/5.0)+32)

celsiustoFahrenheitUdf = udf(celsiustoFahrenheit,DoubleType())

celsiustoFahrenheit(15)

59.0

* The required UDF has been created. So, we’ll now use this UDF to
transform the temperature from Celsius to Fahrenheit and add the result as a new column.
* We are going to use the withColumn() function with a second
argument as a UDF and with tempInCelsius as the input to the UDF.

In [71]:
tempDfFahrenheit = tempDf.withColumn('tempInFahrenheit',celsiustoFahrenheitUdf(tempDf.tempInCelsius))
tempDfFahrenheit.show()

+----+-------------+----------------+
|Time|tempInCelsius|tempInFahrenheit|
+----+-------------+----------------+
| 6AM|         15.0|            59.0|
| 8AM|         16.0|            60.8|
|10AM|         17.0|            62.6|
|12AM|         17.0|            62.6|
| 2PM|         18.0|            64.4|
| 4PM|         17.0|            62.6|
| 6PM|         16.0|            60.8|
| 8PM|         14.0|            57.2|
+----+-------------+----------------+



In [65]:
from pyspark.sql.functions import udf
def labelTemprature(temp) :
    if temp > 15 :
        return "High"
    else :
        return "Low"

In [66]:
labelTemprature(14), labelTemprature(16)

('Low', 'High')

In [67]:
labelTempratureUdf = udf(labelTemprature)

In [68]:
tempDf2 = tempDf.withColumn("label",labelTempratureUdf(tempDf.tempInCelsius))
tempDf2.show()

+----+-------------+-----+
|Time|tempInCelsius|label|
+----+-------------+-----+
| 6AM|         15.0|  Low|
| 8AM|         16.0| High|
|10AM|         17.0| High|
|12AM|         17.0| High|
| 2PM|         18.0| High|
| 4PM|         17.0| High|
| 6PM|         16.0| High|
| 8PM|         14.0|  Low|
+----+-------------+-----+



In [72]:
corrData = spark.read.json(path='corrData.json')
corrData.show(6)

+----+----+-----+
| iv1| iv2|  iv3|
+----+----+-----+
| 5.5| 8.5|  9.5|
|6.13|9.13|10.13|
|5.92|8.92| 9.92|
|6.89|9.89|10.89|
|6.12|9.12|10.12|
+----+----+-----+



In [73]:
corrData.printSchema()

root
 |-- iv1: double (nullable = true)
 |-- iv2: double (nullable = true)
 |-- iv3: double (nullable = true)



In [80]:
#Now we calculate the population variance:
meanVal = corrData.agg({"iv1":"avg","iv2":"avg","iv3":"avg"})
meanVal.show()

#Calculating the Variance of Each Column
varSampleVal = corrData.agg({"iv1":"var_samp","iv2":"var_samp","iv3":"var_samp"})
varSampleVal.show()

#Now we calculate the population variance:
varPopulation = corrData.agg({"iv1":"var_pop","iv2":"var_pop","iv3":"var_pop"})
varPopulation.show()

#Counting the Number of Data Points in Each Column
countVal = corrData.agg({"iv1":"count","iv2":"count","iv3":"count"})
countVal.show()

+--------+--------+--------+
|avg(iv2)|avg(iv1)|avg(iv3)|
+--------+--------+--------+
|   9.112|   6.112|  10.112|
+--------+--------+--------+

+------------------+-------------+------------------+
|     var_samp(iv2)|var_samp(iv1)|     var_samp(iv3)|
+------------------+-------------+------------------+
|0.2542699999999997|      0.25427|0.2542699999999997|
+------------------+-------------+------------------+

+-------------------+-------------------+-------------------+
|       var_pop(iv2)|       var_pop(iv1)|       var_pop(iv3)|
+-------------------+-------------------+-------------------+
|0.20341599999999976|0.20341599999999999|0.20341599999999976|
+-------------------+-------------------+-------------------+

+----------+----------+----------+
|count(iv2)|count(iv1)|count(iv3)|
+----------+----------+----------+
|         5|         5|         5|
+----------+----------+----------+



In [76]:
meanVal.describe().show()

+-------+--------+--------+--------+
|summary|avg(iv2)|avg(iv1)|avg(iv3)|
+-------+--------+--------+--------+
|  count|       1|       1|       1|
|   mean|   9.112|   6.112|  10.112|
| stddev|     NaN|     NaN|     NaN|
|    min|   9.112|   6.112|  10.112|
|    max|   9.112|   6.112|  10.112|
+-------+--------+--------+--------+



In [82]:
#Calculating Summation, Mean, and Standard Deviation on the First Column
moreAggOnOneCol = corrData.agg({"iv1":"sum","iv1":"avg", "iv1":"stddev_samp"})
moreAggOnOneCol.show()

from pyspark.sql.functions import *
moreAggOnOneCol = corrData.agg(sum("iv1"), avg("iv1"),stddev_samp("iv1"))
moreAggOnOneCol.show()

+------------------+
|  stddev_samp(iv1)|
+------------------+
|0.5042519211663947|
+------------------+

+--------+--------+------------------+
|sum(iv1)|avg(iv1)|  stddev_samp(iv1)|
+--------+--------+------------------+
|   30.56|   6.112|0.5042519211663947|
+--------+--------+------------------+



In [83]:
# Calculating the Variance of the First Column, the Mean of the Second Column, and the Standard Deviation of the Third Column

colWiseDiffAggregation = corrData.agg({"iv1":"var_samp","iv2":"avg","iv3":"stddev_samp"})
colWiseDiffAggregation.show()

+--------+-------------+------------------+
|avg(iv2)|var_samp(iv1)|  stddev_samp(iv3)|
+--------+-------------+------------------+
|   9.112|      0.25427|0.5042519211663945|
+--------+-------------+------------------+



In [86]:
# Calculating Covariance Between Variables
print(corrData.cov('iv1','iv2'))
print(corrData.cov('iv1','iv3'))

0.2542699999999998
0.2542699999999998


In [88]:
#Calculating Correlation Between Variables
print(corrData.corr('iv1','iv2'))
print(corrData.corr('iv2','iv3'))

0.9999999999999996
1.0


In [100]:
#Applying the describe( ) Function on Each Column
dataDescription = corrData.describe()
dataDescription.show()

+-------+------------------+------------------+------------------+
|summary|               iv1|               iv2|               iv3|
+-------+------------------+------------------+------------------+
|  count|                 5|                 5|                 5|
|   mean|             6.112|             9.112|            10.112|
| stddev|0.5042519211663947|0.5042519211663945|0.5042519211663945|
|    min|               5.5|               8.5|               9.5|
|    max|              6.89|              9.89|             10.89|
+-------+------------------+------------------+------------------+



In [99]:
#Applying the describe( ) Function on Columns
dataDescriptioniv1iv2 = corrData.describe(['iv1', 'iv2'])
dataDescriptioniv1iv2.show()

+-------+------------------+------------------+
|summary|               iv1|               iv2|
+-------+------------------+------------------+
|  count|                 5|                 5|
|   mean|             6.112|             9.112|
| stddev|0.5042519211663947|0.5042519211663945|
|    min|               5.5|               8.5|
|    max|              6.89|              9.89|
+-------+------------------+------------------+



In [97]:
#Applying the summary( ) Function on Each Column
summaryData = corrData.summary()
summaryData.show()

+-------+------------------+------------------+------------------+
|summary|               iv1|               iv2|               iv3|
+-------+------------------+------------------+------------------+
|  count|                 5|                 5|                 5|
|   mean|             6.112|             9.112|            10.112|
| stddev|0.5042519211663947|0.5042519211663945|0.5042519211663945|
|    min|               5.5|               8.5|               9.5|
|    25%|              5.92|              8.92|              9.92|
|    50%|              6.12|              9.12|             10.12|
|    75%|              6.13|              9.13|             10.13|
|    max|              6.89|              9.89|             10.89|
+-------+------------------+------------------+------------------+



In [98]:
#Using the summary() function, we can apply selective summary statistics
summaryMeanMax = corrData.summary(['mean','max'])
summaryMeanMax.show()

+-------+-----+-----+------+
|summary|  iv1|  iv2|   iv3|
+-------+-----+-----+------+
|   mean|6.112|9.112|10.112|
|    max| 6.89| 9.89| 10.89|
+-------+-----+-----+------+



In [101]:
#Applying the summary( ) Function on Columns
summaryiv1iv2 = corrData.select('iv1','iv2').summary('min','max')
summaryiv1iv2.show()

+-------+----+----+
|summary| iv1| iv2|
+-------+----+----+
|    min| 5.5| 8.5|
|    max|6.89|9.89|
+-------+----+----+



In [111]:
#Adding a Column of Categorical Variables
from pyspark.sql.functions import udf
def labelIt(x):
    if x > 10.0:
        return 'High'
    else:
        return 'Low'

In [112]:
labelIt = udf(labelIt)

In [113]:
corrData.show(2)
corrData.printSchema()

+----+----+-----+
| iv1| iv2|  iv3|
+----+----+-----+
| 5.5| 8.5|  9.5|
|6.13|9.13|10.13|
+----+----+-----+
only showing top 2 rows

root
 |-- iv1: double (nullable = true)
 |-- iv2: double (nullable = true)
 |-- iv3: double (nullable = true)



In [114]:
corrData1 = corrData.withColumn('iv4', labelIt('iv3'))
corrData1.show(5)

+----+----+-----+----+
| iv1| iv2|  iv3| iv4|
+----+----+-----+----+
| 5.5| 8.5|  9.5| Low|
|6.13|9.13|10.13|High|
|5.92|8.92| 9.92| Low|
|6.89|9.89|10.89|High|
|6.12|9.12|10.12|High|
+----+----+-----+----+



In [115]:
meanMaxSummary = corrData1.summary('mean','max')
meanMaxSummary.show()

+-------+-----+-----+------+----+
|summary|  iv1|  iv2|   iv3| iv4|
+-------+-----+-----+------+----+
|   mean|6.112|9.112|10.112|null|
|    max| 6.89| 9.89| 10.89| Low|
+-------+-----+-----+------+----+



In [116]:
countMinMaxSummary = corrData1.summary('count',
'min','max')
countMinMaxSummary.show()

+-------+----+----+-----+----+
|summary| iv1| iv2|  iv3| iv4|
+-------+----+----+-----+----+
|  count|   5|   5|    5|   5|
|    min| 5.5| 8.5|  9.5|High|
|    max|6.89|9.89|10.89| Low|
+-------+----+----+-----+----+



**Sort Data in a DataFrame Problem**

In [117]:
swimmerDf.show(4)

+---+------+-----------+----------------+-----+
| id|Gender| Occupation|swimTimeInSecond|Speed|
+---+------+-----------+----------------+-----+
|id1|  Male| programmer|           16.73|1.195|
|id2|Female|    Manager|           15.56|1.285|
|id3|  Male|    Manager|           15.15| 1.32|
|id4|  Male|riskanalyst|           15.27| 1.31|
+---+------+-----------+----------------+-----+
only showing top 4 rows



In [119]:
swimmerDfSorted1 = swimmerDf.orderBy("swimTimeInSecond",ascending=True)
swimmerDfSorted1.show(6)

+----+------+-----------+----------------+-----+
|  id|Gender| Occupation|swimTimeInSecond|Speed|
+----+------+-----------+----------------+-----+
| id3|  Male|    Manager|           15.15| 1.32|
| id4|  Male|riskanalyst|           15.27| 1.31|
| id2|Female|    Manager|           15.56|1.285|
| id5|  Male| programmer|           15.65|1.278|
| id6|  Male|riskanalyst|           15.74|1.271|
|id12|Female|riskanalyst|            15.9|1.258|
+----+------+-----------+----------------+-----+
only showing top 6 rows



In [120]:
swimmerDfSorted2 =swimmerDf.orderBy("swimTimeInSecond",ascending=False)
swimmerDfSorted2.show(6)

+----+------+-----------+----------------+-----+
|  id|Gender| Occupation|swimTimeInSecond|Speed|
+----+------+-----------+----------------+-----+
| id8|  Male|    Manager|           17.11|1.169|
| id9|Female| programmer|           16.83|1.188|
| id7|Female| programmer|            16.8| 1.19|
| id1|  Male| programmer|           16.73|1.195|
|id10|Female|riskanalyst|           16.34|1.224|
|id11|  Male| programmer|           15.96|1.253|
+----+------+-----------+----------------+-----+
only showing top 6 rows



In [126]:
# Sorting on Two Columns in Different Order
swimmerDfSorted3 = swimmerDf.orderBy("Occupation","swimTimeInSecond", ascending=[False,True])
swimmerDfSorted3.show(6)

+----+------+-----------+----------------+-----+
|  id|Gender| Occupation|swimTimeInSecond|Speed|
+----+------+-----------+----------------+-----+
| id4|  Male|riskanalyst|           15.27| 1.31|
| id6|  Male|riskanalyst|           15.74|1.271|
|id12|Female|riskanalyst|            15.9|1.258|
|id10|Female|riskanalyst|           16.34|1.224|
| id5|  Male| programmer|           15.65|1.278|
|id11|  Male| programmer|           15.96|1.253|
+----+------+-----------+----------------+-----+
only showing top 6 rows



**Sort Data Partition-Wise Problem**

DataFrames are partitioned over many nodes. Now we want to sort a DataFrame partition-wise. 

In [124]:
sortedPartitons = swimmerDf.sortWithinPartitions("Occupation","swimTimeInSecond",
                                                 ascending=[False,True])
swimmerDf1.show(6)

+---+------+-----------+----------------+-----+------------------+
| id|Gender| Occupation|swimTimeInSecond|Speed|      swimmerSpeed|
+---+------+-----------+----------------+-----+------------------+
|id1|  Male| programmer|           16.73|1.195| 1.195457262402869|
|id2|Female|    Manager|           15.56|1.285|1.2853470437017995|
|id3|  Male|    Manager|           15.15| 1.32|1.3201320132013201|
|id4|  Male|riskanalyst|           15.27| 1.31| 1.309757694826457|
|id5|  Male| programmer|           15.65|1.278|1.2779552715654952|
|id6|  Male|riskanalyst|           15.74|1.271|1.2706480304955527|
+---+------+-----------+----------------+-----+------------------+
only showing top 6 rows



In [125]:
swimmerDf.rdd.getNumPartitions()

1

In [127]:
swimmerDf1 = swimmerDf.repartition(2)
swimmerDf1.rdd.glom().collect()

[[Row(id='id12', Gender='Female', Occupation='riskanalyst', swimTimeInSecond=15.9, Speed=1.258),
  Row(id='id3', Gender='Male', Occupation='Manager', swimTimeInSecond=15.15, Speed=1.32),
  Row(id='id6', Gender='Male', Occupation='riskanalyst', swimTimeInSecond=15.74, Speed=1.271),
  Row(id='id1', Gender='Male', Occupation='programmer', swimTimeInSecond=16.73, Speed=1.195),
  Row(id='id10', Gender='Female', Occupation='riskanalyst', swimTimeInSecond=16.34, Speed=1.224),
  Row(id='id2', Gender='Female', Occupation='Manager', swimTimeInSecond=15.56, Speed=1.285)],
 [Row(id='id5', Gender='Male', Occupation='programmer', swimTimeInSecond=15.65, Speed=1.278),
  Row(id='id4', Gender='Male', Occupation='riskanalyst', swimTimeInSecond=15.27, Speed=1.31),
  Row(id='id8', Gender='Male', Occupation='Manager', swimTimeInSecond=17.11, Speed=1.169),
  Row(id='id11', Gender='Male', Occupation='programmer', swimTimeInSecond=15.96, Speed=1.253),
  Row(id='id7', Gender='Female', Occupation='programmer', 

In [129]:
sortedPartitons = swimmerDf1.sortWithinPartitions("Occupation","swimTimeInSecond", ascending=[False,True])
sortedPartitons.show(5)

+----+------+-----------+----------------+-----+
|  id|Gender| Occupation|swimTimeInSecond|Speed|
+----+------+-----------+----------------+-----+
| id6|  Male|riskanalyst|           15.74|1.271|
|id12|Female|riskanalyst|            15.9|1.258|
|id10|Female|riskanalyst|           16.34|1.224|
| id1|  Male| programmer|           16.73|1.195|
| id3|  Male|    Manager|           15.15| 1.32|
+----+------+-----------+----------------+-----+
only showing top 5 rows



**Remove Duplicate Records from a DataFrame**

In [144]:
duplicateDataDf = spark.read.csv(path='duplicateData.csv', inferSchema=True, header=True)
duplicateDataDf.show(5)

+-----------+--------------+----------+-------+--------+--------------------+
|Employee ID| Employee Name|Day Worked|Time In|Time Out|Report Download Date|
+-----------+--------------+----------+-------+--------+--------------------+
|       1816|Alyssa Reddall|10/21/2016| 7:21am|  5:00pm|          10/24/2016|
|       1816|Alyssa Reddall|10/21/2016| 7:21am|  5:00pm|          10/25/2016|
|       1816|Alyssa Reddall|10/22/2016| 8:03am|  5:12pm|          10/24/2016|
|       1816|Alyssa Reddall|10/22/2016| 8:03am|  5:12pm|          10/25/2016|
|       1719|  Carl Borunda|10/21/2016| 7:55am|  5:54pm|          10/24/2016|
+-----------+--------------+----------+-------+--------+--------------------+
only showing top 5 rows



In [141]:
duplicateDataDf.count()

56

In [143]:
noDuplicateDf1 = duplicateDataDf.drop_duplicates()
noDuplicateDf1.count()

54

In [146]:
#Removing the Duplicate Records Conditioned on column 
noDuplicateDf2 = duplicateDataDf.drop_duplicates(['Employee Name'])
print(noDuplicateDf2.count())
noDuplicateDf2.show()

14
+-----------+-------------------+----------+-------+--------+--------------------+
|Employee ID|      Employee Name|Day Worked|Time In|Time Out|Report Download Date|
+-----------+-------------------+----------+-------+--------+--------------------+
|       1816|     Alyssa Reddall|10/21/2016| 7:21am|  5:00pm|          10/24/2016|
|       1719|       Carl Borunda|10/21/2016| 7:55am|  5:54pm|          10/24/2016|
|       1651|     Carrie Richard|10/21/2016| 8:04am|  5:03pm|          10/24/2016|
|       1740|         Eric Trent|10/21/2016| 8:10am|  5:03pm|          10/24/2016|
|       1731|     Francis Turner|10/21/2016| 8:03am|  5:02pm|          10/24/2016|
|       1618|    Jeanne Brunelle|10/21/2016| 8:05am|  5:54pm|          10/24/2016|
|       1428|     Leslie Carlton|10/21/2016| 8:12am|  4:53pm|          10/24/2016|
|       1041|     Linda Costigan|10/21/2016| 7:43am|  5:15pm|          10/24/2016|
|       1088|       Marino Varga|10/21/2016| 7:55am|  5:00pm|          10/24/2016|
|

In [148]:
# Removing the Duplicate Records Conditioned on two Columns 
noDuplicateDf3 = duplicateDataDf.drop_duplicates(['Employee ID','Employee Name'])
print(noDuplicateDf3.count())
noDuplicateDf3.show()

14
+-----------+-------------------+----------+-------+--------+--------------------+
|Employee ID|      Employee Name|Day Worked|Time In|Time Out|Report Download Date|
+-----------+-------------------+----------+-------+--------+--------------------+
|       1041|     Linda Costigan|10/21/2016| 7:43am|  5:15pm|          10/24/2016|
|       1088|       Marino Varga|10/21/2016| 7:55am|  5:00pm|          10/24/2016|
|       1091|Stephanie Alexander|10/21/2016| 7:12am|  5:12pm|          10/24/2016|
|       1209|Stephen Kirkpatrick|10/21/2016| 8:25am|  5:30pm|          10/24/2016|
|       1220|     Randell Harvey|10/21/2016| 8:03am|  5:15pm|          10/24/2016|
|       1428|     Leslie Carlton|10/21/2016| 8:12am|  4:53pm|          10/24/2016|
|       1618|    Jeanne Brunelle|10/21/2016| 8:05am|  5:54pm|          10/24/2016|
|       1651|     Carrie Richard|10/21/2016| 8:04am|  5:03pm|          10/24/2016|
|       1673|      Ryan Fontenot|10/21/2016| 8:12am|  5:12pm|          10/24/2016|
|

**Sampling Data from the noDuplicateDf1 DataFrame Without Replacement**

In [149]:
noDuplicateDf1.count()

54

In [150]:
#We are going to fetch 50% of the records as a sample without replacement.
sampleWithoutKeyConsideration = noDuplicateDf1.sample(withReplacement=False, fraction=0.5, seed=200)
sampleWithoutKeyConsideration.count()

25

In [151]:
#Sampling Data from the noDuplicateDf1 DataFrame with Replacements
sampleWithoutKeyConsideration1 = noDuplicateDf1.sample(withReplacement=True, fraction=0.5, seed=200)
sampleWithoutKeyConsideration1.count()

20

**Find Frequent Items**

In [156]:
duplicateDataDf.freqItems(cols=['Employee Name']).show()

+-----------------------+
|Employee Name_freqItems|
+-----------------------+
|   [Marino Varga, Ca...|
+-----------------------+



In [157]:
duplicateDataDf.freqItems(cols=['Employee ID','Employee Name']).show()

+---------------------+-----------------------+
|Employee ID_freqItems|Employee Name_freqItems|
+---------------------+-----------------------+
| [1088, 1091, 1719...|   [Marino Varga, Ca...|
+---------------------+-----------------------+



**Aggregate Data on a Single Key Problem**

In [167]:
ucbDataFrame = spark.read.csv(path='UCB_Admission.csv', inferSchema=True, header=True)
ucbDataFrame.show(5)

+--------+------+----------+----+
|   Admit|Gender|department|Freq|
+--------+------+----------+----+
|Admitted|  Male|         A| 512|
|Rejected|  Male|         A| 313|
|Admitted|Female|         A|  89|
|Rejected|Female|         A|  19|
|Admitted|  Male|         B| 353|
+--------+------+----------+----+
only showing top 5 rows



In [168]:
#Calculating the Required Means
groupedOnAdmit = ucbDataFrame.groupby(["admit"]).mean()
groupedOnAdmit.show()

+--------+------------------+
|   admit|         avg(Freq)|
+--------+------------------+
|Admitted|            146.25|
|Rejected|230.91666666666666|
+--------+------------------+



In [169]:
# Grouping by Gender
groupedOnGender = ucbDataFrame.groupby(["gender"]).mean()
groupedOnGender.show()

+------+------------------+
|gender|         avg(Freq)|
+------+------------------+
|  Male|            224.25|
|Female|152.91666666666666|
+------+------------------+



In [170]:
# Finding the Average Frequency of Application by Gender
groupedOnDepartment = ucbDataFrame.groupby(["department"]).mean()
groupedOnDepartment.show()

+----------+---------+
|department|avg(Freq)|
+----------+---------+
|         A|   233.25|
|         B|   146.25|
|         C|    229.5|
|         D|    198.0|
|         E|    146.0|
|         F|    178.5|
+----------+---------+



In [173]:
#Aggregate Data on Multiple Keys Problem
groupedOnAdmitGender =ucbDataFrame.groupby(["admit" ,"gender"]).mean()
groupedOnAdmitGender.show()

+--------+------+------------------+
|   admit|gender|         avg(Freq)|
+--------+------+------------------+
|Admitted|  Male|199.66666666666666|
|Rejected|  Male|248.83333333333334|
|Admitted|Female| 92.83333333333333|
|Rejected|Female|             213.0|
+--------+------+------------------+



In [174]:
#Grouping the Data on Department and Finding the Mean of Applications
groupedOnAdmitDepartment = ucbDataFrame.groupby(["admit", "department"]).mean()
groupedOnAdmitDepartment.show()

+--------+----------+---------+
|   admit|department|avg(Freq)|
+--------+----------+---------+
|Admitted|         A|    300.5|
|Rejected|         A|    166.0|
|Admitted|         B|    185.0|
|Rejected|         B|    107.5|
|Admitted|         C|    161.0|
|Rejected|         C|    298.0|
|Admitted|         D|    134.5|
|Rejected|         D|    261.5|
|Admitted|         E|     73.5|
|Rejected|         E|    218.5|
|Admitted|         F|     23.0|
|Rejected|         F|    334.0|
+--------+----------+---------+



**Create a Contingency Table Problem**

In [197]:
surveyDf = spark.read.json(path='restaurants.json')
surveyDf = surveyDf.drop("address")
surveyDf.show(4)

+---------+----------+--------------------+--------------------+-------------+
|  borough|   cuisine|              grades|                name|restaurant_id|
+---------+----------+--------------------+--------------------+-------------+
|    Bronx|    Bakery|[[[1393804800000]...|Morris Park Bake ...|     30075445|
| Brooklyn|Hamburgers|[[[1419897600000]...|             Wendy'S|     30112340|
|Manhattan|     Irish|[[[1409961600000]...|Dj Reynolds Pub A...|     30191841|
| Brooklyn| American |[[[1402358400000]...|     Riviera Caterer|     40356018|
+---------+----------+--------------------+--------------------+-------------+
only showing top 4 rows



In [198]:
surveyDf = spark.read.json(path='DataFrames_sample.json')
surveyDf = surveyDf.drop("Id")
surveyDf.show()

+----+----+---------+-----------+----+-----------+-----+-------+-----+
|   D|   H|      HDD|      Model| RAM| ScreenSize|    W| Weight| Year|
+----+----+---------+-----------+----+-----------+-----+-------+-----+
|9.48|0.61|512GB SSD|MacBook Pro|16GB|        15"|13.75|   4.02| 2015|
|7.74|0.52|256GB SSD|    MacBook| 8GB|        12"|11.04|   2.03| 2016|
|8.94|0.68|128GB SSD|MacBook Air| 8GB|      13.3"| 12.8|   2.96| 2016|
| 8.0|20.3|  1TB SSD|       iMac|64GB|        27"| 25.6|   20.8| 2017|
| 8.5|20.9|  1TB SSD|       iMac|64GB|        27"| 25.2|   20.8| 2017|
|9.48|0.61|512GB SSD|MacBook Pro|16GB|        15"|13.75|   4.02| 2015|
+----+----+---------+-----------+----+-----------+-----+-------+-----+



In [199]:
surveyDf.columns

[' D', ' H', ' HDD', ' Model', ' RAM', ' ScreenSize', ' W', ' Weight', ' Year']

In [200]:
surveyDf.crosstab(" Model"," ScreenSize").show()

+------------------+---+-----+---+---+
| Model_ ScreenSize|12"|13.3"|15"|27"|
+------------------+---+-----+---+---+
|       MacBook Air|  0|    1|  0|  0|
|           MacBook|  1|    0|  0|  0|
|       MacBook Pro|  0|    0|  2|  0|
|              iMac|  0|    0|  0|  2|
+------------------+---+-----+---+---+



**Perform Joining Operations on Two DataFrames**

In [2]:
studentsDf = spark.read.csv('studentData.csv',header=True, inferSchema=True)
subjectsDf = spark.read.csv('subjects.csv',header=True, inferSchema=True)
subjectsDf = subjectsDf.drop("id")

In [208]:
studentsDf.show(),subjectsDf.show()

+---------+-------+------+
|studentid|   name|gender|
+---------+-------+------+
|      si1|  Robin|     M|
|      si2|  Maria|     F|
|      si3|  Julie|     F|
|      si4|    Bob|     M|
|      si6|William|     M|
+---------+-------+------+

+-----+---------+--------+
|marks|studentid|subjects|
+-----+---------+--------+
|   78|      si4|     C++|
|   83|      si2|    Java|
|   72|      si3|    Ruby|
|   85|      si2|  Python|
|   77|      si5|     C++|
|   75|      si1|  Python|
|   84|      si4|  Python|
|   76|      si3|    Java|
|   81|      si1|    Java|
+-----+---------+--------+



(None, None)

In [209]:
#Performing an Inner Join on DataFrames
innerDf = studentsDf.join(subjectsDf, studentsDf.studentid == subjectsDf.studentid, how= "inner")
innerDf.show()

+---------+-----+------+-----+---------+--------+
|studentid| name|gender|marks|studentid|subjects|
+---------+-----+------+-----+---------+--------+
|      si4|  Bob|     M|   78|      si4|     C++|
|      si2|Maria|     F|   83|      si2|    Java|
|      si3|Julie|     F|   72|      si3|    Ruby|
|      si2|Maria|     F|   85|      si2|  Python|
|      si1|Robin|     M|   75|      si1|  Python|
|      si4|  Bob|     M|   84|      si4|  Python|
|      si3|Julie|     F|   76|      si3|    Java|
|      si1|Robin|     M|   81|      si1|    Java|
+---------+-----+------+-----+---------+--------+



In [210]:
# Performing a Left Outer Join on DataFrames
leftOuterDf = studentsDf.join(subjectsDf, studentsDf.studentid == subjectsDf.studentid, how= "left")
leftOuterDf.show()

+---------+-------+------+-----+---------+--------+
|studentid|   name|gender|marks|studentid|subjects|
+---------+-------+------+-----+---------+--------+
|      si1|  Robin|     M|   81|      si1|    Java|
|      si1|  Robin|     M|   75|      si1|  Python|
|      si2|  Maria|     F|   85|      si2|  Python|
|      si2|  Maria|     F|   83|      si2|    Java|
|      si3|  Julie|     F|   76|      si3|    Java|
|      si3|  Julie|     F|   72|      si3|    Ruby|
|      si4|    Bob|     M|   84|      si4|  Python|
|      si4|    Bob|     M|   78|      si4|     C++|
|      si6|William|     M| null|     null|    null|
+---------+-------+------+-----+---------+--------+



In [211]:
#Performing a Right Outer Join on DataFrames
rightOuterDf = studentsDf.join(subjectsDf, studentsDf.studentid == subjectsDf.studentid, how= "right")
rightOuterDf.show()

+---------+-----+------+-----+---------+--------+
|studentid| name|gender|marks|studentid|subjects|
+---------+-----+------+-----+---------+--------+
|      si4|  Bob|     M|   78|      si4|     C++|
|      si2|Maria|     F|   83|      si2|    Java|
|      si3|Julie|     F|   72|      si3|    Ruby|
|      si2|Maria|     F|   85|      si2|  Python|
|     null| null|  null|   77|      si5|     C++|
|      si1|Robin|     M|   75|      si1|  Python|
|      si4|  Bob|     M|   84|      si4|  Python|
|      si3|Julie|     F|   76|      si3|    Java|
|      si1|Robin|     M|   81|      si1|    Java|
+---------+-----+------+-----+---------+--------+



In [212]:
#  Performing a Full Outer Join on DataFrames
outerDf = studentsDf.join(subjectsDf, studentsDf.
studentid == subjectsDf.studentid, how= "outer")
outerDf.show()

+---------+-------+------+-----+---------+--------+
|studentid|   name|gender|marks|studentid|subjects|
+---------+-------+------+-----+---------+--------+
|      si1|  Robin|     M|   75|      si1|  Python|
|      si1|  Robin|     M|   81|      si1|    Java|
|      si2|  Maria|     F|   83|      si2|    Java|
|      si2|  Maria|     F|   85|      si2|  Python|
|      si3|  Julie|     F|   72|      si3|    Ruby|
|      si3|  Julie|     F|   76|      si3|    Java|
|      si4|    Bob|     M|   78|      si4|     C++|
|      si4|    Bob|     M|   84|      si4|  Python|
|     null|   null|  null|   77|      si5|     C++|
|      si6|William|     M| null|     null|    null|
+---------+-------+------+-----+---------+--------+



**Vertically Stack Two DataFrames**

In [213]:
verticalDfOne = spark.read.csv('iv1.csv',header=True, inferSchema=True)
verticalDfTwo = spark.read.csv('iv2.csv',header=True, inferSchema=True)

In [214]:
verticalDfOne.show(),verticalDfTwo.show()

+---+---+---+
|iv1|iv2|iv3|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+

+---+---+---+
|iv1|iv2|iv3|
+---+---+---+
| 10| 11| 12|
| 13| 14| 15|
| 16| 17| 18|
+---+---+---+



(None, None)

In [215]:
vstackedDf = verticalDfOne.union(verticalDfTwo)
vstackedDf.show()

+---+---+---+
|iv1|iv2|iv3|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
| 10| 11| 12|
| 13| 14| 15|
| 16| 17| 18|
+---+---+---+



**Perform Missing Value Imputation**

In [219]:
missingDf  = spark.read.csv('missing.csv',header=True, inferSchema=True)
missingDf.show()

+-----+-----+-----+
|  iv1|  iv2|  iv3|
+-----+-----+-----+
|  9.0|11.43|10.25|
|10.23| null| 8.17|
|10.26| 8.35| 9.94|
| 9.84| null| null|
|10.77|10.18|11.02|
+-----+-----+-----+



In [220]:
missingDf.printSchema()

root
 |-- iv1: double (nullable = true)
 |-- iv2: double (nullable = true)
 |-- iv3: double (nullable = true)



In [218]:
from pyspark.sql.types import DoubleType

In [221]:
missingDf = missingDf.withColumn("iv2", missingDf.
iv2.cast(DoubleType())).withColumn("iv3", missingDf.iv3.
cast(DoubleType()))

In [222]:
#Dropping the Rows that Have Null Values
missingDf.dropna(how ='any').show()

+-----+-----+-----+
|  iv1|  iv2|  iv3|
+-----+-----+-----+
|  9.0|11.43|10.25|
|10.26| 8.35| 9.94|
|10.77|10.18|11.02|
+-----+-----+-----+



In [223]:
#Since all the values are not null, the all value of how won’t affect the DataFrame
missingDf.dropna(how ='all').show()

+-----+-----+-----+
|  iv1|  iv2|  iv3|
+-----+-----+-----+
|  9.0|11.43|10.25|
|10.23| null| 8.17|
|10.26| 8.35| 9.94|
| 9.84| null| null|
|10.77|10.18|11.02|
+-----+-----+-----+



1. Dropping Rows that Have Null Values Using the thresh Argument 
2. If the thresh value is set to 2, any row containing less than two non-null
values will be dropped. Only the fourth column has fewer than two nonnull values (it has only one), so it is the only row that will be dropped.

In [225]:
missingDf.show()
missingDf.dropna(how ='all',thresh=2).show()

+-----+-----+-----+
|  iv1|  iv2|  iv3|
+-----+-----+-----+
|  9.0|11.43|10.25|
|10.23| null| 8.17|
|10.26| 8.35| 9.94|
| 9.84| null| null|
|10.77|10.18|11.02|
+-----+-----+-----+

+-----+-----+-----+
|  iv1|  iv2|  iv3|
+-----+-----+-----+
|  9.0|11.43|10.25|
|10.23| null| 8.17|
|10.26| 8.35| 9.94|
|10.77|10.18|11.02|
+-----+-----+-----+



In [227]:
missingDf.dropna(how ='all',thresh=3).show()

+-----+-----+-----+
|  iv1|  iv2|  iv3|
+-----+-----+-----+
|  9.0|11.43|10.25|
|10.26| 8.35| 9.94|
|10.77|10.18|11.02|
+-----+-----+-----+



In [228]:
#Filling in the Missing Value with Some Number
missingDf.fillna(value=0).show()

+-----+-----+-----+
|  iv1|  iv2|  iv3|
+-----+-----+-----+
|  9.0|11.43|10.25|
|10.23|  0.0| 8.17|
|10.26| 8.35| 9.94|
| 9.84|  0.0|  0.0|
|10.77|10.18|11.02|
+-----+-----+-----+



**Create a Temp View from a DataFrame**

In [11]:
studentsDf.show()

+---------+-------+------+
|studentid|   name|gender|
+---------+-------+------+
|      si1|  Robin|     M|
|      si2|  Maria|     F|
|      si3|  Julie|     F|
|      si4|    Bob|     M|
|      si6|William|     M|
+---------+-------+------+



In [3]:
studentsDf.createOrReplaceTempView("Students")

In [5]:
#spark.tableNames()

In [9]:
outputDf = spark.sql("select * from students")
outputDf.show()

+---------+-------+------+
|studentid|   name|gender|
+---------+-------+------+
|      si1|  Robin|     M|
|      si2|  Maria|     F|
|      si3|  Julie|     F|
|      si4|    Bob|     M|
|      si6|William|     M|
+---------+-------+------+



In [16]:
#Using Column Names in Spark SQL
spark.sql("Describe students").show()

+---------+---------+-------+
| col_name|data_type|comment|
+---------+---------+-------+
|studentid|   string|   null|
|     name|   string|   null|
|   gender|   string|   null|
+---------+---------+-------+



In [15]:
#Creating an Alias for Column Names
spark.sql("select name as Name,gender as Sex from students").show()

+-------+---+
|   Name|Sex|
+-------+---+
|  Robin|  M|
|  Maria|  F|
|  Julie|  F|
|    Bob|  M|
|William|  M|
+-------+---+



In [21]:
#Filtering Data Using a Where Clause
spark.sql("select name ,gender from students\
          where gender ='F'").show()

+-----+------+
| name|gender|
+-----+------+
|Maria|     F|
|Julie|     F|
+-----+------+



In [23]:
spark.sql("select name ,gender from Students where Gender =\
'f'").show()

+----+------+
|name|gender|
+----+------+
+----+------+



In [27]:
spark.sql("select name ,gender from Students\
          where lower(gender) = 'f'").show()

+-----+------+
| name|gender|
+-----+------+
|Maria|     F|
|Julie|     F|
+-----+------+



In [6]:
#Apply Spark UDF Methods on Spark SQL
studentsDf = spark.read.csv('studentData.csv',header=True, inferSchema=True)
studentsDf.show()

+---------+-------+------+-----------+-----+
|studentid|   name|gender|dateofbirth|score|
+---------+-------+------+-----------+-----+
|      si1|  Robin|     M| 1981-09-06|   20|
|      si2|  Maria|     F| 1986-06-06|   30|
|      si3|  Julie|     F| 1988-09-05|   10|
|      si4|    Bob|     M| 1987-05-04|   15|
|      si6|William|     M| 1980-11-12|   25|
+---------+-------+------+-----------+-----+



**Representing a String Value as a Date Value**

In [7]:
studentsDf.createOrReplaceTempView("Students")
spark.sql("Describe students").show()

+-----------+---------+-------+
|   col_name|data_type|comment|
+-----------+---------+-------+
|  studentid|   string|   null|
|       name|   string|   null|
|     gender|   string|   null|
|dateofbirth|   string|   null|
|      score|      int|   null|
+-----------+---------+-------+



In [8]:
spark.sql("select studentid, name, gender,\
          to_date(dateofbirth) as DOB from students").show()

+---------+-------+------+----------+
|studentid|   name|gender|       DOB|
+---------+-------+------+----------+
|      si1|  Robin|     M|1981-09-06|
|      si2|  Maria|     F|1986-06-06|
|      si3|  Julie|     F|1988-09-05|
|      si4|    Bob|     M|1987-05-04|
|      si6|William|     M|1980-11-12|
+---------+-------+------+----------+



In [11]:
studentDf_dob=spark.sql("select studentid, name, gender,\
          to_date(dateofbirth) as dateofbirth from students")
studentDf_dob.createOrReplaceTempView("StudentsDob")
spark.sql("Describe studentsDob").show()

+-----------+---------+-------+
|   col_name|data_type|comment|
+-----------+---------+-------+
|  studentid|   string|   null|
|       name|   string|   null|
|     gender|   string|   null|
|dateofbirth|     date|   null|
+-----------+---------+-------+



In [8]:
#Using Date Functions for Various Date Manipulations
spark.sql("select dateofbirth,dayofmonth(dateofbirth) day,\
          month(dateofbirth) month, year(dateofbirth) year\
          from studentsdob").show()

+-----------+---+-----+----+
|dateofbirth|day|month|year|
+-----------+---+-----+----+
| 1981-09-06|  6|    9|1981|
| 1986-06-06|  6|    6|1986|
| 1988-09-05|  5|    9|1988|
| 1987-05-04|  4|    5|1987|
| 1980-11-12| 12|   11|1980|
+-----------+---+-----+----+



In [11]:
#Create a New PySpark UDF Problem
def genderCodeToValue(code):
    return('FEMALE' if code == 'F' else 'MALE' if code =='M' else 'NA')
print(genderCodeToValue('M'))
print(genderCodeToValue('F'))
print(genderCodeToValue('S'))

MALE
FEMALE
NA


In [12]:
#Registering the Python Function in a Spark SQL Session
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

In [13]:
spark.udf.register("genderCodeToValue", genderCodeToValue,StringType())

<function __main__.genderCodeToValue(code)>

In [17]:
#Calling the Registered UDF from PySpark SQL
spark.sql("select name as Name, genderCodeToValue(gender) \
          as Gender from studentsdob").show()

+-------+------+
|   Name|Gender|
+-------+------+
|  Robin|  MALE|
|  Maria|FEMALE|
|  Julie|FEMALE|
|    Bob|  MALE|
|William|  MALE|
+-------+------+



**Join Two DataFrames Using SQL Problem**

In [10]:
subjectsDf = spark.read.csv('subjects.csv',header=True, inferSchema=True)
subjectsDf.createOrReplaceTempView("subjects")
spark.sql("select * from subjects").show()

+---+-----+---------+--------+
| id|marks|studentid|subjects|
+---+-----+---------+--------+
|  6|   78|      si4|     C++|
|  9|   83|      si2|    Java|
|  5|   72|      si3|    Ruby|
|  4|   85|      si2|  Python|
|  7|   77|      si5|     C++|
|  1|   75|      si1|  Python|
|  8|   84|      si4|  Python|
|  2|   76|      si3|    Java|
|  3|   81|      si1|    Java|
+---+-----+---------+--------+



In [12]:
spark.sql("select studentId,name,gender from studentsdob").show()

+---------+-------+------+
|studentId|   name|gender|
+---------+-------+------+
|      si1|  Robin|     M|
|      si2|  Maria|     F|
|      si3|  Julie|     F|
|      si4|    Bob|     M|
|      si6|William|     M|
+---------+-------+------+



In [13]:
'''By observing both these datasets, note that studentId is the column
that can be used to join these two datasets. Let’s write the query to join
these datasets'''

spark.sql("select * from studentsdob st join subjects sb on\
          (st.studentId = sb.studentId)").show()

+---------+-----+------+-----------+---+-----+---------+--------+
|studentid| name|gender|dateofbirth| id|marks|studentid|subjects|
+---------+-----+------+-----------+---+-----+---------+--------+
|      si4|  Bob|     M| 1987-05-04|  6|   78|      si4|     C++|
|      si2|Maria|     F| 1986-06-06|  9|   83|      si2|    Java|
|      si3|Julie|     F| 1988-09-05|  5|   72|      si3|    Ruby|
|      si2|Maria|     F| 1986-06-06|  4|   85|      si2|  Python|
|      si1|Robin|     M| 1981-09-06|  1|   75|      si1|  Python|
|      si4|  Bob|     M| 1987-05-04|  8|   84|      si4|  Python|
|      si3|Julie|     F| 1988-09-05|  2|   76|      si3|    Java|
|      si1|Robin|     M| 1981-09-06|  3|   81|      si1|    Java|
+---------+-----+------+-----------+---+-----+---------+--------+



**Join Multiple DataFrames Using SQL**

In [40]:
studentDf_dob=spark.sql("select studentid, name, gender,to_date(dateofbirth) as dateofbirth from students")
studentDf_dob.createOrReplaceTempView("StudentsDob")
print('StudentDOB st')
spark.sql("select * from studentsDob").show()

subjectsDf = spark.read.csv('subjects.csv',header=True, inferSchema=True)
subjectsDf.createOrReplaceTempView("subjects")
print('subjects sb')
spark.sql("select * from subjects").show()

attendanceDf = spark.read.csv('attendance.csv',header=True, inferSchema=True)
attendanceDf.createOrReplaceTempView("attendance")
print('attendance at')
spark.sql("select * from attendance").show()

StudentDOB st
+---------+-------+------+-----------+
|studentid|   name|gender|dateofbirth|
+---------+-------+------+-----------+
|      si1|  Robin|     M| 1981-09-06|
|      si2|  Maria|     F| 1986-06-06|
|      si3|  Julie|     F| 1988-09-05|
|      si4|    Bob|     M| 1987-05-04|
|      si6|William|     M| 1980-11-12|
+---------+-------+------+-----------+

subjects sb
+---+-----+---------+-------+
| id|marks|studentid|subject|
+---+-----+---------+-------+
|  6|   78|      si4|    C++|
|  9|   83|      si2|   Java|
|  5|   72|      si3|   Ruby|
|  4|   85|      si2| Python|
|  7|   77|      si5|    C++|
|  1|   75|      si1| Python|
|  8|   84|      si4| Python|
|  2|   76|      si3|   Java|
|  3|   81|      si1|   Java|
+---+-----+---------+-------+

attendance at
+---------+-------+----------+
|studentid|subject|attendance|
+---------+-------+----------+
|      si1| Python|        30|
|      si3|   Java|        22|
|      si1|   Java|        34|
|      si2| Python|        39|


This query is joining the third dataset as a continuation of the existing
join. Similarly, you can join any number of tables in the same manner.

In [41]:
spark.sql("select * from studentsdob st inner Join subjects sb on\
          (st.studentid=sb.studentid) Join attendance at on\
          (at.studentid=st.studentid)").show()

+---------+-----+------+-----------+---+-----+---------+-------+---------+-------+----------+
|studentid| name|gender|dateofbirth| id|marks|studentid|subject|studentid|subject|attendance|
+---------+-----+------+-----------+---+-----+---------+-------+---------+-------+----------+
|      si4|  Bob|     M| 1987-05-04|  6|   78|      si4|    C++|      si4| Python|        39|
|      si4|  Bob|     M| 1987-05-04|  6|   78|      si4|    C++|      si4|    C++|        38|
|      si2|Maria|     F| 1986-06-06|  9|   83|      si2|   Java|      si2|   Java|        39|
|      si2|Maria|     F| 1986-06-06|  9|   83|      si2|   Java|      si2| Python|        39|
|      si3|Julie|     F| 1988-09-05|  5|   72|      si3|   Ruby|      si3|   Ruby|        25|
|      si3|Julie|     F| 1988-09-05|  5|   72|      si3|   Ruby|      si3|   Java|        22|
|      si2|Maria|     F| 1986-06-06|  4|   85|      si2| Python|      si2|   Java|        39|
|      si2|Maria|     F| 1986-06-06|  4|   85|      si2| Pyt

Now the task is to get a cleaner report with only the required columns
by combining all the datasets—student, subject, attendance—that will
provide a report:

**Complete Code**

In [42]:
#Load and register StudentsDob
studentDf = spark.read.csv('studentData.csv',header=True, inferSchema=True)
studentDf.createOrReplaceTempView("StudentsDob")
#Load and register subjects
subjectsDf = spark.read.csv('subjects.csv',header=True, inferSchema=True)
subjectsDf.createOrReplaceTempView("subjects")
#Load and register attendance
attendanceDf = spark.read.csv('attendance.csv',header=True, inferSchema=True)
attendanceDf.createOrReplaceTempView("attendance")

In [43]:
#Create the gender User Defined Function
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def genderCodeToValue(code):
    return('FEMALE' if code == 'F' else 'MALE' if code == 'M' else 'NA')
spark.udf.register("genderCodeToValue", genderCodeToValue,StringType())

<function __main__.genderCodeToValue(code)>

In [44]:
# Apply a query to get the final report
spark.sql("select name as Name, genderCodeToValue(gender) as Gender,\
          marks as Marks, attendance as Attendance from studentsdob st\
          Join subjects sb on (st.studentId = sb.studentId)\
          Join attendance at on (at.studentId = st.studentId)").show()

+-----+------+-----+----------+
| Name|Gender|Marks|Attendance|
+-----+------+-----+----------+
|Robin|  MALE|   81|        34|
|Robin|  MALE|   81|        30|
|Robin|  MALE|   75|        34|
|Robin|  MALE|   75|        30|
|Maria|FEMALE|   85|        39|
|Maria|FEMALE|   85|        39|
|Maria|FEMALE|   83|        39|
|Maria|FEMALE|   83|        39|
|Julie|FEMALE|   76|        25|
|Julie|FEMALE|   76|        22|
|Julie|FEMALE|   72|        25|
|Julie|FEMALE|   72|        22|
|  Bob|  MALE|   84|        39|
|  Bob|  MALE|   84|        38|
|  Bob|  MALE|   78|        39|
|  Bob|  MALE|   78|        38|
+-----+------+-----+----------+



**Optimizing PySpark SQL**

**Apply Aggregation Using PySpark SQL**

In [45]:
#Load and register Student data
studentDf = spark.read.csv('studentData.csv',header=True, inferSchema=True)
studentDf.createOrReplaceTempView("StudentsDob")
#Load and register subjects
subjectsDf = spark.read.csv('subjects.csv',header=True, inferSchema=True)
subjectsDf.createOrReplaceTempView("subjects")
spark.sql("select * from studentsdob st join subjects\
          sb on(st.studentId = sb.studentId)").show()

+---------+-----+------+-----------+-----+---+-----+---------+-------+
|studentid| name|gender|dateofbirth|score| id|marks|studentid|subject|
+---------+-----+------+-----------+-----+---+-----+---------+-------+
|      si1|Robin|     M| 1981-09-06|   20|  3|   81|      si1|   Java|
|      si1|Robin|     M| 1981-09-06|   20|  1|   75|      si1| Python|
|      si2|Maria|     F| 1986-06-06|   30|  4|   85|      si2| Python|
|      si2|Maria|     F| 1986-06-06|   30|  9|   83|      si2|   Java|
|      si3|Julie|     F| 1988-09-05|   10|  2|   76|      si3|   Java|
|      si3|Julie|     F| 1988-09-05|   10|  5|   72|      si3|   Ruby|
|      si4|  Bob|     M| 1987-05-04|   15|  8|   84|      si4| Python|
|      si4|  Bob|     M| 1987-05-04|   15|  6|   78|      si4|    C++|
+---------+-----+------+-----------+-----+---+-----+---------+-------+



In [46]:
#Group By on this query to identify the average marks per student
spark.sql("select name,avg(marks) from studentsdob st left\
          join subjects sb on (st.studentId = sb.studentId) group by\
          name").show()

+-------+----------+
|   name|avg(marks)|
+-------+----------+
|  Robin|      78.0|
|  Maria|      84.0|
|  Julie|      74.0|
|    Bob|      81.0|
|William|      null|
+-------+----------+



In [47]:
# Multiple Steps
resultDF = spark.sql("select * from studentsdob st left join\
                     subjects sb on(st.studentId = sb.studentId)")
resultDF.show()

+---------+-------+------+-----------+-----+----+-----+---------+-------+
|studentid|   name|gender|dateofbirth|score|  id|marks|studentid|subject|
+---------+-------+------+-----------+-----+----+-----+---------+-------+
|      si1|  Robin|     M| 1981-09-06|   20|   3|   81|      si1|   Java|
|      si1|  Robin|     M| 1981-09-06|   20|   1|   75|      si1| Python|
|      si2|  Maria|     F| 1986-06-06|   30|   4|   85|      si2| Python|
|      si2|  Maria|     F| 1986-06-06|   30|   9|   83|      si2|   Java|
|      si3|  Julie|     F| 1988-09-05|   10|   2|   76|      si3|   Java|
|      si3|  Julie|     F| 1988-09-05|   10|   5|   72|      si3|   Ruby|
|      si4|    Bob|     M| 1987-05-04|   15|   8|   84|      si4| Python|
|      si4|    Bob|     M| 1987-05-04|   15|   6|   78|      si4|    C++|
|      si6|William|     M| 1980-11-12|   25|null| null|     null|   null|
+---------+-------+------+-----------+-----+----+-----+---------+-------+



In [48]:
resultDF.groupBy("name").agg({"marks":"avg"}).show()

+-------+----------+
|   name|avg(marks)|
+-------+----------+
|  Robin|      78.0|
|  Maria|      84.0|
|  Julie|      74.0|
|    Bob|      81.0|
|William|      null|
+-------+----------+



In [49]:
spark.sql("select name,subject,avg(marks) from studentsdob\
          st left join subjects sb on (st.studentId = sb.studentId)\
          group by name,subject").show()

+-------+-------+----------+
|   name|subject|avg(marks)|
+-------+-------+----------+
|  Robin|   Java|      81.0|
|  Robin| Python|      75.0|
|  Maria| Python|      85.0|
|  Maria|   Java|      83.0|
|  Julie|   Java|      76.0|
|  Julie|   Ruby|      72.0|
|    Bob| Python|      84.0|
|    Bob|    C++|      78.0|
|William|   null|      null|
+-------+-------+----------+



In [51]:
# Alternate
resultDF.groupBy('name','subject').agg({"marks":"avg"}).show()

+-------+-------+----------+
|   name|subject|avg(marks)|
+-------+-------+----------+
|  Robin|   Java|      81.0|
|  Robin| Python|      75.0|
|  Maria| Python|      85.0|
|  Maria|   Java|      83.0|
|  Julie|   Java|      76.0|
|  Julie|   Ruby|      72.0|
|    Bob| Python|      84.0|
|    Bob|    C++|      78.0|
|William|   null|      null|
+-------+-------+----------+



**Finding the Number of Students per Subject**

In [53]:
spark.sql("select subject,count(name) Students_count\
          from students st left join subjects sb on\
          (st.studentId =sb.studentId) group by subject").show()

+-------+--------------+
|subject|Students_count|
+-------+--------------+
|   Java|             3|
| Python|             3|
|   Ruby|             1|
|    C++|             1|
|   null|             1|
+-------+--------------+



In [54]:
# Finding the Number of Subjects per Student
spark.sql("select name,count(subject) Students_count\
          from studentsdob st left join subjects sb on\
          (st.studentId =sb.studentId) group by name").show()

+-------+--------------+
|   name|Students_count|
+-------+--------------+
|  Robin|             2|
|  Maria|             2|
|  Julie|             2|
|    Bob|             2|
|William|             0|
+-------+--------------+



**Apply Windows Functions Using PySpark SQL**

The rank function provides a sequential number for each row within a selected set of rows

In [60]:
#  apply the rank function on all the rows without any window selection

spark.sql("select name,subject,marks,rank() over\
          (order by name) as RANK from studentsdob st join\
          subjects sb ON (st.studentId = sb.studentId)").show()

+-----+-------+-----+----+
| name|subject|marks|RANK|
+-----+-------+-----+----+
|  Bob| Python|   84|   1|
|  Bob|    C++|   78|   1|
|Julie|   Java|   76|   3|
|Julie|   Ruby|   72|   3|
|Maria| Python|   85|   5|
|Maria|   Java|   83|   5|
|Robin|   Java|   81|   7|
|Robin| Python|   75|   7|
+-----+-------+-----+----+



**Using Partition By in a Rank Function**

In this query we added a new PARTITION clause on the subject. This is
to say that we are windowing based on the subject column. This will create
a separate window for each distinct subject column value and will apply
the window function rank only within the rows of the window set.

In [62]:
spark.sql("select name,subject,marks,rank() over\
          (PARTITION BY subject order by marks DESC) as RANK\
          from studentsdob st join subjects sb ON\
          (st.studentId = sb.studentId)").show()

+-----+-------+-----+----+
| name|subject|marks|RANK|
+-----+-------+-----+----+
|  Bob|    C++|   78|   1|
|Maria|   Java|   83|   1|
|Robin|   Java|   81|   2|
|Julie|   Java|   76|   3|
|Maria| Python|   85|   1|
|  Bob| Python|   84|   2|
|Robin| Python|   75|   3|
|Julie|   Ruby|   72|   1|
+-----+-------+-----+----+



**Obtaining the Top Two Students for Each Subject**

In [63]:
spark.sql("select name,subject,marks,RANK from (select\
          name,subject,marks,rank() over\
          (PARTITION BY subject order by marks desc) as RANK\
          from studentsdob st join subjects sb ON\
          (st.studentId = sb.studentId)) where RANK <= 2").show()

+-----+-------+-----+----+
| name|subject|marks|RANK|
+-----+-------+-----+----+
|  Bob|    C++|   78|   1|
|Maria|   Java|   83|   1|
|Robin|   Java|   81|   2|
|Maria| Python|   85|   1|
|  Bob| Python|   84|   2|
|Julie|   Ruby|   72|   1|
+-----+-------+-----+----+



**Cache Data Using PySpark SQL**

With caching Spark will keep the data in memory so
it needs to read it again. Reading the data from disk is a costly operation.

We don’t want to cache huge volumes of data. It will lead to out of memory
exceptions. Handling huge volumes of data in memory will leave Spark
with less memory for other operations.

In [64]:
spark.sql("cache table students_cached AS select * from studentsdob").show()

++
||
++
++



In [65]:
spark.sql("select st.studentid,name,subject,marks\
          from students_cached st left join subjects sb\
          on (st.studentId =sb.studentId)").show()

+---------+-------+-------+-----+
|studentid|   name|subject|marks|
+---------+-------+-------+-----+
|      si1|  Robin|   Java|   81|
|      si1|  Robin| Python|   75|
|      si2|  Maria| Python|   85|
|      si2|  Maria|   Java|   83|
|      si3|  Julie|   Java|   76|
|      si3|  Julie|   Ruby|   72|
|      si4|    Bob| Python|   84|
|      si4|    Bob|    C++|   78|
|      si6|William|   null| null|
+---------+-------+-------+-----+



**Apply the Distribute By, Sort By, and Cluster By Clauses in PySpark SQL**