# Spark SQL

## 1. Introduction to DataFrames and Datasets

### Difference between RDDs, DataFrames and DataSets.

**RDDs (2011)**:
- Distribute collection of Java Virtual Machine (JVM) objects
- Functional Operators (map, reduce, filter, etc)

**DataFrame (2013)**:
- Distribute collection of Row objects
- Expression based operations and User Defined Functions (UDFs)
- Logical plans and optimizer
- Fast/efficient internal representations

**DataSet (2015)**:
- Internally Rows, externally JVM objects
- Almost the "Best of both worlds": type safe + fast
- Slower than DataFrame, not as good for interactive analysis, especially Python

## 2. DataFrames

![DataFrame Concept](../images/dataframe_concept.png)

### Basic concepts
- Distributed collection of Row objects. These objects contain the schema within the data.
- Data is organized into columns like a relational database.
- The main features of Dataframes:
- **Catalyst**: powers the Dataframe and SQL APIs.
    1. Analyzing a logical plan to resolve references
    2. Logical plan optimization
    3. Physical planning
    4. Code generation to compile parts of the query to Java bytecode.
- **Tungsten**: provides a physical execution backend which explicitly manages memory and dynamically generates bytecode for expression evaluation.


### Creating DataFrames

DataFrames can be created from many different sources such as existing RDDs, structured and unstructured data files (CSV, JSON, Parquet), databases using JDBC, etc.

In [18]:
!echo $JAVA_HOME




In [19]:
from pyspark.sql import SparkSession
from pyspark.sql.types import Row, IntegerType

# Initialize Spark session
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()

# Creating DataFrame from an existing RDD
rdd = spark.sparkContext.parallelize([0, 1, 2, 3, 4, 5])
df = rdd.map(lambda x: Row(x)).toDF()
print("DataFrame from RDD.toDF() function")
df.show()
schema = IntegerType()
df = spark.createDataFrame(rdd, schema)
print("DataFrame from RDD with saprk.createDataFrame()")
df.show()

# Creating a DataFrame from a Python collection
cars = [
    {"brand": "Toyota", "name": "Camry", "price": 24000},
    {"brand": "Honda", "name": "Civic", "price": 22000},
    {"brand": "Ford", "name": "Mustang", "price": 27000},
    {"brand": "Tesla", "name": "Model 3", "price": 35000},
    {"brand": "Chevrolet", "name": "Malibu", "price": 23000}
]
df = spark.createDataFrame(cars)
print("DataFrame from a Python collection")
df.show()

# Creating a DataFrame from a JSON file
df = spark.read.json('../datasets/students.json')
print("DataFrame from a JSON file")
df.show()

DataFrame from RDD.toDF() function
+---+
| _1|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
+---+

DataFrame from RDD with saprk.createDataFrame()
+-----+
|value|
+-----+
|    0|
|    1|
|    2|
|    3|
|    4|
|    5|
+-----+

DataFrame from a Python collection
+---------+-------+-----+
|    brand|   name|price|
+---------+-------+-----+
|   Toyota|  Camry|24000|
|    Honda|  Civic|22000|
|     Ford|Mustang|27000|
|    Tesla|Model 3|35000|
|Chevrolet| Malibu|23000|
+---------+-------+-----+

DataFrame from a JSON file
+--------------------+---+--------------------+-----+--------+-----------------+
|                 _id|age|               email|grade|    name|          surname|
+--------------------+---+--------------------+-----+--------+-----------------+
|33a624e7-e6f1-40b...| 23|Valeria.Sebastian...| 7.56| Valeria| Sebastian Garcia|
|2cd47675-43f3-415...| 23|Sanchez.Abascal@g...| 8.16|    Emma|  Sanchez Abascal|
|594ea4e7-75e3-456...| 20|Sarabia.Lopez@gma...| 8.22| Agustin|    Sarabia

## 3. Running SQL Queries using Spark SQL

In Spark you can work with DataFrames using the built-in functions or the Spark SQL API that provides a syntax similar to standard SQL

In [20]:
# First, lets check the schema of the df we will work with
df.printSchema()

root
 |-- _id: string (nullable = true)
 |-- age: long (nullable = true)
 |-- email: string (nullable = true)
 |-- grade: double (nullable = true)
 |-- name: string (nullable = true)
 |-- surname: string (nullable = true)



### Reading the top 10 lines

#### DataFrame mode

In [21]:
df.limit(10).show()

+--------------------+---+--------------------+-----+-------+-----------------+
|                 _id|age|               email|grade|   name|          surname|
+--------------------+---+--------------------+-----+-------+-----------------+
|33a624e7-e6f1-40b...| 23|Valeria.Sebastian...| 7.56|Valeria| Sebastian Garcia|
|2cd47675-43f3-415...| 23|Sanchez.Abascal@g...| 8.16|   Emma|  Sanchez Abascal|
|594ea4e7-75e3-456...| 20|Sarabia.Lopez@gma...| 8.22|Agustin|    Sarabia Lopez|
|3b521244-d2d4-40b...| 25|MartinaySebastian...| 7.67|Martina|Corominas Sarabia|
|e6f52130-362f-4a5...| 19|DavidyValeria@gma...| 7.45|  David|   Miranda Grande|
|cee04454-f6ea-48b...| 20|Lopez.Bernal@outl...| 7.35|   Laia|     Lopez Bernal|
|6e5b75cd-0d5f-41f...| 22|MarcosySantiago@h...|  6.8| Marcos|     Garcia Aznar|
|47435195-80b1-473...| 18|Judith.Garcia@gma...|  9.1| Judith|      Garcia Cruz|
|fbdf66dc-49da-467...| 21| IkerySara@gmail.com| 7.77|   Iker|    Seco Coronado|
|0df69140-84ac-47d...| 22|Lopez.Sastre@h

#### SQL mode

In [22]:
# This line only needs to be executed once to create a temporal view in the Spark Session
df.createOrReplaceTempView("students")
# You can check the tables you have defined in a Spark Session with the following line
spark.sql("show tables").show()

+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|         | students|       true|
+---------+---------+-----------+



In [23]:
# Once the table is created you can access it using SQL as follows
spark.sql('SELECT * FROM students LIMIT 10').show()

+--------------------+---+--------------------+-----+-------+-----------------+
|                 _id|age|               email|grade|   name|          surname|
+--------------------+---+--------------------+-----+-------+-----------------+
|33a624e7-e6f1-40b...| 23|Valeria.Sebastian...| 7.56|Valeria| Sebastian Garcia|
|2cd47675-43f3-415...| 23|Sanchez.Abascal@g...| 8.16|   Emma|  Sanchez Abascal|
|594ea4e7-75e3-456...| 20|Sarabia.Lopez@gma...| 8.22|Agustin|    Sarabia Lopez|
|3b521244-d2d4-40b...| 25|MartinaySebastian...| 7.67|Martina|Corominas Sarabia|
|e6f52130-362f-4a5...| 19|DavidyValeria@gma...| 7.45|  David|   Miranda Grande|
|cee04454-f6ea-48b...| 20|Lopez.Bernal@outl...| 7.35|   Laia|     Lopez Bernal|
|6e5b75cd-0d5f-41f...| 22|MarcosySantiago@h...|  6.8| Marcos|     Garcia Aznar|
|47435195-80b1-473...| 18|Judith.Garcia@gma...|  9.1| Judith|      Garcia Cruz|
|fbdf66dc-49da-467...| 21| IkerySara@gmail.com| 7.77|   Iker|    Seco Coronado|
|0df69140-84ac-47d...| 22|Lopez.Sastre@h

### Understanding the Pyshical and Logical plans

As previously stated, the **Catalyst** is in charge of creating an optimizing the plan that will be converted into a DAG and sent to the executors. The following diagram shows the steps that the Catalyst performs to produce an execution plan

<img src="../images/catalyst_steps.webp" title="Catalyst Steps Diagram" width="700px"/>

In order to see the plan generated by the **Catalyst** you can call the function `explain()` which, by default, will display the Physical Plan.

In [24]:
df.limit(10).explain()
spark.sql('SELECT * FROM students LIMIT 10').explain()

== Physical Plan ==
CollectLimit 10
+- FileScan json [_id#360,age#361L,email#362,grade#363,name#364,surname#365] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex(1 paths)[file:/home/amarchan/Documentos/formacion_numa/spark_NUMA_7/notebooks/d..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_id:string,age:bigint,email:string,grade:double,name:string,surname:string>


== Physical Plan ==
CollectLimit 10
+- FileScan json [_id#360,age#361L,email#362,grade#363,name#364,surname#365] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex(1 paths)[file:/home/amarchan/Documentos/formacion_numa/spark_NUMA_7/notebooks/d..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_id:string,age:bigint,email:string,grade:double,name:string,surname:string>




#### Getting various plans
Before Apache Spark 3.0, there was only two modes available to format explain output.

- `explain(extended=False)` which displayed only the physical plan
- `explain(extended=True)` which displayed all the plans (logical and physical)

In [25]:
df.limit(10).explain(extended=True)

== Parsed Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Relation [_id#360,age#361L,email#362,grade#363,name#364,surname#365] json

== Analyzed Logical Plan ==
_id: string, age: bigint, email: string, grade: double, name: string, surname: string
GlobalLimit 10
+- LocalLimit 10
   +- Relation [_id#360,age#361L,email#362,grade#363,name#364,surname#365] json

== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Relation [_id#360,age#361L,email#362,grade#363,name#364,surname#365] json

== Physical Plan ==
CollectLimit 10
+- FileScan json [_id#360,age#361L,email#362,grade#363,name#364,surname#365] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex(1 paths)[file:/home/amarchan/Documentos/formacion_numa/spark_NUMA_7/notebooks/d..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_id:string,age:bigint,email:string,grade:double,name:string,surname:string>



Starting from Apache Spark 3.0, you have a new parameter “mode” that produce expected format for the plan:

- `explain(mode="simple")` which will display the physical plan
- `explain(mode="extended")` which will display physical and logical plans (like "extended" option)
- `explain(mode="codegen")` which will display the java code planned to be executed
- `explain(mode="cost")` which will display the optimized logical plan and related statistics (if they exist)
- `explain(mode="formatted")` which will display a splitted output composed by a nice physical plan outline, and a section with each node details

In [26]:
df.limit(10).explain(mode="formatted")

== Physical Plan ==
CollectLimit (2)
+- Scan json  (1)


(1) Scan json 
Output [6]: [_id#360, age#361L, email#362, grade#363, name#364, surname#365]
Batched: false
Location: InMemoryFileIndex [file:/home/amarchan/Documentos/formacion_numa/spark_NUMA_7/notebooks/datasets/students.json]
ReadSchema: struct<_id:string,age:bigint,email:string,grade:double,name:string,surname:string>

(2) CollectLimit
Input [6]: [_id#360, age#361L, email#362, grade#363, name#364, surname#365]
Arguments: 10




### Top 5 students with higher grades

#### DataFrame mode

In [27]:
from pyspark.sql.functions import desc
df.select('*').orderBy(desc('grade')).limit(5).show()

+--------------------+---+--------------------+-----+-------+----------------+
|                 _id|age|               email|grade|   name|         surname|
+--------------------+---+--------------------+-----+-------+----------------+
|9e27157e-062b-47f...| 22|LaiayVictoria@gma...| 10.0|   Laia|   Segovia Lopez|
|1b36d010-64c0-4aa...| 23|Elena1993@outlook...| 9.99|  Elena| Sanchez Gisbert|
|afae424a-3c2b-43f...| 25|Gomez.Segarra@hot...| 9.99|   Sara|   Gomez Segarra|
|cc8f7297-5920-431...| 25|David.Pascual@yah...| 9.98|  David|Pascual Gonzalez|
|8a4da0df-2988-4f9...| 19|Esteban1997@gmail...| 9.98|Esteban|   Comas Segovia|
+--------------------+---+--------------------+-----+-------+----------------+



In [28]:
df.select('*').orderBy(desc('grade')).limit(5).explain(mode="formatted")

== Physical Plan ==
TakeOrderedAndProject (2)
+- Scan json  (1)


(1) Scan json 
Output [6]: [_id#360, age#361L, email#362, grade#363, name#364, surname#365]
Batched: false
Location: InMemoryFileIndex [file:/home/amarchan/Documentos/formacion_numa/spark_NUMA_7/notebooks/datasets/students.json]
ReadSchema: struct<_id:string,age:bigint,email:string,grade:double,name:string,surname:string>

(2) TakeOrderedAndProject
Input [6]: [_id#360, age#361L, email#362, grade#363, name#364, surname#365]
Arguments: 5, [grade#363 DESC NULLS LAST], [_id#360, age#361L, email#362, grade#363, name#364, surname#365]




#### SQL mode

In [29]:
spark.sql('SELECT * FROM students ORDER BY grade DESC LIMIT 5').show()

+--------------------+---+--------------------+-----+-------+----------------+
|                 _id|age|               email|grade|   name|         surname|
+--------------------+---+--------------------+-----+-------+----------------+
|9e27157e-062b-47f...| 22|LaiayVictoria@gma...| 10.0|   Laia|   Segovia Lopez|
|1b36d010-64c0-4aa...| 23|Elena1993@outlook...| 9.99|  Elena| Sanchez Gisbert|
|afae424a-3c2b-43f...| 25|Gomez.Segarra@hot...| 9.99|   Sara|   Gomez Segarra|
|cc8f7297-5920-431...| 25|David.Pascual@yah...| 9.98|  David|Pascual Gonzalez|
|8a4da0df-2988-4f9...| 19|Esteban1997@gmail...| 9.98|Esteban|   Comas Segovia|
+--------------------+---+--------------------+-----+-------+----------------+



### Max and Mean grade

#### DataFrame mode

In [30]:
from pyspark.sql.functions import max, mean
df.select(max('grade').alias('max_grade'), mean('grade').alias('mean_grade')).show()

+---------+-----------------+
|max_grade|       mean_grade|
+---------+-----------------+
|     10.0|7.954802000000002|
+---------+-----------------+



#### SQL mode

In [31]:
spark.sql("""
SELECT 
    MAX(grade) AS max_grade,
    MEAN(grade) AS mean_grade
FROM
    students
""").show()

+---------+-----------------+
|max_grade|       mean_grade|
+---------+-----------------+
|     10.0|7.954802000000002|
+---------+-----------------+



### Top 5 students with highest mean grade

#### DataFrame mode

In [32]:
from pyspark.sql.functions import mean
df.groupBy('name', 'surname') \
    .agg(mean('grade').alias('mean_grade')) \
    .orderBy(desc('mean_grade')) \
    .limit(5).show()

+--------+----------------+----------+
|    name|         surname|mean_grade|
+--------+----------------+----------+
|    Laia|   Segovia Lopez|      10.0|
|   Elena| Sanchez Gisbert|      9.99|
| Esteban|   Comas Segovia|      9.98|
|   David|Pascual Gonzalez|      9.98|
|Fernando| Bermejo Gisbert|      9.98|
+--------+----------------+----------+



#### SQL mode

In [33]:
spark.sql("""
SELECT 
    name, 
    surname,
    MEAN(grade) AS mean_grade
FROM
    students
GROUP BY
    name,
    surname
ORDER BY
    mean_grade DESC
LIMIT 5
""").show()

+--------+----------------+----------+
|    name|         surname|mean_grade|
+--------+----------------+----------+
|    Laia|   Segovia Lopez|      10.0|
|   Elena| Sanchez Gisbert|      9.99|
| Esteban|   Comas Segovia|      9.98|
|   David|Pascual Gonzalez|      9.98|
|Fernando| Bermejo Gisbert|      9.98|
+--------+----------------+----------+



### Average grade per student age

#### DataFrame mode

In [34]:
from pyspark.sql.functions import mean
df.groupBy('age') \
    .agg(mean('grade').alias('mean_grade')) \
    .orderBy('age').show()

+---+------------------+
|age|        mean_grade|
+---+------------------+
| 18| 7.935931528662416|
| 19|7.9874188034187945|
| 20| 7.920192791282483|
| 21|7.9860449050086375|
| 22| 7.916797804208606|
| 23|7.9727236971484805|
| 24| 7.999281364190009|
| 25|7.9337354651162775|
| 26| 7.903791102514502|
| 27|  7.92304878048781|
| 28| 7.921007194244603|
| 29| 8.041525423728814|
| 30| 8.114893617021275|
| 31| 8.014242424242425|
| 32|             8.091|
| 33| 8.180588235294119|
| 34|7.9399999999999995|
| 35|             7.698|
| 36|              7.68|
| 37|              8.11|
+---+------------------+



#### SQL mode

In [35]:
spark.sql("""
SELECT 
    age,
    MEAN(grade) AS mean_grade
FROM
    students
GROUP BY
    age
""").show()

+---+------------------+
|age|        mean_grade|
+---+------------------+
| 29| 8.041525423728814|
| 26| 7.903791102514502|
| 19|7.9874188034187945|
| 22| 7.916797804208606|
| 34|7.9399999999999995|
| 32|             8.091|
| 31| 8.014242424242425|
| 25|7.9337354651162775|
| 27|  7.92304878048781|
| 28| 7.921007194244603|
| 33| 8.180588235294119|
| 37|              8.11|
| 35|             7.698|
| 36|              7.68|
| 18| 7.935931528662416|
| 21|7.9860449050086375|
| 30| 8.114893617021275|
| 23|7.9727236971484805|
| 20| 7.920192791282483|
| 24| 7.999281364190009|
+---+------------------+



### Add column `excelent` to indicate that a student has a grade over 9.5

#### DataFrame mode

In [36]:
from pyspark.sql.functions import when, col
df.withColumn('excellent', when(col('grade') > 9.5, 'YES').otherwise('NO')).show()

+--------------------+---+--------------------+-----+--------+-----------------+---------+
|                 _id|age|               email|grade|    name|          surname|excellent|
+--------------------+---+--------------------+-----+--------+-----------------+---------+
|33a624e7-e6f1-40b...| 23|Valeria.Sebastian...| 7.56| Valeria| Sebastian Garcia|       NO|
|2cd47675-43f3-415...| 23|Sanchez.Abascal@g...| 8.16|    Emma|  Sanchez Abascal|       NO|
|594ea4e7-75e3-456...| 20|Sarabia.Lopez@gma...| 8.22| Agustin|    Sarabia Lopez|       NO|
|3b521244-d2d4-40b...| 25|MartinaySebastian...| 7.67| Martina|Corominas Sarabia|       NO|
|e6f52130-362f-4a5...| 19|DavidyValeria@gma...| 7.45|   David|   Miranda Grande|       NO|
|cee04454-f6ea-48b...| 20|Lopez.Bernal@outl...| 7.35|    Laia|     Lopez Bernal|       NO|
|6e5b75cd-0d5f-41f...| 22|MarcosySantiago@h...|  6.8|  Marcos|     Garcia Aznar|       NO|
|47435195-80b1-473...| 18|Judith.Garcia@gma...|  9.1|  Judith|      Garcia Cruz|       NO|

#### SQL mode

In [37]:
spark.sql("""
SELECT 
    *,
    CASE 
        WHEN grade > 9.5 THEN 'YES'
        ELSE 'NO'
    END AS excellent
FROM
    students
""").show()

+--------------------+---+--------------------+-----+--------+-----------------+---------+
|                 _id|age|               email|grade|    name|          surname|excellent|
+--------------------+---+--------------------+-----+--------+-----------------+---------+
|33a624e7-e6f1-40b...| 23|Valeria.Sebastian...| 7.56| Valeria| Sebastian Garcia|       NO|
|2cd47675-43f3-415...| 23|Sanchez.Abascal@g...| 8.16|    Emma|  Sanchez Abascal|       NO|
|594ea4e7-75e3-456...| 20|Sarabia.Lopez@gma...| 8.22| Agustin|    Sarabia Lopez|       NO|
|3b521244-d2d4-40b...| 25|MartinaySebastian...| 7.67| Martina|Corominas Sarabia|       NO|
|e6f52130-362f-4a5...| 19|DavidyValeria@gma...| 7.45|   David|   Miranda Grande|       NO|
|cee04454-f6ea-48b...| 20|Lopez.Bernal@outl...| 7.35|    Laia|     Lopez Bernal|       NO|
|6e5b75cd-0d5f-41f...| 22|MarcosySantiago@h...|  6.8|  Marcos|     Garcia Aznar|       NO|
|47435195-80b1-473...| 18|Judith.Garcia@gma...|  9.1|  Judith|      Garcia Cruz|       NO|

## 4. Spark SQL Joins

Spark DataFrame supports all basic SQL Join Types like `INNER`, `LEFT OUTER`, `RIGHT OUTER`, `LEFT ANTI`, `LEFT SEMI`, `CROSS`, `FULL OUTER`, etc. Spark SQL Joins are wide transformations that result in data shuffling over the network hence they have huge performance issues when not designed with care.

On the other hand Spark SQL Joins comes with more optimization by default (thanks to DataFrames & Dataset) however still there would be some performance issues to consider while using them.

### Most common types of Joins

<img src="../images/joins.jpg" title="Types of Joins" width="700px"/>

### How to perform a Join in Spark (cheat sheet)

1. Referencing columns with different names in each df:

```python
dfA.join(dfB, dfA["idA"] == dfB["idB"], "type-of-join")

```

2. Key column has the same alias

```python
dfA.join(dfB, on="id", how="type-of-join")

```

3. More than one key column

```python
dfA.join(dfB, ["id", "code"], "type-of-join")

dfA.join(dfB, (dfA["id"] == dfB["id"]) & (dfA["code"] == dfB["code"]), "type-of-join")

```

4. Maintaining both columns (be careful and be sure of referencing the origin of the column when using it after this)

```python
dfC = dfA.join(dfB, dfA["id"] == dfB["id"], "type-of-join")

dfC.select(dfA["id"]).show()
```

### Read the example data

In [39]:
df_emp = spark.read.json("../datasets/employees.json")
df_emp.show()

+-----------+------+------+--------+------+---------------+-----------+
|emp_dept_id|emp_id|gender|    name|salary|superior_emp_id|year_joined|
+-----------+------+------+--------+------+---------------+-----------+
|         10|     1|     M|   Smith|  3000|             -1|       2018|
|         20|     2|     M|    Rose|  4000|              1|       2010|
|         10|     4|     F|   Jones|  2000|              2|       2005|
|         10|     3|     M|Williams|  1000|              1|       2010|
|         50|     6|      |   Brown|    -1|              2|       2010|
|         40|     5|      |   Brown|    -1|              2|       2010|
+-----------+------+------+--------+------+---------------+-----------+



In [41]:
df_dept = spark.read.json("../datasets/departments.json")
df_dept.show()

+-------+---------+
|dept_id|dept_name|
+-------+---------+
|     10|  Finance|
|     20|Marketing|
|     30|    Sales|
|     40|       IT|
+-------+---------+



### Inner Join

In [42]:
# df_emp.join(df_dept,  df_emp["emp_dept_id"] == df_dept["dept_id"], how="inner").show()
df_emp.join(df_dept, df_emp["emp_dept_id"] == df_dept["dept_id"]).show()

+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|emp_dept_id|emp_id|gender|    name|salary|superior_emp_id|year_joined|dept_id|dept_name|
+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|         10|     1|     M|   Smith|  3000|             -1|       2018|     10|  Finance|
|         20|     2|     M|    Rose|  4000|              1|       2010|     20|Marketing|
|         10|     4|     F|   Jones|  2000|              2|       2005|     10|  Finance|
|         10|     3|     M|Williams|  1000|              1|       2010|     10|  Finance|
|         40|     5|      |   Brown|    -1|              2|       2010|     40|       IT|
+-----------+------+------+--------+------+---------------+-----------+-------+---------+



- The employee that belongs to inexistent department with id 50 is dropped
- The Sales department that has no employees asociated is dropped

### Full Outer Join

Outer a.k.a full, fullouter join returns all rows from both Spark DataFrame, where join expression doesn’t match it returns null on respective record columns.

In [43]:
df_emp.join(df_dept, df_emp["emp_dept_id"] == df_dept["dept_id"],"outer").show()
# df_emp.join(df_dept, df_emp["emp_dept_id"] == df_dept["dept_id"],"fullouter").show()
# df_emp.join(df_dept, df_emp["emp_dept_id"] == df_dept["dept_id"],"full").show()

+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|emp_dept_id|emp_id|gender|    name|salary|superior_emp_id|year_joined|dept_id|dept_name|
+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|         10|     1|     M|   Smith|  3000|             -1|       2018|     10|  Finance|
|         10|     4|     F|   Jones|  2000|              2|       2005|     10|  Finance|
|         10|     3|     M|Williams|  1000|              1|       2010|     10|  Finance|
|         20|     2|     M|    Rose|  4000|              1|       2010|     20|Marketing|
|       null|  null|  null|    null|  null|           null|       null|     30|    Sales|
|         40|     5|      |   Brown|    -1|              2|       2010|     40|       IT|
|         50|     6|      |   Brown|    -1|              2|       2010|   null|     null|
+-----------+------+------+--------+------+---------------+-----------+-------+---------+



- There is no department with id 50 hence the values for the department are null
- There are no employees in the Sales department hence the employee values are null

### Left Join

Left a.k.a Left Outer join returns all rows from the left DataFrame/Dataset regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found.

In [44]:
df_emp.join(df_dept, df_emp["emp_dept_id"] == df_dept["dept_id"],"left").show()
# df_emp.join(df_dept, df_emp["emp_dept_id"] == df_dept["dept_id"],"leftouter").show()

+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|emp_dept_id|emp_id|gender|    name|salary|superior_emp_id|year_joined|dept_id|dept_name|
+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|         10|     1|     M|   Smith|  3000|             -1|       2018|     10|  Finance|
|         20|     2|     M|    Rose|  4000|              1|       2010|     20|Marketing|
|         10|     4|     F|   Jones|  2000|              2|       2005|     10|  Finance|
|         10|     3|     M|Williams|  1000|              1|       2010|     10|  Finance|
|         50|     6|      |   Brown|    -1|              2|       2010|   null|     null|
|         40|     5|      |   Brown|    -1|              2|       2010|     40|       IT|
+-----------+------+------+--------+------+---------------+-----------+-------+---------+



- There is no department with id 50 hence the values for the department are null
- There are no employees in the Sales department hence no row is shown

### Right Join

Right a.k.a Right Outer join is opposite to left join, here it returns all rows from the right DataFrame/Dataset regardless of match found on the left dataset, when join expression doesn’t match, it assigns null for that record and drops records from left where match not found.

In [45]:
df_emp.join(df_dept, df_emp["emp_dept_id"] == df_dept["dept_id"],"right").show()
# df_emp.join(df_dept, df_emp["emp_dept_id"] == df_dept["dept_id"],"rightouter").show()

+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|emp_dept_id|emp_id|gender|    name|salary|superior_emp_id|year_joined|dept_id|dept_name|
+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|         10|     3|     M|Williams|  1000|              1|       2010|     10|  Finance|
|         10|     4|     F|   Jones|  2000|              2|       2005|     10|  Finance|
|         10|     1|     M|   Smith|  3000|             -1|       2018|     10|  Finance|
|         20|     2|     M|    Rose|  4000|              1|       2010|     20|Marketing|
|       null|  null|  null|    null|  null|           null|       null|     30|    Sales|
|         40|     5|      |   Brown|    -1|              2|       2010|     40|       IT|
+-----------+------+------+--------+------+---------------+-----------+-------+---------+



- There is no department with id 50 hence no row is shown
- There are no employees in the Sales department hence the values of the employee columns are null

### CrossJoin

This join combines each row of the first table with each row of the second table. For example, we have `m` rows in one table and `n` rows in another, this gives us `m * n` rows in the resulting table. 

`Note`: A table of 1000 customers combined with a table of 1000 products would produce 1,000,000 records! Try to avoid this with large tables in production.

In [46]:
df_emp.crossJoin(df_dept).show()

+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|emp_dept_id|emp_id|gender|    name|salary|superior_emp_id|year_joined|dept_id|dept_name|
+-----------+------+------+--------+------+---------------+-----------+-------+---------+
|         10|     1|     M|   Smith|  3000|             -1|       2018|     10|  Finance|
|         10|     1|     M|   Smith|  3000|             -1|       2018|     20|Marketing|
|         10|     1|     M|   Smith|  3000|             -1|       2018|     30|    Sales|
|         10|     1|     M|   Smith|  3000|             -1|       2018|     40|       IT|
|         20|     2|     M|    Rose|  4000|              1|       2010|     10|  Finance|
|         20|     2|     M|    Rose|  4000|              1|       2010|     20|Marketing|
|         20|     2|     M|    Rose|  4000|              1|       2010|     30|    Sales|
|         20|     2|     M|    Rose|  4000|              1|       2010|     40|       IT|
|         

## 5. User Defined Functions (UDFs)
User Defined Functions (UDFs) in Spark allow users to define their own transformations using Python or other programming languages, and then apply those functions on a Spark DataFrame. This can be very powerful when you need to make specific transformations to your data that aren't easily achieved using Spark's built-in functions.

When a UDF is defined, under the hood, Spark serializes the function using Py4J, transfers it over to the executor nodes, and deserializes it. This allows the UDF to be executed on rows of the DataFrame in parallel. However, it's worth noting that because UDFs involve serialization and data transfer between Python and JVM, they can be considerably slower than using native Spark functions.

Here's how you can define and use a UDF in Spark with Python (PySpark):

In [47]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def add_suffix(name):
    return name + "_UDF"

suffix_udf = udf(add_suffix, StringType())

#### DataFrame mode

In [49]:
df.withColumn("name_with_suffix", suffix_udf(df["name"])) \
  .select("name", "name_with_suffix") \
  .show()

+--------+----------------+
|    name|name_with_suffix|
+--------+----------------+
| Valeria|     Valeria_UDF|
|    Emma|        Emma_UDF|
| Agustin|     Agustin_UDF|
| Martina|     Martina_UDF|
|   David|       David_UDF|
|    Laia|        Laia_UDF|
|  Marcos|      Marcos_UDF|
|  Judith|      Judith_UDF|
|    Iker|        Iker_UDF|
|   Pablo|       Pablo_UDF|
|  Marcos|      Marcos_UDF|
|   Oriol|       Oriol_UDF|
|  Marcos|      Marcos_UDF|
|  Sandra|      Sandra_UDF|
|   Lucia|       Lucia_UDF|
|  Marcos|      Marcos_UDF|
|    Emma|        Emma_UDF|
|   Pedro|       Pedro_UDF|
|Santiago|    Santiago_UDF|
|    Juan|        Juan_UDF|
+--------+----------------+
only showing top 20 rows



#### SQL mode

In [51]:
# Register the UDF
spark.udf.register("suffixSQL", add_suffix, StringType())
spark.sql("""
SELECT 
    name,
    suffixSQL(name) as name_with_suffix
FROM
    students
""").show()

+--------+----------------+
|    name|name_with_suffix|
+--------+----------------+
| Valeria|     Valeria_UDF|
|    Emma|        Emma_UDF|
| Agustin|     Agustin_UDF|
| Martina|     Martina_UDF|
|   David|       David_UDF|
|    Laia|        Laia_UDF|
|  Marcos|      Marcos_UDF|
|  Judith|      Judith_UDF|
|    Iker|        Iker_UDF|
|   Pablo|       Pablo_UDF|
|  Marcos|      Marcos_UDF|
|   Oriol|       Oriol_UDF|
|  Marcos|      Marcos_UDF|
|  Sandra|      Sandra_UDF|
|   Lucia|       Lucia_UDF|
|  Marcos|      Marcos_UDF|
|    Emma|        Emma_UDF|
|   Pedro|       Pedro_UDF|
|Santiago|    Santiago_UDF|
|    Juan|        Juan_UDF|
+--------+----------------+
only showing top 20 rows

