#### UDF User-defined functions

In [1]:
from pyspark.sql.types import LongType

In [2]:
# Create cubed function

def cubed(s):
 return s * s * s

In [3]:
# Register UDF

spark.udf.register("cubed", cubed, LongType())

<function __main__.cubed(s)>

In [4]:
# Generate temporary view

spark.range(1, 9).createOrReplaceTempView("udf_test")

In [5]:
# Use Spark SQL to execute the cubed() function

spark.sql("SELECT id, cubed(id) AS id_cubed FROM udf_test").show()

+---+--------+
| id|id_cubed|
+---+--------+
|  1|       1|
|  2|       8|
|  3|      27|
|  4|      64|
|  5|     125|
|  6|     216|
|  7|     343|
|  8|     512|
+---+--------+



One of the previous prevailing issues with using PySpark UDFs was that they had slower performance than Scala UDFs. 
To resolve this problem, Pandas UDFs (also known as vectorized UDFs) were introduced as part of Apache Spark 2.3.
A Pandas UDF uses Apache Arrow to transfer data and Pandas to work with the data. 
You define a Pandas UDF using the keyword pandas_udf as the decorator, or to wrap the function itself.
Instead of operating on individual inputs row by row, you are operating on a Pandas Series or DataFrame.

From Apache Spark 3.0 with Python 3.6 and above, Pandas UDFs were split into two API categories: 
* Pandas UDFs 
* Pandas Function APIs: allow you to directly apply a local Python function to a PySpark DataFrame.

In [2]:
# Set file paths

from pyspark.sql.functions import expr
tripdelaysFilePath = "C:/Users/alice.marchi/Downloads/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/departuredelays.csv"
airportsnaFilePath = "C:/Users/alice.marchi/Downloads/LearningSparkV2-master/databricks-datasets/learning-spark-v2//flights/airport-codes-na.txt"
 
# Obtain airports data set

airportsna = (spark.read
 .format("csv")
 .options(header="true", inferSchema="true", sep="\t")
 .load(airportsnaFilePath))
airportsna.createOrReplaceTempView("airports_na")

# Obtain departure delays data set

departureDelays = (spark.read
 .format("csv")
 .options(header="true")
 .load(tripdelaysFilePath))
departureDelays = (departureDelays
 .withColumn("delay", expr("CAST(delay as INT) as delay"))
 .withColumn("distance", expr("CAST(distance as INT) as distance")))
departureDelays.createOrReplaceTempView("departureDelays")

# Create temporary small table

foo = (departureDelays
 .filter(expr("""origin == 'SEA' and destination == 'SFO' and 
 date like '01010%' and delay > 0""")))
foo.createOrReplaceTempView("foo")


The departureDelays DataFrame contains data on >1.3M flights while the foo DataFrame contains just three rows with information 
on flights from SEA to SFO for a specific time range.

In [3]:
spark.sql("SELECT * FROM airports_na LIMIT 10").show()

+-----------+-----+-------+----+
|       City|State|Country|IATA|
+-----------+-----+-------+----+
| Abbotsford|   BC| Canada| YXX|
|   Aberdeen|   SD|    USA| ABR|
|    Abilene|   TX|    USA| ABI|
|      Akron|   OH|    USA| CAK|
|    Alamosa|   CO|    USA| ALS|
|     Albany|   GA|    USA| ABY|
|     Albany|   NY|    USA| ALB|
|Albuquerque|   NM|    USA| ABQ|
| Alexandria|   LA|    USA| AEX|
|  Allentown|   PA|    USA| ABE|
+-----------+-----+-------+----+



In [4]:
spark.sql("SELECT * FROM departureDelays LIMIT 10").show()

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01011245|    6|     602|   ABE|        ATL|
|01020600|   -8|     369|   ABE|        DTW|
|01021245|   -2|     602|   ABE|        ATL|
|01020605|   -4|     602|   ABE|        ATL|
|01031245|   -4|     602|   ABE|        ATL|
|01030605|    0|     602|   ABE|        ATL|
|01041243|   10|     602|   ABE|        ATL|
|01040605|   28|     602|   ABE|        ATL|
|01051245|   88|     602|   ABE|        ATL|
|01050605|    9|     602|   ABE|        ATL|
+--------+-----+--------+------+-----------+



In [5]:
spark.sql("SELECT * FROM foo").show()

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01010710|   31|     590|   SEA|        SFO|
|01010955|  104|     590|   SEA|        SFO|
|01010730|    5|     590|   SEA|        SFO|
+--------+-----+--------+------+-----------+



#### Union
A common pattern within Apache Spark is to union two different DataFrames with the same schema together. This can be achieved using the union() method.

In [6]:
# Union two tables

bar = departureDelays.union(foo)
bar.createOrReplaceTempView("bar")

# Show the union (filtering for SEA and SFO in a specific time range)

bar.filter(expr("""origin == 'SEA' AND destination == 'SFO'
AND date LIKE '01010%' AND delay > 0""")).show()


+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01010710|   31|     590|   SEA|        SFO|
|01010955|  104|     590|   SEA|        SFO|
|01010730|    5|     590|   SEA|        SFO|
|01010710|   31|     590|   SEA|        SFO|
|01010955|  104|     590|   SEA|        SFO|
|01010730|    5|     590|   SEA|        SFO|
+--------+-----+--------+------+-----------+



The bar DataFrame is the union of foo with delays. Using the same filtering criteria results in the bar DataFrame, we see a duplication of the foo data, as expected.

In [7]:
spark.sql("""
SELECT * 
 FROM bar 
 WHERE origin = 'SEA' 
 AND destination = 'SFO' 
 AND date LIKE '01010%' 
 AND delay > 0
""").show()

+--------+-----+--------+------+-----------+
|    date|delay|distance|origin|destination|
+--------+-----+--------+------+-----------+
|01010710|   31|     590|   SEA|        SFO|
|01010955|  104|     590|   SEA|        SFO|
|01010730|    5|     590|   SEA|        SFO|
|01010710|   31|     590|   SEA|        SFO|
|01010955|  104|     590|   SEA|        SFO|
|01010730|    5|     590|   SEA|        SFO|
+--------+-----+--------+------+-----------+



#### Joins
A common DataFrame operation is to join two DataFrames (or tables) together. By default, a Spark SQL join is an inner join.

In [12]:
# Join departure delays data (foo) with airport info

foo.join(
 airportsna, 
 airportsna.IATA == foo.origin
).select("City", "State", "date", "delay", "distance", "destination").show()

+-------+-----+--------+-----+--------+-----------+
|   City|State|    date|delay|distance|destination|
+-------+-----+--------+-----+--------+-----------+
|Seattle|   WA|01010710|   31|     590|        SFO|
|Seattle|   WA|01010955|  104|     590|        SFO|
|Seattle|   WA|01010730|    5|     590|        SFO|
+-------+-----+--------+-----+--------+-----------+



In [13]:
spark.sql("""
SELECT a.City, a.State, f.date, f.delay, f.distance, f.destination 
 FROM foo f
 JOIN airports_na a
 ON a.IATA = f.origin
""").show()

+-------+-----+--------+-----+--------+-----------+
|   City|State|    date|delay|distance|destination|
+-------+-----+--------+-----+--------+-----------+
|Seattle|   WA|01010710|   31|     590|        SFO|
|Seattle|   WA|01010955|  104|     590|        SFO|
|Seattle|   WA|01010730|    5|     590|        SFO|
+-------+-----+--------+-----+--------+-----------+



#### Spark MySQL
Loading data from a JDBC source using load.
(Cargar con spark datos de empleados y departamentos)

In [11]:
employees = (spark
 .read
 .format("jdbc")
 .option("url", "jdbc:mysql://localhost:3306/employees")
 .option("driver", "com.mysql.jdbc.Driver")
 .option("dbtable", "employees")
 .option("user", "root")
 .option("password", "root1234-")
 .load())

In [12]:
from pyspark.sql.functions import *

In [13]:
employees.show(10)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10|
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|
| 10009|1952-04-19|    Sumant|     Peac|     F|1985-02-18|
| 10010|1963-06-01| Duangkaew| Piveteau|     F|1989-08-24|
+------+----------+----------+---------+------+----------+
only showing top 10 rows



In [14]:
departments = (spark
 .read
 .format("jdbc")
 .option("url", "jdbc:mysql://localhost:3306/employees")
 .option("driver", "com.mysql.jdbc.Driver")
 .option("dbtable", "departments")
 .option("user", "root")
 .option("password", "root1234-")
 .load())

In [15]:
departments.show(10)

+-------+------------------+
|dept_no|         dept_name|
+-------+------------------+
|   d009|  Customer Service|
|   d005|       Development|
|   d002|           Finance|
|   d003|   Human Resources|
|   d001|         Marketing|
|   d004|        Production|
|   d006|Quality Management|
|   d008|          Research|
|   d007|             Sales|
+-------+------------------+



In [16]:
salaries = (spark
 .read
 .format("jdbc")
 .option("url", "jdbc:mysql://localhost:3306/employees")
 .option("driver", "com.mysql.jdbc.Driver")
 .option("dbtable", "salaries")
 .option("user", "root")
 .option("password", "root1234-")
 .load())

In [17]:
salaries.show(10)

+------+------+----------+----------+
|emp_no|salary| from_date|   to_date|
+------+------+----------+----------+
| 10001| 60117|1986-06-26|1987-06-26|
| 10001| 62102|1987-06-26|1988-06-25|
| 10001| 66074|1988-06-25|1989-06-25|
| 10001| 66596|1989-06-25|1990-06-25|
| 10001| 66961|1990-06-25|1991-06-25|
| 10001| 71046|1991-06-25|1992-06-24|
| 10001| 74333|1992-06-24|1993-06-24|
| 10001| 75286|1993-06-24|1994-06-24|
| 10001| 75994|1994-06-24|1995-06-24|
| 10001| 76884|1995-06-24|1996-06-23|
+------+------+----------+----------+
only showing top 10 rows



In [18]:
titles = (spark
 .read
 .format("jdbc")
 .option("url", "jdbc:mysql://localhost:3306/employees")
 .option("driver", "com.mysql.jdbc.Driver")
 .option("dbtable", "titles")
 .option("user", "root")
 .option("password", "root1234-")
 .load())

In [19]:
titles.show(10)

+------+---------------+----------+----------+
|emp_no|          title| from_date|   to_date|
+------+---------------+----------+----------+
| 10001|Senior Engineer|1986-06-26|9999-01-01|
| 10002|          Staff|1996-08-03|9999-01-01|
| 10003|Senior Engineer|1995-12-03|9999-01-01|
| 10004|       Engineer|1986-12-01|1995-12-01|
| 10004|Senior Engineer|1995-12-01|9999-01-01|
| 10005|   Senior Staff|1996-09-12|9999-01-01|
| 10005|          Staff|1989-09-12|1996-09-12|
| 10006|Senior Engineer|1990-08-05|9999-01-01|
| 10007|   Senior Staff|1996-02-11|9999-01-01|
| 10007|          Staff|1989-02-10|1996-02-11|
+------+---------------+----------+----------+
only showing top 10 rows



Mediante Joins mostrar toda la información de los empleados además de su título y salario.

In [25]:
employees.join(titles,
 employees.emp_no == titles.emp_no).select("first_name", "last_name", "birth_date", "title").show(10)

+----------+---------+----------+----------------+
|first_name|last_name|birth_date|           title|
+----------+---------+----------+----------------+
|  Alassane|  Iwayama|1960-09-19|Technique Leader|
|   Shalesh|  dAstous|1963-09-16|    Senior Staff|
|Aleksander|   Danlos|1953-07-11|        Engineer|
|Aleksander|   Danlos|1953-07-11| Senior Engineer|
|       Uri|  Rullman|1958-10-02|    Senior Staff|
|       Uri|  Rullman|1958-10-02|           Staff|
|   Shushma|     Bahk|1957-03-01|        Engineer|
|   Shushma|     Bahk|1957-03-01| Senior Engineer|
|   Vasiliy|Kermarrec|1957-08-20|        Engineer|
|   Vasiliy|Kermarrec|1957-08-20| Senior Engineer|
+----------+---------+----------+----------------+
only showing top 10 rows



In [27]:
employees.join(salaries,
 employees.emp_no == salaries.emp_no).select("first_name", "last_name", "birth_date", "salary").show(10)

+----------+---------+----------+------+
|first_name|last_name|birth_date|salary|
+----------+---------+----------+------+
|  Alassane|  Iwayama|1960-09-19| 40000|
|  Alassane|  Iwayama|1960-09-19| 43519|
|  Alassane|  Iwayama|1960-09-19| 46265|
|  Alassane|  Iwayama|1960-09-19| 46865|
|  Alassane|  Iwayama|1960-09-19| 47837|
|  Alassane|  Iwayama|1960-09-19| 52042|
|  Alassane|  Iwayama|1960-09-19| 52370|
|  Alassane|  Iwayama|1960-09-19| 53202|
|  Alassane|  Iwayama|1960-09-19| 56087|
|  Alassane|  Iwayama|1960-09-19| 59252|
+----------+---------+----------+------+
only showing top 10 rows



In [32]:
employees.join(salaries,
 employees.emp_no == salaries.emp_no).join(titles, employees.emp_no == titles.emp_no).select("first_name", "last_name", "birth_date", "salary", "title").show(10)

+----------+---------+----------+------+----------------+
|first_name|last_name|birth_date|salary|           title|
+----------+---------+----------+------+----------------+
|  Alassane|  Iwayama|1960-09-19| 40000|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 43519|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 46265|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 46865|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 47837|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 52042|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 52370|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 53202|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 56087|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 59252|Technique Leader|
+----------+---------+----------+------+----------------+
only showing top 10 rows



In [35]:
employees.createOrReplaceTempView("employees_tb")

In [36]:
spark.sql("""
SELECT * FROM employees_tb
""").show(10)

+------+----------+----------+---------+------+----------+
|emp_no|birth_date|first_name|last_name|gender| hire_date|
+------+----------+----------+---------+------+----------+
| 10001|1953-09-02|    Georgi|  Facello|     M|1986-06-26|
| 10002|1964-06-02|   Bezalel|   Simmel|     F|1985-11-21|
| 10003|1959-12-03|     Parto|  Bamford|     M|1986-08-28|
| 10004|1954-05-01| Chirstian|  Koblick|     M|1986-12-01|
| 10005|1955-01-21|   Kyoichi| Maliniak|     M|1989-09-12|
| 10006|1953-04-20|    Anneke|  Preusig|     F|1989-06-02|
| 10007|1957-05-23|   Tzvetan|Zielinski|     F|1989-02-10|
| 10008|1958-02-19|    Saniya| Kalloufi|     M|1994-09-15|
| 10009|1952-04-19|    Sumant|     Peac|     F|1985-02-18|
| 10010|1963-06-01| Duangkaew| Piveteau|     F|1989-08-24|
+------+----------+----------+---------+------+----------+
only showing top 10 rows



In [37]:
salaries.createOrReplaceTempView("salaries_tb")

In [38]:
titles.createOrReplaceTempView("titles_tb")

In [39]:
spark.sql("""
SELECT a.first_name, a.last_name, a.birth_date, b.salary, c.title 
 FROM employees_tb AS a
 JOIN salaries_tb AS b
 ON a.emp_no = b.emp_no
 JOIN titles_tb AS c
 ON a.emp_no = c.emp_no
""").show(10)

+----------+---------+----------+------+----------------+
|first_name|last_name|birth_date|salary|           title|
+----------+---------+----------+------+----------------+
|  Alassane|  Iwayama|1960-09-19| 40000|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 43519|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 46265|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 46865|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 47837|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 52042|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 52370|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 53202|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 56087|Technique Leader|
|  Alassane|  Iwayama|1960-09-19| 59252|Technique Leader|
+----------+---------+----------+------+----------------+
only showing top 10 rows



##### Diferencia entre Rank y dense_rank (operaciones de ventana) 

##### RANK vs DENSE_RANK 

Se usan para ordenar valores y asignarles unos números. 
(Ejemplo, si tenemos 3 estudiantes con 3 notas diferentes, 100, 85 y 72, esta función asignará los números 1,2 y 3 a cada estudiante según cómo hemos decidido rankearlo). 

Ambas funciones usan una sentencia OVER() con PARTITION BY y ORDER BY. 
(PARTITION BY es opcional pero ORDER BY es obligatorio).

SELECT student_name, RANK() OVER(ORDER BY grades DESC) AS grade_ranking

PARTITION BY agrupa los rankings. Cuando los valores cambian en la columna especificada, el ranking vuelve a empezar. 

Ejemplo: SELECT student_name, DENSE_RANK() OVER(PARTITION BY subject ORDER BY grades DESC) AS grade_ranking
```

Diferencia --> Si por ejemplo 2 estudiantes tienen la misma nota 
                  RANK    DENSE RANK       
Jessica 76          4		 3
Madison 100	        1		 1
Sebastian 100	    1		 1
Eric 92 	        3        2
Josephine 63 	    5        4

```

RANK salta el número 2, mientras DENSE_RANK no se salta ningún valor. 

In [46]:
spark.sql("""
SELECT a.first_name, a.last_name, a.birth_date, b.salary, b.from_date, b.to_date,
LAST_VALUE(b.salary) OVER (PARTITION BY a.first_name, a.last_name ORDER BY from_date) AS last_value
 FROM employees_tb AS a
 JOIN salaries_tb AS b
 ON a.emp_no = b.emp_no
""").show(20)

+----------+---------+----------+------+----------+----------+----------+
|first_name|last_name|birth_date|salary| from_date|   to_date|last_value|
+----------+---------+----------+------+----------+----------+----------+
|     Aamer| Feinberg|1953-11-25| 45996|1992-11-02|1993-11-02|     45996|
|     Aamer| Feinberg|1953-11-25| 46979|1993-11-02|1994-11-02|     46979|
|     Aamer| Feinberg|1953-11-25| 49729|1994-11-02|1995-11-02|     49729|
|     Aamer| Feinberg|1953-11-25| 50212|1995-11-02|1996-11-01|     50212|
|     Aamer| Feinberg|1953-11-25| 51062|1996-11-01|1997-11-01|     51062|
|     Aamer| Feinberg|1953-11-25| 51370|1997-11-01|1998-11-01|     51370|
|     Aamer| Feinberg|1953-11-25| 55258|1998-11-01|1999-11-01|     55258|
|     Aamer| Feinberg|1953-11-25| 59075|1999-11-01|2000-10-31|     59075|
|     Aamer| Feinberg|1953-11-25| 60199|2000-10-31|2001-10-31|     60199|
|     Aamer| Feinberg|1953-11-25| 59974|2001-10-31|9999-01-01|     59974|
|     Aamer| Molenaar|1957-10-27| 7919

In [53]:
spark.sql("""
 SELECT first_name, last_name, salary, title, from_date, to_date, rank
 FROM ( 
 SELECT a.first_name, a.last_name, b.salary, c.title, b.from_date, b.to_date, 
 dense_rank() OVER (PARTITION BY a.first_name, a.last_name ORDER BY b.to_date DESC) AS rank
 FROM employees_tb AS a
 JOIN salaries_tb AS b
 ON a.emp_no = b.emp_no
 JOIN titles_tb AS c
 ON a.emp_no = c.emp_no)
 WHERE rank <= 3
 """).show(20)

+----------+------------+------+---------------+----------+----------+----+
|first_name|   last_name|salary|          title| from_date|   to_date|rank|
+----------+------------+------+---------------+----------+----------+----+
|     Aamer|    Feinberg| 59974|   Senior Staff|2001-10-31|9999-01-01|   1|
|     Aamer|    Feinberg| 59974|          Staff|2001-10-31|9999-01-01|   1|
|     Aamer|    Feinberg| 60199|   Senior Staff|2000-10-31|2001-10-31|   2|
|     Aamer|    Feinberg| 60199|          Staff|2000-10-31|2001-10-31|   2|
|     Aamer|    Feinberg| 59075|   Senior Staff|1999-11-01|2000-10-31|   3|
|     Aamer|    Feinberg| 59075|          Staff|1999-11-01|2000-10-31|   3|
|     Aamer|    Molenaar|115331|   Senior Staff|2002-04-18|9999-01-01|   1|
|     Aamer|    Molenaar|115331|          Staff|2002-04-18|9999-01-01|   1|
|     Aamer|    Molenaar|114295|   Senior Staff|2001-04-18|2002-04-18|   2|
|     Aamer|    Molenaar|114295|          Staff|2001-04-18|2002-04-18|   2|
|     Aamer|

In [54]:
dept_emp = (spark
 .read
 .format("jdbc")
 .option("url", "jdbc:mysql://localhost:3306/employees")
 .option("driver", "com.mysql.jdbc.Driver")
 .option("dbtable", "dept_emp")
 .option("user", "root")
 .option("password", "root1234-")
 .load())

In [55]:
dept_emp.show(10)

+------+-------+----------+----------+
|emp_no|dept_no| from_date|   to_date|
+------+-------+----------+----------+
| 10001|   d005|1986-06-26|9999-01-01|
| 10002|   d007|1996-08-03|9999-01-01|
| 10003|   d004|1995-12-03|9999-01-01|
| 10004|   d004|1986-12-01|9999-01-01|
| 10005|   d003|1989-09-12|9999-01-01|
| 10006|   d005|1990-08-05|9999-01-01|
| 10007|   d008|1989-02-10|9999-01-01|
| 10008|   d005|1998-03-11|2000-07-31|
| 10009|   d006|1985-02-18|9999-01-01|
| 10010|   d004|1996-11-24|2000-06-26|
+------+-------+----------+----------+
only showing top 10 rows



In [59]:
dept_emp.createOrReplaceTempView("dept_emp_tb")

In [60]:
departments.createOrReplaceTempView("departments_tb")

Utilizando operaciones de ventana obtener el salario, posición (cargo) y departamento actual de cada empleado, es decir, el último o más reciente.

In [62]:
spark.sql("""
 SELECT first_name, last_name, salary, title, dept_name, from_date, to_date, rank
 FROM ( 
 SELECT a.first_name, a.last_name, b.salary, c.title, e.dept_name, b.from_date, b.to_date,
 dense_rank() OVER (PARTITION BY a.first_name, a.last_name ORDER BY b.to_date DESC) AS rank
 FROM employees_tb AS a
 JOIN salaries_tb AS b
 ON a.emp_no = b.emp_no
 JOIN titles_tb AS c
 ON a.emp_no = c.emp_no
 JOIN dept_emp_tb AS d
 ON a.emp_no = d.emp_no
 JOIN departments_tb AS e 
 ON d.dept_no = e.dept_no)
 WHERE rank = 1
 """).show(20)

+----------+------------+------+---------------+----------------+----------+----------+----+
|first_name|   last_name|salary|          title|       dept_name| from_date|   to_date|rank|
+----------+------------+------+---------------+----------------+----------+----------+----+
|     Aamer|    Feinberg| 59974|   Senior Staff|         Finance|2001-10-31|9999-01-01|   1|
|     Aamer|    Feinberg| 59974|          Staff|         Finance|2001-10-31|9999-01-01|   1|
|     Aamer|    Molenaar|115331|   Senior Staff|           Sales|2002-04-18|9999-01-01|   1|
|     Aamer|    Molenaar|115331|          Staff|           Sales|2002-04-18|9999-01-01|   1|
| Abdelaziz|       Rosin| 53248|       Engineer|     Development|2001-02-17|2001-06-13|   1|
|Abdelghani|Bernardeschi| 54691|       Engineer|     Development|2002-07-08|9999-01-01|   1|
|Abdelghani|Bernardeschi| 60020|       Engineer|      Production|2001-08-21|9999-01-01|   1|
|Abdelghani|Bernardeschi| 60020|Senior Engineer|      Production|2001-