<h2>PySpark Join Types</h2>
<p>PySpark Join is used to combine two DataFrames and by chaining these you can join multiple DataFrames; it supports all basic join type operations available in traditional <strong>SQL like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF JOIN.</strong>
</p>
<h2>1. PySpark Join Syntax</h2>
<p>PySpark SQL join has a below syntax and it can be accessed directly from DataFrame.</p>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
</br>
# Syntax</br>
join(self, other, on=None, how=None)</br>
</p>
<p>You can also write Join expression by adding <strong>where() and filter()</strong> methods on DataFrame and can have Join on multiple columns.</p>
<h2>2. PySpark Join Types</h2>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
</br>
# Prepare data</br>
import pyspark</br>
from pyspark.sql import SparkSession</br>
</br>
emp = [(1,"Smith",-1,"2018","10","M",3000), \</br>
    (2,"Rose",1,"2010","20","M",4000), \</br>
    (3,"Williams",1,"2010","10","M",1000), \</br>
    (4,"Jones",2,"2005","10","F",2000), \</br>
    (5,"Brown",2,"2010","40","",-1), \</br>
      (6,"Brown",2,"2010","50","",-1) \</br>
  ]</br>
empColumns = ["emp_id","name","superior_emp_id","year_joined", \</br>
       "emp_dept_id","gender","salary"]</br>
</br>
empDF = spark.createDataFrame(data=emp, schema = empColumns)</br>
empDF.printSchema()</br>
empDF.show(truncate=False)</br>
</br>
dept = [("Finance",10), \</br>
    ("Marketing",20), \</br>
    ("Sales",30), \</br>
    ("IT",40) \</br>
  ]</br>
deptColumns = ["dept_name","dept_id"]</br>
deptDF = spark.createDataFrame(data=dept, schema = deptColumns)</br>
deptDF.printSchema()</br>
deptDF.show(truncate=False)</br>
</p>
<h2>4. PySpark Inner Join DataFrame</h2>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
</br>
# Inner join
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"inner") \</br>
     .show(truncate=False)</br>
</p>
<h2>5. PySpark Left Outer Join</h2>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
</br>
# Left outer join</br>
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"left")</br>
    .show(truncate=False)</br>
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftouter")</br>
    .show(truncate=False)</br>
</p>
<h2>6. Right Outer Join</h2>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
</br>
# Right outer join</br>
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"right") \</br>
   .show(truncate=False)</br>
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"rightouter") \</br>
   .show(truncate=False)</br>
</p>
<h2>7. PySpark Full Outer Join</h2>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
</br>
# Full outer join</br>
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"outer") \</br>
    .show(truncate=False)</br>
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"full") \</br>
    .show(truncate=False)</br>
empDF.join(deptDF,empDF.emp_dept</br>
</p>
<h2>8. Left Semi Join</h2>
<p>A Left Semi Join in PySpark returns only the rows from the left DataFrame (the first DataFrame mentioned in the join operation) where there is a match with the right DataFrame (the second DataFrame).</p>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
</br>
# Left semi join</br>
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftsemi") \</br>
   .show(truncate=False)</br>
</p>
<h2>9. Left Anti Join</h2>
<p>A Left Anti Join in PySpark returns only the rows from the left DataFrame (the first DataFrame mentioned in the join operation) where there is no match with the right DataFrame (the second DataFrame). </p>
<p height="50" width="100%" style="background:black;font-size:20px;color:white">
</br>
# Left anti join</br>
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftanti") \</br>
   .show(truncate=False)</br>
</p>
<h2>10. PySpark Self Join</h2>
<p height="100" width="100%" style="background:black;font-size:20px;color:white">
</br>
# Self join</br>
empDF.alias("emp1").join(empDF.alias("emp2"), \</br>
    col("emp1.superior_emp_id") == col("emp2.emp_id"),"inner") \</br>
    .select(col("emp1.emp_id"),col("emp1.name"), \</br>
      col("emp2.emp_id").alias("superior_emp_id"), \</br>
      col("emp2.name").alias("superior_emp_name")) \</br>
   .show(truncate=False)</br>
</p>