**_pySpark Basics: Merging and Joining Data_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 17 June 2016, Spark v1.6.1_

_Abstract: This guide will go over the various ways to concatenate two or more dataframes_

***

We begin with some basic setup, first to verify that the Spark Context was successfully created by the startup script and then to import the SQL structure that supports the dataframes we'll be using:

In [1]:
try:
    sc
except NameError:
    raise Exception('Spark context not created.')

In [2]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

You may have the same columns in each dataframe and just want to stack one on top of the other, row-wise.  We can make this happen with a helper function, after we first build three simple toy dataframes:

In [3]:
from pyspark.sql import Row

row = Row("name", "pet", "count")

df1 = sc.parallelize([
    row("Sue", "cat", 16),
    row("Kim", "dog", 1),    
    row("Bob", "fish", 5)
    ]).toDF()

df2 = sc.parallelize([
    row("Fred", "cat", 2),
    row("Kate", "ant", 179),    
    row("Marc", "lizard", 5)
    ]).toDF()

df3 = sc.parallelize([
    row("Sarah", "shark", 3),
    row("Jason", "kids", 2),    
    row("Scott", "squirrel", 1)
    ]).toDF()

In [5]:
from pyspark.sql import DataFrame
from functools import reduce

def union_many(*dfs):
    #this function can have as many dataframes as you want passed into it
    return reduce(DataFrame.unionAll, dfs)

df_union = union_many(df1, df2, df3)

In [6]:
df_union.show()

+-----+--------+-----+
| name|     pet|count|
+-----+--------+-----+
|  Sue|     cat|   16|
|  Kim|     dog|    1|
|  Bob|    fish|    5|
| Fred|     cat|    2|
| Kate|     ant|  179|
| Marc|  lizard|    5|
|Sarah|   shark|    3|
|Jason|    kids|    2|
|Scott|squirrel|    1|
+-----+--------+-----+



The other way to merge is by combining columns on certain keys across rows.  There are four ways to specify the logic of the operation:

In [9]:
row1 = Row("name", "pet", "count")
row2 = Row("name", "pet2", "count2")

df1 = sc.parallelize([
    row1("Sue", "cat", 16),
    row1("Kim", "dog", 1),    
    row1("Bob", "fish", 5),
    row1("Libuse", "horse", 1)
    ]).toDF()

df2 = sc.parallelize([
    row2("Sue", "eagle", 2),
    row2("Kim", "ant", 179),    
    row2("Bob", "lizard", 5),
    row2("Ferdinand", "bees", 23)
    ]).toDF()

First we'll do an `inner` join, which merges rows that have a match in both dataframes and drops all others.  This is the default type of join, so the `how` argument could be omitted here if you didn't wish to be explicit:

In [10]:
df1.join(df2, 'name', how='inner').show()

+----+----+-----+------+------+
|name| pet|count|  pet2|count2|
+----+----+-----+------+------+
| Sue| cat|   16| eagle|     2|
| Bob|fish|    5|lizard|     5|
| Kim| dog|    1|   ant|   179|
+----+----+-----+------+------+



Then an `outer` join, which uses all rows regardless of matches and fills in `null` for missing observations:

In [11]:
df1.join(df2, 'name', how='outer').show()

+---------+-----+-----+------+------+
|     name|  pet|count|  pet2|count2|
+---------+-----+-----+------+------+
|Ferdinand| null| null|  bees|    23|
|      Sue|  cat|   16| eagle|     2|
|      Bob| fish|    5|lizard|     5|
|   Libuse|horse|    1|  null|  null|
|      Kim|  dog|    1|   ant|   179|
+---------+-----+-----+------+------+



And finally a `left` join uses all keys from the left dataframe (in this case `df1`) but only rows with matches from the right dataframe:

In [12]:
df1.join(df2, 'name', how='left').show()

+------+-----+-----+------+------+
|  name|  pet|count|  pet2|count2|
+------+-----+-----+------+------+
|   Sue|  cat|   16| eagle|     2|
|   Bob| fish|    5|lizard|     5|
|Libuse|horse|    1|  null|  null|
|   Kim|  dog|    1|   ant|   179|
+------+-----+-----+------+------+



A `right` join would just be the opposte of that.  That is equivalent to performing a `left` join but switching the places of `df1` and `df2` in the code block (although the resulting dataframes would have different column orderings).

And a finally, note that merges can be done on multiple columns by passing a **list** of column name strings instead of just a string.