**_pySpark Basics: Pivoting Data_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 20 June 2016, Spark v1.6.1_

_Abstract: This guide will illustrate how to reshape (pivot) data._

***

A few initial setup items:  First we test that the spark context was successfully created during bootstrap and is available in the global namespace as 'sc'.  After that we create the SQL context necessary for working with a dataframe (panel data).

In [1]:
try:
    sc
except NameError:
    raise Exception('Spark context not created.')

In [2]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Then we create a toy dataframe to work with:

In [4]:
from pyspark.sql import Row

row = Row('state', 'industry', 'hq', 'jobs')

df = sc.parallelize([
    row('MI', 'auto', 'domestic', 716),
    row('MI', 'auto', 'foreign', 123),
    row('MI', 'auto', 'domestic', 1340),
    row('MI', 'retail', 'foreign', 12),
    row('MI', 'retail', 'foreign', 33),
    row('OH', 'auto', 'domestic', 349),
    row('OH', 'auto', 'foreign', 101),
    row('OH', 'auto', 'foreign', 77),
    row('OH', 'retail', 'domestic', 45),
    row('OH', 'retail', 'foreign', 12)
    ]).toDF()

In [17]:
df.show()

+-----+--------+--------+----+
|state|industry|      hq|jobs|
+-----+--------+--------+----+
|   MI|    auto|domestic| 716|
|   MI|    auto| foreign| 123|
|   MI|    auto|domestic|1340|
|   MI|  retail| foreign|  12|
|   MI|  retail| foreign|  33|
|   OH|    auto|domestic| 349|
|   OH|    auto| foreign| 101|
|   OH|    auto| foreign|  77|
|   OH|  retail|domestic|  45|
|   OH|  retail| foreign|  12|
+-----+--------+--------+----+



Pivot operations must always be preceeded by a groupBy operation.  In our first case we will simply pivot to show domestic versus foreign jobs in each of our two states:

In [23]:
df_pivot1 = df.groupby('state').pivot('hq', values=['domestic', 'foreign']).sum('jobs')

In [24]:
df_pivot1.show()

+-----+--------+-------+
|state|domestic|foreign|
+-----+--------+-------+
|   OH|     394|    190|
|   MI|    2056|    168|
+-----+--------+-------+



Note that the `values=['domestic', 'foreign']` part of the pivot method is optional.  If we don't supply a list then pySpark will attempt to infer the values, but naturally that requires more processing than if we specify it up front.  As your datasets get larger and larger this sort of help becomes more and more important.

Here's another example, this time pivoting by both `state` and by `industry`:

In [15]:
df_pivot = df.groupBy('state', 'industry').pivot('hq', values=['domestic', 'foreign']).sum('jobs')

In [16]:
df_pivot.show()

+-----+--------+--------+-------+
|state|industry|domestic|foreign|
+-----+--------+--------+-------+
|   MI|    auto|    2056|    123|
|   OH|  retail|      45|     12|
|   OH|    auto|     349|    178|
|   MI|  retail|    null|     45|
+-----+--------+--------+-------+



The `sum` method at the end can be replaced as necessary, for example with `avg`.