**_pySpark Basics: Dataframe Concepts_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 22 June 2016, Spark v1.6.1_

_Abstract: This guide will explore some basic concepts necessary for working with many dataframe operations, in particular `groupBy` and `persist`._

***

Dataframes are just a convenient wrapper over the basic structure of data that Spark uses, which is called an RDD (Resilient Distributed Dataset).  It's _resilient_ because a core benefit of Spark is how it handles network and hardware failures - when you're working with a small dataset on your desktop computer, this rarely becomes an issue.  However, one of the core benefits to Spark is that it's _distributed_ across many computers, and as that number rises the number of failures you will encounter will naturally rise as well.  By some estimates if you have 10,000 computers working together - which Spark can handle - you should expect one system to fail _every day_.  And that's not even counting things like a system freezing or other software problems, or a problem with traffic over your network.

Spark's central manager (the "master") is very smart with this regard.  First, all data is in multiple locations (usually three), so no one failure can threaten your data.  But beyond that, the master keeps track of all of the tasks it has sent to the nodes to do, and if the node doing the work breaks the master just sends the task to a different node.  This is completely transparent to the user.  

Even more than that, if the node doing the work doesn't reply in a reasonable amount of time, the master will send the job out to someone else to finish and then take the result from whoever reports back first.

Spark does its RDD computations in what is called a _lazy_ fashion.  That is, when you tell it to do things to an RDD it _doesn't do them right away._  Instead it makes sure they're valid commands, then stacks them up until you actually ask it to return a value or a dataframe to you.  This is called a _lineage_ in Spark, and means an RDD isn't a store of data, it's a store of instructions.  

Let's see it in action.  First we'll load up the same dataframe we did in basics 1:

In [1]:
try:
    sc
except NameError:
    raise Exception('Spark context not created.')

In [2]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [3]:
df = sqlContext.read.load('s3://ui-hfpc/Performance_2015Q1.txt',
                          format='com.databricks.spark.csv',
                          header='false',
                          inferSchema='true',
                          delimiter='|')

That takes a while, because `read.load.()` returns a dataframe.  But now let's try some numerical operations on a column:

In [5]:
df.dtypes

[('C0', 'bigint'),
 ('C1', 'string'),
 ('C2', 'string'),
 ('C3', 'double'),
 ('C4', 'double'),
 ('C5', 'int'),
 ('C6', 'int'),
 ('C7', 'int'),
 ('C8', 'string'),
 ('C9', 'int'),
 ('C10', 'string'),
 ('C11', 'string'),
 ('C12', 'int'),
 ('C13', 'string'),
 ('C14', 'string'),
 ('C15', 'string'),
 ('C16', 'string'),
 ('C17', 'string'),
 ('C18', 'string'),
 ('C19', 'string'),
 ('C20', 'string'),
 ('C21', 'string'),
 ('C22', 'string'),
 ('C23', 'string'),
 ('C24', 'string'),
 ('C25', 'string'),
 ('C26', 'int'),
 ('C27', 'string')]

In [24]:
df_new = df.withColumn('New_C12', df['C12'] ** 2)    #New_C12 = C12^2
df_new = df_new.withColumn('New_C12', df_new['New_C12'] + df_new['C12']) #New_C12 = New_C12 + C12
df_grp = df_new.groupBy('C2')
df_avg = df_grp.avg('C3', 'C5', 'C6', 'C12', 'New_C12')

Here we performed two (arbitrary) math operations then perform a `groupBy` operation over the entries in `C2` (more on groupBy in a minute) while asking it to calculate averages for six numeric columns within those groups.  

However, you hopefully noticed when running that code block that it finished instantly - despite there being over 3.5 million rows of data.  This is _lazy_ computing - you're just stacking instructions up.  Now let's see what happens if we tell it to `show` us the results:

In [25]:
df_avg.show()

+--------------------+------------------+--------------------+------------------+------------------+------------------+
|                  C2|           avg(C3)|             avg(C5)|           avg(C6)|          avg(C12)|      avg(New_C12)|
+--------------------+------------------+--------------------+------------------+------------------+------------------+
|      PNC BANK, N.A.| 4.371742567994939|  1.1707779886148009|358.78747628083494|               1.0|               2.0|
|PHH MORTGAGE CORP...| 4.156480329368712|  0.9780420860018298|359.02195791399816|              null|              null|
|  QUICKEN LOANS INC.| 4.353479515908324|-0.08899247348614438| 358.5689787889155|              null|              null|
|  CITIMORTGAGE, INC.| 4.101532687651331|   0.338498789346247|359.41670702179175|              null|              null|
|WELLS FARGO BANK,...| 4.266629427172304|  0.6704475572258285|359.25937820293814|              null|              null|
|JP MORGAN CHASE B...| 4.327598085711821

That should have taken a long time to run, because when you executed `show` you asked for a dataframe to be returned to you, which meant Spark went back and caclulated the three previous operations.  You could have done any number of intermediate steps similar to those before calling `show` and they all would have been lazy operations that finished instantly, until `show` ran them all.

Now this would just be a background peculiarity that you only notice by which commands execute fast and which ones slow, except that we have some control over the process.  If you imagine your _lineage_ as a straight line leading from your source data to your ouput, we can use the `persist()` method to create a point for branching.  Essentially it tells Spark "follow the instructions to this point, then _hold these results_ because I'm going to come back to them again."

Let's redo the previous code block with a `persist()`:

In [41]:
df_new = df.withColumn('New_C12', df['C12'] ** 2)    #New_C12 = C12^2
df_new = df_new.withColumn('New_C12', df_new['New_C12'] + df_new['C12']) #New_C12 = New_C12 + C12
df_new.persist()
df_grp = df_new.groupBy('C2')
df_avg = df_grp.avg('C3', 'C5', 'C6', 'C12', 'New_C12')

In [42]:
df_avg.show()

+--------------------+------------------+--------------------+------------------+------------------+------------------+
|                  C2|           avg(C3)|             avg(C5)|           avg(C6)|          avg(C12)|      avg(New_C12)|
+--------------------+------------------+--------------------+------------------+------------------+------------------+
|      PNC BANK, N.A.| 4.371742567994939|  1.1707779886148009|358.78747628083494|               1.0|               2.0|
|PHH MORTGAGE CORP...| 4.156480329368712|  0.9780420860018298|359.02195791399816|              null|              null|
|  QUICKEN LOANS INC.| 4.353479515908324|-0.08899247348614438| 358.5689787889155|              null|              null|
|  CITIMORTGAGE, INC.| 4.101532687651331|   0.338498789346247|359.41670702179175|              null|              null|
|WELLS FARGO BANK,...| 4.266629427172304|  0.6704475572258285|359.25937820293814|              null|              null|
|JP MORGAN CHASE B...| 4.327598085711822

Showing the groupBy averages this way is no different than the first way we did it - Spark executes the entire lineage.  But now let's get the sums of our groupBy object:

In [43]:
df_sum = df_grp.sum('C3', 'C5', 'C6', 'C12', 'New_C12')

In [44]:
df_sum.show()

+--------------------+--------------------+--------+----------+--------+------------+
|                  C2|             sum(C3)| sum(C5)|   sum(C6)|sum(C12)|sum(New_C12)|
+--------------------+--------------------+--------+----------+--------+------------+
|      PNC BANK, N.A.|   6911.724999999999|    1851|    567243|       1|         2.0|
|PHH MORTGAGE CORP...|   9086.066000000004|    2138|    784822|    null|        null|
|  QUICKEN LOANS INC.|  101801.76500000026|   -2081|   8384777|    null|        null|
|  CITIMORTGAGE, INC.|  16939.329999999998|    1398|   1484391|    null|        null|
|WELLS FARGO BANK,...|          187326.365|   29436|  15773283|    null|        null|
|JP MORGAN CHASE B...|   50187.15499999999|   19197|   4155651|    null|        null|
|ROUNDPOINT MORTGA...|   67708.25999999991|   82336|   5669070|      74|       148.0|
|SUNTRUST MORTGAGE...|  21530.767999999898|    4325|   1884795|    null|        null|
|               OTHER|    904855.043999989|   25163|  

Hopefully you noticed that was significantly faster than calculating showing the averages.  This is because Spark kept the intermediate results up to our `persist()` call from when we calculated the averages, and thus only had to run the code that came after that.  We can now do as many different branches of operations as we want stemming from `df_new` and since we persisted it, all the code before can be skipped.

There is no need for persisting if there is no branching.  As a matter of good practice, and to free up more resources, you can call `.unpersist()` on a persisted object to drop it from storage when done with it:

In [46]:
df_new.unpersist();

(The trailing ; simply gags the output from the command. We don't need to see the summary of what we just unpersisted)

Also note that `cache()` is essentially a synonym for `persist()`, except it specifies storing the checkpoint in memory for the fastest recall, while persisting allows Spark to swap some of the checkpoint to disk if necessary.  Obviously `cache()` only works if the dataframe you are forcing it to hold is small enough that it can fit in the memory of each node, so use it with care.

And finally, a bit more on `groupBy`.  Hopefully the usage above has given you some insight into how it works.  In short, `groupBy` is the vehicle for aggregation in a dataframe.  A `groupBy` object is, in itself, incomplete.  So, the line in the code block where we introduced a `persist()` above that looks like this:

`df_grp = df_new.groupBy('C2')`

generates a `groupBy` object where the data is grouped around the unique values found in column `C2`, but it is just a foundation.  It is like the sentence _"We are going to group our data up by the unique values found in column C2, and then..."_  The sentence is unfinished!  The next line of code contains the rest:

`df_avg = df_grp.avg('C3', 'C5', 'C6', 'C12', 'New_C12')`

Or to finish the sentence, _"... calculate the averages for these five columns within each group."_
