**_pySpark Basics: Dataframe Concepts_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 8 Aug 2016, Spark v2.0_

_Abstract: This guide will explore some basic concepts necessary for working with many dataframe operations, in particular `groupBy` and `persist`._

_Main operations used: read.load, withColumn, groupBy, persist, cache, unpersist_

***

Spark does its RDD computations in what is called a _lazy_ fashion.  That is, when you tell it to do things to an RDD it _doesn't do them right away._  Instead it makes sure they're valid commands, then stacks them up until you actually ask it to return a value or a dataframe to you.  This is called a _lineage_ in Spark, and means an RDD isn't a store of data, it's a store of instructions.  

Let's see it in action.  First we'll load up the same dataframe we did in basics 1:

In [1]:
df = spark.read.csv('s3://ui-spark-data/Performance_2015Q1.txt', header=False, inferSchema=True, sep='|')

That takes a while, because `read.csv()` returns a dataframe.  But now let's try some numerical operations on a column:

In [2]:
df.dtypes

[('_c0', 'bigint'),
 ('_c1', 'string'),
 ('_c2', 'string'),
 ('_c3', 'double'),
 ('_c4', 'double'),
 ('_c5', 'int'),
 ('_c6', 'int'),
 ('_c7', 'int'),
 ('_c8', 'string'),
 ('_c9', 'int'),
 ('_c10', 'string'),
 ('_c11', 'string'),
 ('_c12', 'int'),
 ('_c13', 'string'),
 ('_c14', 'string'),
 ('_c15', 'string'),
 ('_c16', 'string'),
 ('_c17', 'string'),
 ('_c18', 'string'),
 ('_c19', 'string'),
 ('_c20', 'string'),
 ('_c21', 'string'),
 ('_c22', 'string'),
 ('_c23', 'string'),
 ('_c24', 'string'),
 ('_c25', 'string'),
 ('_c26', 'int'),
 ('_c27', 'string')]

In [3]:
%%time
df_new = df.withColumn('New_c12', df['_c12'] ** 2)    #New_c12 = _c12^2
df_new = df_new.withColumn('New_c12', df_new['New_c12'] + df_new['_c12']) #New_c12 = New_c12 + _c12
df_grp = df_new.groupBy('_c2')
df_avg = df_grp.avg('_c3', '_c5', '_c6', '_c12', 'New_c12')

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 281 ms


Here we performed two (arbitrary) math operations then perform a `groupBy` operation over the entries in `_c2` (more on groupBy in a minute) while asking it to calculate averages for six numeric columns within those groups.  

However, notice that the the code block finished nearly instantly - we added a simple timer to print out how long it took by using the Jupyter `time` "magic" - despite there being over 3.5 million rows of data.  This is _lazy_ computing - **nothing was actually computed here because we are just stacking instructions up.**  All pySpark did was make sure they were valid instructions.  Now let's see what happens if we tell it to `show` us the results:

In [4]:
%%time
df_avg.show()

+--------------------+------------------+--------------------+------------------+------------------+------------------+
|                 _c2|          avg(_c3)|            avg(_c5)|          avg(_c6)|         avg(_c12)|      avg(New_c12)|
+--------------------+------------------+--------------------+------------------+------------------+------------------+
|  QUICKEN LOANS INC.|  4.35347951590827|-0.08899247348614438| 358.5689787889155|              null|              null|
|NATIONSTAR MORTGA...| 4.172708234075604| 0.39047125841532887| 359.5821853961678|               1.0|               2.0|
|WELLS FARGO BANK,...| 4.266629427172305|  0.6704475572258285|359.25937820293814|              null|              null|
|FANNIE MAE/SETERU...| 4.433333333333333|   9.333333333333334| 350.6666666666667|              null|              null|
|DITECH FINANCIAL LLC| 4.192566550005296|   5.147629653197582| 354.7811008590519|               1.0|               2.0|
|SENECA MORTGAGE S...|4.1412100378136785

That takes a bit longer to run, because when you executed `show` you asked for a dataframe to be returned to you, which meant **Spark went back and caclulated the three previous operations.**  You could have done any number of intermediate steps similar to those before calling `show` and they all would have been lazy operations that finished nearly instantly, until `show` ran them all.

Now this would just be a background peculiarity, except that we have some control over the process.  If you imagine your _lineage_ as a straight line of instructions leading from your source data to your ouput, **we can use the `persist()` method to create a point for branching.**  Essentially it tells Spark "follow the instructions to this point, then _hold these results_ because I'm going to come back to them again."

Let's redo the previous code block with a `persist()`:

In [5]:
%%time
df_new = df.withColumn('New_c12', df['_c12'] ** 2)    #New_c12 = _c12^2
df_new = df_new.withColumn('New_c12', df_new['New_c12'] + df_new['_c12']) #New_c12 = New_c12 + _c12
df_new.persist()
df_grp = df_new.groupBy('_c2')
df_avg = df_grp.avg('_c3', '_c5', '_c6', '_c12', 'New_c12')

CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 221 ms


The `persist` command adds very little overhead in this case, finishing in in well under a second.  Now we call `show` again to force it to calculate:

In [6]:
%%time
df_avg.show()

+--------------------+------------------+--------------------+------------------+------------------+------------------+
|                 _c2|          avg(_c3)|            avg(_c5)|          avg(_c6)|         avg(_c12)|      avg(New_c12)|
+--------------------+------------------+--------------------+------------------+------------------+------------------+
|  QUICKEN LOANS INC.|  4.35347951590827|-0.08899247348614438| 358.5689787889155|              null|              null|
|NATIONSTAR MORTGA...| 4.172708234075604| 0.39047125841532887| 359.5821853961678|               1.0|               2.0|
|WELLS FARGO BANK,...| 4.266629427172305|  0.6704475572258285|359.25937820293814|              null|              null|
|FANNIE MAE/SETERU...| 4.433333333333334|   9.333333333333334| 350.6666666666667|              null|              null|
|DITECH FINANCIAL LLC| 4.192566550005296|   5.147629653197582| 354.7811008590519|               1.0|               2.0|
|SENECA MORTGAGE S...| 4.141210037813679

Showing the groupBy averages this way took a bit longer because of the `persist` overhead.  But now let's back up and, in addition to the mean, lets also get the sums of our groupBy object:

In [7]:
%%time
df_sum = df_grp.sum('_c3', '_c5', '_c6', '_c12', 'New_c12')

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 38.8 ms


That was the *lazy* portion, now we make it execute:

In [8]:
%%time
df_sum.show()

+--------------------+--------------------+--------+----------+---------+------------+
|                 _c2|            sum(_c3)|sum(_c5)|  sum(_c6)|sum(_c12)|sum(New_c12)|
+--------------------+--------------------+--------+----------+---------+------------+
|  QUICKEN LOANS INC.|    101801.764999999|   -2081|   8384777|     null|        null|
|NATIONSTAR MORTGA...|  40287.497999999956|    3770|   3471766|        2|         4.0|
|WELLS FARGO BANK,...|  187326.36500000005|   29436|  15773283|     null|        null|
|FANNIE MAE/SETERU...|  26.599999999999998|      56|      2104|     null|        null|
|DITECH FINANCIAL LLC|  39531.709999999934|   48537|   3345231|       41|        82.0|
|SENECA MORTGAGE S...|  24093.559999999983|   -1192|   2095648|     null|        null|
|SUNTRUST MORTGAGE...|  21530.767999999953|    4325|   1884795|     null|        null|
|ROUNDPOINT MORTGA...|   67708.25999999992|   82336|   5669070|       74|       148.0|
|      PENNYMAC CORP.|  15209.140000000003|

That was dramatically faster than the calculation showing the averages - 1.49 seconds versus over 18 seconds.  This is because Spark kept the intermediate results up to our `persist()` call from when we calculated the averages, and thus only had to run the code that came after that.  We can now do as many different branches of operations as we want stemming from `df_new` and since we persisted it, all the code before can be skipped.

There is no need for persisting if there is no branching.  In fact, as we saw, `persist` adds a bit of overhead to the process, and so is actually a hinderance if you're not going to be utilizing the branch point.  As a matter of good practice, and to free up more resources, you can call `.unpersist()` on a persisted object to drop it from storage when done with it:

In [9]:
df_new.unpersist();

(The trailing ; simply gags the output from the command. We don't need to see the summary of what we just unpersisted)

Also note that `cache()` is essentially a synonym for `persist()`, except it specifies storing the checkpoint in memory for the fastest recall, while persisting allows Spark to swap some of the checkpoint to disk if necessary.  Obviously `cache()` only works if the dataframe you are forcing it to hold is small enough that it can fit in the memory of each node, so use it with care.

And finally, a bit more on `groupBy`.  Hopefully the usage above has given you some insight into how it works.  In short, `groupBy` is the vehicle for aggregation in a dataframe.  A `groupBy` object is, in itself, incomplete.  So, the line in the code block where we introduced a `persist()` above that looks like this:

`df_grp = df_new.groupBy('_c2')`

which generates a `groupBy` object where the data is grouped around the unique values found in column `C2`, but it is just a foundation.  It is like the sentence _"We are going to group our data up by the unique values found in column C2, and then..."_  The sentence is unfinished!  The next line of code contains the rest:

`df_avg = df_grp.avg('_c3', '_c5', '_c6', '_c12', 'New_c12')`

Or to finish the sentence, _"... calculate the averages for these five columns within each group."_
