**_pySpark Basics: Dataframe Concepts_**

## _by Jeff Levy (jlevy@urban.org) & Alex Engler (aengler@urban.org)_

_Last Updated: 31 Jul, Spark v2.1_

_Abstract: This guide will explore some basic concepts necessary for working with many dataframe operations, in particular `groupBy` and `persist`._

_Main operations used: read.load, withColumn, groupBy, persist, cache, unpersist_

***

Spark does its computations in what is called a _lazy_ fashion.  That is, when you tell it to do things to your data, it _doesn't do them right away._  Instead it checks that they're valid commands, then stacks them up until you actually ask it to return a value or a dataframe to you.  This is stack of commands is called a _lineage_ in Spark, and means we can think of Spark dataframe objects as a list of instructions built on top of your original data.

Let's see it in action.  First we'll load up the same dataframe we did in basics 1:

In [1]:
df = spark.read.format('com.databricks.spark.csv').options(header='False', inferschema='true', sep='|').load('s3://ui-spark-social-science-public/data/Performance_2015Q1.txt')

We'll take a subset of the columns and rename them, like we did in the first tutorial: 

In [2]:
df_lim = df.select('_c0','_c1','_c2', '_c3', '_c4', '_c5', '_c6', '_c7', '_c8', '_c9', '_c10', '_c11', '_c12', '_c13')

old_names = ['_c0','_c1','_c2', '_c3', '_c4', '_c5', '_c6', '_c7', '_c8', '_c9', '_c10', '_c11', '_c12', '_c13']
new_names = ['loan_id','period','servicer_name', 'new_int_rt', 'act_endg_upb', 'loan_age', 'mths_remng', 'aj_mths_remng', 'dt_matr', 'cd_msa', 'delq_sts', 'flag_mod', 'cd_zero_bal', 'dt_zero_bal']
for old, new in zip(old_names, new_names):
    df_lim = df_lim.withColumnRenamed(old, new)

But now let's try some numerical operations on a column. We can use the .withColumn method to create a new dataframe that also had an additional calculated variable, in this case the difference between loan_age and months remaining.

In [3]:
## Add a column named 'loan_length' to the existing dataframe:
df_lim = df_lim.withColumn('loan_length', df_lim['loan_age'] + df_lim['mths_remng'])

## Group the new dataframe by servicer name:
df_grp = df_lim.groupBy('servicer_name')

## Compute average loan age, months remaining, and loan length by servicer:
df_avg = df_grp.avg('loan_age', 'mths_remng', 'loan_length')

Here we performed a simple math operation (adding `loan_age` to `mnths_remng`) then perform a `groupBy` operation over the entries in `servicer_name` (more on groupBy in a minute) while asking it to calculate averages for three numeric columns across each servicer.  

However, if you actually ran the code, you probably noticed that the the code block finished nearly instantly - despite there being over 3.5 million rows of data.  This is an example of _lazy_ computing - **nothing was actually computed here. ** At the moment, we're just creating a list of instructions. All pySpark did was make sure they were valid instructions.  Now let's see what happens if we tell it to `show` us the results:

In [4]:
df_avg.show()

+--------------------+--------------------+------------------+------------------+
|       servicer_name|       avg(loan_age)|   avg(mths_remng)|  avg(loan_length)|
+--------------------+--------------------+------------------+------------------+
|  QUICKEN LOANS INC.|-0.08899247348614438| 358.5689787889155|358.47998631542936|
|NATIONSTAR MORTGA...| 0.39047125841532887| 359.5821853961678| 359.9726566545831|
|                null|  5.6264681794400015|354.21486809483747| 359.8413362742775|
|WELLS FARGO BANK,...|  0.6704475572258285|359.25937820293814|359.92982576016396|
|FANNIE MAE/SETERU...|   9.333333333333334| 350.6666666666667|             360.0|
|DITECH FINANCIAL LLC|   5.147629653197582| 354.7811008590519|359.92873051224944|
|SENECA MORTGAGE S...| -0.2048814025438295|360.20075627363354| 359.9958748710897|
|SUNTRUST MORTGAGE...|  0.8241234756097561| 359.1453887195122|  359.969512195122|
|ROUNDPOINT MORTGA...|   5.153408024034549| 354.8269387244163|359.98034674845087|
|      PENNYMAC 

That takes a bit longer to run, because when you executed `show` you asked for a dataframe to be returned to you, which meant **Spark went back and caclulated the three previous operations.**  You could have done any number of intermediate steps similar to those before calling `show` and they all would have been lazy operations that finished nearly instantly, until `show` ran them all.

Now this would just be a background peculiarity, except that we have some control over the process.  If you imagine your _lineage_ as a straight line of instructions leading from your source data to your ouput, **we can use the `persist()` method to create a point for branching.**  Essentially it tells Spark "follow the instructions to this point, then _hold these results_ because I'm going to come back to them again."

Let's redo the previous code block with a `persist()`:

In [5]:
df_keep = df_lim.withColumn('loan_length', df_lim['loan_age'] + df_lim['mths_remng'])

df_keep.persist()

df_grp = df_keep.groupBy('servicer_name')
df_avg = df_grp.avg('loan_age', 'mths_remng', 'loan_length')

The `persist` command adds very little overhead in this case, finishing in in well under a second.  Now we call `show` again to force it to calculate the averages by group:

In [6]:
df_avg.show()

+--------------------+--------------------+------------------+------------------+
|       servicer_name|       avg(loan_age)|   avg(mths_remng)|  avg(loan_length)|
+--------------------+--------------------+------------------+------------------+
|  QUICKEN LOANS INC.|-0.08899247348614438| 358.5689787889155|358.47998631542936|
|NATIONSTAR MORTGA...| 0.39047125841532887| 359.5821853961678| 359.9726566545831|
|                null|  5.6264681794400015|354.21486809483747| 359.8413362742775|
|WELLS FARGO BANK,...|  0.6704475572258285|359.25937820293814|359.92982576016396|
|FANNIE MAE/SETERU...|   9.333333333333334| 350.6666666666667|             360.0|
|DITECH FINANCIAL LLC|   5.147629653197582| 354.7811008590519|359.92873051224944|
|SENECA MORTGAGE S...| -0.2048814025438295|360.20075627363354| 359.9958748710897|
|SUNTRUST MORTGAGE...|  0.8241234756097561| 359.1453887195122|  359.969512195122|
|ROUNDPOINT MORTGA...|   5.153408024034549| 354.8269387244163|359.98034674845087|
|      PENNYMAC 

Showing the groupBy averages this way took a bit longer because of the `persist` overhead.  But now let's back up and, in addition to the mean, lets also get the sums of our groupBy object:

In [7]:
df_sum = df_grp.sum('new_int_rt', 'loan_age', 'mths_remng', 'cd_zero_bal', 'loan_length')

That was the *lazy* portion, now we make it execute:

In [8]:
df_sum.show()

+--------------------+--------------------+-------------+---------------+----------------+----------------+
|       servicer_name|     sum(new_int_rt)|sum(loan_age)|sum(mths_remng)|sum(cd_zero_bal)|sum(loan_length)|
+--------------------+--------------------+-------------+---------------+----------------+----------------+
|  QUICKEN LOANS INC.|  101801.76500000055|        -2081|        8384777|            null|         8382696|
|NATIONSTAR MORTGA...|  40287.497999999934|         3770|        3471766|               2|         3475536|
|                null|1.3139130895007337E7|     17690263|     1113692280|           16932|      1131382543|
|WELLS FARGO BANK,...|  187326.36499999996|        29436|       15773283|            null|        15802719|
|FANNIE MAE/SETERU...|                26.6|           56|           2104|            null|            2160|
|DITECH FINANCIAL LLC|   39531.70999999991|        48537|        3345231|              41|         3393768|
|SENECA MORTGAGE S...|   240

That was dramatically faster than the calculation showing the averages - (we benchmarked it at 1.49 seconds versus over 18 seconds).  This is because Spark kept the intermediate results up to `persist()`, from when we calculated the averages, and thus only had to run the code that came after that.  We can now do as many different branches of operations as we want stemming from `df_new` and since we persisted it, all the code before the `persist()` won't be executed again.

There is no need for persisting if there is no branching.  In fact, as we saw, `persist` adds a bit of overhead to the process, and so is actually a hinderance if you're not going to be utilizing the branch point.  As a matter of good practice, and to free up more resources, you can call `.unpersist()` on a persisted object to drop it from storage when done with it:

In [9]:
df_keep.unpersist();

DataFrame[loan_id: bigint, period: string, servicer_name: string, new_int_rt: double, act_endg_upb: double, loan_age: int, mths_remng: int, aj_mths_remng: int, dt_matr: string, cd_msa: int, delq_sts: string, flag_mod: string, cd_zero_bal: int, dt_zero_bal: string, loan_length: int]

(The trailing ; simply gags the output from the command. We don't need to see the summary of what we just unpersisted)

Also note that `cache()` is essentially a synonym for `persist()`, except it specifies storing the checkpoint in memory for the fastest recall, while persisting allows Spark to swap some of the checkpoint to disk if necessary.  Obviously `cache()` only works if the dataframe you are forcing it to hold is small enough that it can fit in the memory of each node, so use it with care.

And finally, a bit more on `groupBy`.  Hopefully the usage above has given you some insight into how it works.  In short, `groupBy` is the vehicle for aggregation in a dataframe.  A `groupBy` object is, in itself, incomplete.  So, the line in the code block where we introduced a `persist()` above that looks like this:

`df_grp = perf_keep.groupBy('_c2')`

which generates a `groupBy` object where the data is grouped around the unique values found in column `C2`, but it is just a foundation.  It is like the sentence _"We are going to group our data up by the unique values found in column C2, and then..."_  The next line of code contains the rest:

`df_avg = df_grp.avg('_c3', '_c5', '_c6', '_c12', 'New_c12')`

Or to finish the sentence, _"... calculate the averages for these five columns within each group."_
