**_pySpark Basics: Subsetting Data_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 28 June 2016, Spark v1.6.1_

_Abstract: This guide will go over filtering your data based on a specified criteria in order to get a subset_

_Main operations used: dtypes, take, show, select, drop, filter/where, sample_

***

We begin with some basic setup, importing the SQL structure that supports the dataframes we'll be using.  Note that `sc`, the Spark Context, is created automatically when the cluster is loaded:

In [15]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

Then we load in some real data from an S3 bucket (the same data used in the csv tutorial), allowing pySpark to auto determine the schema, then take a quick peak at it:

In [16]:
df = sqlContext.read.load('s3://ui-hfpc/Performance_2015Q1.txt',
                          format='com.databricks.spark.csv',
                          header='false',
                          inferSchema='true',
                          delimiter='|')

In [17]:
df.dtypes

[('C0', 'bigint'),
 ('C1', 'string'),
 ('C2', 'string'),
 ('C3', 'double'),
 ('C4', 'double'),
 ('C5', 'int'),
 ('C6', 'int'),
 ('C7', 'int'),
 ('C8', 'string'),
 ('C9', 'int'),
 ('C10', 'string'),
 ('C11', 'string'),
 ('C12', 'int'),
 ('C13', 'string'),
 ('C14', 'string'),
 ('C15', 'string'),
 ('C16', 'string'),
 ('C17', 'string'),
 ('C18', 'string'),
 ('C19', 'string'),
 ('C20', 'string'),
 ('C21', 'string'),
 ('C22', 'string'),
 ('C23', 'string'),
 ('C24', 'string'),
 ('C25', 'string'),
 ('C26', 'int'),
 ('C27', 'string')]

In [18]:
#Note: this output looks messy because `take` doesn't format the results, it just shows you the row.  It can be formatted
#nicely with `show()`, but due to the width of this it will still look messy.  See below for `show()` in action.
df.take(5)

[Row(C0=100002091588, C1=u'01/01/2015', C2=u'OTHER', C3=4.125, C4=None, C5=0, C6=360, C7=360, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'', C14=u'', C15=u'', C16=u'', C17=u'', C18=u'', C19=u'', C20=u'', C21=u'', C22=u'', C23=u'', C24=u'', C25=u'', C26=None, C27=u''),
 Row(C0=100002091588, C1=u'02/01/2015', C2=u'', C3=4.125, C4=None, C5=1, C6=359, C7=359, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'', C14=u'', C15=u'', C16=u'', C17=u'', C18=u'', C19=u'', C20=u'', C21=u'', C22=u'', C23=u'', C24=u'', C25=u'', C26=None, C27=u''),
 Row(C0=100002091588, C1=u'03/01/2015', C2=u'', C3=4.125, C4=None, C5=2, C6=358, C7=358, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'', C14=u'', C15=u'', C16=u'', C17=u'', C18=u'', C19=u'', C20=u'', C21=u'', C22=u'', C23=u'', C24=u'', C25=u'', C26=None, C27=u''),
 Row(C0=100002091588, C1=u'04/01/2015', C2=u'', C3=4.125, C4=None, C5=3, C6=357, C7=357, C8=u'01/2045', C9=16740, C10=u'0', C11=u'N', C12=None, C13=u'',

One of the simplest subsettings is done by selecting just a few of the columns:

In [19]:
from pyspark.sql.functions import col

df_select = df.select(col('C0'), col('C1'), col('C3'), col('C9'))

In [20]:
#Note: `show()` defaults to showing you the first 20 rows, but here we've specified only 5
df_select.show(5)

+------------+----------+-----+-----+
|          C0|        C1|   C3|   C9|
+------------+----------+-----+-----+
|100002091588|01/01/2015|4.125|16740|
|100002091588|02/01/2015|4.125|16740|
|100002091588|03/01/2015|4.125|16740|
|100002091588|04/01/2015|4.125|16740|
|100002091588|05/01/2015|4.125|16740|
+------------+----------+-----+-----+
only showing top 5 rows



Or we can do the same thing by dropping, which is convenient if we want to keep more columns than we want to drop:

In [21]:
df_drop = df_select.drop(col('C3'))

In [22]:
df_drop.show(5)

+------------+----------+-----+
|          C0|        C1|   C9|
+------------+----------+-----+
|100002091588|01/01/2015|16740|
|100002091588|02/01/2015|16740|
|100002091588|03/01/2015|16740|
|100002091588|04/01/2015|16740|
|100002091588|05/01/2015|16740|
+------------+----------+-----+
only showing top 5 rows



We often want to subset by rows also, for example by specifying a conditional:
<a id="filter_rows"></a>

In [23]:
#Note that we have to use .show() at the end of .describe(), because .describe() returns a new dataframe with the information.
#In many other programs, such as Stata, `describe` returns a formatted table.  Here, `summary` and `C6` are actually column names
df.describe('C6').show()

+-------+-----------------+
|summary|               C6|
+-------+-----------------+
|  count|          3526154|
|   mean|354.7084951479714|
| stddev|4.011812510792076|
|    min|              292|
|    max|              480|
+-------+-----------------+



In [24]:
#Note that `where` is an alias for `filter`: you can use them interchangeably
df_filter = df.filter(df['C6'] < 358)

In [25]:
df_filter.describe('C6').show()

+-------+------------------+
|summary|                C6|
+-------+------------------+
|  count|           2598037|
|   mean|353.15604897081914|
| stddev| 3.517021305688398|
|    min|               292|
|    max|               357|
+-------+------------------+



We can repeat the same proceedure for multiple conditions and columns using standard logical operators, and this time using `where` as the alias for `filter`:

In [26]:
df_filter = df.where((df['C6'] > 340) & (df['C5'] < 4))

In [27]:
df_filter.describe('C6', 'C5').show()

+-------+------------------+------------------+
|summary|                C6|                C5|
+-------+------------------+------------------+
|  count|           1254131|           1254131|
|   mean|358.48713810598736| 1.474693632483369|
| stddev|1.3789619103497548|1.2067831502138442|
|    min|               341|                -1|
|    max|               361|                 3|
+-------+------------------+------------------+



And finally, you might want to take a random sample of rows.  This can be particularlly useful, for example, if your data is large enough to require more expensive clusters to be spun up to work with it all.  Take a more digestable sampling of the whole, do your intermediate work and testing using it on a cheaper cluster, then when it's all ready spin up the more expensive cluster for a final run on the whole dataset.

In [28]:
#you can pass three arguments into sample: the first is a boolean, which is True to sample with replacement, False without.
#the second is the fraction of the dataset to take, in this case 5%, and the third is an optional random seed.  if
#you specify any integer here then someone else performing the same random operation that specifies the same seed
#will get the same result.  if no seed is passed then the exact random sampling can't be duplicated.

df_sample = df.sample(False, 0.05, 99)

In [29]:
df_sample.describe('C6').show()

+-------+------------------+
|summary|                C6|
+-------+------------------+
|  count|            176428|
|   mean| 354.7217051715147|
| stddev|3.9450655094773754|
|    min|               299|
|    max|               361|
+-------+------------------+



If you compare this to our original summary stats on unfiltered column C6 from above, you'll see it does a pretty good job maintaining the mean and stddev in a sample of only 5% of the data.  You can then write this to a new file in an S3 bucket and work with it instead of the whole data.