In [None]:
def display(*args, **kargs): pass

# Sampling
 
This lab demonstrates how to perform sampling including stratified sampling.  There are examples using both `DataFrame` and `RDD` operations

In [None]:
baseDir = '/mnt/ml-class/'
irisTwoFeatures = sqlContext.read.parquet(baseDir + 'irisTwoFeatures.parquet')

In [None]:
display(irisTwoFeatures)

When using a `DataFrame` we can call `.sampleBy` to return a stratified sample without using replacement.  `sampleBy` takes in a column and fractions for what percentage of each value to sample.  An explanation of `sampleBy` can be found under [DataFrame](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sampleBy) for the Python API and under [DataFrameStatFunctions](http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrameStatFunctions) for the Scala API.

In [None]:
help(irisTwoFeatures.sampleBy)

In [None]:
stratifiedSample = irisTwoFeatures.sampleBy('label', {0: .10, 1: .20, 2: .30})
display(stratifiedSample)

How many?  And which labels did we sample?

In [None]:
print 'total count: {0}'.format(stratifiedSample.count())

In [None]:
labelCounts = (stratifiedSample
               .groupBy('label')
               .count()
               .orderBy('label'))
display(labelCounts)

Now let's sample with replacement from the `DataFrame`.

In [None]:
help(irisTwoFeatures.sample)

In [None]:
sampleWithReplace = irisTwoFeatures.sample(True, .20)
labelCountsReplace = (sampleWithReplace
                      .groupBy('label')
                      .count()
                      .orderBy('label'))
display(labelCountsReplace)

#### Convert to an RDD and sample from an RDD

First, we'll convert our `DataFrame` to an `RDD`.

In [None]:
irisTwoFeaturesRDD = (irisTwoFeatures
                      .rdd
                      .map(lambda r: (r[1], r[0])))

print '\n'.join(map(repr, irisTwoFeaturesRDD.take(2)))

Next, we'll perform stratified sampling.

In [None]:
help(irisTwoFeaturesRDD.sampleByKey)

In [None]:
irisSampleRDD = irisTwoFeaturesRDD.sampleByKey(True, {0: 0.5, 1: 0.5, 2: 0.1}, seed=1)

print '\n'.join(map(repr, irisSampleRDD.take(5)))

What do our counts look like?

In [None]:
print irisTwoFeaturesRDD.countByKey()
print irisSampleRDD.countByKey()

We could also call `sample` to perform a random sample instead of a stratified sample.