#PySpark Random Sample with Example


---


**PySpark provides a pyspark.sql.DataFrame.sample(), pyspark.sql.DataFrame.sampleBy(), RDD.sample(), and RDD.takeSample() methods to get the random sampling subset from the large dataset, In this article I will explain with Python examples.**

**If you are working as a Data Scientist or Data analyst you are often required to analyze a large dataset/file with billions or trillions of records, processing these large datasets takes some time hence during the analysis phase it is recommended to use a random subset sample from the large files.**

##1. PySpark SQL sample() Usage & Examples

---


**PySpark sampling (pyspark.sql.DataFrame.sample()) is a mechanism to get random sample records from the dataset, this is helpful when you have a larger dataset and wanted to analyze/test a subset of the data for example 10% of the original file.**



---


###Below is the syntax of the sample() function.


##sample(withReplacement, fraction, seed=None)


---

- withReplacement – Sample with replacement or not (default False).

- fraction – Fraction of rows to generate, range [0.0, 1.0]. Note that it doesn’t guarantee to provide the exact number of the fraction of records.

- seed – Seed for sampling (default a random seed). Used to reproduce the same random sampling.

###1.1 Using fraction to get a random sample in PySpark


**By using fraction between 0 to 1, it returns the approximate number of the fraction of the dataset. For example, 0.1 returns 10% of the rows. However, this does not guarantee it returns the exact 10% of the records.**

---

**Note: If you run these examples on your system, you may see different results.**

In [0]:
df = spark.range(100)
print(df.sample(0.06).collect())

[Row(id=3), Row(id=11), Row(id=31), Row(id=35), Row(id=53), Row(id=83), Row(id=95)]


**My DataFrame has 100 records and I wanted to get 6% sample records which are 6 but the sample() function returned 7 records. This proves the sample function doesn’t return the exact fraction specified.**

###1.2 Using seed to reproduce the same Samples in PySpark


---


**Every time you run a sample() function it returns a different set of sampling records, however sometimes during the development and testing phase you may need to regenerate the same sample every time as you need to compare the results from your previous run. To get consistent same random sampling uses the same slice value for every run. Change slice value to get different results.**

In [0]:
print(df.sample(0.1, 123).collect())

[Row(id=35), Row(id=38), Row(id=41), Row(id=45), Row(id=71), Row(id=84), Row(id=87), Row(id=99)]


In [0]:
print(df.sample(0.1, 123).collect())

[Row(id=35), Row(id=38), Row(id=41), Row(id=45), Row(id=71), Row(id=84), Row(id=87), Row(id=99)]


In [0]:
print(df.sample(0.1, 456).collect())

[Row(id=22), Row(id=33), Row(id=35), Row(id=41), Row(id=53), Row(id=80), Row(id=83), Row(id=87), Row(id=92)]


**Here, first 2 examples I have used seed value 123 hence the sampling results are the same and for the last example, I have used 456 as a seed value generate different sampling records.**

###1.3 Sample withReplacement (May contain duplicates)


**some times you may need to get a random sample with repeated values. By using the value true, results in repeated values.**

In [0]:
print(df.sample(True, 0.3, 123).collect())  # With Duplicates

[Row(id=0), Row(id=5), Row(id=9), Row(id=11), Row(id=13), Row(id=16), Row(id=17), Row(id=26), Row(id=26), Row(id=37), Row(id=41), Row(id=45), Row(id=49), Row(id=50), Row(id=50), Row(id=57), Row(id=58), Row(id=58), Row(id=65), Row(id=66), Row(id=71), Row(id=74), Row(id=77), Row(id=80), Row(id=81), Row(id=82), Row(id=84), Row(id=88), Row(id=90), Row(id=91), Row(id=91), Row(id=92), Row(id=94), Row(id=96)]


In [0]:
print(df.sample(0.3, 123).collect())  # No Duplicates

[Row(id=0), Row(id=4), Row(id=12), Row(id=15), Row(id=19), Row(id=21), Row(id=23), Row(id=24), Row(id=25), Row(id=28), Row(id=29), Row(id=34), Row(id=35), Row(id=36), Row(id=38), Row(id=41), Row(id=45), Row(id=47), Row(id=50), Row(id=52), Row(id=59), Row(id=63), Row(id=65), Row(id=71), Row(id=82), Row(id=84), Row(id=87), Row(id=94), Row(id=99)]


**On first example, values 26, 50, 58 and 91 are repeated values.**

###1.4 Stratified sampling in PySpark


**You can get Stratified sampling in PySpark without replacement by using sampleBy() method. It returns a sampling fraction for each stratum. If a stratum is not specified, it takes zero as the default.**



---

###sampleBy() Syntax


##sampleBy(col, fractions, seed=None)



- col – column name from DataFrame

- fractions – It’s Dictionary type takes key and value.


---


###sampleBy() Example

In [0]:
df2 = df.select((df.id % 3).alias("key"))

print(df2.sampleBy("key", {0: 0.1, 1:0.2}, 0).collect())

[Row(key=0), Row(key=0), Row(key=1), Row(key=1), Row(key=0), Row(key=1), Row(key=0), Row(key=1), Row(key=0), Row(key=0), Row(key=1), Row(key=1), Row(key=0)]


##2. PySpark RDD Sample


**PySpark RDD also provides sample() function to get a random sampling, it also has another signature takeSample() that returns an Array[T].**


---


###RDD sample() Syntax & Example

**PySpark RDD sample() function returns the random sampling similar to DataFrame and takes a similar types of parameters but in a different order. Since I’ve already covered the explanation of these parameters on DataFrame, I will not be repeating the explanation on RDD, If not already read I recommend reading the DataFrame section above.**

**sample() of RDD returns a new RDD by selecting random sampling.** 

---

###Below is a syntax.


##sample(self, withReplacement, fraction, seed=None)

---


**Below is an example of RDD sample() function**

In [0]:
rdd = sc.range(0,100)

print(rdd.sample(False, 0.1, 0).collect())

[23, 48, 53, 60, 72, 87, 91, 96, 98]


In [0]:
print(rdd.sample(True, 0.3, 0).collect())

[2, 4, 4, 5, 7, 13, 15, 17, 23, 24, 25, 26, 29, 30, 30, 31, 31, 32, 37, 38, 42, 43, 45, 48, 52, 55, 57, 62, 68, 69, 73, 74, 74, 76, 82, 83, 84, 86, 93]


###RDD takeSample() Syntax & Example

**RDD takeSample() is an action hence you need to careful when you use this function as it returns the selected sample records to driver memory. Returning too much data results in an out-of-memory error similar to collect().**


---


###Syntax of RDD takeSample() .


##takeSample(self, withReplacement, num, seed=None) 


----


**Example of RDD takeSample()**

In [0]:
print(rdd.takeSample(False, 10, 0))

[18, 60, 51, 68, 22, 1, 35, 84, 75, 72]


In [0]:
print(rdd.takeSample(True, 30, 123))

[72, 91, 55, 86, 37, 49, 34, 46, 63, 21, 81, 17, 20, 84, 29, 46, 84, 14, 59, 7, 80, 25, 60, 59, 54, 22, 34, 83, 82, 25]


##Conclusion


**In summary, PySpark sampling can be done on RDD and DataFrame. In order to do sampling, you need to know how much data you wanted to retrieve by specifying fractions.**

**Use seed to regenerate the same sampling multiple times. and**

**Use withReplacement if you are okay to repeat the random records.**