- Author: Ben Du
- Date: 2020-08-26 10:43:36
- Title: Sample Rows from a Spark DataFrame
- Slug: spark-dataframe-sample
- Category: Computer Science
- Tags: programming, Scala, Spark, DataFrame, sample

## SQL API

```
SELECT * FROM some_table
TABLESAMPLE (100 ROWS)
```

```
SELECT * FROM some_table
TABLESAMPLE (50 PERCENT)
```

In [1]:
val df = spark.read.json("../data/people.json")
df.show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



## Sample with Replacement

In [3]:
df.sample(true, 0.9).show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|null|Michael|
|  30|   Andy|
|  30|   Andy|
|  19| Justin|
+----+-------+



## Sample without Replacement

In [5]:
df.sample(false, 0.9).show

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [7]:
df.sample(false, 0.5).show

+---+------+
|age|  name|
+---+------+
| 30|  Andy|
| 19|Justin|
+---+------+



## Be Careful with Subsampling

If you don't persist the data frame, 
it's recalculated every time!
This is really dangerous for any random associated data processing,
e.g., subsampling.

In [5]:
val df = Range(1, 100).toDF("x").sample(false, 0.5)

df = [x: int]


[x: int]

In [6]:
df.count

[Stage 0:>                                                          (0 + 0) / 8]

49

In [9]:
df.count

49

In [10]:
df.show

+---+
|  x|
+---+
|  4|
|  6|
|  7|
|  8|
| 10|
| 11|
| 12|
| 13|
| 14|
| 16|
| 17|
| 20|
| 21|
| 22|
| 27|
| 28|
| 31|
| 34|
| 35|
| 39|
+---+
only showing top 20 rows

