How to access first n rows in Spark

Spark Dataset의 주요 함수 limit(n), head(n), first(), take(n) 비교하기

limit(n)

상위 n개 Row를 갖는 새로운 Dataset을 반환한다. (org.apache.spark.sql.Dataset[...])
호출돼도 바로 연산이 실행되지 않는 transformation 함수

예시

val df = sc.parallelize(Seq((1,"hello"), (2,"world"),(3, "haha"))).toDF("no", "word")

df.limit(2)
// res35: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [no: int, word: string]

head(n)

상위 n개 Row를 반환한다. (Array[org.apache.spark.sql.Row])
입력값이 없을 경우엔 단일 Row를 반환한다. (org.apache.spark.sql.Row)
호출 즉시 연상이 수행되는 action 함수

내부에서 limit(n)을 호출하여 Dataset을 만들고 collect하여 action을 수행한다.

  def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

실행하면 내부적으로 head가 호출되는 함수: first(), take()

예시

val df = sc.parallelize(Seq((1,"hello"), (2,"world"),(3, "haha"))).toDF("no", "word")

df.head(2)
// res33: Array[org.apache.spark.sql.Row] = Array([1,hello], [2,world])

df.head()
// res34: org.apache.spark.sql.Row = [1,hello]

first()

최상위 Row를 반환한다.
head()와 동일하다.

take(n: Int)

상위 n개 Row를 반환한다.
head(n)과 동일하다.

limit(n) vs head(n)

head(n)는 action 함수로 실제 데이터 배열을 반환한다.
limit(n)는 transformation 함수로 새로운 Dataset을 반환한다.
head(n)는 n이 과도하게 클 경우 driver 프로세스에서 OutOfMemory가 발생할 수 있다.
head(n)은 limit(n)을 호출하여 Dataset을 생성한 후 collect한 결과를 반환한다.

summary

method	description
limit(n)	return Dataset
head(n)	return Array of Row
first()	alias for head()
take(n)	alias for head(n)

Reference

https://stackoverflow.com/questions/46832394/spark-access-first-n-rows-take-vs-limit
https://stackoverflow.com/questions/35869884/more-than-one-hour-to-execute-pyspark-sql-dataframe-take4/35870245#35870245
https://stackoverflow.com/questions/45138742/apache-spark-dataset-api-headnint-vs-takenint
https://spark.apache.org/docs/latest/api/scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark_access_first_n_rows.md

spark_access_first_n_rows.md

How to access first n rows in Spark

limit(n)

head(n)

first()

take(n: Int)

limit(n) vs head(n)

summary

Reference

Files

spark_access_first_n_rows.md

Latest commit

History

spark_access_first_n_rows.md

File metadata and controls

How to access first n rows in Spark

limit(n)

head(n)

first()

take(n: Int)

limit(n) vs head(n)

summary

Reference