- What is take() in PySpark?
    - The take(num) function in PySpark is used to retrieve the first 'num' rows from a DataFrame and return them as a list of Row objects.

- Key Points:
    - Returns the first list 'num' rows.
    - Returns a Python list containing Row objects.
    - if 'num' is greater than the number of rows in the DataFrame, it returns all rows.
    - Commonly used for quick previews of data.

- Syntax:
    DataFrame.take(num)

- Example: 
    df.take(5) -> Returns the first 5 rows.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("takeFunctionExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/15 14:40:03 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/15 14:40:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/15 14:40:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/15 14:40:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/15 14:40:19 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/09/15 14:40:19 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/09/15 14:40:19 WARN Utils: Serv

In [3]:
data = [
    (29, "Dipankar"),
    (28, "Prodipta"),
    (27, "Padma"),
    (26, "Souvik"),
    (28, "Soukarjya")
]

columns = ["age", "name"]

df = spark.createDataFrame(data, columns)
df.show()




+---+---------+
|age|     name|
+---+---------+
| 29| Dipankar|
| 28| Prodipta|
| 27|    Padma|
| 26|   Souvik|
| 28|Soukarjya|
+---+---------+



                                                                                

In [4]:
# Using take() function in PySpark
# Example: Take the first 2 rows

first_two_rows = df.take(2)

print("First 2 rows using take(): ")
for row in first_two_rows:
    print(row)


                                                                                

First 2 rows using take(): 
Row(age=29, name='Dipankar')
Row(age=28, name='Prodipta')


In [5]:
# Example: Take more rows than exist (request 10 rows, only 4 rows in DataFrame)
more_rows = df.take(10)

print("Taking more rows than available (requested 10 rows): ")
for row in more_rows:
    print(row)


[Stage 5:>                                                          (0 + 3) / 3]

Taking more rows than available (requested 10 rows): 
Row(age=29, name='Dipankar')
Row(age=28, name='Prodipta')
Row(age=27, name='Padma')
Row(age=26, name='Souvik')
Row(age=28, name='Soukarjya')


                                                                                

- Summary:
    - take() is a quick way to fetch the top 'n' rows from a DataFrame.
    - Returns a list of Row objects that you can loop through in Python.
    - Helpful for small data previews, testing and debugging