#### **Collect()**: Retrieve data from DataFrame

- collect() is an **action** that returns the **entire dataset** in an **Array** to the **Driver**.

- **collect()** is an `action` hence it **doesn't return a Dataframe** instead, it returns `data in an Array to the Driver`. Once the data is in an array, you can use `python for loop` to process it further.

- collect() is used to retrieve the action output when you have very **small result set** and calling collect() on an RDD/Dataframe with **bigger result set** causes **out of memory** as it returns the **entire dataset (from all workers) to the driver** hence we should avoid calling collect() on a larger dataset.

- Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.

#### **1) Loop Array in Python**
**Collect():** Returns data in an Array to the Driver.

In [0]:
data = [("Finance", 10, "Manager"), ("Marketing", 20, "Sr.Manager"), ("Sales", 30, "Representative"), ("IT", 40, "Software")]
schema = ["dept_name", "dept_id", "Designation"]

df = spark.createDataFrame(data, schema)
display(df)

dept_name,dept_id,Designation
Finance,10,Manager
Marketing,20,Sr.Manager
Sales,30,Representative
IT,40,Software


In [0]:
# Returns the entire dataset in an Array
# AttributeError: 'list' object has no attribute 'display' (df1.display())
# collect() is an action hence it does not return a DataFrame instead, it returns data in an Array to the driver.
# retrieves all elements in a DataFrame as an Array of Row type to the driver node
df_collect = df.collect()
df_collect

[Row(dept_name='Finance', dept_id=10, Designation='Manager'),
 Row(dept_name='Marketing', dept_id=20, Designation='Sr.Manager'),
 Row(dept_name='Sales', dept_id=30, Designation='Representative'),
 Row(dept_name='IT', dept_id=40, Designation='Software')]

In [0]:
for i in df_collect:
    print(i)

Row(dept_name='Finance', dept_id=10, Designation='Manager')
Row(dept_name='Marketing', dept_id=20, Designation='Sr.Manager')
Row(dept_name='Sales', dept_id=30, Designation='Representative')
Row(dept_name='IT', dept_id=40, Designation='Software')


     for i in df_collect:
         print(i['dept_name'])
             (or)
     for i in df_collect:
         print(i[0])

In [0]:
for i in df_collect:
    print(i[0])

Finance
Marketing
Sales
IT


     for i in df_collect:
         print(i['dept_id'])
            (or)
     for i in df_collect:
         print(i[1])

In [0]:
for i in df_collect:
    print(i[1])

10
20
30
40


     for i in df_collect:
         print(i['Designation'][0])
             (or)
     for i in df_collect:
         print(i[2])

In [0]:
for i in df_collect:
    print(i[2])

Manager
Sr.Manager
Representative
Software


**2) collect()[0][0]**

In [0]:
dept = [("Finance",10,"Manager"), ("Marketing",20,"Sr.Manager"), ("Sales",30,"Representative"), ("IT",40,"Software")]
deptColumns = ["dept_name", "dept_id", "Designation"]

df = spark.createDataFrame(data=dept, schema = deptColumns)
df.show(truncate=False)

+---------+-------+--------------+
|dept_name|dept_id|Designation   |
+---------+-------+--------------+
|Finance  |10     |Manager       |
|Marketing|20     |Sr.Manager    |
|Sales    |30     |Representative|
|IT       |40     |Software      |
+---------+-------+--------------+



In [0]:
# collect() is an action hence it does not return a DataFrame instead, it returns data in an Array to the driver.
# retrieves all elements in a DataFrame as an Array of Row type to the driver node
dataCollect = df.collect()
print(dataCollect)

[Row(dept_name='Finance', dept_id=10, Designation='Manager'), Row(dept_name='Marketing', dept_id=20, Designation='Sr.Manager'), Row(dept_name='Sales', dept_id=30, Designation='Representative'), Row(dept_name='IT', dept_id=40, Designation='Software')]


     # returns the first element in an array (1st row)
     df.collect()[0]

     # returns the second element in an array (2nd row)
     df.collect()[1]

     # returns the third element in an array (3rd row)
     df.collect()[2]

     # returns the fourth element in an array (4th row)
     df.collect()[3]

In [0]:
# returns the first element in an array (1st row)
df.collect()[0]

Row(dept_name='Finance', dept_id=10, Designation='Manager')

In [0]:
# returns the second element in an array (2nd row)
df.collect()[1]

Row(dept_name='Marketing', dept_id=20, Designation='Sr.Manager')

In [0]:
# returns the third element in an array (3rd row)
df.collect()[2]

Row(dept_name='Sales', dept_id=30, Designation='Representative')

In [0]:
# returns the fourth element in an array (4th row)
df.collect()[3]

Row(dept_name='IT', dept_id=40, Designation='Software')

     # returns the value of the first row & 1st column
     df.collect()[0][0]

     # returns the value of the first row & 2nd column
     df.collect()[0][1]

     # returns the value of the first row & 3rd column
     df.collect()[0][2]

In [0]:
# returns the value of the first row & 1st column
df.collect()[0][0]

'Finance'

In [0]:
# returns the value of the first row & 2nd column
df.collect()[0][1]

10

In [0]:
# returns the value of the first row & 3rd column
df.collect()[0][2]

'Manager'

     # returns the value of the 2nd row & 1st column
     df.collect()[1][0]

     # returns the value of the 2nd row & 2nd column
     df.collect()[1][1]

     # returns the value of the 2nd row & 3rd column
     df.collect()[1][2]

In [0]:
# returns the value of the 2nd row & 1st column
df.collect()[1][0]

'Marketing'

In [0]:
# returns the value of the 2nd row & 2nd column
df.collect()[1][1]

20

In [0]:
# returns the value of the 2nd row & 3rd column
df.collect()[1][2]

'Sr.Manager'

     # returns the value of the 3rd row & 1st column
     df.collect()[2][0]

     # returns the value of the 3rd row & 2nd column
     df.collect()[2][1]

     # returns the value of the 3rd row & 3rd column
     df.collect()[2][2]

In [0]:
# returns the value of the 3rd row & 1st column
df.collect()[2][0]

'Sales'

In [0]:
# returns the value of the 3rd row & 2nd column
df.collect()[2][1]

30

In [0]:
# returns the value of the 3rd row & 3rd column
df.collect()[2][2]

'Representative'

     # returns the value of the 4th row & 1st column
     df.collect()[3][0]

     # returns the value of the 4th row & 2nd column
     df.collect()[3][1]

     # returns the value of the 4th row & 3rd column
     df.collect()[3][2]

In [0]:
# returns the value of the 4th row & 1st column
df.collect()[3][0]

'IT'

In [0]:
# returns the value of the 4th row & 2nd column
df.collect()[3][1]

40

In [0]:
# returns the value of the 4th row & 3rd column
df.collect()[3][2]

'Software'

**3) df.select("column").collect()[0][0]**

In [0]:
print(df.select("dept_name").collect())

[Row(dept_name='Finance'), Row(dept_name='Marketing'), Row(dept_name='Sales'), Row(dept_name='IT')]


     # returns the first element in an array (1st row)
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[0])

     # returns the second element in an array (2nd row)
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[1])

     # returns the third element in an array (3rd row)
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[2])

     # returns the fourth element in an array (4th row)
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[3])

In [0]:
# returns the first element in an array (1st row)
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[0])

Row(dept_name='Finance', dept_id=10, Designation='Manager')


In [0]:
# returns the second element in an array (2nd row)
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[1])

Row(dept_name='Marketing', dept_id=20, Designation='Sr.Manager')


In [0]:
# returns the third element in an array (3rd row)
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[2])

Row(dept_name='Sales', dept_id=30, Designation='Representative')


In [0]:
# returns the fourth element in an array (4th row)
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[3])

Row(dept_name='IT', dept_id=40, Designation='Software')


     # returns the value of the first row & 1st column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[0][0])

     # returns the value of the first row & 2nd column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[0][1])

     # returns the value of the first row & 3rd column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[0][2])

In [0]:
# returns the value of the first row & 1st column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[0][0])

Finance


In [0]:
# returns the value of the first row & 2nd column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[0][1])

10


In [0]:
# returns the value of the first row & 3rd column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[0][2])

Finance


     # returns the value of the second row & 1st column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[1][0])

     # returns the value of the second row & 2nd column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[1][1])

     # returns the value of the second row & 3rd column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[1][2])

In [0]:
# returns the value of the second row & 1st column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[1][0])

Marketing


In [0]:
# returns the value of the second row & 2nd column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[1][1])

20


In [0]:
# returns the value of the second row & 3rd column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[1][2])

Sr.Manager


     # returns the value of the 3rd row & 1st column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[2][0])

     # returns the value of the 3rd row & 2nd column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[2][1])

     # returns the value of the 3rd row & 3rd column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[2][2])

In [0]:
# returns the value of the 3rd row & 1st column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[2][0])

Sales


In [0]:
# returns the value of the 3rd row & 2nd column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[2][1])

30


In [0]:
# returns the value of the 3rd row & 3rd column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[2][2])

Representative


     # returns the value of the 4th row & 1st column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[3][0])

     # returns the value of the 4th row & 2nd column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[3][1])

     # returns the value of the 4th row & 3rd column
     print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[3][2])

In [0]:
# returns the value of the 4th row & 1st column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[3][0])

IT


In [0]:
# returns the value of the 4th row & 2nd column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[3][1])

40


In [0]:
# returns the value of the 4th row & 3rd column
print(df.select(['dept_name', 'dept_id', 'Designation']).collect()[3][2])

Software
