**Collect() Operation with RDD Operation**

In [1]:
data = sc.textFile("students.csv")

In [2]:
 header = data.first()
 rows = data.filter(lambda line: line != header)

In [3]:
split_rdd = rows.map(lambda line: line.split(","))

In [4]:
 print("=== Student Dataset (first 10 rows) ===")
 for row in split_rdd.take(10):   # you can change 10 → 20, 50 etc.
    print(row)

=== Student Dataset (first 10 rows) ===
['1', 'Alice', '20', 'F', '66', '92', '44']
['2', 'Bob', '20', 'M', '82', '52', '77']
['3', 'Charlie', '22', 'F', '43', '57', '76']
['4', 'David', '19', 'M', '95', '69', '46']
['5', 'Eva', '19', 'F', '62', '44', '96']
['6', 'Frank', '22', 'F', '70', '78', '94']
['7', 'Grace', '24', 'F', '67', '66', '93']
['8', 'Henry', '21', 'F', '53', '82', '60']
['9', 'Ivy', '19', 'M', '64', '52', '46']
['10', 'Jack', '19', 'F', '44', '59', '60']


In [5]:
students_rdd = split_rdd.map(lambda x: (int(x[0]), x[1], int(x[2]), x[3], int(x[4]), int(x[5]), int(x[6])))

In [6]:
 avg_marks_rdd = students_rdd.map(lambda x: (x[1], (x[4] + x[5] + x[6]) / 3))

In [7]:
 passed_rdd = avg_marks_rdd.filter(lambda x: x[1] >= 75)

In [8]:
sorted_passed_rdd = passed_rdd.sortBy(lambda x: x[1], ascending=False)

In [9]:
 results = sorted_passed_rdd.collect()

In [10]:
print("=== Students with Average >= 75 ===")
for student in results:
    print(f"Name: {student[0]}, Avg Marks: {student[1]:.2f}")

=== Students with Average >= 75 ===
Name: Leo, Avg Marks: 88.00
Name: Olivia, Avg Marks: 88.00
Name: Rita, Avg Marks: 86.67
Name: Kathy, Avg Marks: 81.67
Name: George, Avg Marks: 81.67
Name: Frank, Avg Marks: 80.67
Name: Oscar, Avg Marks: 80.00
Name: Uma, Avg Marks: 78.33
Name: Kyle, Avg Marks: 78.33
Name: Matt, Avg Marks: 78.33
Name: Tina, Avg Marks: 76.00
Name: Victor, Avg Marks: 75.67
Name: Grace, Avg Marks: 75.33
Name: Mona, Avg Marks: 75.00
Name: Will, Avg Marks: 75.00


In [11]:
 count_passed = passed_rdd.count()
 print("\nNumber of students who passed:", count_passed)


Number of students who passed: 15


**Conclusion**
This PySpark RDD program demonstrates how to process and analyze a student dataset using distributed data operations. The analysis successfully loads the CSV file, removes the header, splits the data into structured fields, and calculates each student’s average marks. Using transformations like map(), filter(), and sortBy(), the program identifies students who scored an average of 75 or above. The collect() operation retrieves the processed results from all worker nodes to the driver for display. The output shows that 15 students passed, with Olivia being the top scorer. Overall, this experiment clearly shows how PySpark’s RDD operations can efficiently handle data transformation, filtering, sorting, and aggregation in a parallel computing environment.