In [8]:
'''
Write a PySpark program to count the occurrences of each word in a given text file. The solution must utilize RDD transformations and actions for processing, and then convert the final RDD into a DataFrame for output. Sort DataFrame by count in descending order.

Hint
To convert an RDD to a DataFrame, use the spark.createDataFrame() method. Example:

df = spark.createDataFrame(rdd, schema=["word", "count"])
Input
The input data will be provided as a plain text file located at /datasets/notes.txt. Each line contains a sentence or phrase. Example input text:

hello world
hello PySpark
PySpark is fun
hello world again
Output
Sample Output Schema
word: String
count: Integer
Example Table
word	count
hello	3
world	2
PySpark	2
is	1
fun	1
again	1
Explanation
The program reads the text file into an RDD.
The RDD is processed to split lines into words, count the occurrences of each word, and sort by word.
The final RDD is converted into a DataFrame with columns word and count and sorted by count in descending order.
Use display(df) to show the final DataFrame.
'''

'''
üö® When to Use RDD

Use RDD only if:

Data is completely unstructured
Need low-level control
Custom partitioning logic
Legacy Spark code

‚úÖ When to Use DataFrame (Almost Always)

Use DataFrame when:
‚úî Working with structured/semi-structured data
‚úî Need performance
‚úî Writing SQL queries
‚úî Using Spark 2.x+

95% of Spark jobs should use DataFrames

what kind of fine grain control rdd provides which is missing in spark?

When people say ‚ÄúRDD gives fine-grained control‚Äù, they mean control over execution behavior and 
data handling that DataFrames intentionally hide for optimization and simplicity.



'''

# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName('Spark Playground').getOrCreate()

rdd = spark.sparkContext.textFile("./sample_dataset.txt")

word_count_rdd = (
  rdd.flatMap(lambda x : x.split(" ")) # split lines into words
      .map(lambda word : word.strip())             # remove extra spaces
      .filter(lambda word : word != "")            # remove empty words
      .map(lambda word : (word, 1)) # (word, 1)
      .reduceByKey(lambda a, b : a + b) # count words
)

'''
‚úî flatMap ‚Üí converts lines into individual words
‚úî map ‚Üí creates (word, 1) pairs
‚úî reduceByKey ‚Üí efficiently aggregates counts
'''

df = spark.createDataFrame(word_count_rdd, schema=["word", "count"])

df_result = df.sort(F.col("count").desc())

# Display the final DataFrame using the display() function.
df_result.show()

+----------+-----+
|      word|count|
+----------+-----+
|     hello|    4|
|   PySpark|    3|
|        is|    2|
|     world|    2|
|     Spark|    2|
|       fun|    1|
|     makes|    1|
|     again|    1|
|      data|    1|
|       big|    1|
|processing|    1|
|      easy|    1|
|  powerful|    1|
+----------+-----+

