In [8]:
'''
Write a PySpark program to count the occurrences of each word in a given text file. The solution must utilize RDD transformations and actions for processing, and then convert the final RDD into a DataFrame for output. Sort DataFrame by count in descending order.

Hint
To convert an RDD to a DataFrame, use the spark.createDataFrame() method. Example:

df = spark.createDataFrame(rdd, schema=["word", "count"])
Input
The input data will be provided as a plain text file located at /datasets/notes.txt. Each line contains a sentence or phrase. Example input text:

hello world
hello PySpark
PySpark is fun
hello world again
Output
Sample Output Schema
word: String
count: Integer
Example Table
word	count
hello	3
world	2
PySpark	2
is	1
fun	1
again	1
Explanation
The program reads the text file into an RDD.
The RDD is processed to split lines into words, count the occurrences of each word, and sort by word.
The final RDD is converted into a DataFrame with columns word and count and sorted by count in descending order.
Use display(df) to show the final DataFrame.
'''

'''
üö® When to Use RDD

Use RDD only if:

Data is completely unstructured
Need low-level control
Custom partitioning logic
Legacy Spark code

‚úÖ When to Use DataFrame (Almost Always)

Use DataFrame when:
‚úî Working with structured/semi-structured data
‚úî Need performance
‚úî Writing SQL queries
‚úî Using Spark 2.x+

95% of Spark jobs should use DataFrames

what kind of fine grain control rdd provides which is missing in spark?

When people say ‚ÄúRDD gives fine-grained control‚Äù, they mean control over execution behavior and 
data handling that DataFrames intentionally hide for optimization and simplicity.

üîç What ‚ÄúFine-Grained Control‚Äù RDD Provides (That DataFrames Don‚Äôt)
1Ô∏è‚É£ Custom Partitioning Logic
RDD

You can define exactly how keys are distributed across partitions.

rdd.partitionBy(
    numPartitions=10,
    partitionFunc=lambda key: hash(key) % 10
)


‚úî Control data locality
‚úî Reduce shuffle for joins
‚úî Optimize skewed keys

DataFrame

‚ùå No custom partition function
‚úî Only repartition() or partitionBy() (coarse-grained)

2Ô∏è‚É£ Control Over Data Serialization
RDD

You control:

Serialization format

Object representation

Custom serializers (Kryo tuning)

spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")


RDD lets you tune object-level serialization.

DataFrame

‚ùå Serialization is managed internally
‚úî Columnar + Tungsten (fast but opaque)

3Ô∏è‚É£ Processing Unstructured / Irregular Data
RDD

Can process:

Free-text

Logs

Binary blobs

Irregular nested formats

rdd.flatMap(lambda line: custom_parse(line))

DataFrame

‚ùå Needs schema
‚úî Best for structured / semi-structured data

4Ô∏è‚É£ Custom Stateful Computation
RDD

You can maintain state inside transformations.

rdd.mapPartitions(lambda it: custom_stateful_logic(it))


‚úî Full control per partition
‚úî Stateful streaming logic (legacy)

DataFrame

‚ùå Stateless by design
‚úî Only declarative aggregations

5Ô∏è‚É£ Control Over Memory Usage
RDD

You can choose:

Storage level

Serialization strategy

rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)


‚úî Fine memory tuning

DataFrame

‚úî Can cache
‚ùå Cannot control serialization details

6Ô∏è‚É£ Side Effects & External Systems
RDD

Allows:

Writing to external systems

Network calls

Custom I/O

rdd.foreachPartition(send_to_api)

DataFrame

‚ùå Discouraged
‚úî Must use structured sinks

7Ô∏è‚É£ Deterministic Execution Order (Within Partition)
RDD

Order is preserved inside partitions.

rdd.mapPartitions(process_in_order)

DataFrame

‚ùå Order is NOT guaranteed unless explicitly sorted

8Ô∏è‚É£ Custom Error Handling & Retry Logic
RDD

You can implement:

Try/catch per record

Fallback logic

rdd.map(lambda x: safe_parse(x))

DataFrame

‚ùå Errors fail the whole job
‚úî Limited via try_cast (SQL)

‚ö† Why Spark Hides This in DataFrames

Spark intentionally removes fine-grained control in DataFrames to:

‚úî Enable Catalyst optimization
‚úî Enable predicate pushdown
‚úî Enable vectorized execution
‚úî Reduce developer errors
‚úî Improve performance

You trade control for speed and safety.

üéØ Interview-Ready Summary

RDDs provide fine-grained control over data partitioning, serialization, stateful computation, memory management, and custom execution logic.
DataFrames abstract these details to enable powerful optimizations through Catalyst and Tungsten.
RDDs are useful for unstructured data or custom processing, while DataFrames are preferred for most analytics workloads.

'''

# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName('Spark Playground').getOrCreate()

rdd = spark.sparkContext.textFile("./sample_dataset.txt")

word_count_rdd = (
  rdd.flatMap(lambda x : x.split(" ")) # split lines into words
      .map(lambda word : word.strip())             # remove extra spaces
      .filter(lambda word : word != "")            # remove empty words
      .map(lambda word : (word, 1)) # (word, 1)
      .reduceByKey(lambda a, b : a + b) # count words
)

'''
‚úî flatMap ‚Üí converts lines into individual words
‚úî map ‚Üí creates (word, 1) pairs
‚úî reduceByKey ‚Üí efficiently aggregates counts
'''

df = spark.createDataFrame(word_count_rdd, schema=["word", "count"])

df_result = df.sort(F.col("count").desc())

# Display the final DataFrame using the display() function.
df_result.show()

+----------+-----+
|      word|count|
+----------+-----+
|     hello|    4|
|   PySpark|    3|
|        is|    2|
|     world|    2|
|     Spark|    2|
|       fun|    1|
|     makes|    1|
|     again|    1|
|      data|    1|
|       big|    1|
|processing|    1|
|      easy|    1|
|  powerful|    1|
+----------+-----+

