# Big Data & Streaming: Exercise Results

## 1. PySpark DataFrame
- Read a CSV file with PySpark and print the schema.


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv('data.csv', header=True)
print(df.printSchema())


## 2. Filtering and Aggregation
- Filter rows where a value > 100 and group by a category column, counting rows per group.


In [None]:
result = df.filter(df['value'] > 100).groupBy('category').count()
result.show()


## 3. Spark Transformations
- Chain at least two transformations (e.g., filter, select, groupBy) before calling an action.


In [None]:
filtered = df.filter(df['value'] > 100).select('category')
grouped = filtered.groupBy('category').count()
grouped.show()

## 4. Kafka Streaming
- Write a Python script to send and consume messages from a Kafka topic (localhost).


In [None]:
from kafka import KafkaProducer, KafkaConsumer

producer = KafkaProducer(bootstrap_servers='localhost:9092')
producer.send('topic', b'data')
consumer = KafkaConsumer('topic', bootstrap_servers='localhost:9092')

for msg in consumer:
    print(msg.value)


---

### Challenge
- Process a simulated data stream (e.g., random numbers) in real-time and compute a running average using PySpark Structured Streaming.


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg

spark = SparkSession.builder.getOrCreate()
stream_df = spark.readStream.format('rate').option('rowsPerSecond', 1).load()
avg_df = stream_df.withColumn('running_avg', avg('value').over())
query = avg_df.writeStream.format('console').start()
query.awaitTermination()
