1. Working with RDDs:
   a) Write a Python program to create an RDD from a local data source.
   b) Implement transformations and actions on the RDD to perform data processing tasks.
   c) Analyze and manipulate data using RDD operations such as map, filter, reduce, or aggregate.


In [None]:
from pyspark import SparkContext

# Create a SparkContext
sc = SparkContext("local", "RDDExample")

# Create an RDD from a local data source (list)
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Perform transformations and actions on the RDD
squared_rdd = rdd.map(lambda x: x ** 2)  # Apply a transformation: square each element
filtered_rdd = squared_rdd.filter(lambda x: x > 10)  # Apply a transformation: filter elements > 10
sum_result = filtered_rdd.reduce(lambda x, y: x + y)  # Apply an action: calculate sum

# Analyze and manipulate data using RDD operations
count = rdd.count()  # Count the number of elements in the RDD
min_value = rdd.min()  # Find the minimum value in the RDD
max_value = rdd.max()  # Find the maximum value in the RDD

# Print the results
print("Original RDD: {}".format(rdd.collect()))
print("Squared RDD: {}".format(squared_rdd.collect()))
print("Filtered RDD: {}".format(filtered_rdd.collect()))
print("Sum of Filtered RDD: {}".format(sum_result))
print("Number of elements in RDD: {}".format(count))
print("Minimum value in RDD: {}".format(min_value))
print("Maximum value in RDD: {}".format(max_value))

# Stop the SparkContext
sc.stop()


2. Spark DataFrame Operations:
   a) Write a Python program to load a CSV file into a Spark DataFrame.
   b)Perform common DataFrame operations such as filtering, grouping, or joining.
   c) Apply Spark SQL queries on the DataFrame to extract insights from the data.


In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# Load a CSV file into a DataFrame
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Perform common DataFrame operations
filtered_df = df.filter(df["age"] > 25)  # Apply a filter: select rows where age > 25
grouped_df = df.groupBy("gender").count()  # Group by gender and count occurrences
joined_df = df.join(grouped_df, "gender")  # Join with grouped_df based on the "gender" column

# Apply Spark SQL queries on the DataFrame
df.createOrReplaceTempView("people")  # Create a temporary view for the DataFrame
sql_result = spark.sql("SELECT name, age FROM people WHERE age > 30")  # Execute a SQL query

# Show the results
df.show()
filtered_df.show()
grouped_df.show()
joined_df.show()
sql_result.show()

# Stop the SparkSession
spark.stop()


3. Spark Streaming:
  a) Write a Python program to create a Spark Streaming application.
   b) Configure the application to consume data from a streaming source (e.g., Kafka or a socket).
   c) Implement streaming transformations and actions to process and analyze the incoming data stream.


In [None]:
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

# Create a Spark Streaming Context with a batch interval of 1 second
ssc = StreamingContext(sparkContext, 1)

# Configure the streaming application to consume data from Kafka
kafka_params = {
    "bootstrap.servers": "localhost:9092",
    "group.id": "my_consumer_group",
    "auto.offset.reset": "latest"
}
topic = "my_topic"
stream = KafkaUtils.createDirectStream(ssc, [topic], kafka_params)

# Implement streaming transformations and actions
lines = stream.map(lambda x: x[1])  # Extract the value from the Kafka message
word_counts = lines.flatMap(lambda line: line.split(" ")) \
                   .map(lambda word: (word, 1)) \
                   .reduceByKey(lambda a, b: a + b)

# Print the word counts
word_counts.pprint()

# Start the streaming context
ssc.start()

# Await termination or stop after a specified duration
ssc.awaitTerminationOrTimeout(30)  # Stop the streaming context after 30 seconds

# Stop the streaming context
ssc.stop(stopSparkContext=True, stopGraceFully=True)


4. Spark SQL and Data Source Integration:
   a) Write a Python program to connect Spark with a relational database (e.g., MySQL, PostgreSQL).
   b)Perform SQL operations on the data stored in the database using Spark SQL.
   c) Explore the integration capabilities of Spark with other data sources, such as Hadoop Distributed File System (HDFS) or Amazon S3.


In [None]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()

# Connect Spark with a relational database (e.g., MySQL, PostgreSQL)
db_url = "jdbc:mysql://localhost:3306/mydatabase"
db_properties = {
    "user": "username",
    "password": "password",
    "driver": "com.mysql.jdbc.Driver"
}
table_name = "mytable"

# Read data from the database table into a DataFrame
df = spark.read.jdbc(url=db_url, table=table_name, properties=db_properties)

# Perform SQL operations on the data stored in the database using Spark SQL
df.createOrReplaceTempView("mydata")  # Create a temporary view for the DataFrame
sql_result = spark.sql("SELECT * FROM mydata WHERE age > 30")  # Execute a SQL query

# Explore integration capabilities with other data sources (e.g., HDFS or Amazon S3)
hdfs_path = "hdfs://localhost:9000/path/to/hdfs/data.parquet"
s3_path = "s3a://bucket/path/to/s3/data.parquet"

# Read data from HDFS into a DataFrame
hdfs_df = spark.read.parquet(hdfs_path)

# Read data from Amazon S3 into a DataFrame
s3_df = spark.read.parquet(s3_path)

# Perform operations on the data from different sources
combined_df = df.union(hdfs_df).union(s3_df)  # Combine data from different sources

# Show the results
df.show()
sql_result.show()
combined_df.show()

# Stop the SparkSession
spark.stop()
