#### 1. **Set the Schema and Load Data using Structured Streaming (5 Marks)**
   - **Objective**: Load data from different locations into data frames using PySpark's structured streaming.
   - **Tools**: PySpark, HDFS, Kafka.
   - **Approach**:
     - **HDFS**: Store all datasets (`cust_dimen`, `market_fact`, `orders_dimen`, `prod_dimen`, `shipping_dimen`) in HDFS.
     - **Kafka**: Stream data into Spark. Use Kafka topics for each dataset and set up producers to publish data from HDFS to Kafka.
     - **PySpark Structured Streaming**: Read data from Kafka topics using structured streaming and define schemas for each dataset.
     - Example code snippet:
       ```python
       from pyspark.sql import SparkSession
       from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
       
       spark = SparkSession.builder \
           .appName("SalesDataStreaming") \
           .getOrCreate()

       schema = StructType([
           StructField("Column1", StringType(), True),
           StructField("Column2", IntegerType(), True),
           # Add other fields based on your datasets
       ])
       
       df = spark.readStream \
           .format("kafka") \
           .option("kafka.bootstrap.servers", "localhost:9092") \
           .option("subscribe", "cust_dimen") \
           .load()
       
       df = df.selectExpr("CAST(value AS STRING)")
       df = spark.read.json(df.rdd, schema)
       ```

#### 2. **Join DataFrames to Create `Full_DataFrame` (5 Marks)**
   - **Objective**: Combine all datasets into a single DataFrame without duplicate columns.
   - **Approach**:
     - Use PySpark DataFrame join operations.
     - Ensure you remove or rename duplicate columns during the join process.
     - Example code snippet:
       ```python
       full_df = cust_dimen_df.join(market_fact_df, "common_column") \
                              .join(orders_dimen_df, "common_column") \
                              .join(prod_dimen_df, "common_column") \
                              .join(shipping_dimen_df, "common_column")
       ```

#### 3. **Convert Date Columns and Show Top 5 Records (5 Marks)**
   - **Objective**: Convert `Order_Date` and `Ship_Date` columns to `DateType` and display schema and records.
   - **Approach**:
     - Use `to_date` function in PySpark to convert the columns.
     - Example code snippet:
       ```python
       from pyspark.sql.functions import to_date
       
       full_df = full_df.withColumn("Order_Date", to_date(full_df["Order_Date"], "yyyy-MM-dd")) \
                        .withColumn("Ship_Date", to_date(full_df["Ship_Date"], "yyyy-MM-dd"))
       
       full_df.printSchema()
       full_df.select("Order_Date", "Ship_Date").show(5)
       ```

#### 4. **Find Top 3 Customers by Number of Orders (5 Marks)**
   - **Objective**: Identify the top 3 customers with the highest number of orders.
   - **Approach**:
     - Group by `customer_id` and count orders, then use `orderBy` to get the top 3.
     - Example code snippet:
       ```python
       top_customers = full_df.groupBy("customer_id").count().orderBy("count", ascending=False).limit(3)
       top_customers.show()
       ```

#### 5. **Create `DaysTakenForDelivery` Column (5 Marks)**
   - **Objective**: Calculate the difference between `Order_Date` and `Ship_Date`.
   - **Approach**:
     - Use the `datediff` function.
     - Example code snippet:
       ```python
       from pyspark.sql.functions import datediff
       
       full_df = full_df.withColumn("DaysTakenForDelivery", datediff("Ship_Date", "Order_Date"))
       full_df.show(5)
       ```

#### 6. **Find Customer with Maximum Delivery Time (5 Marks)**
   - **Objective**: Identify the customer whose order took the longest to deliver.
   - **Approach**:
     - Use `orderBy` on the `DaysTakenForDelivery` column.
     - Example code snippet:
       ```python
       max_delivery_customer = full_df.orderBy("DaysTakenForDelivery", ascending=False).select("customer_id").first()
       print(max_delivery_customer)
       ```

#### 7. **Retrieve Total Sales per Product using Window Functions (5 Marks)**
   - **Objective**: Calculate total sales for each product.
   - **Approach**:
     - Use PySpark Window functions to partition by `product_id`.
     - Example code snippet:
       ```python
       from pyspark.sql.window import Window
       from pyspark.sql.functions import sum as _sum
       
       windowSpec = Window.partitionBy("product_id")
       sales_per_product = full_df.withColumn("Total_Sales", _sum("sales").over(windowSpec))
       sales_per_product.show()
       ```

#### 8. **Retrieve Total Profit per Product (5 Marks)**
   - **Objective**: Calculate total profit per product using and without using window functions.
   - **Approach**:
     - **With Window Function**:
       ```python
       profit_per_product = full_df.withColumn("Total_Profit", _sum("profit").over(windowSpec))
       profit_per_product.show()
       ```
     - **Without Window Function**:
       ```python
       profit_per_product_no_window = full_df.groupBy("product_id").sum("profit")
       profit_per_product_no_window.show()
       ```

#### 9. **Count Unique Customers in January and Recurring Customers (5 Marks)**
   - **Objective**: Count unique customers in January 2011 and those who return every month throughout the year.
   - **Approach**:
     - Use `filter` for January data and `groupBy` for counting returning customers.
     - Example code snippet:
       ```python
       from pyspark.sql.functions import month, year
       
       jan_customers = full_df.filter((month("Order_Date") == 1) & (year("Order_Date") == 2011)).select("customer_id").distinct()
       returning_customers = full_df.groupBy("customer_id").agg(_countDistinct("Order_Date").alias("order_months"))
       recurring_customers = returning_customers.filter("order_months = 12")
       jan_customers_count = jan_customers.count()
       recurring_customers_count = recurring_customers.count()
       ```

#### 10. **Calculate Total Sales, Profit, Quantity, and Discount by Customer (5 Marks)**
   - **Objective**: Compute total sales, profit, quantity, and discounts by customer and sort by total profit.
   - **Approach**:
     - Use `groupBy` and aggregation functions.
     - Example code snippet:
       ```python
       result_df = full_df.groupBy("customer_id") \
           .agg(_sum("quantity").alias("Total_Quantity"),
                _sum("discount").alias("Total_Discount"),
                _sum("sales").alias("Total_Sales"),
                _sum("profit").alias("Total_Profit")) \
           .orderBy("Total_Profit", ascending=False)
       
       result_df.show()
       ```

### Additional Steps

1. **Data Loading with Sqoop**:
   - If the data is stored in a relational database, use Sqoop to import the data into HDFS.

2. **Data Storage in Hive**:
   - Create external Hive tables pointing to the data stored in HDFS.
   - This allows you to run SQL queries using PySpark's `spark.sql` method for additional analysis.

3. **Setting Up the Environment in Jupyter**:
   - Ensure all necessary PySpark configurations are set in your Jupyter notebook, and all required libraries are imported.
   - Configure connections to HDFS, Hive, and Kafka correctly.

4. **Testing and Validation**:
   - Thoroughly test each step with sample data before running the full dataset.
   - Validate the output at each stage to ensure accuracy.
