### Day 3 Exercise: PySpark & Distributed Processing for Fraud Rules 🌪️

#### Objective
This exercise transitions from single-node processing with Pandas to distributed data processing with PySpark. You will implement more complex fraud detection rules on a larger conceptual dataset, leveraging the power of Spark's distributed architecture.

#### Scenario
The "Real-time Transaction Fraud Detection System" is now handling millions of transactions per hour. Pandas is no longer sufficient. Your task is to re-implement the existing fraud logic and add a new, more complex rule using PySpark within a simulated Azure Databricks environment.

#### Data to Use 📊
You will start with the enriched Pandas DataFrame (`df_enriched_transactions`) created in the Day 2 exercise. The first step will be to convert this into a PySpark DataFrame.

---

### Part 1: PySpark Implementation 💻

#### 1.1 Setup and Data Conversion
* **Initialize SparkSession**: Create a `SparkSession`, the entry point for any Spark functionality.
* **Create PySpark DataFrame**: Convert your cleaned and enriched Pandas DataFrame (`df_enriched_transactions`) into a PySpark DataFrame. Name it `sdf_enriched_transactions`.
* **Show Schema and Data**: Print the schema (`.printSchema()`) and show the first 5 rows (`.show(5)`) of your new PySpark DataFrame to verify the conversion.

#### 1.2 Fraud Rule 1 in PySpark: High Transaction Value for New Customers
* **Re-implement**: Using PySpark DataFrame operations, re-implement the logic for "Rule 1".
* **Logic Recap**:
    * A "new customer" has a `registration_date` within the last 7 days of the latest transaction `timestamp`.
    * A "high transaction value" is an `amount` > $500.
* **Implementation**: Use `pyspark.sql.functions` (like `col`, `when`, `lit`, `datediff`) to create the `is_fraudulent_rule1` column.

#### 1.3 Fraud Rule 2 in PySpark: Multiple Transactions in a Short Period
* **Implement New Rule**: You will now implement "**Rule 2: Multiple Transactions in a Short Period from Different Locations**".
* **Logic**: Flag a transaction if the *same customer* has made more than **3 transactions** within a **10-minute window**, and the transaction's `ip_address` is different from the previous transaction's `ip_address`.
* **Implementation Steps**:
    1.  **Define a Window**: Use `Window.partitionBy("customer_id").orderBy("timestamp")` to create a window specification.
    2.  **Use Window Functions**:
        * Use the `lag()` function over the window to get the `ip_address` and `timestamp` of the previous transaction for each customer.
        * Calculate the time difference in seconds between the current and previous transaction.
        * Use `count()` over a *rows-based window* (`rowsBetween(-2, 0)`) or a *time-based window* to count transactions in the specified period.
    3.  **Apply Conditions**: Create a new boolean column `is_fraudulent_rule2` that is `True` if all conditions are met.
* **Display Results**: Show the `transaction_id`, `customer_id`, `timestamp`, `ip_address`, `is_fraudulent_rule1`, and `is_fraudulent_rule2` for any transactions flagged by either rule.

---

### Part 2: Conceptual Questions 🤔

#### 2.1 Why PySpark?
* Explain the fundamental reasons for moving from Pandas to PySpark for this fraud detection system. Discuss the limitations of Pandas and the architectural advantages of Spark (e.g., distributed processing, fault tolerance, lazy evaluation).

#### 2.2 Performance Tuning in Spark
* Imagine your PySpark job is running slowly. Describe **two** optimization techniques you could apply and explain how they would improve performance.
    * **Hint**: Consider concepts like partitioning, broadcasting, caching, or data serialization formats (like Parquet vs. CSV).

#### 2.3 Role of Azure Databricks
* Conceptually, how does a platform like Azure Databricks simplify the work you just did? Touch upon at least **three** aspects (e.g., managed infrastructure, collaborative notebooks, integration with Delta Lake, job scheduling).