### PySpark Data Processing Project Checklist

#### 1. Data Loading
- [ ] Load `orders.csv` into a Spark DataFrame
- [ ] Load `customers.csv` into a Spark DataFrame

#### 2. Data Transformation
- [ ] Merge the datasets on `customer_id`
- [ ] Make all columns uppercase 
- [ ] Calculate each customer's total spend across all orders
- [ ] Create a new column `days_since_first_order` indicating the number of days since the customer first ordered

#### 3. Aggregation
- [ ] Group data by `product_category`
- [ ] Calculate the total order amount for each category
- [ ] Identify the top 3 customers based on total spend

#### 4. Data Export
- [ ] Create `customer_spend.csv` with columns:
  - [ ] `customer_id`
  - [ ] `customer_name`
  - [ ] `total_spend`
  - [ ] `days_since_first_order`
- [ ] Create `category_summary.csv` with columns:
  - [ ] `product_category`
  - [ ] `total_order_amount`
- [ ] Verify both CSV files were saved correctly

#### 5. Validation (Optional)
- [ ] Verify row counts match expectations
- [ ] Check for null values in critical columns
- [ ] Validate calculations with sample data

In [7]:
# create spark session 
import os 
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, sum, avg, count, row_number, round, dayofmonth, min, max, current_date, datediff, upper, to_date
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
from pyspark.sql.window import Window

spark = SparkSession.builder \
        .appName("ecomm_orders_etl") \
        .master("local[*]") \
        .getOrCreate()