# Task 1 – Big Data Analysis with PySpark

This notebook demonstrates big data analysis using **PySpark** on a synthetic retail transactions dataset.

It covers:
- Creating a Spark session
- Generating a large synthetic dataset
- Loading data into a Spark DataFrame
- Running transformations and aggregations
- Deriving business insights from the results


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum, count, avg, hour
import random
from datetime import datetime, timedelta

# Create Spark session
spark = SparkSession.builder \
    .appName('CodTech_Big_Data_Analysis') \
    .getOrCreate()

spark


In [None]:
# Generate synthetic large-scale retail transaction data
num_records = 500000  # half a million rows as an example of big data

customers = [f'CUST_{i:04d}' for i in range(1, 501)]
products = [
    ('PROD_A', 'Electronics'),
    ('PROD_B', 'Groceries'),
    ('PROD_C', 'Clothing'),
    ('PROD_D', 'Home & Kitchen'),
    ('PROD_E', 'Books')
]

start_date = datetime(2024, 1, 1)
rows = []

for i in range(num_records):
    cust = random.choice(customers)
    prod, category = random.choice(products)
    quantity = random.randint(1, 5)
    price = random.choice([199, 299, 399, 499, 799, 999, 1499])
    amount = quantity * price
    order_time = start_date + timedelta(minutes=random.randint(0, 60 * 24 * 90))

    rows.append((i + 1, cust, prod, category, quantity, price, amount, order_time))

columns = ['transaction_id', 'customer_id', 'product_id', 'category', 'quantity', 'price', 'amount', 'order_time']

rdd = spark.sparkContext.parallelize(rows)
df = spark.createDataFrame(rdd, schema=columns)

df.printSchema()
df.show(5, truncate=False)


In [None]:
# Basic data exploration
print('Total number of records:', df.count())

df.describe(['quantity', 'price', 'amount']).show()


In [None]:
# Insight 1: Total revenue by product category
revenue_by_category = df.groupBy('category').agg(
    spark_sum('amount').alias('total_revenue'),
    count('*').alias('num_transactions')
).orderBy(col('total_revenue').desc())

revenue_by_category.show()


In [None]:
# Insight 2: Top 10 customers by total spending
top_customers = df.groupBy('customer_id').agg(
    spark_sum('amount').alias('total_spent'),
    count('*').alias('num_orders')
).orderBy(col('total_spent').desc()).limit(10)

top_customers.show()


In [None]:
# Insight 3: Revenue by hour of day (peak business hours)
df_with_hour = df.withColumn('order_hour', hour(col('order_time')))

revenue_by_hour = df_with_hour.groupBy('order_hour').agg(
    spark_sum('amount').alias('total_revenue'),
    count('*').alias('num_transactions')
).orderBy('order_hour')

revenue_by_hour.show(24)


## Summary of Insights

- **Top Revenue Categories:** Categories like Electronics or Home & Kitchen (depending on random run) generate the highest revenue.
- **High-Value Customers:** We identified the top 10 customers by total spend, which can be targeted for loyalty programs.
- **Peak Hours:** Revenue by hour of day shows when the store is busiest, helping in staffing and marketing decisions.

This notebook demonstrates how PySpark can efficiently handle and analyze hundreds of thousands of records on a single machine while remaining scalable for larger clusters.
