# Homework 04 (Build PySpark pipeline for the data transformation)

## Radosław Jurczak

-------------------------------------------------

A docker network `de_network` is used, created by
```{bash}
docker network create de_network
```

Minio was run with the following command:
```{bash}
docker run -p 9000:9000 -p 9090:9090 --name minio --network=de_network -v ~/minio/data:/data -e "MINIO_ROOT_USER=admin" -e "MINIO_ROOT_PASSWORD=adminadmin" quay.io/minio/minio server /data --console-address ":9090"
```

To run the code below, you'll need to create a minio bucket called `hw4`.

The notebook was run inside docker, set up by
```{bash}
docker run \
-it -d --rm \
--network=de_network \
-p 10000:8888 -p 4041:4040 \
-v "${PWD}":/home/rj/data_engineering \
quay.io/jupyter/all-spark-notebook:2023-10-20
```

---------------------------------------------------
The data regards transactions in an online store.

I store the data in minio according to the __Star Schema__: there is a single long fact table recording `150 000` transaction events (`transaction_id`, `user_id`, `product_id`, `transaction_amount`, `units_sold`, `time_id`, `location_id`) and four dimension tables (on customers, products, time (e.g. exact date, month, quarter, year) and customer locations).

### Data generation:

In [1]:
!pip install randomtimestamp



In [2]:
import datetime
import random

import randomtimestamp

In [3]:
N_TRANSACTIONS = 150_000
N_CUSTOMERS = 10_000
N_PRODUCTS = 5_000
N_LOCATIONS = 30_000
N_TIMES = 100_000

N_CITIES = 10_000
N_COUNTRIES = 195
N_CONTINENTS = 6
N_PRODUCT_CATEGORIES = 30

#### Customers dimension table: `customer_id`, `first_name`, `last_name`, `zip_code`, `is_company`, `discount`; the shop has 10_000 registered clients.

In [4]:
cities_aux = [f"city_{i}" for i in range(N_CITIES)]

customers = [
    (customer_id,
     f"first_name_{customer_id}",
     f"last_name_{customer_id}",
     f"{random.randint(0, 99)}-{random.randint(100, 999)}",
     random.choice([True, False]),
     random.choice([0.0, 0.1, 0.15, 0.2])
    )
    for customer_id in range(N_CUSTOMERS)
]

#### Products dimension table: `product_id`, `product_name`, `product_category`, `unit_price`; the shop has 5_000 products on sale in 30 categories.

In [5]:
products = [
    (product_id,
     f"product_name_{product_id}",
     f"category_{random.randint(0, N_PRODUCT_CATEGORIES)}",
     round(random.uniform(0.5, 2000.0), 2),
    )
    for product_id in range(N_PRODUCTS)
]

#### Locations dimension table: `location_id`, `location_name`, `country`, `continent`, `area_type`.

In [6]:
locations = [
    (location_id,
     f"location_name_{location_id}",
     f"country_{random.randint(0, N_COUNTRIES)}",
     f"continent_{random.randint(0, N_CONTINENTS)}",
     random.choice(["city", "town", "rural"]),
    )
    for location_id in range(N_LOCATIONS)
]

#### Times dimension table: `time_id`, `timestamp`, `date`, `weekday`, `month`, `quarter`, `year`

In [7]:
times = []
for i in range(N_TIMES):
    timestamp = randomtimestamp.randomtimestamp(start_year=2022, end_year=2023)
    times.append((
        i,
        timestamp,
        timestamp.date(),
        timestamp.weekday(),
        timestamp.month,
        (timestamp.month - 1)//3,
        timestamp.year
    ))

#### Transaction events fact table: `transaction_id`, `customer_id`, `product_id`, `location_id`, `time_id`, `product_amount`

In [8]:
transactions = [
    (transaction_id,
     random.randint(0, N_CUSTOMERS),
     random.randint(0, N_PRODUCTS),
     random.randint(0, N_LOCATIONS),
     random.randint(0, N_TIMES),
     random.randint(0, 500))
    for transaction_id in range(N_TRANSACTIONS)
]

#### Convert everything to Spark, then save in Minio as parquet files organized according to the Star Schema model.

In [9]:
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StringType

In [10]:
spark_conf = (
SparkConf().set("spark.jars.packages", 'org.apache.hadoop:hadoop-client:3.3.4,org.apache.hadoop:hadoop-aws:3.3.4')
.set("spark.driver.memory", "6g")
.set("spark.hadoop.fs.s3a.endpoint", "minio:9000")
.set("spark.hadoop.fs.s3a.access.key", "admin")
.set("spark.hadoop.fs.s3a.secret.key", "adminadmin" )
.set("spark.hadoop.fs.s3a.path.style.access", "true")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
.set("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
)
sc = SparkContext.getOrCreate(spark_conf)
spark = SparkSession(sc)

In [11]:
transaction_df = spark.createDataFrame(transactions, ["transaction_id", "customer_id", "product_id", "location_id", "time_id", "product_amount"])
print("Transactions df created")
customer_df = spark.createDataFrame(customers, ["customer_id", "first_name", "last_name", "zip_code", "discount"])
print("Customers df created")
product_df = spark.createDataFrame(products, ["product_id", "name", "category", "unit_price"])
print("Products df created")
location_df = spark.createDataFrame(locations, ["location_id", "location_name", "country", "continent", "area_type"])
print("Locations df created")
time_df = spark.createDataFrame(times, ["time_id", "timestamp", "date", "weekday", "month", "quarter", "year"])
print("Times df created")

Transactions df created
Customers df created
Products df created
Locations df created
Times df created


In [12]:
transaction_df.write.format('parquet').mode('overwrite').save('s3a://hw4/transactions')
print("Transactions df saved")
customer_df.write.format('parquet').mode('overwrite').save('s3a://hw4/customers')
print("Customers df saved")
product_df.write.format('parquet').mode('overwrite').save('s3a://hw4/products')
print("Products df saved")
location_df.write.format('parquet').mode('overwrite').save('s3a://hw4/locations')
print("Locations df saved")
time_df.write.format('parquet').mode('overwrite').save('s3a://hw4/times')
print("Times df saved")

Transactions df saved
Customers df saved
Products df saved
Locations df saved
Times df saved


#### Retrieve data and create a report: top 10 product categories by sales revenue in the first half of the year 2023

In [13]:
transaction_df = spark.read.format("parquet").load('s3a://hw4/transactions')
transaction_df.printSchema()

root
 |-- transaction_id: long (nullable = true)
 |-- customer_id: long (nullable = true)
 |-- product_id: long (nullable = true)
 |-- location_id: long (nullable = true)
 |-- time_id: long (nullable = true)
 |-- product_amount: long (nullable = true)



In [14]:
product_df = spark.read.format("parquet").load('s3a://hw4/products')
product_df.printSchema()

root
 |-- product_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- unit_price: double (nullable = true)



In [15]:
time_df = spark.read.format("parquet").load('s3a://hw4/times')
time_df.printSchema()

root
 |-- time_id: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- date: date (nullable = true)
 |-- weekday: long (nullable = true)
 |-- month: long (nullable = true)
 |-- quarter: long (nullable = true)
 |-- year: long (nullable = true)



In [16]:
report = transaction_df.join(
    product_df, on="product_id"
).join(
    time_df, on="time_id"
).withColumn(
    "revenue", f.round(f.col("product_amount") * f.col("unit_price"), 2)
).select(
    "category", "quarter", "year", "revenue"
).filter(
    (f.col("year") == 2023) & (f.col("quarter").isin({0, 1}))
).select(
    "category", "revenue"
).groupby("category").agg(
    f.sum("revenue").alias("total_revenue")
).sort("total_revenue", ascending=False).limit(10)
print("10 categories that generated most revenue in the first two quarters of 2023:")
report.show()

10 categories that generated most revenue in the first two quarters of 2023:
+-----------+--------------------+
|   category|       total_revenue|
+-----------+--------------------+
| category_8|       3.758324464E8|
| category_5|      3.5610826225E8|
|category_13| 3.319138924000001E8|
| category_0|      3.2437222756E8|
| category_6|3.1898830100000006E8|
|category_14|      3.1811856803E8|
|category_25|3.1694807359999996E8|
| category_1|3.0797836798999995E8|
|category_29|      3.0713342208E8|
|category_22| 3.065464972800001E8|
+-----------+--------------------+

