# Homework 06 (Datahub)

## Radosław Jurczak

-------------------------------------------------

A docker network `de_network` is used, created by
```{bash}
docker network create de_network
```

Minio was run with the following command:
```{bash}
docker run -p 9000:9000 -p 9090:9090 --name minio --network=de_network -v ~/minio/data:/data -e "MINIO_ROOT_USER=admin" -e "MINIO_ROOT_PASSWORD=adminadmin" quay.io/minio/minio server /data --console-address ":9090"
```

To succesfully run the code below, you'll need to create a minio bucket called `hw6`.

The notebook was run inside docker, set up by
```{bash}
docker run \
-it -d --rm \
--network=de_network \
-p 10000:8888 -p 4041:4040 \
-v "${PWD}":/home/rj/data_engineering \
quay.io/jupyter/all-spark-notebook
```

---------------------------------------------------
Two base datasets are created: 
 - `user` table, with columns `user_id`, `location_id`, `first_name`, `last_name`, `age`;
 - `location` table, with columns `location_id`, `zip_code`, `city`, `city_size`;

An additional table `user_location` is created by joining location info to `user` dataset.

Finally, a report table `user_age_by_city` is created by aggregating `user_location` by city and calculating average user age in each group.

---------------------------------------------------
Screens from datahub are attached separately.

In [1]:
!pip install delta-spark
!pip install randomtimestamp
!pip install names



In [2]:
import datetime
import random 
from tqdm import tqdm

import names
import randomtimestamp
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.types import StringType
from delta.tables import DeltaTable

In [3]:
spark_conf = (
    SparkConf()
    .set("spark.jars.packages", 'org.apache.hadoop:hadoop-client:3.3.4,org.apache.hadoop:hadoop-aws:3.3.4,io.delta:delta-spark_2.12:3.0.0')
  
    .set("spark.driver.memory", "6g")

    
    .set("spark.hadoop.fs.s3a.endpoint", "minio:9000")
    .set("spark.hadoop.fs.s3a.access.key", "admin")
    .set("spark.hadoop.fs.s3a.secret.key", "adminadmin" )
    .set("spark.hadoop.fs.s3a.path.style.access", "true") 
    .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
    .set("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")

    .set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .set("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    # .set("spark.databricks.delta.schema.autoMerge.enabled", "true") # enable adding columns on merge
)
sc = SparkContext.getOrCreate(spark_conf)
spark = SparkSession(sc)

In [4]:
print(f"Hadoop version = {spark._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}")
print(f"Spark version = {spark.version}")

Hadoop version = 3.3.4
Spark version = 3.5.0


In [5]:
N_USERS = 20_000
N_LOCATIONS = 1_000
N_CITIES = 30

#### Generate user and location datasets and push them to delta lake on Minio

In [6]:
users = [
    (f"user_{i}",
     f"location_{random.randint(0, N_LOCATIONS-1)}",
     names.get_first_name(),
     names.get_last_name(),
     random.randint(18, 100),
    )
    for i in range(N_USERS)
]

city2size = {
    f"city_{i}": random.choice(("small", "medium", "large"))
    for i in range(N_CITIES)
}
city_names = list(city2size.keys())
locations = []
for i in range(N_LOCATIONS):
    city = random.choice(city_names)
    locations.append((
        f"location_{i}",
        f"{random.randint(10, 99)}-{random.randint(100, 999)}",
        city,
        city2size[city],
))

user_df = spark.createDataFrame(users, ["user_id", "location_id", "first_name", "last_name", "age"])
user_df.write.format("delta").mode("overwrite").save("s3a://hw6/user")
user_df.show(5)

location_df = spark.createDataFrame(locations, ["location_id", "zip_code", "city", "city_size"])
location_df.write.format("delta").mode("overwrite").save("s3a://hw6/location")
location_df.show(5)

+-------+------------+----------+-----------+---+
|user_id| location_id|first_name|  last_name|age|
+-------+------------+----------+-----------+---+
| user_0|location_372| Roosevelt|      Watts| 41|
| user_1|location_461|     Velma|      Loper| 65|
| user_2|location_873|      Mark|Zimmerebner| 61|
| user_3|location_973|   Douglas|    Kirksey| 58|
| user_4|location_893|     Maria|    Swilley| 40|
+-------+------------+----------+-----------+---+
only showing top 5 rows

+-----------+--------+-------+---------+
|location_id|zip_code|   city|city_size|
+-----------+--------+-------+---------+
| location_0|  78-256|city_27|   medium|
| location_1|  47-932| city_6|    large|
| location_2|  16-330|city_21|    large|
| location_3|  12-350| city_4|    small|
| location_4|  42-704|city_13|    large|
+-----------+--------+-------+---------+
only showing top 5 rows



#### Join users with locations and store the result in Minio

In [7]:
user_location_df = user_df.join(
    location_df, on="location_id"
)
user_location_df.write.format("delta").mode("overwrite").save("s3a://hw6/user_location")
user_location_df.show(5)

+-----------+----------+----------+---------+---+--------+------+---------+
|location_id|   user_id|first_name|last_name|age|zip_code|  city|city_size|
+-----------+----------+----------+---------+---+--------+------+---------+
| location_8|user_18918|    Donald|  Rickman| 48|  76-507|city_3|   medium|
| location_8|user_17549|  Jennifer|    Perez| 95|  76-507|city_3|   medium|
| location_8|user_13131|    Evelyn|   Newman| 73|  76-507|city_3|   medium|
| location_8|user_12875|      John|    Yates| 54|  76-507|city_3|   medium|
| location_8|user_12411|      Erma|   Berger| 59|  76-507|city_3|   medium|
+-----------+----------+----------+---------+---+--------+------+---------+
only showing top 5 rows



#### Create a report table: average user age by city; store the result in Minio

In [8]:
user_age_by_city_df = user_location_df.select(
    "city", "age"
).groupby("city").agg(
    f.round(f.avg("age"), 2).alias("average_user_age")
).sort("average_user_age", ascending=False)
user_age_by_city_df.write.format("delta").mode("overwrite").save("s3a://hw6/user_age_by_city")
user_age_by_city_df.show(5)

+-------+----------------+
|   city|average_user_age|
+-------+----------------+
|city_29|           60.67|
|city_20|            60.0|
|city_13|           59.99|
| city_3|           59.99|
|city_10|           59.89|
+-------+----------------+
only showing top 5 rows

