# Spark with Parquet File Reads on S3

## Prerequisites

#### 1. Make sure maven is configured with java 11
1. Hadoop 3.x: Requires Java 8 or Java 11, so point to java 11 version in `PATH`
2. Check `mvn --version`
3. If not configured, setup maven https://phoenixnap.com/kb/install-maven-windows

#### 2. Add `winutils.exe` to hadoop home
1. Create a directory for hadoop home (ex: `C:\hadoop`)
2. Download `winutils.exe`, a popular repository having this is https://github.com/steveloughran/winutils,   
   File link: https://github.com/steveloughran/winutils/blob/master/hadoop-3.0.0/bin/winutils.exe
4. Create a folder in hadoop home named `bin` and add `winutils.exe` to it
5. Add environment variable, `HADOOP_HOME` (`C:\hadoop`)
6. Add  `HADOOP_HOME\bin` to `PATH` (`%HADOOP_HOME%\bin`)

#### 3. Add below AWS related values to environment variables

These can be taken from https://sysco-sso.awsapps.com/start/#

1. `AWS_REGION`
2. `AWS_ACCESS_KEY_ID`
3. `AWS_SECRET_ACCESS_KEY`
4. `AWS_SESSION_TOKEN`

## Create Spark Session

In [2]:
ICEBERG_S3_WAREHOUSE = "s3://cx-unique-purchase-data-non-prod/dev/"
AWS_REGION = "us-east-1"

In [3]:
from pyspark.sql import SparkSession
import time

start_time = time.time()

spark = (
    SparkSession.builder 
    .appName('Spark Iceberg Example')
    
    .config("spark.network.timeout", '10000s')
    .config('spark.sql.autoBroadcastJoinThreshold', -1)
    .config('spark.shuffle.consolidateFiles', True)
    .config('spark.dynamicAllocation.enabled', False)
    .config("spark.serializer", 'org.apache.spark.serializer.KryoSerializer')
    .config('spark.shuffle.service.enabled', False)
    .config('spark.hadoop.fs.s3.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')

    .config("spark.hadoop.fs.s3a.endpoint", f"s3.{AWS_REGION}.amazonaws.com")

    .config('spark.jars.packages', 
            'org.apache.hadoop:hadoop-aws:3.3.4,'
            'com.amazonaws:aws-java-sdk-bundle:1.11.901')

    .config('spark.partial-progress.enabled', True)
    .getOrCreate()
)

end_time = time.time()

print(f"Spark session created in {end_time - start_time} seconds")

Spark session created in 9.816956281661987 seconds


## Read Parquet File

In [4]:
s3_parquet_path = "s3://cx-staging-glue-catalog/dev/sysco_src_restrictions_product_attributes_staging.parquet"
parquet_df = spark.read.parquet(s3_parquet_path)
parquet_df.show()

+-----------+-----------+---------+-------------+---------+--------------------+--------------------+-------+
|customer_id|cust_id_int|seller_id|sub_seller_id|list_type|             list_id|          updated_at|opco_id|
+-----------+-----------+---------+-------------+---------+--------------------+--------------------+-------+
|     000000|          0|     USBL|         USBL|    Block|056_common_cannotbuy|2024-03-12 12:47:...|     56|
|     000000|          0|     USBL|         USBL|  Can Buy|          056_canbuy|2024-03-12 12:47:...|     56|
|     000006|          6|     USBL|         USBL|    Block|056_common_cannotbuy|2024-09-24 23:35:...|     56|
|     000008|          8|     USBL|         USBL|  Can Buy|          056_canbuy|2024-09-24 23:35:...|     56|
|     000009|          9|     USBL|         USBL|    Block|056_common_cannotbuy|2024-03-12 12:47:...|     56|
|     000009|          9|     USBL|         USBL|  Can Buy|          056_canbuy|2024-03-12 12:47:...|     56|
|     0000