### Gold Parquet to Delta

Due to limitations of my laptop, I could not process the silver layer locally with spark. Instead I had to do a simple python only method where I worked through batches of parquet on Runpod with 2 rented GPUs, this is gold parquet gpu 0 and gold parquet gpu 1. 

Due to dependancy errors with XLRS on the runpod cloud env GI lexicon features had to be dropped

This notebook is to read through the 800 parquet chunks and make them a cohesive delta table to work with later.


# parquet to delta table

In [1]:
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

builder = (
    SparkSession.builder.appName("gold_pipeline")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.driver.memory", "8g")
    .config("spark.sql.shuffle.partitions", "16")
)

spark = configure_spark_with_delta_pip(builder).getOrCreate()

print("Spark session initialised.")


25/12/04 16:25:59 WARN Utils: Your hostname, david-ThinkPad-T490 resolves to a loopback address: 127.0.1.1; using 172.16.0.186 instead (on interface wlp0s20f3)
25/12/04 16:25:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/david/School/CapStone/.venv/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/david/.ivy2/cache
The jars for the packages stored in: /home/david/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-cbdb9142-0957-4b95-a307-98ee2dc3485a;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.1.0 in central
	found io.delta#delta-storage;3.1.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 485ms :: artifacts dl 17ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.1.0 from central in [default]
	io.delta#delta-storage;3.1.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0  

Spark session initialised.


25/12/04 16:26:15 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [2]:
from pathlib import Path
import pandas as pd
from math import ceil

# ------------------------
# Directories
# ------------------------
FILE_DIR = Path.cwd()
G0 = FILE_DIR / "gold_parquet_gpu0"
G1 = FILE_DIR / "gold_parquet_gpu1"

all_files = sorted(list(G0.glob("*.parquet"))) + sorted(list(G1.glob("*.parquet")))
print("Total parquet files detected:", len(all_files))

# ------------------------
# Detect bad columns (anchor_policy = broken int/string mismatch)
# ------------------------
sample_df = pd.read_parquet(all_files[0])
bad_cols = ["anchor_policy"]

use_cols = [c for c in sample_df.columns if c not in bad_cols]
print("Using columns:", len(use_cols))
print("Dropped columns:", bad_cols)

# ------------------------
# Delta output directory
# ------------------------
DELTA_OUT = FILE_DIR / "gold_delta"
DELTA_OUT.mkdir(exist_ok=True)


Total parquet files detected: 780
Using columns: 461
Dropped columns: ['anchor_policy']


In [3]:
batch_size = 10        # SAFE for your laptop. Lower to 10 if still heavy.
num_batches = ceil(len(all_files) / batch_size)

print(f"Processing {len(all_files)} files in {num_batches} batches of {batch_size} files each")

first_mode = "overwrite"
append_mode = "append"

for i in range(num_batches):
    batch_files = all_files[i*batch_size:(i+1)*batch_size]
    print(f"\n--- Batch {i+1}/{num_batches} ({len(batch_files)} files) ---")

    df = (
        spark.read
            .option("timestampType", "TIMESTAMP_MILLIS")
            .parquet(*[str(f) for f in batch_files])
            .select(*use_cols)
    )

    write_mode = first_mode if i == 0 else append_mode
    df.write.format("delta").mode(write_mode).save(str(DELTA_OUT))

    spark.catalog.clearCache()
    print(f"Batch {i+1}/{num_batches} complete.")


Processing 780 files in 78 batches of 10 files each

--- Batch 1/78 (10 files) ---


25/12/04 16:29:59 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Batch 1/78 complete.

--- Batch 2/78 (10 files) ---


                                                                                

Batch 2/78 complete.

--- Batch 3/78 (10 files) ---


                                                                                

Batch 3/78 complete.

--- Batch 4/78 (10 files) ---


                                                                                

Batch 4/78 complete.

--- Batch 5/78 (10 files) ---


                                                                                

Batch 5/78 complete.

--- Batch 6/78 (10 files) ---


                                                                                

Batch 6/78 complete.

--- Batch 7/78 (10 files) ---


                                                                                

Batch 7/78 complete.

--- Batch 8/78 (10 files) ---


                                                                                

Batch 8/78 complete.

--- Batch 9/78 (10 files) ---


                                                                                

Batch 9/78 complete.

--- Batch 10/78 (10 files) ---


                                                                                

Batch 10/78 complete.

--- Batch 11/78 (10 files) ---


                                                                                

Batch 11/78 complete.

--- Batch 12/78 (10 files) ---


                                                                                

Batch 12/78 complete.

--- Batch 13/78 (10 files) ---


                                                                                

Batch 13/78 complete.

--- Batch 14/78 (10 files) ---


                                                                                

Batch 14/78 complete.

--- Batch 15/78 (10 files) ---


                                                                                

Batch 15/78 complete.

--- Batch 16/78 (10 files) ---


                                                                                

Batch 16/78 complete.

--- Batch 17/78 (10 files) ---


                                                                                

Batch 17/78 complete.

--- Batch 18/78 (10 files) ---


                                                                                

Batch 18/78 complete.

--- Batch 19/78 (10 files) ---


                                                                                

Batch 19/78 complete.

--- Batch 20/78 (10 files) ---


                                                                                

Batch 20/78 complete.

--- Batch 21/78 (10 files) ---


                                                                                

Batch 21/78 complete.

--- Batch 22/78 (10 files) ---


                                                                                

Batch 22/78 complete.

--- Batch 23/78 (10 files) ---


                                                                                

Batch 23/78 complete.

--- Batch 24/78 (10 files) ---


                                                                                

Batch 24/78 complete.

--- Batch 25/78 (10 files) ---


                                                                                

Batch 25/78 complete.

--- Batch 26/78 (10 files) ---


                                                                                

Batch 26/78 complete.

--- Batch 27/78 (10 files) ---


                                                                                

Batch 27/78 complete.

--- Batch 28/78 (10 files) ---


                                                                                

Batch 28/78 complete.

--- Batch 29/78 (10 files) ---


                                                                                

Batch 29/78 complete.

--- Batch 30/78 (10 files) ---


                                                                                

Batch 30/78 complete.

--- Batch 31/78 (10 files) ---


                                                                                

Batch 31/78 complete.

--- Batch 32/78 (10 files) ---


                                                                                

Batch 32/78 complete.

--- Batch 33/78 (10 files) ---


                                                                                

Batch 33/78 complete.

--- Batch 34/78 (10 files) ---


                                                                                

Batch 34/78 complete.

--- Batch 35/78 (10 files) ---


                                                                                

Batch 35/78 complete.

--- Batch 36/78 (10 files) ---


                                                                                

Batch 36/78 complete.

--- Batch 37/78 (10 files) ---


                                                                                

Batch 37/78 complete.

--- Batch 38/78 (10 files) ---


                                                                                

Batch 38/78 complete.

--- Batch 39/78 (10 files) ---


                                                                                

Batch 39/78 complete.

--- Batch 40/78 (10 files) ---


                                                                                

Batch 40/78 complete.

--- Batch 41/78 (10 files) ---


                                                                                

Batch 41/78 complete.

--- Batch 42/78 (10 files) ---


                                                                                

Batch 42/78 complete.

--- Batch 43/78 (10 files) ---


                                                                                

Batch 43/78 complete.

--- Batch 44/78 (10 files) ---


                                                                                

Batch 44/78 complete.

--- Batch 45/78 (10 files) ---


                                                                                

Batch 45/78 complete.

--- Batch 46/78 (10 files) ---


                                                                                

Batch 46/78 complete.

--- Batch 47/78 (10 files) ---


                                                                                

Batch 47/78 complete.

--- Batch 48/78 (10 files) ---


                                                                                

Batch 48/78 complete.

--- Batch 49/78 (10 files) ---


                                                                                

Batch 49/78 complete.

--- Batch 50/78 (10 files) ---


                                                                                

Batch 50/78 complete.

--- Batch 51/78 (10 files) ---


                                                                                

Batch 51/78 complete.

--- Batch 52/78 (10 files) ---


                                                                                

Batch 52/78 complete.

--- Batch 53/78 (10 files) ---


                                                                                

Batch 53/78 complete.

--- Batch 54/78 (10 files) ---


                                                                                

Batch 54/78 complete.

--- Batch 55/78 (10 files) ---


                                                                                

Batch 55/78 complete.

--- Batch 56/78 (10 files) ---


                                                                                

Batch 56/78 complete.

--- Batch 57/78 (10 files) ---


                                                                                

Batch 57/78 complete.

--- Batch 58/78 (10 files) ---


                                                                                

Batch 58/78 complete.

--- Batch 59/78 (10 files) ---


                                                                                

Batch 59/78 complete.

--- Batch 60/78 (10 files) ---


                                                                                

Batch 60/78 complete.

--- Batch 61/78 (10 files) ---


                                                                                

Batch 61/78 complete.

--- Batch 62/78 (10 files) ---


                                                                                

Batch 62/78 complete.

--- Batch 63/78 (10 files) ---


                                                                                

Batch 63/78 complete.

--- Batch 64/78 (10 files) ---


                                                                                

Batch 64/78 complete.

--- Batch 65/78 (10 files) ---


                                                                                

Batch 65/78 complete.

--- Batch 66/78 (10 files) ---


                                                                                

Batch 66/78 complete.

--- Batch 67/78 (10 files) ---


                                                                                

Batch 67/78 complete.

--- Batch 68/78 (10 files) ---


                                                                                

Batch 68/78 complete.

--- Batch 69/78 (10 files) ---


                                                                                

Batch 69/78 complete.

--- Batch 70/78 (10 files) ---


                                                                                

Batch 70/78 complete.

--- Batch 71/78 (10 files) ---


                                                                                

Batch 71/78 complete.

--- Batch 72/78 (10 files) ---


                                                                                

Batch 72/78 complete.

--- Batch 73/78 (10 files) ---


                                                                                

Batch 73/78 complete.

--- Batch 74/78 (10 files) ---


                                                                                

Batch 74/78 complete.

--- Batch 75/78 (10 files) ---


                                                                                

Batch 75/78 complete.

--- Batch 76/78 (10 files) ---


                                                                                

Batch 76/78 complete.

--- Batch 77/78 (10 files) ---


                                                                                

Batch 77/78 complete.

--- Batch 78/78 (10 files) ---




Batch 78/78 complete.


                                                                                

In [4]:
df_delta = spark.read.format("delta").load(str(DELTA_OUT))

print("Delta table loaded.")
print("Schema:")
df_delta.printSchema()

print("Row count:")
print(df_delta.count())


Delta table loaded.
Schema:
root
 |-- date: timestamp_ntz (nullable = true)
 |-- text: string (nullable = true)
 |-- publication: string (nullable = true)
 |-- author: string (nullable = true)
 |-- url: string (nullable = true)
 |-- text_type: string (nullable = true)
 |-- time_precision: string (nullable = true)
 |-- date_trading: timestamp_ntz (nullable = true)
 |-- tz_hint: string (nullable = true)
 |-- dataset: string (nullable = true)
 |-- dataset_source: string (nullable = true)
 |-- source: string (nullable = true)
 |-- source_file: string (nullable = true)
 |-- len_text: integer (nullable = true)
 |-- silver_ingestion_ts: timestamp_ntz (nullable = true)
 |-- emb_0: float (nullable = true)
 |-- emb_1: float (nullable = true)
 |-- emb_2: float (nullable = true)
 |-- emb_3: float (nullable = true)
 |-- emb_4: float (nullable = true)
 |-- emb_5: float (nullable = true)
 |-- emb_6: float (nullable = true)
 |-- emb_7: float (nullable = true)
 |-- emb_8: float (nullable = true)
 |-- e



1949081


                                                                                