# DS5460 Milestone 2 - EDA 

### Ingesting Files to PySpark

Author: Anne Tumlin

Date: 03/21/25

Now that we have taken the files from the original GCS bucket, extracted them, and put them in our local GCS bucket (see `docs/EXTRACTION_PROCESS` in GitHub for more details), we can begin to ingest our data into PySpark. 

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("app_name") \
    .getOrCreate()

Don't forget to edit the path with YOUR google storage bucket here. 

In [3]:
json_path = "gs://ds5460-tumlinam-fp-bucket/gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/*.json"

### Testing on a Small Scale
Let's try this with only 100 files first. 

In [9]:
from google.cloud import storage
client = storage.Client()

Reminder: Change bucket_name 

In [10]:
bucket_name = "ds5460-tumlinam-fp-bucket"
bucket = client.get_bucket(bucket_name)

In [12]:
prefix = "gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/"

In [13]:
# List blobs (files) in the specified prefix and collect the first 100 file paths
blobs = bucket.list_blobs(prefix=prefix)
file_paths = []
for blob in blobs:
    file_paths.append(f"gs://{bucket_name}/{blob.name}")
    if len(file_paths) >= 100:
        break

In [14]:
print("Loading the following 100 JSON file paths:")
for path in file_paths:
    print(path)

Loading the following 100 JSON file paths:
gs://ds5460-tumlinam-fp-bucket/gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/example_15000.json
gs://ds5460-tumlinam-fp-bucket/gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/example_15001.json
gs://ds5460-tumlinam-fp-bucket/gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/example_15002.json
gs://ds5460-tumlinam-fp-bucket/gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/example_15003.json
gs://ds5460-tumlinam-fp-bucket/gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/example_15004.json
gs://ds5460-tumlinam-fp-bucket/gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/example_15005.json
gs://ds5460-tumlinam-fp-bucket/gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/example_15006.json
gs://ds5460-tumlinam-fp-bucket/gridopt-dataset-tmp/dataset_release_1/pglib_opf_case500_goc/group_1/example_15007.json
gs://ds5460-t

**IMPORTANT NOTE:** After testing and running into issues with the schema, I discovered that due to the way the JSON files are formatted we must utilize the multiline read option. Otherwise, our data will not be read in properly. Instead, it will lead to the error `|-- _corrupt_record: string (nullable = true)`. 

In [19]:
# Make sure to use multiline read option!
df_small = spark.read.option("multiline", "true").json(file_paths)

In [20]:
df_small.printSchema()

root
 |-- grid: struct (nullable = true)
 |    |-- context: array (nullable = true)
 |    |    |-- element: array (containsNull = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: double (containsNull = true)
 |    |-- edges: struct (nullable = true)
 |    |    |-- ac_line: struct (nullable = true)
 |    |    |    |-- features: array (nullable = true)
 |    |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |    |-- element: double (containsNull = true)
 |    |    |    |-- receivers: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- senders: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- generator_link: struct (nullable = true)
 |    |    |    |-- receivers: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- senders: array (nullable = true)
 |    |    |  

In [21]:
df_small.show(2)

+--------------------+--------------------+--------------------+
|                grid|            metadata|            solution|
+--------------------+--------------------+--------------------+
|[[[[100.0]]], [[[...| [443934.8106702195]|[[[[[1.2271252469...|
|[[[[100.0]]], [[[...|[465533.45792886155]|[[[[[1.3700529900...|
+--------------------+--------------------+--------------------+
only showing top 2 rows



### Testing on the Full Dataset

In [None]:
df = spark.read.option("multiline", "true").json(json_path)

In [None]:
df.printSchema()