# **Knowledge Graph Generation**


### Load data into Spark

In [1]:
import sys
sys.path.append("/usr/local/python-env/py39/lib/python3.9/site-packages")

import pyspark
print(pyspark.__version__)

print(sys.executable)

3.5.1
/usr/bin/python3.9


### Initialze a SparkSession

Initialize a test session to ensure the SparkSession is working properly. This will connect to the resource manager node that is running the YARN cluster. If we visit the YARN web portal, we can see that the Spark application is running.

Ensuring the pyspark library is being accessed from my local usr directory.


In [2]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3.9'

In [3]:
import pkg_resources

sedona_version = pkg_resources.get_distribution("apache-sedona").version
print(f"Apache Sedona version: {sedona_version}")

Apache Sedona version: 1.5.1


In [4]:
print(os.environ['SPARK_HOME'])
print(os.environ['PYSPARK_PYTHON'])

/usr/local/spark/latest
/usr/bin/python3.9


In [5]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col, lit
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import split
from pyspark.sql.functions import col
from pyspark.sql import Row
from pyspark.sql.types import IntegerType, DateType
from pyspark.sql.functions import year  # used to extract year from date, could do this manually as well
from pyspark.sql import Window
from pyspark.sql.functions import sum as pyspark_sum
from pyspark.sql import DataFrame
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
from sedona.spark import *
import geopandas as gpd

Skipping SedonaKepler import, verify if keplergl is installed


## Now to make the app

In [6]:


spark = SparkSession \
    .builder \
    .appName('GeoSpatialQueries_Freddy') \
    .master('spark://columbus-oh.cs.colostate.edu:30800') \
    .config("spark.yarn.resourcemanager.address", "columbia.cs.colostate.edu:30799") \
    .config("spark.serializer", KryoSerializer.getName) \
    .config("spark.kryo.registrator", SedonaKryoRegistrator.getName) \
    .config('spark.jars.packages',
            'org.apache.sedona:sedona-spark-3.5_2.12:1.5.1,'
            'org.datasyslab:geotools-wrapper:1.5.1-28.2') \
    .config('spark.jars.repositories', 'https://artifacts.unidata.ucar.edu/repository/unidata-all') \
    .getOrCreate()

# Set log level to DEBUG
spark.sparkContext.setLogLevel("ERROR")

sedona = SedonaContext.create(spark)
SedonaRegistrator.registerAll(spark)

# create a logger
logger = spark._jvm.org.apache.log4j.LogManager.getLogger(__name__)
logger.info("Pyspark initialized...")

https://artifacts.unidata.ucar.edu/repository/unidata-all added as a remote repository with the name: repo-1
Ivy Default Cache set to: /s/chopin/a/grad/flarrieu/.ivy2/cache
The jars for the packages stored in: /s/chopin/a/grad/flarrieu/.ivy2/jars
org.apache.sedona#sedona-spark-3.5_2.12 added as a dependency
org.datasyslab#geotools-wrapper added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2998bf84-9932-4b8a-8ef0-266f4e53c2d8;1.0
	confs: [default]
	found org.apache.sedona#sedona-spark-3.5_2.12;1.5.1 in central


:: loading settings :: url = jar:file:/usr/local/spark/3.5.0-with-hadoop3.3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.sedona#sedona-common;1.5.1 in central
	found org.apache.commons#commons-math3;3.6.1 in central
	found org.locationtech.jts#jts-core;1.19.0 in central
	found org.wololo#jts2geojson;0.16.1 in central
	found org.locationtech.spatial4j#spatial4j;0.8 in central
	found com.google.geometry#s2-geometry;2.0.0 in central
	found com.google.guava#guava;25.1-jre in central
	found com.google.code.findbugs#jsr305;3.0.2 in user-list
	found org.checkerframework#checker-qual;2.0.0 in central
	found com.google.errorprone#error_prone_annotations;2.1.3 in central
	found com.google.j2objc#j2objc-annotations;1.1 in central
	found org.codehaus.mojo#animal-sniffer-annotations;1.14 in central
	found com.uber#h3;4.1.1 in central
	found net.sf.geographiclib#GeographicLib-Java;1.52 in central
	found com.github.ben-manes.caffeine#caffeine;2.9.2 in central
	found org.checkerframework#checker-qual;3.10.0 in central
	found com.google.errorprone#error_prone_annotations;2.5.1 in central
	found org.apac

## Load the datasets

Got the code for this from https://sedona.apache.org/1.5.1/tutorial/sql/

Load GeoJSON using Spark JSON Data Source:

Spark SQL's built-in JSON data source supports reading GeoJSON data. To ensure proper parsing of the geometry property, we can define a schema with the geometry property set to type 'string'. This prevents Spark from interpreting the property and allows us to use the ST_GeomFromGeoJSON function for accurate geometry parsing.

```python
schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>";
(sedona.read.json(geojson_path, schema=schema)
    .selectExpr("explode(features) as features") # Explode the envelope to get one feature per row.
    .select("features.*") # Unpack the features struct.
    .withColumn("geometry", f.expr("ST_GeomFromGeoJSON(geometry)")) # Convert the geometry string.
    .printSchema())
```

In [7]:
# Import the necessary module from py4j to interact with JVM
from py4j.java_gateway import java_import

# Import the Path class from Hadoop. This class is used to handle file paths in Hadoop.
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')

# Define a function to recursively get all .json and .geojson files in a directory and its subdirectories
def get_files_recursive(path):
    # Use the listStatus method of the FileSystem class to get an array of FileStatus objects
    # Each FileStatus object represents a file or directory in the given path
    file_status_arr = fs.listStatus(spark._jvm.Path(path))
    
    # Initialize an empty list to hold the file paths
    file_paths = []
    
    # Loop through each FileStatus object in the array
    for file_status in file_status_arr:
        # If the FileStatus object represents a directory
        if file_status.isDirectory():
            # Call the get_files_recursive function with the directory path
            # This is a recursive call, which means the function calls itself
            # Add the returned file paths to the file_paths list
            file_paths += get_files_recursive(file_status.getPath().toString())
        # If the FileStatus object represents a file that ends with .json or .geojson
        elif file_status.getPath().getName().endswith(('.json', '.geojson')):
            # Add the file path to the file_paths list
            file_paths.append(file_status.getPath().toString())
    
    print(file_paths)
    # Return the list of file paths
    return file_paths

In [8]:

# Initialize a Hadoop file system 
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())

# Directory containing the files
json_directory = "hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/"

# Define the schema for the GeoJSON data
geojsonSchema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>"


# Get a list of the JSON and GeoJSON files in the directory and its subdirectories
json_files = get_files_recursive(json_directory)

# Create a dictionary to hold the DataFrames
json_dataset_dataframes = {}

# Define the current and desired EPSG codes
current_epsg = "EPSG:3857"  # Web Mercator
desired_epsg = "EPSG:4326"  # WGS84

# Load each JSON file into a DataFrame and store it in the dictionary
for file_path in json_files:
    file_name = file_path.split('/')[-1]
    
    # Print the file path
    print(f"Processing file: {file_path}")
    
    # Read the GeoJSON file using the defined schema using sedona into a spark dataframe
    df = spark.read.schema(geojsonSchema).json(file_path, multiLine=True)
    
    # Explode the features array to create a row for each feature and select the columns
    df = (df
        .select(F.explode("features").alias("features"))
        .select("features.*")
        # Use Sedona's ST_GeomFromGeoJSON function to convert the geometry string to a geometry object
        .withColumn("geometry", F.expr("ST_GeomFromGeoJSON(geometry)"))
        )
    
    json_dataset_dataframes[file_name] = df

['hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Alabama.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Alaska.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/AmericanSamoa.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Arizona.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Arkansas.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/California.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Colorado.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/CommonwealthoftheNorthernMarianaIslands.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Connecticut.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Delaware.geojson', 'hdfs://columbus-oh.cs.colostate

# **Helper Functions**

In [9]:
# Assuming you have access to a function to write to HDFS or similar storage
def append_to_csv(df: DataFrame):
    path = "hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/graph/base_graph.csv"
    # Here we use mode 'append' to add to the existing file
    df.write.csv(path=path, mode='append', header=True)

In [10]:
def load_and_display_graph() -> DataFrame:
    hdfs_path =  "hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/graph/base_graph.csv"
    
    # Define the schema for the graph data
    schema = StructType([
        StructField("Subject", StringType(), True),
        StructField("Relationship", StringType(), True),
        StructField("Object", StringType(), True)
    ])
    
    # Load the CSV file into a DataFrame with the defined schema
    graph_df = spark.read.csv(hdfs_path, header=True, schema=schema)
    
    # Show a sample of the DataFrame
    graph_df.show(truncate=False)
    
    return graph_df

# **Add Continents**

In [8]:
from pyspark.sql import functions as F

# Create a DataFrame for Earth with a single row
earth_df = spark.createDataFrame([("Earth",)], ["name"])

# Assuming df_continents already loaded
df_continents = json_dataset_dataframes['WorldContinents.geojson']

# Add columns to df_continents that establishes the 'partOf' relationship to Earth
df_continents = df_continents.withColumn("Subject", F.col("properties.CONTINENT"))
df_continents = df_continents.withColumn("Relationship", F.lit("partOf"))
df_continents = df_continents.withColumn("Object", F.lit("Earth"))

# Select the new structured columns to form the triple
df_to_save = df_continents.select("Subject", "Relationship", "Object")

# Show DataFrame to verify the structure
df_to_save.show()

[Stage 6:>                                                          (0 + 1) / 1]

+-------------+------------+------+
|      Subject|Relationship|Object|
+-------------+------------+------+
|       Africa|      partOf| Earth|
|         Asia|      partOf| Earth|
|    Australia|      partOf| Earth|
|      Oceania|      partOf| Earth|
|South America|      partOf| Earth|
|   Antarctica|      partOf| Earth|
|       Europe|      partOf| Earth|
|North America|      partOf| Earth|
+-------------+------------+------+



                                                                                

In [11]:
# Define the HDFS path for the output CSV file
hdfs_output_path = "hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/graph/base_graph.csv"

# Try to save the DataFrame to HDFS
try:
    df_to_save.write.csv(path=hdfs_output_path, mode="overwrite", header=True)
    print(f"CSV file has been successfully saved to HDFS at {hdfs_output_path}.")
except Exception as e:
    print("Failed to write CSV file to HDFS:", e)

CSV file has been successfully saved to HDFS at hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/graph/base_graph.csv.


# **Add Countries**

In [8]:
df_continents = json_dataset_dataframes['WorldContinents.geojson']
df_countries = json_dataset_dataframes['CountryTerritories.geojson']

In [11]:
df_continents.show(n=1, truncate=False)
df_countries.show(n=1, truncate=False)

                                                                                

+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

[Stage 9:>                                                          (0 + 1) / 1]

+-------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|type   |geometry                                                                                                 |properties                                                                                                                                                                                                                                                              |
+-------+---------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------

                                                                                

In [21]:
# Rename geometries directly and avoid unnecessary transformation if already in 'epsg:4326'
df_continents = df_continents.withColumnRenamed("geometry", "continents_geometry")
df_countries = df_countries.withColumnRenamed("geometry", "country_geometry")

# Also rename properties to differentiate them
df_continents = df_continents.withColumnRenamed("properties", "continent_properties")
df_countries = df_countries.withColumnRenamed("properties", "country_properties")

# Now perform the spatial join using the renamed columns
df_country_continent_instersects = df_continents.crossJoin(df_countries).where(
    F.expr("ST_Contains(continents_geometry, ST_Centroid(country_geometry))")
)

# Show results, making sure to differentiate which properties you're referring to
df_country_continent_instersects.select("continent_properties", "country_properties").show(n=1, truncate=False)


[Stage 120:>                                                        (0 + 1) / 1]

+--------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|continent_properties                                                                                                                                    |country_properties                                                                                                                                                                                                                                                      |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+------

                                                                                

In [22]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col, lit

df_relationships = df_country_continent_instersects.select(
    col("country_properties.name").alias("Subject"),
    lit("isPartOf").alias("Relationship"),
    col("continent_properties.CONTINENT").alias("Object")
)

df_relationships.show(5, truncate=False)

+-----------------+------------+------+
|Subject          |Relationship|Object|
+-----------------+------------+------+
|Ma'tan al-Sarra  |isPartOf    |Africa|
|Equatorial Guinea|isPartOf    |Africa|
|Niger            |isPartOf    |Africa|
|Morocco          |isPartOf    |Africa|
|Algeria          |isPartOf    |Africa|
+-----------------+------------+------+
only showing top 5 rows



In [None]:
append_to_csv(df_relationships)

In [24]:
load_and_display_graph()

+----------------------+------------+------+
|Subject               |Relationship|Object|
+----------------------+------------+------+
|Ma'tan al-Sarra       |isPartOf    |Africa|
|Equatorial Guinea     |isPartOf    |Africa|
|Niger                 |isPartOf    |Africa|
|Morocco               |isPartOf    |Africa|
|Algeria               |isPartOf    |Africa|
|Togo                  |isPartOf    |Africa|
|Libyan Arab Jamahiriya|isPartOf    |Africa|
|Liberia               |isPartOf    |Africa|
|Rwanda                |isPartOf    |Africa|
|Mozambique            |isPartOf    |Africa|
|Zimbabwe              |isPartOf    |Africa|
|Uganda                |isPartOf    |Africa|
|Eritrea               |isPartOf    |Africa|
|Congo                 |isPartOf    |Africa|
|Guinea                |isPartOf    |Africa|
|Zambia                |isPartOf    |Africa|
|Mayotte               |isPartOf    |Africa|
|Cameroon              |isPartOf    |Africa|
|Western Sahara        |isPartOf    |Africa|
|Malawi   

DataFrame[Subject: string, Relationship: string, Object: string]

24/04/18 20:01:11 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED
24/04/18 20:01:11 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
	at org.apache.spark.errors.SparkCoreErrors$.clusterSchedulerError(SparkCoreErrors.scala:291)
	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:981)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:165)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:263)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:170)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.proce

# **Add States**

In [14]:
df_countries = json_dataset_dataframes['CountryTerritories.geojson']
df_states = json_dataset_dataframes['States.geojson']

In [15]:
df_states.show(n=1, truncate=False)

+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [17]:
df_countries = df_countries.withColumnRenamed("geometry", "country_geometry")
df_states = df_states.withColumnRenamed("geometry", "state_geometry")

df_countries = df_countries.withColumnRenamed("properties", "country_properties")
df_states = df_states.withColumnRenamed("properties", "state_properties")

# Now perform the spatial join using the renamed columns
df_state_partOf_country = df_countries.crossJoin(df_states).where(
    F.expr("ST_Contains(country_geometry, ST_Centroid(state_geometry))")
)

# Show results, making sure to differentiate which properties you're referring to
df_state_partOf_country.select("country_properties", "state_properties").show(n=1, truncate=False)

[Stage 13:>                                                         (0 + 1) / 1]

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|country_properties                                                                                                                                                                                                                                                                      |state_properties                                                                                                                                                      |
+---------------------------------------------------------------------------------------------------

                                                                                

In [24]:
df_relationships = df_state_partOf_country.select(
    col("state_properties.NAME").alias("Subject"),
    lit("isPartOf").alias("Relationship"),
    col("country_properties.name").alias("Object")
)

df_relationships.show(5, truncate=False)

[Stage 19:>                                                         (0 + 1) / 1]

+--------------+------------+------------------------+
|Subject       |Relationship|Object                  |
+--------------+------------+------------------------+
|Mississippi   |isPartOf    |United States of America|
|North Carolina|isPartOf    |United States of America|
|Oklahoma      |isPartOf    |United States of America|
|Virginia      |isPartOf    |United States of America|
|West Virginia |isPartOf    |United States of America|
+--------------+------------+------------------------+
only showing top 5 rows



                                                                                

In [25]:
append_to_csv(df_relationships)

                                                                                

# **Add Counties**

In [14]:
# Initialize a Hadoop file system 
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())

counties_dataframes = {}

# Directory containing the files
json_directory = "hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/"

# Define the schema for the GeoJSON data
geojsonSchema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>"

json_files = get_files_recursive(json_directory)

# Load each JSON file into a DataFrame and store it in the dictionary
for file_path in json_files:
    file_name = file_path.split('/')[-1]
    
    # Print the file path
    print(f"Processing file: {file_name}")
    
    # Read the GeoJSON file using the defined schema using sedona into a spark dataframe
    df = spark.read.schema(geojsonSchema).json(file_path, multiLine=True)
    
    # Explode the features array to create a row for each feature and select the columns
    df = (df
        .select(F.explode("features").alias("features"))
        .select("features.*")
        # Use Sedona's ST_GeomFromGeoJSON function to convert the geometry string to a geometry object
        .withColumn("geometry", F.expr("ST_GeomFromGeoJSON(geometry)"))
        )
    
    counties_dataframes[file_name] = df

['hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/AlabamaCounties.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/AlaskaCounties.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/AmericanSamoaCounties.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/ArizonaCounties.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/ArkansasCounties.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/CaliforniaCounties.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/ColoradoCounties.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/CommonwealthoftheNorthernMarianaIslandsCounties.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/CountiesByState/ConnecticutCounties.geojson', 'hdfs://columbus-oh.cs.colostate.ed

In [14]:
df_states = json_dataset_dataframes['States.geojson']
for state in counties_dataframes.keys():
    
    print(state)
    df_counties = counties_dataframes[state]

    df_states = df_states.withColumnRenamed("geometry", "state_geometry")
    df_counties = df_counties.withColumnRenamed("geometry", "county_geometry")

    df_states = df_states.withColumnRenamed("properties", "state_properties")
    df_counties = df_counties.withColumnRenamed("properties", "county_properties")

    df_counties_partOf_states = df_states.crossJoin(df_counties).where(
        F.expr("ST_Contains(state_geometry, ST_Centroid(county_geometry))")
    )

    df_counties_partOf_states.select("state_properties", "county_properties").show(n=1, truncate=False)

    df_relationships = df_counties_partOf_states.select(
        col("county_properties.NAME").alias("Subject"),
        lit("isPartOf").alias("Relationship"),
        col("state_properties.NAME").alias("Object")
    )

    df_relationships.show(3, truncate=False)

    append_to_csv(df_relationships)


AlabamaCounties.geojson


                                                                                

+------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                  |county_properties                                                                                                                                                                                                                 

                                                                                

+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|Sumter |isPartOf    |Alabama|
|Dallas |isPartOf    |Alabama|
|Lee    |isPartOf    |Alabama|
+-------+------------+-------+
only showing top 3 rows



                                                                                

AlaskaCounties.geojson
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                    |county_properties                                                                                                                                                                            

                                                                                

+-----------+------------+------+
|Subject    |Relationship|Object|
+-----------+------------+------+
|Bristol Bay|isPartOf    |Alaska|
|North Slope|isPartOf    |Alaska|
|Nome       |isPartOf    |Alaska|
+-----------+------------+------+
only showing top 3 rows



                                                                                

AmericanSamoaCounties.geojson
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                      |county_properties                                                                                                                                                                   

                                                                                

+-------------+------------+--------------+
|Subject      |Relationship|Object        |
+-------------+------------+--------------+
|Eastern      |isPartOf    |American Samoa|
|Swains Island|isPartOf    |American Samoa|
|Western      |isPartOf    |American Samoa|
+-------------+------------+--------------+



                                                                                

ArizonaCounties.geojson


                                                                                

+------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                  |county_properties                                                                                                                                                                                                           

                                                                                

+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|Cochise|isPartOf    |Arizona|
|La Paz |isPartOf    |Arizona|
|Gila   |isPartOf    |Arizona|
+-------+------------+-------+
only showing top 3 rows



                                                                                

ArkansasCounties.geojson
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                   |county_properties                                                                                                                                                                                     

                                                                                

CaliforniaCounties.geojson
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                      |county_properties                                                                                                                                                                    

                                                                                

ColoradoCounties.geojson
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                   |county_properties                                                                                                                                                                                  

                                                                                

+--------+------------+--------+
|Subject |Relationship|Object  |
+--------+------------+--------+
|Hinsdale|isPartOf    |Colorado|
|Delta   |isPartOf    |Colorado|
|Gilpin  |isPartOf    |Colorado|
+--------+------------+--------+
only showing top 3 rows

CommonwealthoftheNorthernMarianaIslandsCounties.geojson
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|st

                                                                                

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                     |county_properties                                                                                                                                                                                                    

                                                                                

+----------+------------+--------+
|Subject   |Relationship|Object  |
+----------+------------+--------+
|Kent      |isPartOf    |Delaware|
|New Castle|isPartOf    |Delaware|
|Sussex    |isPartOf    |Delaware|
+----------+------------+--------+

DistrictofColumbiaCounties.geojson
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                  

                                                                                

GeorgiaCounties.geojson


                                                                                

+------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                  |county_properties                                                                                                                                                                                                                   

                                                                                

+--------+------------+-------+
|Subject |Relationship|Object |
+--------+------------+-------+
|Greene  |isPartOf    |Georgia|
|Pulaski |isPartOf    |Georgia|
|Columbia|isPartOf    |Georgia|
+--------+------------+-------+
only showing top 3 rows



                                                                                

GuamCounties.geojson


                                                                                

+-----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                           |county_properties                                                                                                                                                                                                                                             

                                                                                

+----------+------------+------+
|Subject   |Relationship|Object|
+----------+------------+------+
|Latah     |isPartOf    |Idaho |
|Washington|isPartOf    |Idaho |
|Nez Perce |isPartOf    |Idaho |
+----------+------------+------+
only showing top 3 rows



                                                                                

IllinoisCounties.geojson
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                   |county_properties                                                                                                                                                                                     

                                                                                

IndianaCounties.geojson
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                 |county_properties                                                                                                                                                                                            

                                                                                

+---------+------------+-------+
|Subject  |Relationship|Object |
+---------+------------+-------+
|Martin   |isPartOf    |Indiana|
|Hendricks|isPartOf    |Indiana|
|Posey    |isPartOf    |Indiana|
+---------+------------+-------+
only showing top 3 rows



                                                                                

IowaCounties.geojson
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                               |county_properties                                                                                                                                                                                                      

                                                                                

KansasCounties.geojson
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                 |county_properties                                                                                                                                                                                           

                                                                                

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                   |county_properties                                                                                                                                                                                                        

                                                                                

+----------+------------+--------+
|Subject   |Relationship|Object  |
+----------+------------+--------+
|Cumberland|isPartOf    |Kentucky|
|Johnson   |isPartOf    |Kentucky|
|Larue     |isPartOf    |Kentucky|
+----------+------------+--------+
only showing top 3 rows



                                                                                

LouisianaCounties.geojson
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                     |county_properties                                                                                                                                                                                 

                                                                                

MaineCounties.geojson
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                |county_properties                                                                                                                                                                                           

                                                                                

MassachusettsCounties.geojson
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                       |county_properties                                                                                                                                                                     

                                                                                

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                     |county_properties                                                                                                                                                                                                          

                                                                                

MinnesotaCounties.geojson


                                                                                

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                     |county_properties                                                                                                                                                                                                       

                                                                                

+-------+------------+---------+
|Subject|Relationship|Object   |
+-------+------------+---------+
|Jackson|isPartOf    |Minnesota|
|Scott  |isPartOf    |Minnesota|
|Wilkin |isPartOf    |Minnesota|
+-------+------------+---------+
only showing top 3 rows



                                                                                

MississippiCounties.geojson


                                                                                

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                      |county_properties                                                                                                                                                                                                   

                                                                                

MissouriCounties.geojson


                                                                                

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                   |county_properties                                                                                                                                                                                                             

                                                                                

MontanaCounties.geojson
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                  |county_properties                                                                                                                                                                                          

                                                                                

NebraskaCounties.geojson
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                   |county_properties                                                                                                                                                                                    

                                                                                

NevadaCounties.geojson
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                 |county_properties                                                                                                                                                                                  

                                                                                

NewMexicoCounties.geojson
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                    |county_properties                                                                                                                                                                                  

                                                                                

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                    |county_properties                                                                                                                                                                                                                  

                                                                                

NorthDakotaCounties.geojson
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                       |county_properties                                                                                                                                                                         

                                                                                

+-------+------------+------------+
|Subject|Relationship|Object      |
+-------+------------+------------+
|Bowman |isPartOf    |North Dakota|
|Grant  |isPartOf    |North Dakota|
|Rolette|isPartOf    |North Dakota|
+-------+------------+------------+
only showing top 3 rows

OhioCounties.geojson
+----------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                    

                                                                                

+--------+------------+------+
|Subject |Relationship|Object|
+--------+------------+------+
|Marion  |isPartOf    |Ohio  |
|Hancock |isPartOf    |Ohio  |
|Clermont|isPartOf    |Ohio  |
+--------+------------+------+
only showing top 3 rows



                                                                                

OklahomaCounties.geojson


                                                                                

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                   |county_properties                                                                                                                                                                                                           

                                                                                

SouthDakotaCounties.geojson
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                       |county_properties                                                                                                                                                                          

                                                                                

TennesseeCounties.geojson


                                                                                

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                    |county_properties                                                                                                                                                                                                        

                                                                                

TexasCounties.geojson
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                 |county_properties                                                                                                                                                                                             

                                                                                

UnitedStatesVirginIslandsCounties.geojson
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                                    |county_properties                                                                                                            

                                                                                

+---------------+------------+--------+
|Subject        |Relationship|Object  |
+---------------+------------+--------+
|Nottoway       |isPartOf    |Virginia|
|Charlottesville|isPartOf    |Virginia|
|Lee            |isPartOf    |Virginia|
+---------------+------------+--------+
only showing top 3 rows



                                                                                

WashingtonCounties.geojson
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                      |county_properties                                                                                                                                                                         

                                                                                

WisconsinCounties.geojson
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                     |county_properties                                                                                                                                                                             

                                                                                

WyomingCounties.geojson


                                                                                

+------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|state_properties                                                                                                                                                  |county_properties                                                                                                                                                                                                                  

                                                                                

+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|Park   |isPartOf    |Wyoming|
|Platte |isPartOf    |Wyoming|
|Goshen |isPartOf    |Wyoming|
+-------+------------+-------+
only showing top 3 rows



# **Add Tracts**

In [11]:
# Initialize a Hadoop file system 
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())

tracts_dataframes = {}

# Directory containing the files
json_directory = "hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/"

# Define the schema for the GeoJSON data
geojsonSchema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>"

json_files = get_files_recursive(json_directory)

# Load each JSON file into a DataFrame and store it in the dictionary
for file_path in json_files:
    file_name = file_path.split('/')[-1]
    
    # Print the file path
    print(f"Processing file: {file_path}")
    
    # Read the GeoJSON file using the defined schema using sedona into a spark dataframe
    df = spark.read.schema(geojsonSchema).json(file_path, multiLine=True)
    
    # Explode the features array to create a row for each feature and select the columns
    df = (df
        .select(F.explode("features").alias("features"))
        .select("features.*")
        # Use Sedona's ST_GeomFromGeoJSON function to convert the geometry string to a geometry object
        .withColumn("geometry", F.expr("ST_GeomFromGeoJSON(geometry)"))
        )
    
    tracts_dataframes[file_name] = df

['hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/AlabamaTracts.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/AlaskaTracts.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/AmericanSamoaTracts.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/ArizonaTracts.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/ArkansasTracts.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/CaliforniaTracts.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/ColoradoTracts.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/CommonwealthoftheNorthernMarianaIslandsTracts.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsByState/ConnecticutTracts.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/TractsBySta

In [15]:
tracts_dataframes['WyomingTracts.geojson'].show(n=1, truncate=False)
counties_dataframes['AlabamaCounties.geojson'].show(n=1, truncate=False)

                                                                                

+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|type   |geometry                                                                                                                                                                                                                                                                                                                      

In [16]:
for counties_file_name in counties_dataframes.keys():
    for tracts_file_name in tracts_dataframes.keys():  
        counties = counties_file_name.replace('Counties.geojson', '')
        tracts = tracts_file_name.replace('Tracts.geojson', '')
        if counties in tracts:
            print(counties)
            df_counties = counties_dataframes[counties_file_name]
            df_tracts = tracts_dataframes[tracts_file_name]
            
            df_counties = df_counties.withColumnRenamed("geometry", "counties_geometry")
            df_tracts = df_tracts.withColumnRenamed("geometry", "tracts_geometry")

            df_counties = df_counties.withColumnRenamed("properties", "counties_properties")
            df_tracts = df_tracts.withColumnRenamed("properties", "tracts_properties")

            df_tracts_partOf_counties = df_counties.crossJoin(df_tracts).where(
                F.expr("ST_Contains(counties_geometry, ST_Centroid(tracts_geometry))")
            )

            df_relationships = df_tracts_partOf_counties.select(
                col("tracts_properties.NAME").alias("Subject"),
                lit("isPartOf").alias("Relationship"),
                col("counties_properties.NAME").alias("Object")
            )
            
            append_to_csv(df_relationships)


Alabama


                                                                                

+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|114    |isPartOf    |Sumter|
|113.02 |isPartOf    |Sumter|
|115    |isPartOf    |Sumter|
+-------+------------+------+
only showing top 3 rows



                                                                                

Alaska


                                                                                

+-------+------------+---------+
|Subject|Relationship|Object   |
+-------+------------+---------+
|10     |isPartOf    |Anchorage|
|17.01  |isPartOf    |Anchorage|
|20     |isPartOf    |Anchorage|
+-------+------------+---------+
only showing top 3 rows



                                                                                

AmericanSamoa
+-------+------------+-------------+
|Subject|Relationship|Object       |
+-------+------------+-------------+
|9518   |isPartOf    |Manu'a       |
|9512.02|isPartOf    |Western      |
|9520   |isPartOf    |Swains Island|
+-------+------------+-------------+
only showing top 3 rows



                                                                                

Arizona
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|16.02  |isPartOf    |Cochise|
|17.01  |isPartOf    |Cochise|
|1.01   |isPartOf    |Cochise|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Arkansas


                                                                                

+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|9504.01|isPartOf    |Columbia|
|9505   |isPartOf    |Columbia|
|9503.02|isPartOf    |Columbia|
+-------+------------+--------+
only showing top 3 rows



                                                                                

California


                                                                                

+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|101.02 |isPartOf    |Humboldt|
|2      |isPartOf    |Humboldt|
|7      |isPartOf    |Humboldt|
+-------+------------+--------+
only showing top 3 rows



                                                                                

Colorado
+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|9731   |isPartOf    |Hinsdale|
|9646   |isPartOf    |Delta   |
|9651   |isPartOf    |Delta   |
+-------+------------+--------+
only showing top 3 rows



                                                                                

CommonwealthoftheNorthernMarianaIslands
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9      |isPartOf    |Saipan|
|11     |isPartOf    |Saipan|
|9501   |isPartOf    |Rota  |
+-------+------------+------+
only showing top 3 rows

Connecticut
+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|5042   |isPartOf    |Hartford|
|5247   |isPartOf    |Hartford|
|4207   |isPartOf    |Hartford|
+-------+------------+--------+
only showing top 3 rows

Delaware
+-------+------------+----------+
|Subject|Relationship|Object    |
+-------+------------+----------+
|103    |isPartOf    |New Castle|
|115    |isPartOf    |New Castle|
|139.03 |isPartOf    |New Castle|
+-------+------------+----------+
only showing top 3 rows

DistrictofColumbia
+-------+------------+--------------------+
|Subject|Relationship|Object              |
+-------+------------+--------------------+
|2.01   |isPartOf    |District of Columbi

                                                                                

Georgia
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9501   |isPartOf    |Greene|
|9502   |isPartOf    |Greene|
|9504   |isPartOf    |Greene|
+-------+------------+------+
only showing top 3 rows



                                                                                

Guam
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9530   |isPartOf    |Guam  |
|9803   |isPartOf    |Guam  |
|9510   |isPartOf    |Guam  |
+-------+------------+------+
only showing top 3 rows

Hawaii
+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|20.03  |isPartOf    |Honolulu|
|111.05 |isPartOf    |Honolulu|
|106.01 |isPartOf    |Honolulu|
+-------+------------+--------+
only showing top 3 rows

Idaho


                                                                                

+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|56     |isPartOf    |Latah |
|55     |isPartOf    |Latah |
|57     |isPartOf    |Latah |
+-------+------------+------+
only showing top 3 rows



                                                                                

Illinois


                                                                                

+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|9507   |isPartOf    |Iroquois|
|9502   |isPartOf    |Iroquois|
|9505   |isPartOf    |Iroquois|
+-------+------------+--------+
only showing top 3 rows



                                                                                

Indiana
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9502   |isPartOf    |Martin|
|9501   |isPartOf    |Martin|
|9503   |isPartOf    |Martin|
+-------+------------+------+
only showing top 3 rows



                                                                                

Iowa
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|2701   |isPartOf    |Hancock|
|2702   |isPartOf    |Hancock|
|2704   |isPartOf    |Hancock|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Kansas
+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|9583   |isPartOf    |Cherokee|
|9581   |isPartOf    |Cherokee|
|9585   |isPartOf    |Cherokee|
+-------+------------+--------+
only showing top 3 rows



                                                                                

Kentucky


                                                                                

+-------+------------+----------+
|Subject|Relationship|Object    |
+-------+------------+----------+
|9502   |isPartOf    |Cumberland|
|9501   |isPartOf    |Cumberland|
|9601   |isPartOf    |Johnson   |
+-------+------------+----------+
only showing top 3 rows



                                                                                

Louisiana
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|211    |isPartOf    |Caddo |
|223    |isPartOf    |Caddo |
|236    |isPartOf    |Caddo |
+-------+------------+------+
only showing top 3 rows



                                                                                

Maine
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9711   |isPartOf    |Knox  |
|9703.01|isPartOf    |Knox  |
|9706   |isPartOf    |Knox  |
+-------+------------+------+
only showing top 3 rows



                                                                                

Maryland
+-------+------------+----------+
|Subject|Relationship|Object    |
+-------+------------+----------+
|4      |isPartOf    |Washington|
|111    |isPartOf    |Washington|
|112.03 |isPartOf    |Washington|
+-------+------------+----------+
only showing top 3 rows



                                                                                

Massachusetts
+-------+------------+----------+
|Subject|Relationship|Object    |
+-------+------------+----------+
|113    |isPartOf    |Barnstable|
|126.01 |isPartOf    |Barnstable|
|135    |isPartOf    |Barnstable|
+-------+------------+----------+
only showing top 3 rows



                                                                                

Michigan
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|8      |isPartOf    |Clare  |
|613.02 |isPartOf    |Lenawee|
|1612   |isPartOf    |Oakland|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Minnesota
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|4804   |isPartOf    |Jackson|
|4802   |isPartOf    |Jackson|
|4803   |isPartOf    |Jackson|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Mississippi
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|9501   |isPartOf    |Carroll|
|9502.01|isPartOf    |Carroll|
|9502.02|isPartOf    |Carroll|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Missouri
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9609   |isPartOf    |Marion|
|9606   |isPartOf    |Marion|
|9605   |isPartOf    |Marion|
+-------+------------+------+
only showing top 3 rows



                                                                                

Montana
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|9403   |isPartOf    |Sanders|
|2.02   |isPartOf    |Sanders|
|2.01   |isPartOf    |Sanders|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Nebraska
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|50     |isPartOf    |Douglas|
|52     |isPartOf    |Douglas|
|54     |isPartOf    |Douglas|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Nevada
+-------+------------+---------+
|Subject|Relationship|Object   |
+-------+------------+---------+
|9501   |isPartOf    |Churchill|
|9507   |isPartOf    |Churchill|
|9503.02|isPartOf    |Churchill|
+-------+------------+---------+
only showing top 3 rows

NewHampshire
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|9561.01|isPartOf    |Carroll|
|9556.02|isPartOf    |Carroll|
|9559.02|isPartOf    |Carroll|
+-------+------------+-------+
only showing top 3 rows

NewJersey
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|168    |isPartOf    |Essex |
|171    |isPartOf    |Essex |
|1      |isPartOf    |Essex |
+-------+------------+------+
only showing top 3 rows



                                                                                

NewMexico
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9507   |isPartOf    |Colfax|
|9506   |isPartOf    |Colfax|
|9505   |isPartOf    |Colfax|
+-------+------------+------+
only showing top 3 rows

NewYork
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|108    |isPartOf    |Chemung|
|1      |isPartOf    |Chemung|
|3      |isPartOf    |Chemung|
+-------+------------+-------+
only showing top 3 rows



                                                                                

NorthCarolina


                                                                                

+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9203.01|isPartOf    |Polk  |
|9203.03|isPartOf    |Polk  |
|9201.04|isPartOf    |Polk  |
+-------+------------+------+
only showing top 3 rows



                                                                                

NorthDakota
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9652   |isPartOf    |Bowman|
|9653   |isPartOf    |Bowman|
|9659   |isPartOf    |Grant |
+-------+------------+------+
only showing top 3 rows

Ohio
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|11     |isPartOf    |Marion|
|1      |isPartOf    |Marion|
|2      |isPartOf    |Marion|
+-------+------------+------+
only showing top 3 rows



                                                                                

Oklahoma
+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|7797   |isPartOf    |McIntosh|
|7802   |isPartOf    |McIntosh|
|7803.02|isPartOf    |McIntosh|
+-------+------------+--------+
only showing top 3 rows



                                                                                

Oregon
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|9512   |isPartOf    |Clatsop|
|9506   |isPartOf    |Clatsop|
|9504   |isPartOf    |Clatsop|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Pennsylvania
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|1004   |isPartOf    |Blair |
|1003   |isPartOf    |Blair |
|1008   |isPartOf    |Blair |
+-------+------------+------+
only showing top 3 rows



                                                                                

PuertoRico
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|5003.04|isPartOf    |Juncos|
|5002   |isPartOf    |Juncos|
|5004.01|isPartOf    |Juncos|
+-------+------------+------+
only showing top 3 rows



                                                                                

RhodeIsland
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|216    |isPartOf    |Kent   |
|201.01 |isPartOf    |Kent   |
|408    |isPartOf    |Newport|
+-------+------------+-------+
only showing top 3 rows

SouthCarolina
+-------+------------+---------+
|Subject|Relationship|Object   |
+-------+------------+---------+
|9203   |isPartOf    |McCormick|
|9202   |isPartOf    |McCormick|
|9201   |isPartOf    |McCormick|
+-------+------------+---------+
only showing top 3 rows



                                                                                

SouthDakota


                                                                                

+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|9536   |isPartOf    |Deuel  |
|9537   |isPartOf    |Deuel  |
|9696   |isPartOf    |Douglas|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Tennessee
+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|9502   |isPartOf    |Campbell|
|9506.01|isPartOf    |Campbell|
|9506.02|isPartOf    |Campbell|
+-------+------------+--------+
only showing top 3 rows



                                                                                

Texas


                                                                                

+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|9502   |isPartOf    |San Saba|
|9501   |isPartOf    |San Saba|
|208.04 |isPartOf    |Hidalgo |
+-------+------------+--------+
only showing top 3 rows



                                                                                

UnitedStatesVirginIslands
+-------+------------+----------+
|Subject|Relationship|Object    |
+-------+------------+----------+
|9611   |isPartOf    |St. Thomas|
|9610   |isPartOf    |St. Thomas|
|9713   |isPartOf    |St. Croix |
+-------+------------+----------+
only showing top 3 rows

Utah
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|1301   |isPartOf    |Kane   |
|1302   |isPartOf    |Kane   |
|9601   |isPartOf    |Daggett|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Vermont
+-------+------------+----------+
|Subject|Relationship|Object    |
+-------+------------+----------+
|201    |isPartOf    |Grand Isle|
|202    |isPartOf    |Grand Isle|
|9642   |isPartOf    |Rutland   |
+-------+------------+----------+
only showing top 3 rows

Virginia
+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|3      |isPartOf    |Nottoway|
|1.01   |isPartOf    |Nottoway|
|2      |isPartOf    |Nottoway|
+-------+------------+--------+
only showing top 3 rows



                                                                                

Virginia


                                                                                

+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
+-------+------------+------+



                                                                                

Washington
+-------+------------+--------+
|Subject|Relationship|Object  |
+-------+------------+--------+
|9801   |isPartOf    |Franklin|
|204.02 |isPartOf    |Franklin|
|202.01 |isPartOf    |Franklin|
+-------+------------+--------+
only showing top 3 rows



                                                                                

WestVirginia
+-------+------------+-------+
|Subject|Relationship|Object |
+-------+------------+-------+
|9657   |isPartOf    |Barbour|
|9655   |isPartOf    |Barbour|
|9656   |isPartOf    |Barbour|
+-------+------------+-------+
only showing top 3 rows



                                                                                

Wisconsin
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|1013.01|isPartOf    |Oconto|
|1006   |isPartOf    |Oconto|
|1007   |isPartOf    |Oconto|
+-------+------------+------+
only showing top 3 rows



                                                                                

Wyoming
+-------+------------+------+
|Subject|Relationship|Object|
+-------+------------+------+
|9651   |isPartOf    |Park  |
|9653.02|isPartOf    |Park  |
|9653.01|isPartOf    |Park  |
+-------+------------+------+
only showing top 3 rows



# **Add Zipcodes**

In [12]:
df_zipcodes = json_dataset_dataframes['Zipcodes.geojson']

for tracts_file_name in tracts_dataframes.keys():
    df_tracts = tracts_dataframes[tracts_file_name]
    df_zipcodes = df_zipcodes.withColumnRenamed("geometry", "zipcodes_geometry")
    df_tracts = df_tracts.withColumnRenamed("geometry", "tracts_geometry")

    df_zipcodes = df_zipcodes.withColumnRenamed("properties", "zipcodes_properties")
    df_tracts = df_tracts.withColumnRenamed("properties", "tracts_properties")

    # Now perform the spatial join using the renamed columns
    df_tracts_zipcodes = df_zipcodes.crossJoin(df_tracts).where(
        F.expr("ST_Contains(zipcodes_geometry, ST_Centroid(tracts_geometry))")
    )

    # Show results, making sure to differentiate which properties you're referring to
    df_tracts_zipcodes.select("zipcodes_properties", "tracts_properties").show(1)

                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 360...|{STATEFP -> 01, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 995...|{STATEFP -> 02, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 967...|{STATEFP -> 60, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 852...|{STATEFP -> 04, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 716...|{STATEFP -> 05, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 933...|{STATEFP -> 06, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 815...|{STATEFP -> 08, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 969...|{STATEFP -> 69, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 064...|{STATEFP -> 09, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 199...|{STATEFP -> 10, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 200...|{STATEFP -> 11, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 342...|{STATEFP -> 12, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 304...|{STATEFP -> 13, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 969...|{STATEFP -> 66, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 967...|{STATEFP -> 15, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 838...|{STATEFP -> 16, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 600...|{STATEFP -> 17, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 477...|{STATEFP -> 18, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 505...|{STATEFP -> 19, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 662...|{STATEFP -> 20, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 410...|{STATEFP -> 21, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 703...|{STATEFP -> 22, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 047...|{STATEFP -> 23, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 218...|{STATEFP -> 24, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 024...|{STATEFP -> 25, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 498...|{STATEFP -> 26, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 564...|{STATEFP -> 27, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 391...|{STATEFP -> 28, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 641...|{STATEFP -> 29, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 598...|{STATEFP -> 30, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 693...|{STATEFP -> 31, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 891...|{STATEFP -> 32, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 030...|{STATEFP -> 33, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 076...|{STATEFP -> 34, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 880...|{STATEFP -> 35, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 115...|{STATEFP -> 36, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 284...|{STATEFP -> 37, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 588...|{STATEFP -> 38, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 441...|{STATEFP -> 39, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 735...|{STATEFP -> 40, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 973...|{STATEFP -> 41, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 191...|{STATEFP -> 42, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 009...|{STATEFP -> 72, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 029...|{STATEFP -> 44, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 294...|{STATEFP -> 45, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 577...|{STATEFP -> 46, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 370...|{STATEFP -> 47, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 784...|{STATEFP -> 48, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 008...|{STATEFP -> 78, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 846...|{STATEFP -> 49, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 054...|{STATEFP -> 50, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 241...|{STATEFP -> 51, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 983...|{STATEFP -> 53, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 254...|{STATEFP -> 54, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 545...|{STATEFP -> 55, C...|
+--------------------+--------------------+
only showing top 1 row



[Stage 115:>                                                        (0 + 1) / 1]

+--------------------+--------------------+
| zipcodes_properties|   tracts_properties|
+--------------------+--------------------+
|{ZCTA5CE10 -> 822...|{STATEFP -> 56, C...|
+--------------------+--------------------+
only showing top 1 row



                                                                                

# **Add Blocks**

In [12]:
# Initialize a Hadoop file system 
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())

blocks_dataframes = {}

# Directory containing the files
json_directory = "hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/"

# Define the schema for the GeoJSON data
geojsonSchema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>"

json_files = get_files_recursive(json_directory)

# Load each JSON file into a DataFrame and store it in the dictionary
for file_path in json_files:
    file_name = file_path.split('/')[-1]
    
    # Print the file path
    print(f"Processing file: {file_path}")
    
    # Read the GeoJSON file using the defined schema using sedona into a spark dataframe
    df = spark.read.schema(geojsonSchema).json(file_path, multiLine=True)
    
    # Explode the features array to create a row for each feature and select the columns
    df = (df
        .select(F.explode("features").alias("features"))
        .select("features.*")
        # Use Sedona's ST_GeomFromGeoJSON function to convert the geometry string to a geometry object
        .withColumn("geometry", F.expr("ST_GeomFromGeoJSON(geometry)"))
        )
    
    blocks_dataframes[file_name] = df

['hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Alabama.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Alaska.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/AmericanSamoa.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Arizona.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Arkansas.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/California.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Colorado.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/CommonwealthoftheNorthernMarianaIslands.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Connecticut.geojson', 'hdfs://columbus-oh.cs.colostate.edu:30785/geospatial/input/BlocksByState/Delaware.geojson', 'hdfs://columbus-oh.cs.colostate

In [13]:
tracts_dataframes['AlabamaTracts.geojson'].show(n=1, truncate=False)
blocks_dataframes['Alabama.geojson'].show(n=1, truncate=False)

                                                                                

+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|type   |geometry                                                                                                                                                                                                 

[Stage 7:>                                                          (0 + 1) / 1]

+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|type   |geometry                                                                                                                                                                                                                                                                                                                                                                               

                                                                                

In [14]:
for tracts_file_name in tracts_dataframes.keys():
    for blocks_file_name in blocks_dataframes.keys():  
        blocks = blocks_file_name.replace('.geojson', '')
        tracts = tracts_file_name.replace('Tracts.geojson', '')
        if blocks in tracts:
            print(blocks)
            df_blocks = blocks_dataframes[blocks_file_name]
            df_tracts = tracts_dataframes[tracts_file_name]
            
            df_blocks = df_blocks.withColumnRenamed("geometry", "blocks_geometry")
            df_tracts = df_tracts.withColumnRenamed("geometry", "tracts_geometry")

            df_blocks = df_blocks.withColumnRenamed("properties", "blocks_properties")
            df_tracts = df_tracts.withColumnRenamed("properties", "tracts_properties")

            df_blocks_partOf_tracts = df_blocks.crossJoin(df_tracts).where(
                F.expr("ST_Contains(tracts_geometry, ST_Centroid(blocks_geometry))")
            )
           
            df_relationships = df_blocks_partOf_tracts.select(
                col("blocks_properties.NAME").alias("Subject"),
                lit("isPartOf").alias("Relationship"),
                col("tracts_properties.NAME").alias("Object")
            )
            
            append_to_csv(df_relationships)

Alabama


                                                                                

Alaska


                                                                                

AmericanSamoa


                                                                                

Arizona


                                                                                

Arkansas


                                                                                

California


                                                                                

Colorado


                                                                                

CommonwealthoftheNorthernMarianaIslands
Connecticut


                                                                                

Delaware
DistrictofColumbia


                                                                                

Florida


                                                                                

Georgia


                                                                                

Guam


                                                                                

Hawaii
Idaho


                                                                                

Illinois


                                                                                

Indiana


                                                                                

Iowa


                                                                                

Kansas


                                                                                

Kentucky


                                                                                

Louisiana


                                                                                

Maine


                                                                                

Maryland


                                                                                

Massachusetts


                                                                                

Michigan


                                                                                

Minnesota


                                                                                

Mississippi


                                                                                

Missouri


                                                                                

Montana


                                                                                

Nebraska


                                                                                

Nevada


                                                                                

NewHampshire
NewJersey


                                                                                

NewMexico


                                                                                

NewYork


                                                                                

NorthCarolina


                                                                                

NorthDakota
Ohio


                                                                                

Oklahoma


                                                                                

Oregon


                                                                                

Pennsylvania


                                                                                

PuertoRico


                                                                                

RhodeIsland
SouthCarolina


                                                                                

SouthDakota
Tennessee


                                                                                

Texas


                                                                                

UnitedStatesVirginIslands
Utah


                                                                                

Vermont
Virginia


                                                                                

Washington


                                                                                

Virginia


                                                                                

WestVirginia


                                                                                

Wisconsin


                                                                                

Wyoming


## More Comparitive Datasets

Combined wildland fire datasets for the United States and certain territories

Source: https://www.sciencebase.gov/catalog/item/61aa537dd34eb622f699df81

Temperature, Average Annual 1971 - 2000 for Wyoming at 1:250,000

Source: https://www.sciencebase.gov/catalog/item/4f4e479ee4b07f02db4927d7

World Urban Areas

Source: https://www.sciencebase.gov/catalog/item/537f6b14e4b021317a86f8dc

Land status in the Colorado Plateau coal assessment study area

Source: https://www.sciencebase.gov/catalog/item/60a6bbddd34ea221ce4ba94b

TIGER/Line Geodatabases

Source: https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-geodatabase-file.html

PAD-US 2.1 Download data by State GeoJSON

Source: https://www.sciencebase.gov/catalog/item/6025985bd34eb12031138e21