# **Excavating New Datasets via GeoSpatial Insights from Datalakes**


Eric Martin <br>
Federico Larrieu <br>
CS 555 Distributed Systems <br>
Colorado State University <br>
Spring 2024 <br>

### Objective

    • Perform analytics over a large-scale temporal network

### Overview
In this assignment, I will perform an analysis of a continuously evolving temporal network. Large-scale networks are observed in many different sociological and scientific settings such as computer networks, networks of social media, academic/technical citation networks and hyperlink networks. To understand such networks, there have been several properties of interest based primarily on two key measurements: the degrees of nodes and the shortest distances between pairs of nodes. The node-to-node distances often infer the graph’s diameter, which is the maximum shortest distance among all the connected pairs of nodes. 

Most of the large networks evolve over time by adding new members/items and relationships between them or removing some of them. In the traditional temporal network analysis, there are two major hypotheses. 

    (a) the average node degree in the network remains constant over time.
    (Or the number of edges grows linearly in the number of nodes.). 

    (b) the diameter is a slowly growing function of the network size. 

    How are these hypothesis (a) and (b) reflected in real-world data?

In this assignment, I measure fundamental network properties with a 
citation network and investigate how they evolve. I will perform the following computations using Apache Spark.

### Dataset

The dataset for this assignment is the arXiv citation graph ( J. Gehrke, P. Ginsparg, and J. M. Kleinberg. Overview of the 2003 kdd cup. SIGKDD Explorations, 5(2):149–151, 2003) that covers papers published in the period from January 1993 to April 2003 (11 years). 

Please note that the dataset for the last year (2003) is incomplete and does not represent the entire year.

## Task 1: Exercises on the basic Spark features

In this task, I will practice with the key features of the Apache Spark.

    (1) Count the number of unique published papers per year - create an output file with the number of papers published each year.

    (2) Count the number of edges (citations) generated per year - create an output file with the number of citations added each year.


First I am going to transfer my two datasets to the HDFS file system. I will use the following commands to transfer the files to the HDFS file system.

1) Start HDFS on NameNode
    ```bash
    start-dfs.sh
    ```
2) Start YARN on ResourceManager
    ```bash
    start-yarn.sh
    ```
3) Start Spark on the NameNode 
    start-master.sh
    start-workers.sh
    ```
4) Transfer the files to HDFS
    ```bash
        hadoop fs -mkdir /pa1
        hadoop fs -mkdir /pa1/input
        hadoop fs -put cs535/PA1/citations-redo.txt /pa1/input
        hadoop fs -put cs535/PA1/published-dates-redo.txt /pa1/input
    ```
5) Verify the files are in HDFS
    5.1) Set up SSH with tunneling
    ```bash
        ssh -L 8080:localhost:8080 ebmartin@hartford.cs.colostate.edu
    ```
    5.2) Find HDFS web address 
    ```bash
        cd hadoopConf
        vim hdfs-site.xml
    ```
    5.3) Find the following:
    ```xml
        <property>
            <name>dfs.namenode.http-address</name>
            <value>hartford.cs.colostate.edu:30182</value>
            <description>Location of the DFS web UI</description>
        </property>
    ``````
    5.4) Open local web browser and go to HDFS web address (http://<namenode>:<port>)
    ```bash
        http://hartford.cs.colostate.edu:30182/
    ```
    5.5) Verify files are in HDFS
    ```bash
        http://hartford.cs.colostate.edu:30182/explorer.html#/pa1/input
    ```
6) Check the YARN web portal to see if the Spark application is running (http://<resource_manager_host>:<port>)
    ```bash
        http://honolulu.cs.colostate.edu:30194/
    ```
7) Check the Spark web portal to see if the Spark application is running (http://<SPARK_MASTER_IP>:<SPARK_MASTER_WEBUI_PORT>)
    ```bash
        http://hartford.cs.colostate.edu:30197/
    ```
    *Note:*
    *I wrote all this so I can reference how I did it in future assignments.*

### Load data into Spark

In [1]:
import sys
sys.path.append("/usr/local/python-env/py39/lib/python3.9/site-packages")

import pyspark
print(pyspark.__version__)


print(sys.executable)

3.5.0
/usr/bin/python3.9


### Initialze a SparkSession

Initialize a test session to ensure the SparkSession is working properly. This will connect to the resource manager node that is running the YARN cluster. If we visit the YARN web portal, we can see that the Spark application is running.

Ensuring the pyspark library is being accessed from my local usr directory.


In [2]:
import os
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3.9'

In [3]:
import pkg_resources

sedona_version = pkg_resources.get_distribution("apache-sedona").version
print(f"Apache Sedona version: {sedona_version}")

Apache Sedona version: 1.5.1


In [4]:

print(os.environ['SPARK_HOME'])
print(os.environ['PYSPARK_PYTHON'])

/usr/local/spark/latest
/usr/bin/python3.9


## Now to make the app

In [9]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import split
from pyspark.sql.functions import col
from pyspark.sql import Row
from pyspark.sql.types import IntegerType, DateType
from pyspark.sql.functions import year  # used to extract year from date, could do this manually as well
from pyspark.sql import Window
from pyspark.sql.functions import sum as pyspark_sum


from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
from sedona.spark import *
import geopandas as gpd

spark = SparkSession \
    .builder \
    .appName('GeoSpatialQueries') \
    .master('spark://hartford:30196') \
    .config("spark.yarn.resourcemanager.address", "honolulu.cs.colostate.edu:30190") \
    .config("spark.serializer", KryoSerializer.getName) \
    .config("spark.kryo.registrator", SedonaKryoRegistrator.getName) \
    .config('spark.jars.packages',
            'org.apache.sedona:sedona-spark-3.5_2.12:1.5.1,'
            'org.datasyslab:geotools-wrapper:1.5.1-28.2') \
    .config('spark.jars.repositories', 'https://artifacts.unidata.ucar.edu/repository/unidata-all') \
    .getOrCreate()

# Set log level to DEBUG
spark.sparkContext.setLogLevel("ERROR")

sedona = SedonaContext.create(spark)
SedonaRegistrator.registerAll(spark)

# create a logger
logger = spark._jvm.org.apache.log4j.LogManager.getLogger(__name__)
logger.info("Pyspark initialized...")

# IF YOU WANT TO RUN THE TEST, SET isTest = True
isTest = True

  SedonaRegistrator.registerAll(spark)
  cls.register(spark)


## Load the datasets

Got the code for this from https://sedona.apache.org/1.5.1/tutorial/sql/

Load GeoJSON using Spark JSON Data Source:

Spark SQL's built-in JSON data source supports reading GeoJSON data. To ensure proper parsing of the geometry property, we can define a schema with the geometry property set to type 'string'. This prevents Spark from interpreting the property and allows us to use the ST_GeomFromGeoJSON function for accurate geometry parsing.

```python
schema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>";
(sedona.read.json(geojson_path, schema=schema)
    .selectExpr("explode(features) as features") # Explode the envelope to get one feature per row.
    .select("features.*") # Unpack the features struct.
    .withColumn("geometry", f.expr("ST_GeomFromGeoJSON(geometry)")) # Convert the geometry string.
    .printSchema())
```

In [25]:
# Initialize a Hadoop file system 
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())

# Directory containing the CSV files
csv_directory = "hdfs:///cs555/Datasets/"
json_directory = "hdfs:///geospatial/input/"

# Define the schema for the GeoJSON data
geojsonSchema = "type string, crs string, totalFeatures long, features array<struct<type string, geometry string, properties map<string, string>>>"


if isTest:

    # Path to the GeoJSON file
    geojson_path = "hdfs:///geospatial/input/cb_2018_us_state_20m.json"

    # Read the GeoJSON file using the defined schema using sedona into a spark dataframe
    state_boundaries_sedona = spark.read.schema(geojsonSchema).json(geojson_path, multiLine=True)
    
    # Explode the features array to create a row for each feature and select the columns
    state_boundaries_sedona = (state_boundaries_sedona
                               .select(F.explode("features").alias("features"))
                               .select("features.*")
                               # Use Sedona's ST_GeomFromGeoJSON function to convert the geometry string to a geometry object
                               .withColumn("geometry", F.expr("ST_GeomFromGeoJSON(geometry)"))
                              )

else:
    
    # Get a list of the CSV and JSON files in the directory
    json_files = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()).globStatus(spark._jvm.org.apache.hadoop.fs.Path(json_directory + "*.csv"))
    csv_files = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()).globStatus(spark._jvm.org.apache.hadoop.fs.Path(csv_directory + "*.json"))
    
    # Create a dictionary to hold the DataFrames
    csv_dataset_dataframes = {}
    json_dataset_dataframes = {}

    # Load each CSV file into a DataFrame and store it in the dictionary
    for csv_file in csv_files:
        file_name = csv_file.getPath().getName()
        csv_dataset_dataframes[file_name] = spark.read.csv(csv_directory + file_name, header=True, inferSchema=True)

    for json_file in spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()).globStatus(spark._jvm.org.apache.hadoop.fs.Path(json_directory + "*.json")):
        file_name = json_file.getPath().getName()
        json_dataset_dataframes[file_name] = spark.read.json(json_directory + file_name, multiLine=True)
        

## We can see that the datasets are not in a workable format

In [11]:
# print the schemas

if (isTest):
    print("State Boundaries Sedona Schema:")
    state_boundaries_sedona.printSchema()

State Boundaries Sedona Schema:
root
 |-- type: string (nullable = true)
 |-- geometry: geometry (nullable = false)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



In [20]:
# View the first 5 rows of the state_boundaries_sedona DataFrame

if (isTest):
    state_boundaries_sedona.show(5, truncate=False)

+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Running Spatial Queries

https://sedona.apache.org/1.5.1/api/sql/Function/

### Range Query

This example demonstrates how to perform a range query using ST_Contains to find geometries within a specified polygon:

In [21]:
# Define a polygon using ST_PolygonFromEnvelope and perform a range query

bbox_polygon = "ST_PolygonFromEnvelope(-79.5, 37.9, -75.6, 39.8)"

# Perform the range query to find features within the bounding box
contained_features = state_boundaries_sedona.filter(
    F.expr(f"ST_Contains({bbox_polygon}, geometry)")
)

# Show results
contained_features.show()

+-------+--------------------+--------------------+
|   type|            geometry|          properties|
+-------+--------------------+--------------------+
|Feature|POLYGON ((-77.119...|{STATEFP -> 11, S...|
+-------+--------------------+--------------------+



                                                                                

## KNN Query

This example demonstrates how to perform a k-nearest neighbors (KNN) query using ST_Distance to find the k nearest geometries to a specified point:

In [23]:
from pyspark.sql import functions as F

# Calculate the center of the bounding box and create a WKT representation of the point
center_longitude = (-79.5 + -75.6) / 2
center_latitude = (37.9 + 39.8) / 2
center_point_wkt = f"POINT({center_longitude} {center_latitude})"

# Perform the KNN query using ST_Distance to calculate the distance to the center point
knnQueryResult = state_boundaries_sedona.select(
    # Access the 'NAME' from the 'properties' map
    F.col("properties").getItem("NAME").alias("NAME"),
    F.expr(f"ST_Distance(ST_GeomFromWKT('{center_point_wkt}'), geometry)").alias("distance")
).orderBy("distance").limit(5)

knnQueryResult.show()



+--------------------+-------------------+
|                NAME|           distance|
+--------------------+-------------------+
|            Virginia|                0.0|
|            Maryland| 0.2425364842513449|
|       West Virginia|0.39633443061384227|
|District of Columbia|0.43843022219048333|
|        Pennsylvania| 0.8705736535332227|
+--------------------+-------------------+



In [24]:
state_boundaries_sedona.show()

+-------+--------------------+--------------------+
|   type|            geometry|          properties|
+-------+--------------------+--------------------+
|Feature|MULTIPOLYGON (((-...|{STATEFP -> 24, S...|
|Feature|POLYGON ((-96.621...|{STATEFP -> 19, S...|
|Feature|POLYGON ((-75.773...|{STATEFP -> 10, S...|
|Feature|MULTIPOLYGON (((-...|{STATEFP -> 39, S...|
|Feature|POLYGON ((-80.519...|{STATEFP -> 42, S...|
|Feature|POLYGON ((-104.05...|{STATEFP -> 31, S...|
|Feature|MULTIPOLYGON (((-...|{STATEFP -> 53, S...|
|Feature|MULTIPOLYGON (((-...|{STATEFP -> 72, S...|
|Feature|POLYGON ((-88.468...|{STATEFP -> 01, S...|
|Feature|POLYGON ((-94.617...|{STATEFP -> 05, S...|
|Feature|POLYGON ((-109.04...|{STATEFP -> 35, S...|
|Feature|POLYGON ((-106.62...|{STATEFP -> 48, S...|
|Feature|MULTIPOLYGON (((-...|{STATEFP -> 06, S...|
|Feature|POLYGON ((-89.544...|{STATEFP -> 21, S...|
|Feature|POLYGON ((-85.605...|{STATEFP -> 13, S...|
|Feature|MULTIPOLYGON (((-...|{STATEFP -> 55, S...|
|Feature|POL