# Close Encounters Calculator
### Preamble: Session details
Start a Cloudera Machine Learning (CML) session on Cloudera which has the following sessions settings:

![Cloudera Machine Learning Session Settings](close-encounters/media/CloseEncountersSessionCML.JPG "Cloudera Machine Learning Session Settings")

### 1. Install requirements 
It might be you need to install some additional Python packages first time you run this code. Run the cell below. 

In [1]:
#!pip install close-encounters==0.1.0

### 2. Library imports

In [2]:
# Python
import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items # Hotfix since iteritems is deprecated
import numpy as np
from time import time
from close_encounters import CloseEncounters
import os
from pyspark.sql import SparkSession
from IPython.display import display, HTML

### 2. Close encounter algorithm settings

In [3]:
## Set Minimal Horizontal Separation in Nautical Miles (NM) 
h_dist_NM = 5

## Set Minimal Vertical Separation in Flight Levels (FL) 
v_dist_ft = 1000

# Set Minimal Flight Level (FL)
# Note: All flight sections below this altitude are pruned before close encounter algorithm is applied.
v_cutoff_FL = 245

# Set resampling frequency 
freq_s = 5

# Set Maximal Interpolation Time in Minutes (min) 
# Note: Whenever a trajectory is missing a portion of the flight which takes longer than this time, it will not be interpolated. 
t_max = 10

### 3. Spark Session Initialization

In [None]:
# Initialize the Spark Session
spark = SparkSession.builder \
    .appName("CloseEncounters") \
    .config("spark.executor.memory", "12g") \
    .config("spark.driver.memory", "10g") \
    .config("spark.executor.cores", "1") \
    .config("spark.executor.instances", "5") \
    .config("spark.sql.shuffle.partitions", "100") \
    .config("spark.default.parallelism", "100") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.rpc.message.maxSize", "512") \
    .getOrCreate()

# Display the Spark URL to monitor the process
# Get environment variables
engine_id = os.getenv('CDSW_ENGINE_ID')
domain = os.getenv('CDSW_DOMAIN')

# Format the URL
url = f"https://spark-{engine_id}.{domain}"

# Display the clickable URL
display(HTML(f'<a href="{url}">{url}</a>'))

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/27 17:28:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 59163)
Traceback (most recent call last):
  File "/opt/anaconda3/envs/ce/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/anaconda3/envs/ce/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/opt/anaconda3/envs/ce/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/anaconda3/envs/ce/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/opt/anaconda3/envs/ce/lib/python3.10/site-packages/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/opt/anaconda3/envs/ce/lib/python3.10/site-packages/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
  File "/opt/anaconda3/envs/ce/l

### 4. Run on sample data

In [5]:
%%time
# Initiate Close Encounters with Spark
ce = CloseEncounters(spark = spark)

# Load trajectories into close encounters
#ce = ce.load_parquet_trajectories(
#    parquet_path = 'data/flight_profiles_cpf_20240701_filtered.parquet',
#    flight_id_col = 'FLIGHT_ID', 
#    icao24_col = 'ICAO24',
#    longitude_col = 'LONGITUDE',
#    latitude_col = 'LATITUDE',
#    time_over_col = 'TIME_OVER',
#    flight_level_col = 'FLIGHT_LEVEL'
#)

ce = ce.load_sample_trajectories(nrows = 10000000)

[2025-06-27 17:28:55,799] INFO - Initialized CloseEncounters class.
[2025-06-27 17:28:57,395] INFO - Loaded trajectory data from parquet: data/flight_profiles_cpf_20240701_filtered.parquet
[2025-06-27 17:28:57,516] INFO - Loaded trajectory data from Spark DataFrame.


CPU times: user 6.71 ms, sys: 4.17 ms, total: 10.9 ms
Wall time: 1.72 s


In [6]:
%%time
ce = ce.resample(freq_s = freq_s, t_max=t_max)

25/06/27 17:29:09 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
[2025-06-27 17:29:30,017] INFO - Resampling complete. Total segments: 50865675  


CPU times: user 108 ms, sys: 46.1 ms, total: 154 ms
Wall time: 32.5 s


%%time
# Find close encounters
ce_sdf = ce.find_close_encounters(
    h_dist_NM=h_dist_NM,
    v_dist_ft=v_dist_ft,
    v_cutoff_FL=v_cutoff_FL,
    freq_s=freq_s,
    t_max=t_max,
    method = 'brute_force'
)

# Convert from a Spark DataFrame (sdf) to Pandas Dataframe (pdf)
ce_pdf_bf = ce_sdf

print(ce_pdf_bf.shape)

In [7]:
%%time
# Find close encounters
ce_duckdb = ce.find_close_encounters_duckdb(
    h_dist_NM=h_dist_NM,
    v_dist_ft=v_dist_ft,
    v_cutoff_FL=v_cutoff_FL,
    freq_s=freq_s,
    t_max=t_max
)

print(ce_duckdb.shape)


[2025-06-27 17:29:30,025] INFO - Starting DuckDB-based detection (1:1 with Spark).
[2025-06-27 17:29:30,026] INFO - Skipping resample: already done (freq_s=5, t_max=10)
                                                                                

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

[2025-06-27 17:32:36,599] INFO - DuckDB (1:1 Spark) found 242534 encounters


(242534, 12)
CPU times: user 8min 37s, sys: 1min 44s, total: 10min 22s
Wall time: 3min 13s


In [13]:
ce_duckdb["ID_combined"] = ce_duckdb["ID1"].astype(str).str.cat(ce_duckdb["ID2"].astype(str), sep="_")

In [14]:
ce_duckdb['isin_spark'] = ce_duckdb.ID_combined.isin(ce_pdf_half_disk.ID.to_list())

In [16]:
ce_duckdb[ce_duckdb['isin_spark'] == False]

Unnamed: 0,time_over,ID1,ID2,lat1,lon1,altitude_ft1,flight_id1,lat2,lon2,altitude_ft2,flight_id2,h_dist_NM,ID_combined,isin_spark
92112,2024-07-01 13:15:10,309237931563,532576201785,65.502611,-11.325444,35000.0,273714416.0,65.471611,-11.299778,34000.0,273713939.0,1.970174,309237931563_532576201785,False
114431,2024-07-01 13:15:00,309237931561,532576201783,65.488111,-11.284111,35000.0,273714416.0,65.456556,-11.257056,34000.0,273713939.0,2.013255,309237931561_532576201783,False
132444,2024-07-01 09:30:40,128849261916,618475446304,40.703611,16.594537,32933.33,273711156.0,40.743333,16.55713,31933.33,273705871.0,2.933274,128849261916_618475446304,False
200873,2024-07-01 13:15:05,309237931562,532576201784,65.495361,-11.304778,35000.0,273714416.0,65.464083,-11.278417,34000.0,273713939.0,1.991682,309237931562_532576201784,False


25/06/27 22:28:57 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 949853 ms exceeds timeout 120000 ms
25/06/27 22:28:57 WARN SparkContext: Killing executors is not supported by current scheduler.
25/06/27 22:29:01 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$

In [15]:
ce_pdf_half_disk.altitude_ft1.values[0]

np.float64(37000.0)

In [12]:
%%time
# Find close encounters
ce_sdf = ce.find_close_encounters(
    h_dist_NM=h_dist_NM,
    v_dist_ft=v_dist_ft,
    v_cutoff_FL=v_cutoff_FL,
    freq_s=freq_s,
    t_max=t_max,
    method = 'half_disk'
)

# Convert from a Spark DataFrame (sdf) to Pandas Dataframe (pdf)
ce_pdf_half_disk = ce_sdf.toPandas()

print(ce_pdf_half_disk.shape)

[2025-06-27 17:49:54,539] INFO - Starting close encounter detection with method='half_disk'
[2025-06-27 17:49:54,539] INFO - Skipping resample: already done (freq_s=5, t_max=10)
[2025-06-27 17:55:18,620] INFO - Found 242530 candidate close encounters        


(242530, 20)
CPU times: user 10.6 s, sys: 503 ms, total: 11.1 s
Wall time: 5min 34s


In [None]:
%%time
# Find close encounters
ce_sdf = ce.find_close_encounters_duckdb(
    h_dist_NM=h_dist_NM,
    v_dist_ft=v_dist_ft,
    v_cutoff_FL=v_cutoff_FL,
    freq_s=freq_s,
    t_max=t_max
)

print(ce_sdf.shape)


[2025-06-27 01:26:17,217] INFO - Starting DuckDB-based detection (1:1 with Spark).
[2025-06-27 01:26:17,217] INFO - Skipping resample: already done (freq_s=1, t_max=10)
25/06/27 01:26:24 WARN MemoryStore: Not enough space to cache rdd_53_50 in memory! (computed 27.1 MiB so far)
25/06/27 01:26:24 WARN MemoryStore: Not enough space to cache rdd_53_52 in memory! (computed 13.7 MiB so far)
25/06/27 01:26:24 WARN MemoryStore: Not enough space to cache rdd_53_53 in memory! (computed 7.1 MiB so far)
25/06/27 01:26:25 WARN MemoryStore: Not enough space to cache rdd_53_56 in memory! (computed 27.1 MiB so far)
25/06/27 01:26:25 WARN MemoryStore: Not enough space to cache rdd_53_57 in memory! (computed 13.7 MiB so far)
25/06/27 01:26:26 WARN MemoryStore: Not enough space to cache rdd_53_61 in memory! (computed 27.1 MiB so far)
25/06/27 01:26:26 WARN MemoryStore: Not enough space to cache rdd_53_62 in memory! (computed 13.7 MiB so far)
25/06/27 01:26:26 WARN MemoryStore: Not enough space to cache 

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

OutOfMemoryException: Out of Memory Error: failed to offload data block of size 256.0 KiB (499.9 GiB/500.0 GiB used).
This limit was set by the 'max_temp_directory_size' setting.
By default, this setting utilizes the available disk space on the drive where the 'temp_directory' is located.
You can adjust this setting, by using (for example) PRAGMA max_temp_directory_size='10GiB'

Possible solutions:
* Reducing the number of threads (SET threads=X)
* Disabling insertion-order preservation (SET preserve_insertion_order=false)
* Increasing the memory limit (SET memory_limit='...GB')

See also https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads

In [9]:
import math
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, lag, udf
from pyspark.sql.types import DoubleType

def calculate_bearing(lat1, lon1, lat2, lon2):
    """
    Calculate the initial bearing (forward azimuth) between two points
    specified in decimal degrees using the great-circle formula.

    Parameters:
        lat1 (float): Latitude of the first point.
        lon1 (float): Longitude of the first point.
        lat2 (float): Latitude of the second point.
        lon2 (float): Longitude of the second point.

    Returns:
        float: Initial bearing in degrees, normalized to [0, 360).
    """
    if None in (lat1, lon1, lat2, lon2):
        return None

    lat1_rad = math.radians(lat1)
    lat2_rad = math.radians(lat2)
    delta_lon_rad = math.radians(lon2 - lon1)

    x = math.sin(delta_lon_rad) * math.cos(lat2_rad)
    y = (math.cos(lat1_rad) * math.sin(lat2_rad) -
         math.sin(lat1_rad) * math.cos(lat2_rad) * math.cos(delta_lon_rad))

    bearing_rad = math.atan2(x, y)
    bearing_deg = math.degrees(bearing_rad)

    return (bearing_deg + 360) % 360


# Register UDF
calculate_bearing_udf = udf(calculate_bearing, DoubleType())

# Assume `resampled_sdf` is your existing DataFrame
# Define window for each flight ordered by timestamp
window_spec = Window.partitionBy("flight_id").orderBy("time_over")

# Add previous point's latitude and longitude
resampled_sdf = resampled_sdf.withColumn(
    "prev_latitude", lag("latitude").over(window_spec)
)
resampled_sdf = resampled_sdf.withColumn(
    "prev_longitude", lag("longitude").over(window_spec)
)

# Compute heading using the UDF
resampled_sdf = resampled_sdf.withColumn(
    "heading",
    calculate_bearing_udf(
        col("prev_latitude"),
        col("prev_longitude"),
        col("latitude"),
        col("longitude")
    )
)


In [10]:
resampled_pdf = resampled_sdf.limit(20000).toPandas()

                                                                                

In [11]:
resampled_pdf

Unnamed: 0,time_over,flight_level,latitude,longitude,flight_id,icao24,is_ts_interpolated,segment_id,prev_latitude,prev_longitude,heading
0,2024-07-01 12:00:15,370.0,65.029722,-6.206111,273696561.0,3965AF,False,1432,,,
1,2024-07-01 12:00:20,370.0,65.020000,-6.200083,273696561.0,3965AF,True,1433,65.029722,-6.206111,165.327592
2,2024-07-01 12:00:25,370.0,65.010278,-6.194056,273696561.0,3965AF,True,1434,65.020000,-6.200083,165.322479
3,2024-07-01 12:00:30,370.0,65.000556,-6.188028,273696561.0,3965AF,True,1435,65.010278,-6.194056,165.317367
4,2024-07-01 12:00:35,370.0,64.990833,-6.182000,273696561.0,3965AF,True,1436,65.000556,-6.188028,165.312255
...,...,...,...,...,...,...,...,...,...,...,...
19995,2024-07-01 13:32:10,380.0,41.479722,41.306111,273697355.0,4406DE,False,26171,41.478148,41.321343,277.858780
19996,2024-07-01 13:32:15,380.0,41.481065,41.292917,273697355.0,4406DE,True,26172,41.479722,41.306111,277.738986
19997,2024-07-01 13:32:20,380.0,41.482407,41.279722,273697355.0,4406DE,True,26173,41.481065,41.292917,277.739145
19998,2024-07-01 13:32:25,380.0,41.483750,41.266528,273697355.0,4406DE,True,26174,41.482407,41.279722,277.739303


In [15]:
!pip uninstall -f plotly


Usage:   
  pip uninstall [options] <package> ...
  pip uninstall [options] -r <requirements file> ...

no such option: -f


In [9]:
import plotly.express as px
px.scatter(resampled_pdf, x = 'lat', y = 'lon')

ModuleNotFoundError: No module named 'plotly'

In [None]:
ce.

In [10]:
ce = ce.find_close_encounters()

Skipping resample: Already done (w. freq_s = 5 and t_max = 10)


In [11]:
ce.show()



+-----------+----------+-------------------+---------------+--------------------+------------------+------------------+-------------------+------------------+------------+-------+------------------+------------------+-------------------+-----------------+------------+-------+-----------+------------------+------------------+
|        ID2|       ID1|          time_over|       h3_group|                  ID|              lat1|              lon1|              time1|       flight_lvl1|  flight_id1|icao241|              lat2|              lon2|              time2|      flight_lvl2|  flight_id2|icao242|time_diff_s|         v_dist_FL|         h_dist_NM|
+-----------+----------+-------------------+---------------+--------------------+------------------+------------------+-------------------+------------------+------------+-------+------------------+------------------+-------------------+-----------------+------------+-------+-----------+------------------+------------------+
|     869136|    57

                                                                                

In [12]:
df = load_sample_trajectories()
encounters_df = CloseEncountersH3HalfDisk(
    df, 
    distance_nm = horizontal_separation_NM, 
    FL_diff = vertical_separation_FL, 
    FL_min = minimal_FL, 
    deltaT_min = deltaT_min, 
    pnumb = 100, 
    spark = spark)

NameError: name 'CloseEncountersH3HalfDisk' is not defined

In [None]:
create_keplergl_html(encounters_df)

TypeError: CloseEncounters.__init__() missing 1 required positional argument: 'spark'