# Coordinate Parsing Example

Here is a quick example of showing coordinate parsing on a small dataset (<10k rows).
This was made from trips randomly constructed from GPS points in the Kansas City (USA)
metro area.

Some are small, some are large. I have no idea. it was grabbing a bunch of random
coordinates and finding routes between them.

Below is an example of using ``rapidgeo`` to create polylines out of the coordinate arrays.


In [1]:
import time
import sys
import pandas as pd
import numpy as np

import rapidgeo

In [2]:
print(f"RapidGeo version: {rapidgeo.__version__}")

RapidGeo version: 0.2.2


In [3]:
def to_python_lists(coords_array):
    return [[float(coord[0]), float(coord[1])] for coord in coords_array]

def to_flat_numpy(coords_array):
    return np.array(coords_array).flatten()

def to_contiguous_2d(coords_array):
  return np.array([[float(coord[0]), float(coord[1])] for coord in coords_array], dtype=np.float64)


df = pd.read_parquet('kc_metro_combined.parquet', columns=('trip_id', 'coordinates_array', 'gps_points'))
print(f"Loaded {len(df):,} KC Metro routes")

df['contiguous_coords'] = df['coordinates_array'].apply(to_contiguous_2d)

coords = df['coordinates_array'].iloc[0]
print(f"\nOriginal data type: {type(coords)}")

coords = df['contiguous_coords'].iloc[0]
print(f"contiguous coords type: {type(coords)}")

df.head()

Loaded 7,496 KC Metro routes

Original data type: <class 'numpy.ndarray'>
contiguous coords type: <class 'numpy.ndarray'>


Unnamed: 0,trip_id,coordinates_array,gps_points,contiguous_coords
0,kc_000000,"[[-94.6821126, 38.9578505], [-94.6820774999999...",13,"[[-94.6821126, 38.9578505], [-94.6820774999999..."
1,kc_000001,"[[-95.217415, 38.9726943], [-95.2190526, 38.97...",12,"[[-95.217415, 38.9726943], [-95.2190526, 38.97..."
2,kc_000002,"[[-94.8391741, 38.7676072], [-94.8282483999999...",16,"[[-94.8391741, 38.7676072], [-94.8282483999999..."
3,kc_000003,"[[-94.9115648, 39.0301498], [-94.9113505999999...",11,"[[-94.9115648, 39.0301498], [-94.9113505999999..."
4,kc_000004,"[[-94.6948674, 38.78569299999999], [-94.694291...",13,"[[-94.6948674, 38.78569299999999], [-94.694291..."


In [4]:
def test_coords(df_name, coords):
    """
    Test Performance time creating polylines
    """
    start_time = time.time()
    df_name['polyline'] = df_name[coords].apply(
        lambda coords: rapidgeo.polyline.encode(
            rapidgeo.formats.coords_to_lnglat(coords)
        )
    )
    end_time = time.time()
    
    total_time = end_time - start_time
    routes_per_sec = len(df) / total_time
    
    print(f"\t\tProcessed {len(df_name):,} routes in {total_time:.4f} seconds")
    print(f"\t\tRate: {routes_per_sec:.4f} routes/second")

## Processing original ndarray

Out of parquet/pandas we get a numpy array. These are fast!  The polyline algorithm should just chug through this.


In [5]:
print("native numpy array")
test_coords(df, 'coordinates_array')

print("\ncontiguous_2d")
test_coords(df, 'contiguous_coords')

native numpy array
		Processed 7,496 routes in 0.0402 seconds
		Rate: 186653.6620 routes/second

contiguous_2d
		Processed 7,496 routes in 0.0273 seconds
		Rate: 274227.8984 routes/second


## Python Lists

Let's see what happens on this dataset in a list of lists

It might surprise you.

In [5]:
df['list_coords'] = df['coordinates_array'].apply(to_python_lists)

coords = df['list_coords'].iloc[0]
print(f"List Data Type: {type(coords)}")

List Data Type: <class 'list'>


## Results

Pretty interesting, right?

Let's make some larger datasets to see what happens!

In [6]:
df_2x = pd.concat([df] * 2, ignore_index=True)
df_6x = pd.concat([df] * 5, ignore_index=True)
df_10x = pd.concat([df] * 15, ignore_index=True)
df_35x = pd.concat([df] * 35, ignore_index=True)

print(f"Original: {len(df):,} routes")
print(f"2x: {len(df_2x):,} routes")
print(f"10x: {len(df_10x):,} routes")
print(f"35x: {len(df_35x):,} routes")

Original: 7,496 routes
2x: 14,992 routes
10x: 112,440 routes
35x: 262,360 routes


In [7]:
print("=" * 60)
for dataframe in (df_2x, df_6x, df_10x, df_35x):
    print(f"Testing: {len(dataframe):,} Records!")
    print("\tNumpy Array")
    test_coords(dataframe, 'coordinates_array')

    print("\n\tList")
    test_coords(dataframe, 'list_coords')

    print("\n\tcontiguous_2d")
    test_coords(dataframe, 'contiguous_coords')
    print("=" * 60)
    print("")


Testing: 14,992 Records!
	Numpy Array
		Processed 14,992 routes in 0.8098 seconds
		Rate: 9256.7466 routes/second

	List
		Processed 14,992 routes in 0.3781 seconds
		Rate: 19824.5972 routes/second

	contiguous_2d
		Processed 14,992 routes in 0.3459 seconds
		Rate: 21671.5821 routes/second

Testing: 37,480 Records!
	Numpy Array
		Processed 37,480 routes in 2.0028 seconds
		Rate: 3742.8497 routes/second

	List
		Processed 37,480 routes in 1.0121 seconds
		Rate: 7406.5958 routes/second

	contiguous_2d
		Processed 37,480 routes in 0.8655 seconds
		Rate: 8660.9901 routes/second

Testing: 112,440 Records!
	Numpy Array
		Processed 112,440 routes in 6.1878 seconds
		Rate: 1211.4184 routes/second

	List
		Processed 112,440 routes in 2.8069 seconds
		Rate: 2670.5847 routes/second

	contiguous_2d
		Processed 112,440 routes in 2.6264 seconds
		Rate: 2854.0772 routes/second

Testing: 262,360 Records!
	Numpy Array
		Processed 262,360 routes in 14.4903 seconds
		Rate: 517.3101 routes/second

	List
	

# Batch encoding polylines

We've been a bit unfair to the process so far, as all of the benchmarks are taking into account 
BOTH coordinate parsing *and* creating polylines.

Let's get out of a single CPU and show what ``encode_batch()`` can do with this!


In [11]:
def test_batch_coords(df_name, coords):
  """
  Test Performance time creating polylines
  """
  start_time = time.time()
  df_name['polyline'] = rapidgeo.polyline.encode_column(df_name[coords])
  end_time = time.time()

  total_time = end_time - start_time
  routes_per_sec = len(df) / total_time

  print(f"\t\tProcessed {len(df_name):,} routes in {total_time:.4f} seconds")
  print(f"\t\tRate: {routes_per_sec:.4f} routes/second")

In [12]:
print("=" * 60)
for dataframe in (df_2x, df_6x, df_10x, df_35x):
    print(f"Testing: {len(dataframe):,} Records!")
    print("\tNumpy Array")
    test_batch_coords(dataframe, 'coordinates_array')

    print("\n\tList")
    test_batch_coords(dataframe, 'list_coords')

    print("\n\tcontiguous_2d")
    test_batch_coords(dataframe, 'contiguous_coords')
    print("=" * 60)
    print("")


Testing: 14,992 Records!
	Numpy Array
		Processed 14,992 routes in 0.0451 seconds
		Rate: 166326.3456 routes/second

	List
		Processed 14,992 routes in 0.0221 seconds
		Rate: 339017.7139 routes/second

	contiguous_2d
		Processed 14,992 routes in 0.0182 seconds
		Rate: 412659.1782 routes/second

Testing: 37,480 Records!
	Numpy Array
		Processed 37,480 routes in 0.1096 seconds
		Rate: 68390.2506 routes/second

	List
		Processed 37,480 routes in 0.0526 seconds
		Rate: 142576.9685 routes/second

	contiguous_2d
		Processed 37,480 routes in 0.0459 seconds
		Rate: 163374.8144 routes/second

Testing: 112,440 Records!
	Numpy Array
		Processed 112,440 routes in 0.3402 seconds
		Rate: 22033.3135 routes/second

	List
		Processed 112,440 routes in 0.1650 seconds
		Rate: 45440.7535 routes/second

	contiguous_2d
		Processed 112,440 routes in 0.1519 seconds
		Rate: 49350.2521 routes/second

Testing: 262,360 Records!
	Numpy Array
		Processed 262,360 routes in 0.8287 seconds
		Rate: 9045.1409 routes/sec

## Results

We have a dropoff as we get more results. If I were in a cluster, I could throw more computers at it.

But on a single machine, we can chunk and try to get the best results

In [13]:
def chunked_encode_with_assignment(df_sample, coord_column, chunk_size=10000, description=""):
    """
    FULL TEST: Process large datasets in chunks AND assign back to DataFrame
    This is the true apples-to-apples comparison with the original monolithic approach
    """
    print(f"\n=== Chunked Processing + DF Assignment: {description} ===")
    print(f"Total records: {len(df_sample):,} | Chunk size: {chunk_size:,}")

    coord_list = df_sample[coord_column].tolist()
    total_chunks = (len(coord_list) + chunk_size - 1) // chunk_size

    all_polylines = []
    total_time = 0

    print(f"Processing in {total_chunks} chunks...")

    # TIME THE ENTIRE WORKFLOW INCLUDING DATAFRAME OPERATIONS
    workflow_start = time.time()

    for i in range(0, len(coord_list), chunk_size):
        chunk = coord_list[i:i + chunk_size]
        actual_chunk_size = len(chunk)

        # Process chunk at peak performance
        start_time = time.time()
        chunk_polylines = rapidgeo.polyline.encode_column(chunk)
        chunk_time = time.time() - start_time

        total_time += chunk_time
        all_polylines.extend(chunk_polylines)

        chunk_rate = actual_chunk_size / chunk_time
        print(f"\t\tChunk {(i//chunk_size)+1}/{total_chunks}: {actual_chunk_size:,} records in {chunk_time:.3f}s = {chunk_rate:.0f} routes/sec")

    # ASSIGN BACK TO DATAFRAME (this is what we were missing!)
    assignment_start = time.time()
    df_sample['polyline_chunked'] = all_polylines
    assignment_time = time.time() - assignment_start

    total_workflow_time = time.time() - workflow_start

    # Overall stats INCLUDING DataFrame assignment
    processing_rate = len(df_sample) / total_time
    workflow_rate = len(df_sample) / total_workflow_time

    print(f"\nCOMPLETE WORKFLOW RESULTS:")
    print(f"\t\tProcessing time: {total_time:.3f}s")
    print(f"\t\tDataFrame assignment time: {assignment_time:.3f}s")
    print(f"\t\tTotal workflow time: {total_workflow_time:.3f}s")
    print(f"\t\tProcessing rate: {processing_rate:.0f} routes/second")
    print(f"\t\tComplete workflow rate: {workflow_rate:.0f} routes/second")
    print(f"\t\tAssignment overhead:  {(assignment_time/total_workflow_time)*100:.1f}%")

    return all_polylines

def compare_chunked_vs_monolithic(df_sample, coord_column, description=""):
    """
    Direct comparison: chunked vs monolithic with full DataFrame  assignment
    """
    print(f"\nHEAD-TO-HEAD: {description}")

    # Test 1: Monolithic approach (original)
    print("1️MONOLITHIC (original approach):")
    start_time = time.time()
    df_sample['polyline_mono'] = rapidgeo.polyline.encode_column(df_sample[coord_column])
    mono_time = time.time() - start_time
    mono_rate = len(df_sample) / mono_time
    print(f"\t\tMonolithic: {len(df_sample):,} records in {mono_time:.3f}s = {mono_rate:.0f} routes/sec")

    # Test 2: Chunked approach  
    print("2️CHUNKED (optimized approach):")
    chunked_polylines = chunked_encode_with_assignment(df_sample, coord_column, chunk_size=10000, description="")

    # Verify results are identical
    if df_sample['polyline_mono'].equals(df_sample['polyline_chunked']):
        print("Results match!")
    else:
        print("Results differ!")

    return mono_rate

# Run the COMPLETE comparison
print("COMPLETE WORKFLOW COMPARISON")
print("Testing full pipeline: processing + DataFrame assignment")

datasets = [
    (df, "Original (7.5k)"),
    (df_2x, "2x (15k)"),
    (df_6x, "5x (37k)"),
    (df_10x, "15x (112k)"),
    (df_35x, "35x (262k)")
]

for dataset, name in datasets:
    mono_rate = compare_chunked_vs_monolithic(dataset.copy(), 'contiguous_coords', description=name)

COMPLETE WORKFLOW COMPARISON
Testing full pipeline: processing + DataFrame assignment

HEAD-TO-HEAD: Original (7.5k)
1️MONOLITHIC (original approach):
		Monolithic: 7,496 records in 0.009s = 811347 routes/sec
2️CHUNKED (optimized approach):

=== Chunked Processing + DF Assignment:  ===
Total records: 7,496 | Chunk size: 10,000
Processing in 1 chunks...
		Chunk 1/1: 7,496 records in 0.009s = 843836 routes/sec

COMPLETE WORKFLOW RESULTS:
		Processing time: 0.009s
		DataFrame assignment time: 0.002s
		Total workflow time: 0.011s
		Processing rate: 843836 routes/second
		Complete workflow rate: 686579 routes/second
		Assignment overhead:  17.6%
Results match!

HEAD-TO-HEAD: 2x (15k)
1️MONOLITHIC (original approach):
		Monolithic: 14,992 records in 0.016s = 931074 routes/sec
2️CHUNKED (optimized approach):

=== Chunked Processing + DF Assignment:  ===
Total records: 14,992 | Chunk size: 10,000
Processing in 2 chunks...
		Chunk 1/2: 10,000 records in 0.011s = 923205 routes/sec
		Chunk 2/2: 4