# Benchmark: Polymo vs Spark UDF for REST API ingestion

This notebook compares the performance of Polymo's native Spark DataSource against a plain Spark UDF that fetches REST API data row-by-row. It uses the public JSONPlaceholder service so no credentials are required.

## Prerequisites

- Install the project extras required for the Builder / DataSource: `pip install "polymo[builder]"`
- Ensure PySpark 4.x is available (Polymo's minimum requirement).
- The JSONPlaceholder API is rate-limited; keep batch sizes modest for repeatable results.

In [None]:
from pathlib import Path
from time import perf_counter
import tempfile
import textwrap

import requests
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, struct
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

from polymo import ApiReader

spark = SparkSession.builder.appName("polymo-benchmark").master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel('WARN')
spark.dataSource.register(ApiReader)

config_yaml = textwrap.dedent('''
version: 0.1
source:
  type: rest
  base_url: https://jsonplaceholder.typicode.com
stream:
  name: posts
  path: /posts
  params:
    _limit: 100
  pagination:
    type: none
  infer_schema: true
''').strip()

config_dir = Path(tempfile.mkdtemp(prefix='polymo-benchmark-'))
config_path = config_dir / 'jsonplaceholder.yml'
config_path.write_text(config_yaml)
config_path

In [None]:
def benchmark_polymo(config_path: Path) -> dict:
    start = perf_counter()
    df = spark.read.format('polymo').option('config_path', str(config_path)).load()
    count = df.count()
    elapsed = perf_counter() - start
    return {'approach': 'Polymo DataSource', 'rows': count, 'seconds': elapsed}

def benchmark_udf() -> dict:
    ids = spark.range(1, 101).toDF('id')

    def fetch_post(post_id: int) -> str:
        url = f'https://jsonplaceholder.typicode.com/posts/{post_id}'
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        return response.text

    spark.udf.register('fetch_post', fetch_post, StringType())

    schema = StructType([
        StructField('userId', IntegerType()),
        StructField('id', IntegerType()),
        StructField('title', StringType()),
        StructField('body', StringType()),
    ])

    start = perf_counter()
    udf_df = ids.select(from_json(col('fetch_post(id)'), schema).alias('post'))
    count = udf_df.count()
    elapsed = perf_counter() - start
    return {'approach': 'Spark UDF (requests per row)', 'rows': count, 'seconds': elapsed}


In [None]:
results = [
    benchmark_polymo(config_path),
    benchmark_udf(),
]
results

In [None]:
import pandas as pd
summary = pd.DataFrame(results)
summary['throughput_rows_per_sec'] = summary['rows'] / summary['seconds']
summary

## Interpretation

The Polymo reader issues a single batched request (thanks to the configuration) and then performs the usual Spark pipeline, whereas the UDF issues one HTTP GET per row. In practice you should see Polymo outperform the row-by-row UDF both in total runtime and in rows/sec.

Different APIs and pagination settings will change absolute numbers, but the pattern typically holds: pushing the REST logic down into the DataSource avoids per-row Python overhead and enables Spark to optimise the plan.

In [None]:
spark.stop()