- approxQuantile() gives you an approxmimate value for quantiles (percentiles)
- It is faster than exact quantiles and works well on big datasets

- Syntax:
    - df.approxQuantile(col, probabilities, relativeError)

- col: the column you want to calculate quantiles on
- probabilities: list of quantiles you want (e.g., 0.5 for median)
- relativeError: allowed error (0 = exact, but slower)

In [None]:
from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.appName("RowExample").getOrCreate()

In [None]:
data = [
    (1, "Manta", 75000, "IT", 24),
    (2, "Dipankar", 30000, "Post Master", 27),
    (3, "Souvik", 60000, "Army Officer", 27),
    (4, "Soukarjya", 45000, "BDO", 26),
    (5, "Arvind", 35000, "Business Data Analyst", 28),
    (6, "Prodipta", 25000, "Data Analyst", 28),
    (7, "Padma", 20000, "Data Analyst", 27),
    (8, "Panta", 125000, "Business Analyst", 27)
]

df = spark.createDataFrame(data, ["id", "name", "salary", "department", "age"])

# show full DataFrame
df.show()

#### Calcualte Quantiles for Age Column

In [None]:
# Get the median (50th percentile)
median_age = df.approxQuantile("age", [0.5], 0.01)
print("Median Age: ", median_age)

In [None]:
# get the 25th, 50th, and 75th percentiles
quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0.01)
print("25th, 50th, and 75th Percentiles: ", quantiles)

In [None]:
# get min (0th), median (50th), max (100th) percentiles
min_median_max = df.approxQuantile("age", [0.0, 0.5, 1.0], 0.01)
print("Min, Median, Max Age: ", min_median_max)

#### How relativeError works
    - Lower relativeError = more accurate but slower
    - Higher relativeError = less accurate but faster

In [None]:
# Set relativeError to 0.1 (less accurate but faster)
quantiles_with_error = df.approxQuantile("age", [0.25, 0.5, 0.75], 0.1)
print("Quantiles with higher relative error: ", quantiles_with_error)

#### Quick Summary

- approxQuantile() is useful when you need quick estimates of quantiles
- works well on large datasets because it's fast
- relativeError lets you control the speed vs accuracy trade-off