This notebook explores an example case where the baseline quant_gen parameterization has been shown to be a poor representation of an original PDF - especially as compared to the quant_piecewise_gen (constant) parameterization.

Sam Schmidt provided the foundation for this notebook by establishing some input multimodal PDFs with relatively sharp peaks, such that the P(z) value between the peaks was approximately 0. 

In the first cell, the input PDF is defined, and 51 quantiles are defined for the baseline comparison. The cell also includes the calculation of quantiles such that half are requested number are used to get CDF values, then the second half if used for a linear spacing of CDF values to retrieve quantiles. The two set of (quantile, CDF) pairs are then combined, and used in the examination of the quant_gen parameterization in comparison to the baseline and the quant_piecewise_get parameterization.

In [None]:
import numpy as np
import qp
import scipy.stats as sps
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d

mu = np.array([[0.0,1.1, 2.9], [0.5, 1.25, 2.8], [0.3, 1.9, 2.2]])
sig = np.array([[0.05,0.01,0.04], [0.05,0.01,0.02], [0.025, 0.01, 0.025]])

wt = np.array([[1,1,1], [1,1,1], [1,1,1]])

def makesamples(mu, sig):
    allsamples=np.zeros(5000)
    sizes = [1000,3000, 1000]
    for i in range(3):
        samples = np.array([])
        for j in range(3):
            tmpnorm = sps.norm(loc=mu[i,j], scale=sig[i,j])
            samples=np.hstack((samples, np.array(tmpnorm.rvs(size=sizes[j]))))
        samps = np.array(samples).flatten()

        allsamples = np.vstack([allsamples, samps])
    return allsamples

samps = makesamples(mu,sig)
goodsamps = samps[1:]

# Number of quantiles to use for the baseline comparison
num_grid_points_baseline = 51
percentiles_baseline = np.linspace(0.001, 0.999, num_grid_points_baseline)
pcts_baseline = np.array([np.quantile(goodsamps[x], percentiles_baseline) for x in range(3)])

# Half the baseline quantile points are used to define the first set of quantiles.
num_grid_points = int(np.ceil(num_grid_points_baseline/2))
percentiles = np.linspace(0.001, 0.999, num_grid_points)
pcts = np.array([np.quantile(goodsamps[x], percentiles) for x in range(3)])

plt.plot(pcts_baseline[1], percentiles_baseline)
plt.scatter(pcts[1], percentiles, marker='.')
plt.xlim([-0.1,3.5])

do_other_percentiles = True
if do_other_percentiles:

    low = np.mean(pcts[1,0:2]) # mean of the first 2 points
    high = np.mean(pcts[1,-2:-1]) # mean of the last two points

    # define linear spacing of `num_grid_points` between the low and high CDF values
    other_pcts = np.linspace(low, high, num_grid_points)

    # use linear interpolation to get the second half of percentile values
    # notice the difference on the plot below of the orange dots and blue line.
    # if instead we used the CDF function of the distribution, presumably there would be less difference.
    # But the spacing of the `percentiles_baseline` vs `percentiles` also plays a role.
    interp_function = interp1d(pcts[1], percentiles, kind='linear')
    other_percentiles = interp_function(other_pcts)

    # plot 'em so you see 'em.
    plt.scatter(other_pcts, other_percentiles, marker='.', color='orange')

    # combine original and interpolated quants and locations
    all_pcts = np.concatenate((pcts[1], other_pcts))
    all_percentiles = np.concatenate((percentiles, other_percentiles))

    # Sort and select the unique values (to avoid division by 0 errors when calculating derivatives)
    inds = np.argsort(all_pcts)

    combined_pcts = np.take_along_axis(all_pcts, inds, axis=0)
    combined_percentiles = np.take_along_axis(all_percentiles, inds, axis=0)

    output_pcts, inds = np.unique(combined_pcts, return_index=True)
    output_percentiles = np.take_along_axis(combined_percentiles, inds, axis=0)

Baseline PDf using piecewise linear quantile representation (i.e. `interpolate_multi_x_multi_y` instead of `evaluate_histo_multi_x_multi_y`.

In [None]:
qens = qp.Ensemble(qp.quant,data=dict(quants=percentiles_baseline, locs=pcts_baseline[1]))
qens.plot_native(xlim=(0,3))

The following includes the second set of quantiles that are a a linear interpolation between existing (quantile, CDF) pairs. The number of pairs of points for this representations is the same as above.

In [None]:
if do_other_percentiles:
    qens_2d = qp.Ensemble(qp.quant,data=dict(quants=output_percentiles, locs=output_pcts))
    qens_2d.plot_native(xlim=(0,3))

For reference, the following is the baseline piecewise_constant representation.

In [None]:
if do_other_percentiles:
    qens_pw_const = qp.Ensemble(qp.quant_piecewise,data=dict(quants=output_percentiles, locs=output_pcts))
    qens_pw_const.plot_native(xlim=(0,3))

The following is the original PDF that is being approximated.

In [None]:
ens = qp.Ensemble(qp.mixmod, data=dict(means=mu, stds=sig, weights=wt))
ens[1].plot_native(xlim=(0,3))

The area under the curve is not equal to 1. But it does approache 1 as the number of (quantile, CDF) pairs increases.

In [None]:
grid = np.linspace(0,3.1,1_000_000)

pdf_values = qens.pdf(grid)
pdf_values = np.nan_to_num(pdf_values)
print(f'Linear-baseline constant gridding integral: {np.trapz(pdf_values, grid)}')

pdf_values = qens_2d.pdf(grid)
pdf_values = np.nan_to_num(pdf_values)
print(f'X/Y sampled CDF gridding integral: {np.trapz(pdf_values, grid)}')

pdf_values = qens_pw_const.pdf(grid)
pdf_values = np.nan_to_num(pdf_values)
print(f'PW constant gridding integral: {np.trapz(pdf_values, grid)}')

The following is just a single plot showing the original PDF being approximated, the piecewise constant PDF and the X/Y sampled PDF. 

In [None]:
xvals = np.linspace(0., 3., 101)
pw_pdf = qens_pw_const.pdf(xvals)
linear_2d_pdf = qens_2d.pdf(xvals)
original_pdf = ens[1].pdf(xvals)

fig = ens[1].plot_native(xlim=(0,3), label='Original PDF')

fig.plot(xvals, pw_pdf, linestyle='-.', label='Piecewise constant PDF')
fig.plot(xvals, linear_2d_pdf, color='blueviolet', linestyle='--', label='X/Y Sampled PDF')
fig.legend()

The following are plots of residual values. The differences between Original, PW constant, and X/Y sampled.

In [None]:
fig, (ax0, ax1, ax2) = plt.subplots(3, 1, sharex=True)
ax0.set_xlim([0,3])

ax0.plot(xvals, np.squeeze(original_pdf) - pw_pdf, label='Original PDF - Piecewise PDF')
ax0.legend()
ax1.plot(xvals, pw_pdf - linear_2d_pdf, label='Piecewise PDF - X/Y sampled PDF')
ax1.legend()
ax2.plot(xvals, np.squeeze(original_pdf) - linear_2d_pdf, label='Original PDF - X/Y sampled PDF')
ax2.legend()
