# Exploring the BPZ Test Data

_Alex Malz & Phil Marshall_

We have a small dataset to test our `qp` approximations on: 30,000 photometric redshift 1D posterior PDFs, in "gridded" format, from Melissa Graham (UW, LSST). In this notebook we visualize these distributions, and develop machinery to evaluate our approximations on the whole set in "survey mode." 

## Set-up, Ingest

In [None]:
from __future__ import print_function

%load_ext autoreload
%autoreload 2
    
import numpy as np
import random
import pandas as pd
pd.set_option('display.max_columns', None)

import matplotlib.pyplot as plt
%matplotlib inline

import qp

The data file doesn't appear to come with redshifts at which the PDFs are evaluated, but we are told they're evenly spaced between 0.1 and 3.51.

In [None]:
z = np.arange(0.01, 3.51, 0.01, dtype='float')
zrange = 3.51 - 0.01

The PDFs in the data file aren't properly normalized.  In order to be PDFs, we want $\int\ p(z)\ dz=1$, but the data file entries satisfy $\sum_{z}\ p(z)=1$, which is not the same.  We approximate the desired integral as $\int\ p(z)\ dz\ \approx\ \Delta z\ \sum_{i}^{N}\ p(z_{i})$ where $\Delta z=\frac{z_{max}-z_{min}}{N}$ is the distance between each neighbor pair $i$ of $N$ redshifts at which the PDF is evaluated.

In [None]:
with open('bpz_euclid_test_10_2.probs', 'rb') as data_file:
    lines = (line.split(None) for line in data_file)
    lines.next()
    # lines.next()
    pdfs = np.array([[float(line[k]) for k in range(1,len(line))] for line in lines])
    pdf_shape = np.shape(pdfs)
    #print(np.sum(pdfs, axis=1)[:100] / zrange)
    norm_factor = zrange / pdf_shape[1]
    pdfs /= norm_factor
    print(np.sum(pdfs * zrange, axis=1)[:100])
data_file.close()
log_pdfs = qp.utils.safelog(pdfs)
pdfs = np.exp(log_pdfs)
print(np.sum(pdfs, axis=1)[:100])

## Visualizing the BPZ $p(z)$'s

Let's plot a few interesting PDFs from the dataset.

In [None]:
indices = [1, 3, 14, 16, 19, 21]
colors = ['red','green','blue','cyan','magenta','yellow']
for i in range(len(colors)):
    plt.plot(z, pdfs[indices[i]], color=colors[i], label='Galaxy '+str(indices[i]))
plt.xlabel('redshift $z$', fontsize=16)
plt.legend();

Now, let's turn one of them into a `qp.PDF` object initialized with a gridded parametrization.

In [None]:
# chosen = random.choice(indices)
# print(chosen)

chosen=14
G = qp.PDF(gridded=(z, pdfs[chosen]))
G.plot()

## Approximating the BPZ $p(z)'s$


Quantile and histogram representations cannot be computed directly from gridded PDFs - we need to make a GMM first, and use this to instantiate a `qp.PDF` object using a `qp.composite` object based on that GMM as `qp.PDF.truth`.  Currently, a GMM can only be fit to samples, so we start by sampling our gridded parametrization.

In [None]:
G.sample(1000, vb=False)
G.plot()

Now that there are samples, we can fit the GMM, producing a `qp.composite` object.

In [None]:
M_dist = G.mix_mod_fit(n_components=2, vb=False)
G.plot(vb=False)

The `qp.composite` object can be used as the `qp.PDF.truth` to initialize a new `qp.PDF` object that doesn't have any information about the gridded or sample approximations.  Now we can approximate it any way we like!

In [None]:
M = qp.PDF(truth=M_dist)
M.quantize(vb=False)
M.histogramize(vb=False)
M.sample(N=100,vb=False)
M.plot(vb=False)

## Quantifying the Accuracy of the Approximation

Let's start by computing the RMSE and KLD between each approximation and the truth, in a sample of systems - and then graduate to looking at the estimated $n(z)$. We'll need a function to do all the analysis on a single object, and then accumulate the outputs to analyze them.

In [None]:
def analyze(chosen, vb=False, z=None):
    """
    Model the input BPZ P(z) as a GMM, approximate that GMM in 
    various ways, and assess the quality of each approximation.
    
    Parameters
    ----------
    chosen : int
        ID of galaxy
    vb : boolean
        Verbose output?
    z : float, ndarr
        Redshift array for input gridded "truth". Used for 
        evaluating n(z) too

    Returns
    -------
    result : dict
        Dictionary containing metric values, n(z) on standard 
        grid, samples, "true" GMM gridded p(z).
        
    Notes
    -----
    In some cases the GMM does not fit well, leading to bad KLD and 
    RMSE values when it is compared to the truth.
    
    """
    # Make z array if we don't already have it:
    if z is None:
        z = np.arange(0.01, 3.51, 0.01, dtype='float')

    # Make a dictionary to contain the results:     
    result = {}
    
    # Make a GMM model of the input BPZ p(z) (which are stored
    # in the global 'pdfs' variable:
    G = qp.PDF(gridded=(z, pdfs[chosen]), vb=vb)
    
    # Draw 1000 samples, fit a GMM model to them, and make a true PDF:
    G.sample(1000, vb=vb)
    GMM = G.mix_mod_fit(n_components=5, vb=vb)
    P = qp.PDF(truth=GMM, vb=vb)
    
    # Evaluate the GMM on the z grid, and store in the result dictionary. We'll 
    # need this to make our "true" n(z) estimator. We don't need to keep the 
    # z array, as we passed that in.
    result['truth'] = P.evaluate(z, using='truth', vb=vb)[1]

    # Now approximate P in various ways, and assess:
    Q, KLD, RMSE, approximation = {}, {}, {}, {}
    zlimits, dz = [0.0, 3.5], 0.01
    Q['quantiles'] = qp.PDF(quantiles=P.quantize(N=100, vb=vb), vb=vb)
    Q['histogram'] = qp.PDF(histogram=P.histogramize(N=100, binrange=zlimits, vb=vb), vb=vb)
    Q['samples'] = qp.PDF(samples=P.sample(N=100, vb=vb), vb=vb)
    for k in Q.keys():
        KLD[k] = qp.calculate_kl_divergence(P, Q[k], limits=zlimits, dx=dz, vb=vb)
        RMSE[k] = qp.calculate_rmse(P, Q[k], limits=zlimits, dx=dz, vb=vb)
        approximation[k] = Q[k].evaluate(z, using=k, vb=vb)[1]
        
    # Store approximations:
    result['KLD'] = KLD
    result['RMSE'] = RMSE
    result['approximation'] = approximation
    result['samples'] = Q['samples'].samples
    
    return result

In [None]:
x = analyze(14, z=z, vb=False)

In [None]:
x.keys()

In [None]:
print(x['approximation']['quantiles'].shape)
print(x['approximation']['histogram'].shape)
print(x['approximation']['samples'].shape)
print(x['truth'].shape)
print(x['samples'].shape)

In [None]:
x['samples']

OK, now lets's loop over the first 100 galaxies, and look at the distribution of metric values.

In [None]:
%%time
results = []
for i in range(100):
    results.append(analyze(i, z=z))
    if i%10 == 0: print('.', end='')

There is almost certainly a better way of collating the KLD values out of all our results dictionaries than with a for loop, but I don't know what it is.

In [None]:
KLD, RMSE = {}, {}
for approximation in results[0]['KLD'].keys():
    x = np.array([])
    for k in range(len(results)):
        x = np.append(x, results[k]['KLD'][approximation])
    KLD[approximation] = x
    x = np.array([])
    for k in range(len(results)):
        x = np.append(x, results[k]['RMSE'][approximation])
    RMSE[approximation] = x

Now let's plot histograms of the metric values.

In [None]:
colors = {'samples':'red', 'quantiles':'green', 'histogram':'blue'}
plt.figure(figsize=(12, 4))

# Lefthand panel: KLD
plt.subplot(1, 2, 1)
bins = np.linspace(0.0, 5, 25)
for k in ['samples', 'quantiles', 'histogram']:
    plt.hist(KLD[k], bins, label=k, fc=colors[k], ec=colors[k], alpha=0.3)
plt.xlabel('KL Divergence Metric', fontsize=16)
plt.ylim(0.1, 100.0)
plt.legend()

# Righthand panel: RMSE
plt.subplot(1, 2, 2)
bins = np.linspace(0.0, 5, 25)
for k in ['samples', 'quantiles', 'histogram']:
    plt.hist(RMSE[k], bins, label=k, fc=colors[k], ec=colors[k], alpha=0.3)
plt.xlabel('RMS Error Metric', fontsize=16)
plt.ylim(0.1, 100.0)
plt.legend();

Interesting: looks like the quantile approximation does better in RMSE, but only slightly better in KLD. Histogram does noticeably worse in both metrics.

KLD seems to flag more "bad" approximations than RMSE. How do we know where to set the threshold in each metric? 

Now lets compute the estimated $n(z)$. We'll do this with the GMM "truth", and then using each of our approximations. And we'll normalize the $n(z)$ to account for lost systems with bad approximations.

In [None]:
n = {}

# Pull out all truths and compute the average at each z:
x = np.zeros([len(z), len(results)])
y = {}
for approx in ['samples', 'quantiles', 'histogram']:
    y[approx] = np.zeros([len(z), len(results)])
    for k in range(len(results)):
         y[approx][:,k] = results[k]['approximation'][approx] 
for k in range(len(results)):
    x[:,k] = results[k]['truth'] 

# Now do the averaging to make the estimators:
n['truth'] = np.mean(x, axis=1)
for approx in ['samples', 'quantiles', 'histogram']:
    n[approx] = np.mean(y[approx], axis=1)

# Note: this uses the samples' KDE to make the approximation. We could (and 
# should!) also try simply concatenating the samples and histogramming them.
    
# Plot truth and all the approximations. 
# The NaNs in the histogram approximation make that unplottable for now.
plt.plot(z, n['truth'], color='black', lw=4, alpha=0.3, label='truth')
for k in ['samples', 'quantiles', 'histogram']:
    plt.plot(z, n[k], label=k, color=colors[k])
plt.xlabel('redshift z')
plt.ylabel('n(z)')
plt.legend();

The "samples" approximation seems to give a slightly better result for the $n(z)$ estimator than the "quantiles" approximation; "histogram" is noticeably worse than both. Let's use the `qp.PDF` object to compare them quantitatively (since $n(z)$ can be normalized to give the global $p(z)$).

In [None]:
p = {}
for k in ['samples', 'quantiles', 'histogram']:
    p[k] = qp.PDF(gridded=(z,n[k]), vb=False)

p['truth'] = qp.PDF(gridded=(z,n['truth']), vb=False)

In [None]:
KLD, RMSE = {}, {}
zlimits, dz = [0.0, 3.5], 0.01
for k in ['samples', 'quantiles', 'histogram']:
    KLD[k] = qp.calculate_kl_divergence(p['truth'], p[k], limits=zlimits, dx=dz, vb=False)
    RMSE[k] = qp.calculate_rmse(p['truth'], p[k], limits=zlimits, dx=dz, vb=False)

In [None]:
print('KLD metrics for n(z) estimator: ', KLD)
print('RMSE metrics for n(z) estimator: ', RMSE)

This early indication suggests that all three approximations are fairly closely matched in this metric. The rank order of the three methods is the same when the $n(z)$ estimates are compared with the KLD metric and the RMSE metric: from best to worst we have "histogram", "samples" and "quantiles." A bigger test, using the full dataset, should allow this to be tested further: jack-knife error bars shoudl also be calculable. 

A different set of quantile points maygive a different result. Also, it would be interesting to see whether the conclusions about the choice of approximation vary as the number of available stored values is varied away from 100 (to, perhaps, 3, 10, 30, 100, 300). 