# Exploring the BPZ Test Data

_Alex Malz & Phil Marshall_

We have a small dataset to test our `qp` approximations on: 30,000 photometric redshift 1D posterior PDFs, in "gridded" format, from Melissa Graham (UW, LSST). In this notebook we visualize these distributions, and develop machinery to evaluate our approximations on the whole set in "survey mode." 

## Set-up, Ingest

In [None]:
import numpy as np
import random

import matplotlib.pyplot as plt
%matplotlib inline

import qp

The data file doesn't appear to come with redshifts at which the PDFs are evaluated, but we are told they're evenly spaced between 0.1 and 3.51.

In [None]:
z = np.arange(0.01, 3.51, 0.01, dtype='float')
zrange = 3.51-0.01

The PDFs in the data file aren't properly normalized.  In order to be PDFs, we want $\int\ p(z)\ dz=1$, but the data file entries satisfy $\sum_{z}\ p(z)=1$, which is not the same.  We approximate the desired integral as $\int\ p(z)\ dz\ \approx\ \Delta z\ \sum_{i}^{N}\ p(z_{i})$ where $\Delta z=\frac{z_{max}-z_{min}}{N}$ is the distance between each neighbor pair $i$ of $N$ redshifts at which the PDF is evaluated.

In [None]:
with open('bpz_euclid_test_10_2.probs', 'rb') as data_file:
    lines = (line.split(None) for line in data_file)
    lines.next()
    # lines.next()
    pdfs = np.array([[float(line[k]) for k in range(1,len(line))] for line in lines])
    pdf_shape = np.shape(pdfs)
    #print(np.sum(pdfs, axis=1)[:100] / zrange)
    norm_factor = zrange / pdf_shape[1]
    pdfs /= norm_factor
    print(np.sum(pdfs * zrange, axis=1)[:100])
data_file.close()
log_pdfs = qp.utils.safelog(pdfs)
pdfs = np.exp(log_pdfs)
print(np.sum(pdfs, axis=1)[:100])

## Visualizing the BPZ $p(z)$'s

Let's plot a few interesting PDFs from the dataset.

In [None]:
indices = [1, 3, 14, 16, 19, 21]
colors = 'rgbcmy'
for i in range(len(colors)):
    plt.plot(z, pdfs[indices[i]], color=colors[i])
plt.xlabel('redshift $z$', fontsize=16)

Now, let's turn one of them into a `qp.PDF` object initialized with a gridded parametrization.

In [None]:
# chosen = random.choice(indices)
# print(chosen)

chosen=14
G = qp.PDF(gridded=(z, pdfs[chosen]))
G.plot()

## Approximating the BPZ $p(z)'s$


Quantile and histogram representations cannot be computed directly from gridded PDFs - we need to make a GMM first, and use this to instantiate a `qp.PDF` object using a `qp.composite` object based on that GMM as `qp.PDF.truth`.  Currently, a GMM can only be fit to samples, so we start by sampling our gridded parametrization.

In [None]:
G.sample(1000, vb=False)
G.plot()

Now that there are samples, we can fit the GMM, producing a `qp.composite` object.

In [None]:
M_dist = G.mix_mod_fit(n_components=2, vb=False)
G.plot(vb=False)

The `qp.composite` object can be used as the `qp.PDF.truth` to initialize a new `qp.PDF` object that doesn't have any information about the gridded or sample approximations.  Now we can approximate it any way we like!

In [None]:
M = qp.PDF(truth=M_dist)
M.quantize(vb=False)
M.histogramize(vb=False)
M.sample(N=100,vb=False)
M.plot(vb=False)

## Quantifying the Accuracy of the Approximation

Let's compute the RMSE and KLD between each approximation and the truth.

In [None]:
def compare(M, vb=False):
    P = qp.PDF(truth=M.truth)
    Q = {}
    Q['quantiles'] = qp.PDF(quantiles=M.quantize(N=100, vb=vb), vb=vb)
    Q['histogram'] = qp.PDF(histogram=M.histogramize(N=100, vb=vb), vb=vb)
    Q['samples'] = qp.PDF(samples=M.sample(N=100, vb=vb), vb=vb)
    KLD = {}
    for approximation in Q.keys():
        KLD[approximation] = qp.utils.calculate_kl_divergence(P, Q[approximation], limits=[0.0, 1.0], vb=False)
    print KLD
    return

compare(M)