# Exploring the BPZ Test Data

_Alex Malz & Phil Marshall_

We have a small dataset to test our `qp` approximations on: 30,000 photometric redshift 1D posterior PDFs, in "gridded" format, from Melissa Graham (UW, LSST). In this notebook we visualize these distributions, and develop machinery to evaluate our approximations on the whole set in "survey mode." 

## Set-up, Ingest

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from __future__ import print_function
    
import pickle
import numpy as np
from pathos.multiprocessing import ProcessingPool as Pool
import random
import cProfile
import pstats
import StringIO

import pandas as pd
pd.set_option('display.max_columns', None)

import matplotlib.pyplot as plt
%matplotlib inline

import qp

The data file doesn't appear to come with redshifts at which the PDFs are evaluated, but we are told they're evenly spaced between 0.1 and 3.51.

In [None]:
z = np.arange(0.01, 3.51, 0.01, dtype='float')
z_range = 3.51 - 0.01
delta_z = z_range / len(z)

In [None]:
with open('bpz_euclid_test_10_2.probs', 'rb') as data_file:
    lines = (line.split(None) for line in data_file)
    lines.next()
    pdfs = np.array([[float(line[k]) for k in range(1,len(line))] for line in lines])
data_file.close()

## Visualizing the BPZ $p(z)$'s

Let's plot a few interesting PDFs from the dataset.

In [None]:
indices = [1, 3, 14, 16, 19, 21]
colors = ['red','green','blue','cyan','magenta','yellow']
for i in range(len(colors)):
    plt.plot(z, pdfs[indices[i]], color=colors[i], label='Galaxy '+str(indices[i]))
plt.xlabel('redshift $z$', fontsize=16)
plt.legend();

Now, let's turn one of them into a `qp.PDF` object initialized with a gridded parametrization.  

Note: The PDFs in the data file weren't properly normalized.  In order to be PDFs, we want $\int\ p(z)\ dz=1$, but the data file entries satisfy $\sum_{z}\ p(z)=1$, which is not the same.  `qp` approximates the desired integral as $\int\ p(z)\ dz\ \approx\ \Delta z\ \sum_{i}^{N}\ p(z_{i})$ where $\Delta z=\frac{z_{max}-z_{min}}{N}$ is the distance between each neighbor pair $i$ of $N$ redshifts at which the PDF is evaluated.

In [None]:
chosen=1
G = qp.PDF(gridded=(z, pdfs[chosen]))
G.plot()

## Approximating the BPZ $p(z)'s$


Quantile and histogram representations cannot be computed directly from gridded PDFs - we need to make a GMM first, and use this to instantiate a `qp.PDF` object using a `qp.composite` object based on that GMM as `qp.PDF.truth`.  We can fit a GMM directly to the gridded PDF, or we can sample it and fit a GMM to the samples.

In [None]:
G.mix_mod_fit(n_components=3)
G.plot()

We can also fit the GMM to samples, producing a very similar `qp.composite` object.

In [None]:
G.sample(1000, vb=False)
M_dist = G.mix_mod_fit(n_components=3, vb=True)
G.plot()

The `qp.composite` object can be used as the `qp.PDF.truth` to initialize a new `qp.PDF` object that doesn't have any information about the gridded or sample approximations.  Now we can approximate it any way we like!

In [None]:
M = qp.PDF(truth=M_dist)
M.sample(N=100,vb=False)
M.quantize(N=100, vb=False)
M.histogramize(N=100, vb=False)
M.plot(vb=False)

## Quantifying the Accuracy of the Approximation

Let's start by computing the RMSE and KLD between each approximation and the truth, in a sample of systems - and then graduate to looking at the estimated $n(z)$. We'll need a function to do all the analysis on a single object, and then accumulate the outputs to analyze them.

In [None]:
def analyze(index, N_comps, z, N_floats=5, vb=False):
    """
    Model the input BPZ P(z) as a GMM, approximate that GMM in 
    various ways, and assess the quality of each approximation.
    
    Parameters
    ----------
    index : int
        ID of galaxy
    N_comps : int
        Number of components used in GMM
    N_floats : int
        Number of floats used to parametrize the P(z)
    z : float, ndarr
        Redshift array for input gridded "truth". Used for 
        evaluating n(z) too
    vb : boolean
        Verbose output?

    Returns
    -------
    result : dict
        Dictionary containing metric values, n(z) on standard 
        grid, samples, "true" GMM gridded p(z).
        
    Notes
    -----
    In some cases the GMM does not fit well, leading to bad KLD and 
    RMSE values when it is compared to the truth.
    
    """
#     # Make z array if we don't already have it:
#     if z is None:
#         z = np.arange(0.01, 3.51, 0.01, dtype='float')
    dz = (max(z) - min(z)) / len(z)
    zlimits = [min(z), max(z)]

    # Make a dictionary to contain the results:     
    result = {}
    
    # Make a GMM model of the input BPZ p(z) (which are stored
    # in the global 'pdfs' variable:
    G = qp.PDF(gridded=(z, pdfs[index]), vb=vb)
    
    # Draw 1000 samples, fit a GMM model to them, and make a true PDF:
    G.sample(1000, vb=vb)
    GMM = G.mix_mod_fit(n_components=N_comps, vb=vb)
    P = qp.PDF(truth=GMM, vb=vb)
    
    # Evaluate the GMM on the z grid, and store in the result dictionary. We'll 
    # need this to make our "true" n(z) estimator. We don't need to keep the 
    # z array, as we passed that in.
    result['truth'] = P.evaluate(z, using='truth', vb=vb)[1]

    # Now approximate P in various ways, and assess:
    Q, KLD, RMSE, approximation = {}, {}, {}, {}
    Q['quantiles'] = qp.PDF(quantiles=P.quantize(N=N_floats, vb=vb), vb=vb)
    Q['histogram'] = qp.PDF(histogram=P.histogramize(N=N_floats, binrange=zlimits, vb=vb), vb=vb)
    Q['samples'] = qp.PDF(samples=P.sample(N=N_floats, vb=vb), vb=vb)
    for k in Q.keys():
        KLD[k] = qp.calculate_kl_divergence(P, Q[k], limits=zlimits, dx=dz, vb=vb)
        RMSE[k] = qp.calculate_rmse(P, Q[k], limits=zlimits, dx=dz, vb=vb)
        approximation[k] = Q[k].evaluate(z, using=k, vb=vb)[1]
        
    # Store approximations:
    result['KLD'] = KLD
    result['RMSE'] = RMSE
    result['approximation'] = approximation
    result['samples'] = Q['samples'].samples
    
    return result

OK, now lets's collate the metrics for the first 100 galaxies over a variable number of parameters, and look at the distribution of metric values.  We're using multiprocessing because the `for` loop is slow; the rate-limiting step is the optimization routine for finding quantiles of a GMM.

In [None]:
def one_analysis(N):
    
    pr = cProfile.Profile()
    pr.enable()
    
    all_results[str(N)] = []
    for i in range(100):
        all_results[str(N)].append(analyze(i, 2, z, N_floats=N))
        if i%10 == 0: print('.', end='')
            
    pr.disable()
    s = StringIO.StringIO()
    sortby = 'cumtime'
    ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
    ps.print_stats()
    print(N, s.getvalue())
    
    return# all_results

In [None]:
#%%time

numbers = [3, 10, 30, 100]
n_numbers = len(numbers)
all_results = {}

# pool = Pool(3)
# all_results = pool.map(one_analysis, numbers)
for N in numbers:
    one_analysis(N)
    
#all_results = all_results[0]
#print(all_results.keys())

Since the previous step is quite slow (on the order of 5 minutes per test of different numbers of parameters), this is a good point to save the results.  We can load them from the file later and not remake them if we only want to do the rest of the analysis.

In [None]:
with open('all_results.pkl', 'wb') as result_file: 
    pickle.dump(all_results, result_file)

In [None]:
with open('all_results.pkl', 'rb') as result_file: 
    all_results = pickle.load(result_file)

In [None]:
all_KLD, all_RMSE = [], []
for n in range(n_numbers):
    KLD, RMSE = {}, {}
    for approximation in all_results[str(numbers[n])][0]['KLD'].keys():
        x = np.array([])
        for k in range(len(all_results[str(numbers[n])])):
            x = np.append(x, all_results[str(numbers[n])][k]['KLD'][approximation])
        KLD[approximation] = x
        x = np.array([])
        for k in range(len(all_results[str(numbers[n])])):
            x = np.append(x, all_results[str(numbers[n])][k]['RMSE'][approximation])
        RMSE[approximation] = x
    all_KLD.append(KLD)
    all_RMSE.append(RMSE)

Now let's plot histograms of the metric values.

In [None]:
colors = {'samples':'green', 'quantiles':'blue', 'histogram':'red'}
plt.figure(figsize=(12, 5*n_numbers))

i=0
for n in range(n_numbers):
    i += 1
    # Lefthand panel: KLD
    plt.subplot(n_numbers, 2, i)
    plt.title('KLD for '+str(numbers[n])+' numbers')
    bins = np.linspace(0.0, 5., 25)
    for k in ['samples', 'quantiles', 'histogram']:
        plt.hist(all_KLD[n][k], bins, label=k, fc=colors[k], ec=colors[k], alpha=0.3, normed=True)
    #plt.semilogx()
    plt.xlabel('KL Divergence Metric', fontsize=16)
    plt.ylim(0., 5.0)
    plt.xlim(0., 5.0)
    plt.legend()
    
    i += 1
    # Righthand panel: RMSE
    plt.subplot(n_numbers, 2, i)#+n_numbers)
    plt.title('RMSE for '+str(numbers[n])+' numbers')
    bins = np.linspace(0.0, 5., 25)
    for k in ['samples', 'quantiles', 'histogram']:
        plt.hist(all_RMSE[n][k], bins, label=k, fc=colors[k], ec=colors[k], alpha=0.3, normed=True)
    #plt.semilogx()
    plt.xlabel('RMS Error Metric', fontsize=16)
    plt.ylim(0., 5.0)
    plt.xlim(0., 5.0)
    plt.legend();
    
plt.savefig('money.png')

Interestingly, the metrics don't agree, nor is the behavior consistent across different numbers of parameters.  However, as the number of parameters increases, the distribution of the metrics converge to lower numbers.

KLD seems to flag more "bad" approximations than RMSE. How do we know where to set the threshold in each metric? 

We should think of the right way to get a summary statistic (first moment?) on the ensemble of KLD or RMSE values so we can make the plot of number of parameters vs. quality of approximation.

Now lets compute the estimated $n(z)$. We'll do this with the GMM "truth", and then using each of our approximations. And we'll normalize the $n(z)$ to account for lost systems with bad approximations.

In [None]:
plt.figure(figsize=(6, 5*n_numbers))
all_n = []
all_x = []
all_y = []

for i in range(n_numbers):
    results = all_results[str(numbers[i])]
    n = {}

    # Pull out all truths and compute the average at each z:
    x = np.zeros([len(z), len(results)])
    y = {}
    for approx in ['samples', 'quantiles', 'histogram']:
        y[approx] = np.zeros([len(z), len(results)])
        for k in range(len(results)):
            y[approx][:,k] = results[k]['approximation'][approx] 
    for k in range(len(results)):
        x[:,k] = results[k]['truth'] 

    # Now do the averaging to make the estimators:
    n['truth'] = np.mean(x, axis=1)
    n['truth'] /= np.sum(n['truth']) * delta_z
    for approx in ['samples', 'quantiles', 'histogram']:
        n[approx] = np.mean(y[approx], axis=1)
        n[approx] /= np.sum(n[approx]) * delta_z
        
    all_n.append(n)
    all_x.append(x)
    all_y.append(y)

    # Note: this uses the samples' KDE to make the approximation. We could (and 
    # should!) also try simply concatenating the samples and histogramming them.
    
    # Plot truth and all the approximations. 
    # The NaNs in the histogram approximation make that unplottable for now.
    plt.subplot(n_numbers, 1, i+1)#+n_numbers)
    plt.title(r'$n(z)$ for '+str(numbers[i])+' numbers')
    plt.plot(z, n['truth'], color='black', lw=4, alpha=0.3, label='truth')
    for k in ['samples', 'quantiles', 'histogram']:
        plt.plot(z, n[k], label=k, color=colors[k])
    plt.xlabel('redshift z')
    plt.ylabel('n(z)')
    plt.legend();
plt.savefig('nz_comparison.png')

The "samples" approximation gives the best result for the $n(z)$ estimator even with a small number of samples.  However, once the number of parameters increases slightly, the "quantiles" approximation performs similarly.  It takes a large number of parameters before the "histogram" approximation approaches the other options. Let's use the `qp.PDF` object to compare them quantitatively (since $n(z)$ can be normalized to give the global $p(z)$).

In [None]:
all_p = []

for i in range(n_numbers):
    n = all_n[i]
    p = {}
    for k in ['samples', 'quantiles', 'histogram']:
        p[k] = qp.PDF(gridded=(z,n[k]), vb=False)

    p['truth'] = qp.PDF(gridded=(z,n['truth']), vb=False)
    
    all_p.append(p)

In [None]:
all_KLD_nz, all_RMSE_nz = {}, {}
zlimits, dz = [0.0, 3.5], 0.01
for k in ['samples', 'quantiles', 'histogram']:
    p = all_p[i]
    KLD_nz, RMSE_nz = [], []
    for i in range(n_numbers):
        KLD_nz.append(qp.calculate_kl_divergence(all_p[i]['truth'], all_p[i][k], limits=zlimits, dx=dz, vb=False))
        RMSE_nz.append(qp.calculate_rmse(all_p[i]['truth'], all_p[i][k], limits=zlimits, dx=dz, vb=False))
    
    all_KLD_nz[k] = KLD_nz
    all_RMSE_nz[k] = RMSE_nz

In [None]:
plt.figure(figsize=(12, 5))
both = [plt.subplot(1, 2, i+1) for i in range(2)]
KLD_plot = both[0]
RMSE_plot = both[1]
KLD_plot.set_title(r'KLD for $n(z)$')
RMSE_plot.set_title(r'RMSE for $n(z)$')
KLD_plot.set_xlabel('number of parameters')
RMSE_plot.set_xlabel('number of parameters')
KLD_plot.set_ylabel('KLD')
RMSE_plot.set_ylabel('RMSE')
# KLD_plot.semilogx()
# KLD_plot.semilogy()
# RMSE_plot.semilogx()
# RMSE_plot.semilogy()

for k in ['samples', 'quantiles', 'histogram']:
    KLD_plot.plot(numbers, all_KLD_nz[k], color=colors[k], label=k)
    RMSE_plot.plot(numbers, all_RMSE_nz[k], color=colors[k], label=k)

KLD_plot.semilogy()
KLD_plot.semilogx()
RMSE_plot.semilogy()
RMSE_plot.semilogx()
KLD_plot.legend()
RMSE_plot.legend()
plt.savefig('summary.png')

In [None]:
print('KLD metrics for n(z) estimator: ', all_KLD_nz)
print('RMSE metrics for n(z) estimator: ', all_RMSE_nz)

This early indication suggests that the histogram approximation really is the best after all. The rank order of the three methods is the same when the $n(z)$ estimates are compared with the KLD metric and the RMSE metric in all cases except for at 10 parameters where samples and quantiles perform almost identically under the KLD.

| Number of data points | Lowest KLD | Middle KLD | Highest KLD |
| --------------------- | ---------- | ---------- | ----------- |
| 3 | samples | histogram | quantiles |
| 10 | samples | quantiles | histogram |
| 30 | quantiles | samples | histogram |
| 100 | histogram | quantiles | samples |

| Number of data points | Lowest RMSE | Middle RMSE | Highest RMSE |
| --------------------- | ---------- | ---------- | ----------- |
| 3 | samples | histogram | quantiles |
| 10 | quantiles | samples | histogram |
| 30 | quantiles | samples | histogram |
| 100 | quantiles | samples | histogram |

A bigger test, using the full dataset, should allow this to be tested further: jack-knife error bars should also be calculable. 