# Exploring the __SLICS-HR__ particle data
notebook by _Alex Malz (GCCL@RUB)_, (add your name here)

In [None]:
import astropy as ap
from astropy.cosmology import FlatLambdaCDM
import matplotlib as mpl
import matplotlib.pyplot as plt
import multiprocessing as mp
import numpy as np
import os
import pandas as pd

%matplotlib inline

We're only considering the first 8 SLICS snapshots because the GAMA data doesn't have good enough coverage beyond that.

In [None]:
z_SLICS = np.array([0.042, 0.080, 0.130, 0.221, 0.317, 0.418, 0.525, 0.640])
#, 0.764, 0.897, 1.041, 1.199, 1.372, 1.562, 1.772, 2.007, 2.269, 2.565, 2.899])
z_mids = (z_SLICS[1:] + z_SLICS[:-1]) / 2.
z_bins = np.insert(z_mids, 0, 0.023)
z_bins = np.append(z_bins, 3.066)

## Read in data

Download one of the 64 nodes $\times$ 20 redshifts files at each redshift from Joachim Harnois-Deraps to start.
I chose file 21 at $z=0.042$ for this example.

In [None]:
which_z = 2
z_str = '{:<05}'.format(str(z_SLICS[which_z]))
fn_base = 'xv'
fn_index = 21
fn_ext = '.dat'
fn = z_str + fn_base + str(fn_index) + fn_ext
data_dir = 'particle_data/cuillin.roe.ac.uk/~jharno/SLICS/SLICS_HR/LOS1'

Read in from binary float(4) format.

In [None]:
dt_each = 'f' + str(4)
dt = np.dtype([('x', dt_each), ('y', dt_each), ('z', dt_each), ('vx', dt_each), ('vy', dt_each), ('vz', dt_each)])

In [None]:
with open(os.path.join(data_dir, fn), 'rb') as f1:
    raw_data = np.fromfile(f1, dtype=dt)

Throw out first 12 entries as unwanted header information.

In [None]:
loc_data = pd.DataFrame(data=raw_data[2:], columns=['x', 'y', 'z', 'vx', 'vy', 'vz'])

In [None]:
plt.hist2d(loc_data['x'], loc_data['y'], bins=(200, 200), norm=mpl.colors.LogNorm(), cmap='Spectral_r')

In [None]:
len(loc_data)

In [None]:
loc_to_plot = loc_data.sample(50000)
# plt.scatter(loc_to_plot['x'], loc_to_plot['y'], marker='.', s=1, alpha=0.5)

## Convert to physical units

The particle data starts out in simulation units relative to the per-node subvolume and needs to be converted to physical units in the space of all subvolumes before the whole volume can be considered.

In [None]:
# number of MPI tasks per dimension
nodes_dim = 4

# volume size
rnc = 3072.

# subvolume size
ncc = rnc / nodes_dim

# physical scale in Mpc/h
phys_scale = 505.

Note that the conversion below makes sense for `x`, `y`, and `z` but not for `vx`, `vy`, and `vz`.
Because of how the data is distributed across the files, I think 21, 22, 25, 26, 37, 38, 41, 42 are "adjacent" and free of edge effects.

In [None]:
all_nodes_coords = np.empty((nodes_dim, nodes_dim, nodes_dim))
for k1 in range(1, nodes_dim+1):
    for j1 in range(1, nodes_dim+1):
        for i1 in range(1, nodes_dim+1):
            current_ind = (i1 - 1) + (j1 - 1) * nodes_dim + (k1 - 1) * nodes_dim ** 2
            node_coords = {'x': i1 - 1, 'y': j1 - 1, 'z': k1 - 1}
            if fn_index == current_ind:
                print('found index '+str(fn_index)+' at '+str((i1, j1, k1)))
                true_node_coords = node_coords
            all_nodes_coords[node_coords['x'], node_coords['y'], node_coords['z']] = current_ind
            
# print(all_nodes_coords)

To get coherent coordinates across all files, we need to shift them accordingly.
The next cell is unexpectely slow.

In [None]:
# shift data
glob_data = loc_data
for col in ['x', 'y', 'z']:
    glob_data[col] = np.remainder(loc_data[col] + true_node_coords[col] * ncc, rnc)
    assert(max(glob_data[col] <= rnc))

In [None]:
# convert to Mpc/h
phys_data = glob_data * phys_scale / rnc

In [None]:
for dim in ['x', 'y', 'z']:
    plt.hist(phys_data[dim], density=True, alpha=0.5)
plt.xlabel('distance (Mpc/h)')

In [None]:
plt.hist2d(phys_data['x'], phys_data['y'], bins=(200,200), norm=mpl.colors.LogNorm(), cmap='Spectral_r')

In [None]:
phys_to_plot = phys_data.sample(50000)
# plt.scatter(phys_to_plot['x'], phys_to_plot['y'], marker='.', s=1, alpha=0.5)

## How much data do we need?

Obtain necessary depth from ~~[Ned Wright's cosmology calculator](http://www.astro.ucla.edu/~wright/CosmoCalc.html)~~ `astropy`.
The SLICS cosmology has $\Omega_{m} = 0.2905$, $\Omega_{\Lambda} = 0.7095$, $\Omega_{b} = 0.0473$, $h = 0.6898$, $\sigma_{8} = 0.826$, and $n_{s} = 0.969$.
Let's assume the naive Cartesian-to-angular coordinates and flatten along the `z` direction.
We need to flatten a depth corresponding to the bounds of each redshift bin.

In [None]:
h = 0.6898
cosmo = FlatLambdaCDM(H0=100.*h, Om0=0.2905, Ob0=0.0473)
d_comov = []
for z in z_bins:
    dc = cosmo.comoving_distance(float(z))
    d_comov.append(dc.value / h)
d_comov = np.array(d_comov)
depths = d_comov[1:] - d_comov[:-1]

avg_d_comov = []
for z in z_SLICS:
    dc = cosmo.comoving_distance(float(z))
    avg_d_comov.append(dc.value / h)
    
# print(depths)
# print(avg_d_comov)

Sadly, `depths` < `phys_scale` $Mpc/h$ only in the first three redshift bins, meaning the depths of the next five GAMA redshift bin may require opening two files.
I think the way they're arranged means that (21, 37), (22, 38), (25, 41), and (26, 42) are pairs adjacent in `z`.

_This is as good a time as any to note that our mock catalog will have a bit of a degeneracy if we use the same file numbers for all redshifts because each file corresponds to the same physical volume across cosmic time, whereas in a real survey, our redshift bins contain different volumes/galaxies.
We have a choice to make about discontinuities or non-physical repetitition._

## Convert to angular units

Obtain angular diameter distance $d_{a}$ in units $\theta = x / d_{a}$ with $d_{a} = d_{c} / (1 + z)$, where $d_{c}$ is the comoving diameter distance and $x$ is the distance in the SLICS data.
Compare with the GAMA footprint of $286^{\circ^{2}} * (\pi / 180^{\circ})^{2} \approx 0.087 sr$.

In [None]:
d_ang = avg_d_comov / (1 + z_SLICS)
theta_box = phys_scale / d_ang * 180. / np.pi
footprint = theta_box**2
# print(footprint)


The scaling behavior is as expected;
`phys_scale` subtends a larger angle at low redshifts and a smaller angle at high redshifts.
One file's worth of SLICS data subtends an angular area larger than the GAMA footprint in the first five GAMA redshift bins, but the next three GAMA redshift bins would definitely require more than one file's worth of data.
We need to pick an angular area for our mock galaxy catalog.
Let's go with twice that for now.
_Do we think twice the GAMA area is sufficiently compelling?_

In [None]:
theta_gama = 286.
GAMA_phys_scale = np.sqrt(theta_gama) * (np.pi / 180.) * d_ang
# print(2. * GAMA_phys_scale)

If we go with twice the GAMA footprint, then the first three redshift bins need only one file, the next three need two, and the last two need three.
I think (21, 22), (25, 26), (37, 38), and (41, 42) are adjacent in `x`/`RA` and (21, 25), (22, 26), (37, 41), and (38, 42) are adjacent in `y`/`DEC`.

In [None]:
# for i in range(4):
#     j = i+1
#     subset = phys_data[(phys_data['x'] <= 10.*j) & (phys_data['y'] <= 10.*j) & (phys_data['z'] <= 10.*j)]
#     subset.to_csv('spat'+str(j)+'0Mpc.csv', header=False, index=False, sep=' ', columns=['x', 'y', 'z'])
#     angular = subset / 313.5 * 69.6 / 100. * float(j) * 180 / np.pi
#     print((min(angular['x']), max(angular['x'])))
#     print((min(angular['y']), max(angular['y'])))  
#     angular.to_csv('ang'+str(j)+'deg.csv', header=False, index=False, sep=',', columns=['x', 'y'])

Right now, I'm not going to deal with combining adjacent files, just chopping up ones that are too big.
This is slow!

In [None]:
ang_data = phys_data[np.mod(phys_data['z'] - min(phys_data['z']), phys_scale) < depths[which_z]]
ang_data['RA'] = ang_data['x'] / d_ang[1] * 180. / np.pi
ang_data['DEC'] = ang_data['y'] / d_ang[1] * 180. / np.pi

We'd change this for the area of our mock survey when we decide on it.
_There is an edge effect going on right now.
I need to switch to one of the internal files to avoid roll-over that's breaking min/max checks._

In [None]:
lim_theta = np.sqrt(2. * theta_gama)
cut_data = ang_data[(ang_data['RA'] < lim_theta + min(ang_data['RA']))
                    & (ang_data['DEC'] < lim_theta + min(ang_data['DEC']))]

# plt.hist(cut_data['RA'])
# plt.hist(cut_data['DEC'])

cut_data.to_csv(z_str+'cut.csv', header=True, index=False, sep=',', columns=['RA', 'DEC'])

In [None]:
plt.hist2d(cut_data['RA'], cut_data['DEC'], bins=(200,200), norm=mpl.colors.LogNorm(), cmap='Spectral_r')
plt.xlabel('RA (deg)')
plt.ylabel('DEC (deg)')

In [None]:
cut_to_plot = cut_data.sample(50000)
# plt.scatter(cut_to_plot['x'], cut_to_plot['y'], marker='.', s=1, alpha=0.5)

# scratch after here

## Spatially subsample data

Turns out 1/64th of the total data was still way more than we could reasonably use at once to compute correlation functions!
This should really be sliced by size of box.
First, just break it up into smaller boxes.
Let's say we want $10^{5}$ particles per box, so we'll cut it in 16 in each dimension.

In [None]:
# # distances = np.sqrt(phys_data['x']**2 + phys_data['y']**2 + phys_data['z']**2)
# splitpoints = {}
# for dim in ['RA', 'DEC']:
#     splitpoints[dim] = np.linspace(min(ang_data[dim]), max(ang_data[dim]), 17)
#     print(splitpoints[dim])

In [None]:
# for i in range(16):
#     for j in range(16):
#         subsample = ang_data.loc[(ang_data['RA'] >= splitpoints['RA'][i]) & (ang_data['RA'] <= splitpoints['RA'][i+1])\
#                                  & (ang_data['DEC'] >= splitpoints['DEC'][j]) & (ang_data['DEC'] <= splitpoints['DEC'][j+1])]
#         subsample.to_csv(z_str+'slice_'+str(i)+'_'+str(j)+'.csv', header=True, index=False, sep=',', columns=['RA', 'DEC'])

## Randomly subsample data

In [None]:
# print(angular)

In [None]:
# to_plot = angular.sample(5000)

In [None]:
# plt.hist2d(to_plot['x'], to_plot['y'], bins=100, norm=mpl.colors.LogNorm(), cmap='Spectral_r')
# plt.savefig('mock_gal_pos.png', dpi=250)

In [None]:
# try_distances = np.flip(np.geomspace(0.01, 1.0, 10), axis=0)

In [None]:
# import environment as galenv

# def calc_env(ind):
#     res = []
#     friends = data
#     for dist in try_distances:
#         friends = galenv.nn_finder(friends, data[ind], dist)
#         res.append(len(friends))
#     return res

In [None]:
# data = [to_plot['x'].values, to_plot['y'].values]
# print(data)

In [None]:
# data = np.array([to_plot['x'].values, to_plot['y'].values]).T
# nps = mp.cpu_count()
# pool = mp.Pool(nps - 1)
# envs = pool.map(calc_env, range(len(data)))
# pool.close()
# # envs_arr = np.array(all_envs)
# # envs_df = pd.DataFrame(data=envs_arr, index = envs_arr[:, 0], columns = ['CATAID']+[str(i) for i in try_distances])

# # df = pd.merge(envs_df, zdf, on='CATAID')
# # df.to_csv('enviros.csv')

no clue what to plot here. . .

## Examine the precomputed 2PCF

Download the 2PCF at several redshifts [here](https://drive.google.com/drive/folders/1eGlAO_wl9h0xiXiTMKV_m7h9YCRhDHP_?usp=sharing).

Note that the data is $\Delta^{2}(k)$, not the more familiar (to me) $\mathcal{P}(k)$.  (A reminder of the relationship between them can be found [here](http://universe-review.ca/R05-04-powerspectrum.htm), particularly in [this figure](http://universe-review.ca/I02-20-correlate1b.png).)

In [None]:
# pk = np.genfromtxt('NptFns/0.042ngpps_new.dat_LOS1').T

In [None]:
# plt.plot(pk[0], pk[1])
# plt.semilogx()
# plt.semilogy()
# plt.xlabel(r'$k$ [Mpc/h]')
# plt.ylabel(r'$\Delta^2(k)$')

In [None]:
# rmin = 2 * np.pi / max(pk[0])
# rmax = 2 * np.pi / min(pk[0])
# print((rmin, rmax))

# Next steps

Ultimately, we will need to calculate the 2 and 3+ point correlation functions of the particle data.
Because the data is split into 64 files per redshift, we also need a way to combine the positional information from each file to get coherent correlation functions.
We may be able to more easily accomplish both goals if we first smooth the data using a Fourier-space basis like wavelets.

## combine particle data from multiple files

## calculate the N-point correlation functions