To make it easier for everyone to access the large (50+ GB) data in bite-sized chunks, I have written a script called repackage_data.py (under dustcurve/) which inputs a healpy nside 128 index and outputs a repackaged data file called index.h5, containing all the stellar posterior, stellar coordinate, and CO intensity data for that index. A healpy nside 128 pixel index essentially identifies a unique 0.2 sq. deg. chunk of area on the sky, so when you input the index, you're outputting all the relevant information needed to run the MCMC code on that specific bite-sized area of the sky. All the mini-area files will eventually be made available on a public Odyssey server, so users will be able to retrieve only as much data as they need. 

For the purposes of this tutorial, I used repackage_data.py to create a data file called "89998.h5" which contains all the stars for a 0.2 sq. deg. chunk of area centered at (l,b)=(Galactic longitude, Galactic latitude)=(109.3359375,13.861949672). This is right in the middle of our cloud of interest (the Cepheus molecular cloud). While we would like to eventually run the script on the entire area covering Cepheus (approximately 3x3 deg), we are going to pick a small section to make sure the script is running smoothly with real data. 

This tutorial will closely mimic the analysis done in model.ipynb, except now we're using real data and we're trying to estimate the values of d1-12, not just d7 and d11 (as in our thought experiment). Despite the fact that we're only using a small subset of the data, our values for d1-12 should still hover around the distance to the Cepheus molecular cloud given by the literature (approximately d=10). Before we try to set them off at completely random distances, let's set them off near the expect value and see if we get well-mixed chains, before trying something harder. 

In [None]:
import emcee
from dustcurve import model
import seaborn as sns
import numpy as np
import pandas as pd

# the model has 12 parameters; we'll use 50 walkers and 500 steps each
ndim = 12
nwalkers = 50
nsteps = 500

#This MH Code has been adapted from code snippets found in the PHYS 201 week 9 MCMC notebook (written by Vinny and Tom) 
#and the PHYS 201 week 9 homework solutions (written by Tom and Kevin Shane)
filename="89998.h5"
sampler = emcee.MHSampler(np.diagflat(np.ones(ndim)), ndim, model.log_posterior, args=[filename])

allsamples = np.empty((1,ndim))
#pos_array=[np.random.randint(4,19) for i in range(ndim)]
pos_array=[10. for i in range(ndim)]
std_array=[1. for i in range(ndim)]
starting_positions = emcee.utils.sample_ball((pos_array),(std_array),nwalkers) #set up the initial position vectors for our walkers

# set up and run the sampler 50 different times, and create array of chains
%time sampler.run_mcmc(starting_positions[0],nsteps)
for i in range(nwalkers):
        %time sampler.run_mcmc(sampler.chain[-1,:],nsteps)
print('Done')

#is there a better way to do this??? would love a pull request!

#set up subplots for chain plotting
fig, (ax_d1, ax_d2, ax_d3, ax_d4, ax_d5, ax_d6, ax_d7, ax_d8, ax_d9, ax_d10, ax_d11, ax_d12) = plt.subplots(12)

#label axes
ax_d1.set(ylabel='d1')
ax_d2.set(ylabel='d2')
ax_d3.set(ylabel='d3')
ax_d4.set(ylabel='d4')
ax_d5.set(ylabel='d5')
ax_d6.set(ylabel='d6')
ax_d7.set(ylabel='d7')
ax_d8.set(ylabel='d8')
ax_d9.set(ylabel='d9')
ax_d10.set(ylabel='d10')
ax_d11.set(ylabel='d11')
ax_d12.set(ylabel='d12')

sns.tsplot(sampler.chain[1000:,0], ax=ax_d1)
sns.tsplot(sampler.chain[1000:,1], ax=ax_d2)
sns.tsplot(sampler.chain[1000:,2], ax=ax_d3)
sns.tsplot(sampler.chain[1000:,3], ax=ax_d4)
sns.tsplot(sampler.chain[1000:,4], ax=ax_d5)
sns.tsplot(sampler.chain[1000:,5], ax=ax_d6)
sns.tsplot(sampler.chain[1000:,6], ax=ax_d7)
sns.tsplot(sampler.chain[1000:,7], ax=ax_d8)
sns.tsplot(sampler.chain[1000:,8], ax=ax_d9)
sns.tsplot(sampler.chain[1000:,9], ax=ax_d10)
sns.tsplot(sampler.chain[1000:,10], ax=ax_d11)
sns.tsplot(sampler.chain[1000:,11], ax=ax_d12)

parameter_samples = pd.DataFrame({'d1': sampler.chain[1000:,0], 'd2': sampler.chain[1000:,1], 'd3': sampler.chain[1000:,2], 
                                  'd4': sampler.chain[1000:,3], 'd5': sampler.chain[1000:,4], 'd6': sampler.chain[1000:,5],
                                  'd7': sampler.chain[1000:,6], 'd8': sampler.chain[1000:,7], 'd9': sampler.chain[1000:,8],
                                  'd10': sampler.chain[1000:,9], 'd11': sampler.chain[1000:,10], 'd12': sampler.chain[1000:,11]})

q = parameter_samples.quantile([0.16,0.50,0.84], axis=0)

#what values do we get?
for i in range(1,13):
    print("d"+str(i)+"= {:.2f} + {:.2f} - {:.2f}".format(q["d"+str(i)][0.50], 
                                                q["d"+str(i)][0.84]-q["d"+str(i)][0.50],
                                                q["d"+str(i)][0.50]-q["d"+str(i)][0.16]))

This script is still currently running on Odyssey. Will update when I get the results back! The full dataset will take at least a day to run, but we're only using a small subset, so it should take a few hours...