# Preprocess training and test set for StarNet
This notebook takes you through the steps of how to pre-process the training data necessary for training StarNet and separate out a high S/N test set.

Requirements:
- python packages: `numpy h5py vos`

In [22]:
import numpy as np
import h5py
import os
import vos

datadir='/home/ubuntu/starnet_data/'  # or "/path/to/my/starnet/directory"

def starnet_download_file(filename):
    vclient = vos.Client()
    vclient.copy('vos:starnet/public/'+filename, datadir+filename)
    print(filename+' downloaded')
    
starnet_download_file('apStar_visits_main.h5')

apStar_visits_main.h5 downloaded


**Load the file that contains individual visit spectra along with APOGEE data associated with each star**

In [32]:
filename = datadir + 'apStar_visits_main.h5'
f = h5py.File(filename,'r')
print('Dataset keys in file: \n')
f.keys()

** Load the APOGEE data set into memory**

For the training of StarNet, it is only necessary to obtain the spectra and labels, but we need to set restrictions on the training set to obtain the labels of highest validity so we will first include APOGEE_IDs, the spectra, the S/N of the combined spectra, $T_{\mathrm{eff}}$, $\log(g)$,  [Fe/H],  $V_{scatter}$,  STARFLAGs, and ASPCAPFLAGs

In [24]:
ap_id = f['IDs'][:,0]
spectra = f['spectrum'][:]
combined_snr = f['stacked_snr'][:]
starflag = f['star_flag'][:]
aspcapflag = f['aspcap_flag'][:]
teff = f['TEFF'][:]
logg = f['LOGG'][:]
fe_h = f['FE_H'][:]
vscatter = f['VSCATTER'][:]

f.close()

print('Obtained spectra and data for '+str(len(ap_id))+' from '+str(len(list(set(list(ap_id)))))+' stars.')


Obtained spectra and data for 559359 from 142333 stars.


**Separate out a dataset with good labels**
- combined spectral S/N $\geq$ 200
- STARFLAG = 0
- ASPCAPFLAG = 0
- 4000K < $T_{\mathrm{eff}}$ < 5500K
- -3.0 dex < [Fe/H]
- $\log(g)$ $\neq$ -9999. (value defined by ASPCAP when no ASPCAP labels are given)
- $V_{scatter}$ < 1.0 km/s

In [5]:
snr_min = 200.
teff_min = 4000.
teff_max = 5500.
vscatter_max = 1.
fe_h_min = -3.

In [6]:
indices, cols = np.where((aspcapflag[:]==0.)&(starflag[:]==0.)&(combined_snr[:]>=snr_min)&(vscatter[:]<vscatter_max)&(fe_h[:]>fe_h_min)&(teff[:]>teff_min)&(teff[:]<teff_max)&(logg[:]!=-9999.).reshape(len(ap_id),1))

ap_id = ap_id[indices]
spectra = spectra[indices]
teff = teff[indices]
logg = logg[indices]
fe_h = fe_h[indices]

print(str(len(ap_id))+' spectra remain from '+str(len(list(set(list(ap_id)))))+' stars.')

53548 spectra remain from 17001 stars.


**Select the  visits for the reference set**

Later on, it will be be split into training and cross-validation sets

In [7]:
num_ref = 44784 # number of reference spectra

ap_id_train = ap_id[0:num_ref]
spectra = spectra[0:num_ref]
teff = teff[0:num_ref]
logg = logg[0:num_ref]
fe_h = fe_h[0:num_ref]

print('Reference set includes '+str(len(ap_id_train))+' individual visit spectra from '+str(len(set(ap_id_train)))+' stars.')


Reference set includes 44784 individual visit spectra from 14221 stars.


**Separate a test set of APOGEE IDs**

- These APOGEE IDs will be processed in a following notebook to create StarNet's High S/N test set
- Make sure there are no duplicates from test set that are also in training set (this is necessary because there are some duplicates in the APOGEE v603.fits file)

In [8]:
ap_id_test = ap_id[num_ref:]
ap_id_test = list(set(ap_id_test)-set(ap_id_train))
np.save('high_snr_test_apids', ap_id_test)
print(str(len(ap_id_test))+' stars to be processed for the High S/N test set.')

2780 stars to be processed for the High S/N test set.


**Normalize spectra**
1. separate into three chips
2. divide by median value in each chip
3. recombine each spectrum into a vector of 7214 flux values

In [10]:
# Define edges of detectors
blue_chip_begin = 322
blue_chip_end = 3242
green_chip_begin = 3648
green_chip_end = 6048   
red_chip_begin = 6412
red_chip_end = 8306 

In [11]:
# Separate spectra into chips

blue_sp = spectra[:,blue_chip_begin:blue_chip_end]
green_sp = spectra[:,green_chip_begin:green_chip_end]
red_sp = spectra[:,red_chip_begin:red_chip_end]

# Normalize spectra by chips

blue_sp = (blue_sp.T / np.median(blue_sp, axis=1)).T
green_sp = (green_sp.T / np.median(green_sp, axis=1)).T
red_sp = (red_sp.T / np.median(red_sp, axis=1)).T 

# Recombine spectra

spectra = np.column_stack((blue_sp, green_sp, red_sp))

print('Reference spectra dataset now contains ' + str(spectra.shape[0])+' spectra, each with '+str(spectra.shape[1])+' wavelength bins')

Reference spectra dataset now contains 44784 spectra, each with 7214 wavelength bins


**Save new training data file**

with APOGEE IDs, spectra, and labels.

In [12]:
savename = datadir + 'training_data.h5'
os.remove(savename)
dt = h5py.special_dtype(vlen=bytes)
with h5py.File(savename, 'a') as f:
     
    spectra_ds = f.create_dataset('spectra', spectra.shape, dtype="f")
    teff_ds = f.create_dataset('TEFF', teff.shape, dtype="f")
    logg_ds = f.create_dataset('LOGG', logg.shape, dtype="f")
    fe_h_ds = f.create_dataset('FE_H', fe_h.shape, dtype="f")
    ap_id_ds = f.create_dataset('Ap_IDs', ap_id_train.shape, dtype="S18")
    
    spectra_ds[:] = spectra
    teff_ds[:] = teff
    logg_ds[:] = logg
    fe_h_ds[:] = fe_h
    ap_id_ds[:] = ap_id_train.tolist()

print(savename + ' has been saved as the reference set to be used in 4_Train_Model.ipynb')

training_data.h5 has been saved as the reference set to be used in 4_Train_Model.ipynb
