# Random Forest Training for QSO targets selection

**Author:** Edmond Chaussidon (CEA Saclay)

This notebook explains how the random forest files for the targets selection are generated. 

All the file are written and saved in NERSC : `/global/cfs/cdirs/desi/target/analysis/RF`

**WARNING** This notebook had generated files in NERSC !! **PLEASE** Change path and savename to don't overwritte data or be sure to keep alive the current files.

The training is divided in three parts :
    * 1) data_collection : collect data from dr9
    * 2) data_preparation : build atributes for RF
    * 3) train_test_RF : training and some tests
    
**WARNING** You need a version of topcat : `http://www.star.bris.ac.uk/~mbt/topcat/`

In [None]:
import subprocess

DIR = '/global/cfs/cdirs/desi/target/analysis/RF/'

from pathlib import Path

path_train = f'{Path().absolute()}/../../py/desitarget/train/'

-------
## 1)  data_collection

**REMARK:** Not necessary to run this section for the training if the files are existing in DIR **WARNING** 

In [None]:
from desitarget.train.data_collection.sweep_meta import sweep_meta

sweep_meta('dr9s', f'{DIR}dr9s_sweep_meta.fits')
sweep_meta('dr9n', f'{DIR}dr9n_sweep_meta.fits')

* Add your version of topcat in my_tractor_extract_batch.py :

    `STILTSCMD = 'java -jar -Xmx4096M /global/homes/e/edmondc/Software/topcat/topcat-full.jar -stilts'` 


In [None]:
from desitarget.train.data_collection.my_tractor_extract_batch import my_tractor_extract_batch

#collect QSO sample
my_tractor_extract_batch(16, f'{DIR}/QSO_DR9s.fits', 'dr9s', '0,360,-10,30', 'qso', path_train, DIR)

In [None]:
#collect stars sample
my_tractor_extract_batch(4, f'{DIR}/STARS_DR9s.fits', 'dr9s', '320,340,-1.25,1.25', 'stars', path_train, DIR)

In [None]:
#collect test sample 
my_tractor_extract_batch(4, f'{DIR}/TEST_DR9s.fits', 'dr9s', '30,45,-5,5', 'test', path_train, DIR)

--------
## 2) data_preparation 

**Remark :** We remove test region in *data_preparation/Code/make_training_samples.py* (it is **hard coding**)  for the region 30<RA<45 & -5<DEC<5. Take **CARE** if you don't use this region for the test_sample.

In [None]:
#build attributes and resample stars to avoid overrepresentation for training samples
tmpstr = f'python data_preparation/make_training_samples.py -i1 {DIR}/QSO_DR9s.fits -i2 {DIR}/STARS_DR9s.fits -o1 {DIR}/QSO_TrainingSample_DR9s.fits -o2 {DIR}/STARS_TrainingSample_DR9s.fits'
subprocess.call(tmpstr, shell=True)

tmpstr = f'rm {DIR}/QSO_DR9s.fits {DIR}/STARS_DR9s.fits'
subprocess.call(tmpstr, shell=True)

In [None]:
#build attributes for test sample
tmpstr = f'python ./Code/make_test_sample.py -i {DIR}/TEST_DR9s.fits -o {DIR}/TestSample_DR9s.fits'
subprocess.call(tmpstr, shell=True)

tmpstr = f'rm {DIR}/TEST_DR9s.fits'
subprocess.call(tmpstr, shell=True)

------
## 3) train_test_RF

* Modify **filenames** in *train_test_RF/PipelineConfigScript.py*

* Modify **filenames** in *train_test_RF/Convert_to_DESI_RF.py*

In [None]:
#Pipeline Congifuration (to generate different RF)
tmpstr = f'python train_test_RF/PipelineConfigScript.py'
subprocess.call(tmpstr, shell=True)

In [None]:
#RF training 
tmpstr = f'python train_test_RF/train_RF.py --config_fpn ./WorkingDir/config.npz --MODEL DR8s_LOW --mod_dpn ./WorkingDir/DR8s/RFmodel/DR8s_LOW'
subprocess.call(tmpstr, shell=True)

In [None]:
#RF Highz training
tmpstr = f'python train_test_RF/train_RF.py --config_fpn ./WorkingDir/config.npz --MODEL DR8s_HighZ --mod_dpn ./WorkingDir/DR8s/RFmodel/DR8s_HighZ'
subprocess.call(tmpstr, shell=True)

In [None]:
#Sklearn to desitarget format
tmpstr = f'python train_test_RF/Convert_to_DESI_RF.py'
subprocess.call(tmpstr, shell=True)

------------
## 4) Some tests

In [None]:
faire run compare_rf ! (en prenant dr8 de desitarget pour example)