In [None]:
import os
import pathlib
import json
import matplotlib.pyplot as plt
import numpy as np
from scope.utils import read_parquet

# Data on Google Drive
#### https://drive.google.com/drive/folders/13cm3Tf3RtudlVA5fBaMcHVi8anOnn8Gy?usp=sharing

***Available data include:***
- a field of generated SCoPe features from ZTF DR16 (`field_297`)
- text files with current SCoPe model performance
- the full training set, including features generated from DR16
- a downsampled training set containing 10% of the full training set's rows
- a column guide describing selected columns of the training sets
- DR16 light curves associated with the downsampled training set

## Outline of Notebook:
- **Installing SCoPe**
- **Training**
- **Plotting classifier performance**
- **Inference**
- **Examining predictions**
- **Plot Field 297 predictions**
- **Train a DNN classifier**

**Tasks** are intended to be the primary means of interacting with this notebook.

***Notes*** are meant to more broadly describe SCoPe functionality, but you're welcome to explore these avenues further if time permits. 

# Installing SCoPe

The SCoPe GitHub repository is located here: https://github.com/ZwickyTransientFacility/scope

Follow the instructions here to install SCoPe on your computer: https://zwickytransientfacility.github.io/scope-docs/developer.html

# Training

 By default, SCoPe code is run on the command line. We use `os.system` to call `scope.py train` via this notebook.
 
**Tasks:**
 
- To start, open the `SCoPe training set column guide` on Google Drive to have a reference for the columns of the training set.

- Download the `DR16_merged_classifications_features_revamped_updated_imputed.parquet` training set from Google Drive, placing it within your `scope` directory.
 
- Change the dataset path in `config.yaml` before running the following code. The path specifies the location within the `scope` directory where you put the training set. In `config.yaml`, the path can be found under `training: dataset: `.

In [None]:
# We begin by specifying the tag (or label) on which to train a binary classifier:
tag = 'vnv'
# See all tags under "Labels" in the "SCoPe training set column guide" Google Sheet

# SCoPe supports neural network (dnn) and XGBoost (xgb) algorithms:
algorithm = 'xgb'

# If --save is passed, training results are saved to the group named below: 
group = 'ss23'

# SCoPe determines light curve periods using GPU-accelerated algorithms.
# These algorithms include a Lomb-Scargle approach (ELS), Conditional Entropy (ECE),
# Analysis of Variance (AOV), and an approach nesting all three (ELS_ECE_EAOV).
# Periodic features are stored with the suffix specified below:
period_suffix = 'ELS_ECE_EAOV'

# We require at least min_count positive examples to run training.
min_count = 1000

# Neural network training takes an --epochs argument that we set to 30 here.
epochs = 30

In [None]:
os.system(pathlib.Path.home() / f'scope/scope.py train {tag} --algorithm={algorithm} \
          --group={group} --period_suffix={period_suffix} --epochs={epochs} --verbose --save --plot --skip_cv')


***Notes:***

*The above training runs the XGB algorithm by default and skips cross-validation in the interest of time. If you have time after going through the notebook, you can remove the `--skip_cv` argument to run a cross-validated grid search of XGB hyperparameters during training.*

*DNN hyperparameters have already been optimized using a different approach - Weights and Biases Sweeps (https://docs.wandb.ai/guides/sweeps). The results of these sweeps have been saved in the config file. To run another round of sweeps for DNN, you can create a WandB account and set the `--run_sweeps` keyword in your call to `scope.py train`.*

**Task: train multiple classifiers with one script**
- Run the cell below to use `scope.py create_training_script` to generate a script that trains many classifiers sequentially.

In [None]:
# Note: you will get an error if you try to create an inference script with a name that already exists
os.system(pathlib.Path.home() / f'scope/scope.py create_training_script --filename=train_{algorithm}_ss.sh\
          --min_count={min_count} --algorithm={algorithm} --period_suffix={period_suffix} --add_keywords="--save --plot --group={group} --epochs={epochs} --skip_cv"')

- Update script permissions, adding executable permissions to the new training script:

In [None]:
os.system(f'chmod +x $HOME/scope/train_{algorithm}_ss.sh')

- Run the training script you generated in a terminal window (using `./train_xgb_ss.sh`) to train multiple labels for the XGB algorithm. This could take ~15-20 minutes to finish for all classifiers. Continue to the "Plotting classifier performance" section in the meantime.

***Note: running training on HPC resources***

*`train_algorithm_slurm.py` and `train_algorithm_job_submission.py` can be used generate and submit `slurm` scripts to train all classifiers in parallel using HPC resources.*

# Plotting classifier performance

SCoPe saves diagnostic plots and json files to report each classifier's performance. The below code shows the location of these results for one classifier.

In [None]:
path_model = pathlib.Path.home() / f'scope/models_{algorithm}/{group}/{tag}'
path_stats = [x for x in path_model.glob('*plots/val/*stats.json')][0]

In [None]:
# Path to model
print(path_model)

# Path to performance stats json (validation set)
print(path_stats)

In [None]:
with open(path_stats) as f:
    stats = json.load(f)

In [None]:
stats

In [None]:
plt.figure(figsize=(6,4))
plt.rcParams['font.size']=13
plt.title(f"{algorithm} performance ({tag})")
plt.bar(tag, stats['precision'], color='blue',width=1,label='precision')
plt.bar(tag, stats['recall'], color='red',width=0.6, label='recall')
plt.legend(ncol=2,loc=0)
plt.ylim(0,1.15)
plt.xlim(-3,3)
plt.ylabel('Score')

# Can also loop over many labels to compare each one

**Tasks:**
- In `scope/models_xgb/ss23`, find and examine the `vnv` diagnostic plots/text files for the validation set (`val`).
    - How are the accuracy, precision, recall, F1 score, and area under the ROC curve defined?
    - What are the top three most important features for the `vnv` classifier? Bottom three? (Use results files with `impvars` in the name)
- As more models complete training, write code to plot the results for each classifier in a way that compares their validation precision/recall. **Once training completes for all models, continue on to the "Inference" section and resume plotting work after you complete that section.**

#### Define performance stats here



In [None]:
# Analyze feature importance here


In [None]:
# Plot classifier results here


# Inference

We use `tools/inference.py` to run inference on a field of generated features.



**Tasks:**
- Download the `generated_features` directory from Google Drive and place it within your `scope` directory.

In [None]:
feature_directory = 'generated_features'
field = 297

- Generate an inference script. This is the easiest way to run inference with all trained models.
### Ensure that the training script is finished before running the below cell: ###

In [None]:
# Note you will get an error if you try to create an inference script with a name that already exists
os.system(pathlib.Path.home() / f'scope/scope.py create_inference_script --filename=get_all_preds_ss_{algorithm}.sh --group_name={group} \
          --algorithm={algorithm} --period_suffix={period_suffix} --feature_directory={feature_directory}')

- Add executable permissions to the new inference script:

In [None]:
os.system(f'chmod +x $HOME/scope/get_all_preds_ss_{algorithm}.sh')

- Run inference by calling the inference script:

In [None]:
# Can take a few minutes to impute features before displaying output
os.system(pathlib.Path.home() / f'scope/get_all_preds_ss_{algorithm}.sh {field}')

***Note: running inference on HPC resources***

*`run_inference_slurm.py` and `run_inference_job_submission.py` can be used generate and submit `slurm` scripts to run inference for all classifiers in parallel using HPC resources.*

# Examining predictions

The result of running the inference script will be a parquet file containing some descriptive columns followed by columns containing for each classification's probability for each source in the field. By default, the file is located as follows:

In [None]:
path_preds = pathlib.Path.home() / f"scope/preds_{algorithm}/field_{field}/field_{field}.parquet"
print(path_preds)

**Tasks:**
- Use SCoPe's `read_parquet` utility to read the predictions file

In [None]:
preds = read_parquet(path_preds)

In [None]:
preds.columns

In [None]:
preds.describe()

# Plot Field 297 predictions

**Tasks:**
- Make a histogram of probabilities for a single classification in Field 297.
- Make a scatter plot comparing the probabilities of two related classifications (e.g. `vnv` and `rrlyr`).
- Determine what fraction of Field 297 sources have a `vnv` probability greater than 0.7.


# Train a DNN classifier

**Tasks:**
- Once finished with these tasks, return to the top of the notebook and choose one classification to rerun training/inference for DNN, setting `algorithm = 'dnn'`. You can choose more than one, but it will take longer. The defaults in this notebook should take ~15 mins to train one DNN classifier.
- Compare training performance between the XGB and DNN classifiers. Which algorithm performs better for your chosen classification?
- Compare predictions between the XGB and DNN algorithms for your chosen classification. What differences do you see?

***Note:***
*SCoPe DNN training does not provide feature importance information (due to the hidden layers of the network). Feature importance is possible to estimate for neural networks, but it is more computationally expensive compared to this "free" information from XGB.*