# Comparison of two sleep classifiers with LOOX

In this notebook, we analyze the performance of logistic regression on _Walch et al._ and a modified form we call the "hybrid motion" data set. This data set was created by taking triaxial accelerometer sampled at 50 Hz and combining with the gyroscopic measurements from a Naval platform to capture ship rotational and vibrational noise. 

We further compare logistic regression with the deep UNet model, with best performing open weights, from _Mads Olsen et al._.

Our models output, for each 30 second epoch of input recordings (accelerometer, actigraphy, and/or heart rate and other features) a probability of a particular epoch of recording corresponding to a `sleep` PSG label. When running inference to group the epochs into periods of sleep and wake (or sleep stages), we set a threshold for this probability, above which we predict an epoch sleep. For example, if a model outputs `0.71` (probability of sleep) for a given epoch, then a threshold of `0.5` will send this prediction to `sleep` since `0.71 > 0.5`, whereas a more stringent threshold of `0.76` would return a prediction of `wake` since `0.71 <= 0.76`. 

An ROC curve captures this trade-off that comes with the threshold, and also captures the fact that it's not so important that the model has probabilities spanning 0 to 1; even if the model only outputs predictions between 0.7 and 0.8, we could still pick an optimal threshold between those values as the cut-off for `sleep` or `wake` differentiation.


The metrics we will use are:
1. **AUROC**: area under ROC. 

    This is a useful way to evaluate prediction power of a trained model in highly class-imbalanced data, such as our recordings which are mostly people asleep with brief awakenings at the beginning and end, and a few places in the middle, of the recording.
2. **WASA93**: Wake accuracy when sleep accuracy is at least 93%. 

    Recalling the trade-off between scoring each class accurately and not missing epochs, for WASA93 we choose a threshold makign sleep accuracy (total fraction scored correct) at least 0.93% and then evalute the wake epochs with that threshold. This score is one way of quantifying the reliability of the prediction, especially in high-risk scenarios where monitoring awakenings during schedule sleep hours is important.

3. **Cohen's kappa**: A classical statistic that quantifies the amount to which the labels predicted (at various thresholds) have "random" agreement with the true labels. For example, when a model simply outputs the same class (ie everything is reported as `sleep`, independent of input) the kappa returns the fraction of true labels that take on that class.