TODO:
- Add test scores to table in paper
- Add related work

# Table of Contents

uses the most recent Fourier data from EJ (`sc-agg-f16.npz`) to classify the sleep state for each time interval. The models used are:

- a Kalman filter on the Fourier data, then using softmax classification on the hidden states
- a softmax classifier on the original Fourier data

The approaches are:

1.  **done**: train Kalman filter and softmax on one person, training and validating softmax on random time segments
1. **done**: train Kalman filter and softmax on one person, then run Kalman and softmax on other person
1. **done**: repeat, training and testing on multiple people
1. **done**: train softmax on one person, training and validating on random time segments
1. **done**: train softmax on one person, then run softmax on other person
1. **done**: repeat, training and testing on multiple people
1. **done** (compare results with 2 and 3 above):
- 
  - fit separate Kalman filter on each patient's first night
  - fit one softmax on hidden states of each patient's first night
  - filter on **other** night using each patient's Kalman filter
  - softmax on hidden states for same night
8. **done** (if time) for a given person (predict *future* sleep states):
- 
  - fit a Kalman filter with EM on one night
  - fit softmax on Fourier and hidden states of same night
  - on other night, filter or smooth on first N steps
  - **sample** observations and hidden states for rest of night
  - softmax observations and hidden states

Note of something possibly useful: compare **likelihood** of true observations (train night) to likelihood of predicted observations (val night)

In [2]:
from matplotlib import pyplot as plt
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import homogeneity_score, completeness_score, v_measure_score, \
  adjusted_rand_score, adjusted_mutual_info_score, fowlkes_mallows_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from pykalman import KalmanFilter

π = np.pi

In [3]:
def get_scores(y, pred):
  """Get same scores as other methods used in the project."""
  names = [
    "Fowlkes-Meadows",
    "Homogeneity",
    "Completeness",
    "V Measure",
    "Adjusted Rand",
    "Adjusted Mutual",
    "Accuracy"
  ]

  scores = [
    fowlkes_mallows_score(y, pred),
    homogeneity_score(y, pred),
    completeness_score(y, pred),
    v_measure_score(y, pred),
    adjusted_rand_score(y, pred),
    adjusted_mutual_info_score(y, pred),
    accuracy_score(y, pred)
  ]
  
  formatted = ""
  num_places = 4
  for name, score in zip(names, scores):
    formatted += f" & {str(round(score, num_places)).lstrip('0')}"
    print(f"{name}: {round(score, num_places)}")

  return formatted

def report(y, pred, nan_mask, which_report: str):
  print(f'{which_report} report')
  
  print(classification_report(y[~nan_mask], pred, zero_division=np.nan))

  print('proportions in classes')
  print(np.unique(y[~nan_mask], return_counts=True)[1] / len(pred))
  print()

  print('confusion matrix')
  print(np.round(confusion_matrix(y[~nan_mask], pred, normalize='true'), 2))
  print()

  return get_scores(y[~nan_mask], pred)

# Load data

In [4]:
data_path = 'sc-agg-f16.npz'

data = np.load(data_path)

# Softmax classification on Kalman filter hidden states

## 1. train Kalman and softmax on one person, training and validating softmax on random time segments

In [6]:
patient = data['train_patients'][0]
print(patient)
X, y = data[patient][:, :-1], data[patient][:, -1]

num_features = X.shape[1]

X.shape

SC4241


(2702, 64)

In [7]:
sleep_stages=  ('m', 'w', '1', '2', '3', '4', 'r')
# dim_x doesn't necessarily need to be the same as num_stages
num_stages = len(sleep_stages)
dim_x = num_stages
dim_z = num_features

# Init
kf = KalmanFilter(n_dim_state=dim_x, n_dim_obs=dim_z, em_vars="all")

# Fit to fourier data with expectation maximization
kf.em(X)

<pykalman.standard.KalmanFilter at 0x7f95d84f9e70>

In [8]:
# Predict the states for the fourier data
states, _ = kf.filter(X)

In [13]:
# Train-validation split
val_size = 0.3
X_train, X_val, y_train, y_val = \
  train_test_split(states, y, test_size=val_size, random_state=42)

In [14]:
# Transform data to have mean zero and standard deviation one
# (otherwise, LogisticRegression says it doesn't converge though performs about the same)
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)

In [15]:
train_nan_mask = np.isnan(y_train)
val_nan_mask = np.isnan(y_val)

model = LogisticRegression(max_iter=200)
model.fit(X_train[~train_nan_mask], y_train[~train_nan_mask])

In [17]:
train_pred = model.predict(X_train[~train_nan_mask])

report(y_train, train_pred, train_nan_mask, "training")

training report
              precision    recall  f1-score   support

         0.0       0.99      0.99      0.99      1314
         1.0       0.60      0.05      0.09        59
         2.0       0.81      0.94      0.87       360
         3.0       0.33      0.04      0.07        27
         5.0       0.74      0.86      0.80       131

    accuracy                           0.93      1891
   macro avg       0.70      0.58      0.56      1891
weighted avg       0.92      0.93      0.91      1891

proportions in classes
[0.69487044 0.03120042 0.19037546 0.01427816 0.06927552]

confusion matrix
[[0.99 0.   0.   0.   0.01]
 [0.14 0.05 0.59 0.   0.22]
 [0.01 0.   0.94 0.01 0.04]
 [0.   0.   0.96 0.04 0.  ]
 [0.   0.   0.14 0.   0.86]]

Fowlkes-Meadows: 0.9592
Homogeneity: 0.7115
Completeness: 0.8037
V Measure: 0.7548
Adjusted Rand: 0.9131
Adjusted Mutual: 0.7537
Accuracy: 0.9286


' & .9592 & .7115 & .8037 & .7548 & .9131 & .7537 & .9286'

In [18]:
val_pred = model.predict(X_val[~val_nan_mask])

report(y_val, val_pred, val_nan_mask, "validation")

validation report
              precision    recall  f1-score   support

         0.0       0.98      0.99      0.99       559
         1.0        nan      0.00      0.00        26
         2.0       0.82      0.96      0.88       156
         3.0        nan      0.00      0.00        17
         5.0       0.76      0.94      0.84        53

    accuracy                           0.93       811
   macro avg       0.85      0.58      0.54       811
weighted avg       0.93      0.93      0.91       811

proportions in classes
[0.6892725  0.03205919 0.19235512 0.02096178 0.06535142]

confusion matrix
[[0.99 0.   0.   0.   0.01]
 [0.31 0.   0.5  0.   0.19]
 [0.   0.   0.96 0.   0.04]
 [0.   0.   1.   0.   0.  ]
 [0.02 0.   0.04 0.   0.94]]

Fowlkes-Meadows: 0.9561
Homogeneity: 0.7102
Completeness: 0.8449
V Measure: 0.7717
Adjusted Rand: 0.9067
Adjusted Mutual: 0.7703
Accuracy: 0.9285


' & .9561 & .7102 & .8449 & .7717 & .9067 & .7703 & .9285'

## 2. train Kalman and softmax on one person, then run Kalman and softmax on other person

In [19]:
train_patient = data['train_patients'][0]
print('train patient:', train_patient)
X_train, y_train = data[train_patient][:, :-1], data[train_patient][:, -1]

val_patient = data['validate_patients'][0]
print('val patient:', val_patient)
X_val, y_val = data[val_patient][:, :-1], data[val_patient][:, -1]

num_features = X_train.shape[1]

X_train.shape, X_val.shape

train patient: SC4241
val patient: SC4261


((2702, 64), (2800, 64))

In [20]:
sleep_stages=  ('m', 'w', '1', '2', '3', '4', 'r')
num_stages = len(sleep_stages)

# dim_x doesn't necessarily need to be the same as num_stages
dim_x = num_stages
dim_z = num_features

# Init
kf = KalmanFilter(n_dim_state=dim_x, n_dim_obs=dim_z, em_vars="all")

# Fit to fourier data with expectation maximization
kf.em(X_train)

<pykalman.standard.KalmanFilter at 0x7f95d84fb0a0>

In [21]:
# Predict the states for the fourier data
X_train_states, _ = kf.filter(X_train)
X_val_states, _ = kf.filter(X_val)

In [22]:
# Transform data to have mean zero and standard deviation one
scaler = StandardScaler().fit(X_train_states)
X_train_states = scaler.transform(X_train_states)
X_val_states = scaler.transform(X_val_states)

In [23]:
train_nan_mask = np.isnan(y_train)
val_nan_mask = np.isnan(y_val)

model = LogisticRegression(max_iter=200)
model.fit(X_train_states[~train_nan_mask], y_train[~train_nan_mask])

In [24]:
train_pred = model.predict(X_train_states[~train_nan_mask])

report(y_train, train_pred, train_nan_mask, "training")

training report
              precision    recall  f1-score   support

         0.0       0.99      0.99      0.99      1873
         1.0       0.71      0.06      0.11        85
         2.0       0.84      0.93      0.88       516
         3.0       0.61      0.32      0.42        44
         5.0       0.76      0.90      0.83       184

    accuracy                           0.93      2702
   macro avg       0.78      0.64      0.64      2702
weighted avg       0.93      0.93      0.92      2702

proportions in classes
[0.69319023 0.03145818 0.19096965 0.01628423 0.06809771]

confusion matrix
[[0.99 0.   0.   0.   0.01]
 [0.2  0.06 0.53 0.   0.21]
 [0.01 0.   0.93 0.02 0.04]
 [0.   0.   0.68 0.32 0.  ]
 [0.01 0.   0.09 0.   0.9 ]]

Fowlkes-Meadows: 0.9612
Homogeneity: 0.726
Completeness: 0.8016
V Measure: 0.7619
Adjusted Rand: 0.9176
Adjusted Mutual: 0.7612
Accuracy: 0.9338


' & .9612 & .726 & .8016 & .7619 & .9176 & .7612 & .9338'

In [25]:
val_pred = model.predict(X_val_states[~val_nan_mask])

report(y_val, val_pred, val_nan_mask, "validation")

validation report
              precision    recall  f1-score   support

         0.0       1.00      0.07      0.13      1830
         1.0        nan      0.00      0.00       245
         2.0       0.24      1.00      0.38       405
         3.0       0.00      0.00      0.00        85
         4.0        nan      0.00      0.00         9
         5.0       0.00      0.00      0.00       226

    accuracy                           0.19      2800
   macro avg       0.31      0.18      0.09      2800
weighted avg       0.76      0.19      0.14      2800

proportions in classes
[0.65357143 0.0875     0.14464286 0.03035714 0.00321429 0.08071429]

confusion matrix
[[0.07 0.   0.42 0.02 0.   0.5 ]
 [0.   0.   0.89 0.   0.   0.11]
 [0.   0.   1.   0.   0.   0.  ]
 [0.   0.   1.   0.   0.   0.  ]
 [0.   0.   1.   0.   0.   0.  ]
 [0.   0.   1.   0.   0.   0.  ]]

Fowlkes-Meadows: 0.4568
Homogeneity: 0.1782
Completeness: 0.2288
V Measure: 0.2003
Adjusted Rand: -0.0328
Adjusted Mutual: 0.1981


' & .4568 & .1782 & .2288 & .2003 & -0.0328 & .1981 & .1904'

## 3. train Kalman and softmax on multiple people, then run Kalman and softmax on other people

In [64]:
rng = np.random.default_rng(42)

num_train, num_val = 5, 5

print('number of train patients:', len(data['train_patients']))
train_patients = rng.choice(data['train_patients'], num_train, replace=False)
print('train patients:', train_patients)
X_trains, y_train = [data[train_patient][:, :-1] for train_patient in train_patients], \
  np.concatenate([data[train_patient][:, -1] for train_patient in train_patients])
print()

print('number of val patients:', len(data['validate_patients']))
val_patients = rng.choice(data['validate_patients'], num_val, replace=False)
print('val patients:', val_patients)
X_vals, y_val = [data[val_patient][:, :-1] for val_patient in val_patients], \
  np.concatenate([data[val_patient][:, -1] for val_patient in val_patients])

num_features = X_trains[0].shape[1]

number of train patients: 75
train patients: ['SC4472' 'SC4061' 'SC4112' 'SC4761' 'SC4702']

number of val patients: 31
val patients: ['SC4102' 'SC4491' 'SC4501' 'SC4201' 'SC4462']


In [65]:
num_expectation_maximization_cycles = 3

sleep_stages=  ('m', 'w', '1', '2', '3', '4', 'r')
num_stages = len(sleep_stages)

# dim_x doesn't necessarily need to be the same as num_stages
dim_x = num_stages
dim_z = num_features

# Init
kf = KalmanFilter(n_dim_state=dim_x, n_dim_obs=dim_z, em_vars="all")

rng = np.random.default_rng(42)

# Fit to fourier data with expectation maximization

# Perform num_expecation_maximization_cycles
for i in range(num_expectation_maximization_cycles):
  # Each cycle, choose a random order of the patients to perform EM
  for idx in rng.permutation(len(X_trains)):
    kf.em(X_trains[idx], n_iter=1)

In [66]:
# Predict the states for the fourier data
X_trains_states = np.concatenate([kf.filter(X_train)[0] for X_train in X_trains], axis=0)
X_vals_states = np.concatenate([kf.filter(X_val)[0] for X_val in X_vals], axis=0)

In [71]:
# Transform data to have mean zero and standard deviation one
scaler = StandardScaler().fit(X_trains_states)
X_trains_states = scaler.transform(X_trains_states) 
X_vals_states = scaler.transform(X_vals_states)

In [72]:
train_nan_mask = np.isnan(y_train)
val_nan_mask = np.isnan(y_val)

model = LogisticRegression(max_iter=200)
model.fit(X_trains_states[~train_nan_mask], y_train[~train_nan_mask])

In [73]:
train_pred = model.predict(X_trains_states[~train_nan_mask])

report(y_train, train_pred, train_nan_mask, "training")

training report
              precision    recall  f1-score   support

         0.0       0.85      0.97      0.90     10050
         1.0       0.15      0.01      0.01       595
         2.0       0.56      0.53      0.54      1971
         3.0        nan      0.00      0.00       265
         4.0       0.73      0.51      0.60       187
         5.0       0.11      0.02      0.03       506

    accuracy                           0.80     13574
   macro avg       0.48      0.34      0.35     13574
weighted avg       0.74      0.80      0.76     13574

proportions in classes
[0.74038603 0.0438338  0.14520407 0.01952262 0.01377634 0.03727715]

confusion matrix
[[0.97 0.   0.02 0.   0.   0.  ]
 [0.81 0.01 0.14 0.   0.   0.05]
 [0.47 0.   0.53 0.   0.   0.  ]
 [0.24 0.   0.74 0.   0.02 0.  ]
 [0.1  0.   0.39 0.   0.51 0.  ]
 [0.57 0.   0.41 0.   0.   0.02]]

Fowlkes-Meadows: 0.8093
Homogeneity: 0.2068
Completeness: 0.3746
V Measure: 0.2665
Adjusted Rand: 0.4461
Adjusted Mutual: 0.2657
Acc

' & .8093 & .2068 & .3746 & .2665 & .4461 & .2657 & .8004'

In [74]:
val_pred = model.predict(X_vals_states[~val_nan_mask])

report(y_val, val_pred, val_nan_mask, "validation")

validation report
              precision    recall  f1-score   support

         0.0       0.79      0.78      0.78      9804
         1.0       0.05      0.03      0.04       693
         2.0       0.54      0.68      0.60      2534
         3.0       0.01      0.05      0.02        64
         4.0       0.00       nan      0.00         0
         5.0       0.23      0.02      0.03       973

    accuracy                           0.67     14068
   macro avg       0.27      0.31      0.25     14068
weighted avg       0.66      0.67      0.66     14068

proportions in classes
[0.69690077 0.04926073 0.18012511 0.00454933 0.06916406]

confusion matrix
[[0.78 0.04 0.12 0.02 0.05 0.  ]
 [0.76 0.03 0.15 0.   0.   0.06]
 [0.3  0.01 0.68 0.01 0.   0.  ]
 [0.11 0.   0.84 0.05 0.   0.  ]
 [0.   0.   0.   0.   0.   0.  ]
 [0.81 0.03 0.14 0.   0.   0.02]]

Fowlkes-Meadows: 0.6181
Homogeneity: 0.1474
Completeness: 0.1505
V Measure: 0.149
Adjusted Rand: 0.1907
Adjusted Mutual: 0.1483
Accuracy: 0.6

' & .6181 & .1474 & .1505 & .149 & .1907 & .1483 & .669'

# Softmax classification on Fourier data

## 4. train softmax on one person, training and validating on random time segments

In [33]:
patient = data['train_patients'][0]
print(patient)
X, y = data[patient][:, :-1], data[patient][:, -1]
X.shape

SC4241


(2702, 64)

In [34]:
# Train-validation split
val_size = 0.3
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=val_size, random_state=42)

# Transform data to have mean zero and standard deviation one
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_val = scaler.transform(X_val)

X_train.shape, X_val.shape

((1891, 64), (811, 64))

In [35]:
train_nan_mask = np.isnan(y_train)
val_nan_mask = np.isnan(y_val)

model = LogisticRegression()
model.fit(X_train[~train_nan_mask], y_train[~train_nan_mask])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [36]:
train_pred = model.predict(X_train[~train_nan_mask])

report(y_train, train_pred, train_nan_mask, "training")

training report
              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99      1314
         1.0       0.65      0.25      0.37        59
         2.0       0.85      0.93      0.88       360
         3.0       0.68      0.48      0.57        27
         5.0       0.88      0.85      0.86       131

    accuracy                           0.94      1891
   macro avg       0.81      0.70      0.73      1891
weighted avg       0.94      0.94      0.94      1891

proportions in classes
[0.69487044 0.03120042 0.19037546 0.01427816 0.06927552]

confusion matrix
[[1.   0.   0.   0.   0.  ]
 [0.1  0.25 0.54 0.   0.1 ]
 [0.02 0.01 0.92 0.02 0.03]
 [0.   0.   0.52 0.48 0.  ]
 [0.03 0.02 0.1  0.   0.85]]

Fowlkes-Meadows: 0.9665
Homogeneity: 0.7472
Completeness: 0.8039
V Measure: 0.7745
Adjusted Rand: 0.9282
Adjusted Mutual: 0.7734
Accuracy: 0.9429


' & .9665 & .7472 & .8039 & .7745 & .9282 & .7734 & .9429'

In [37]:
val_pred = model.predict(X_val[~val_nan_mask])

report(y_val, val_pred, val_nan_mask, "validation")

validation report
              precision    recall  f1-score   support

         0.0       0.98      0.98      0.98       559
         1.0       0.50      0.23      0.32        26
         2.0       0.87      0.95      0.91       156
         3.0       0.82      0.53      0.64        17
         5.0       0.84      0.87      0.85        53

    accuracy                           0.93       811
   macro avg       0.80      0.71      0.74       811
weighted avg       0.93      0.93      0.93       811

proportions in classes
[0.6892725  0.03205919 0.19235512 0.02096178 0.06535142]

confusion matrix
[[0.98 0.01 0.   0.   0.  ]
 [0.27 0.23 0.38 0.   0.12]
 [0.02 0.01 0.95 0.   0.03]
 [0.   0.   0.47 0.53 0.  ]
 [0.06 0.   0.08 0.   0.87]]

Fowlkes-Meadows: 0.9458
Homogeneity: 0.7128
Completeness: 0.7592
V Measure: 0.7353
Adjusted Rand: 0.8863
Adjusted Mutual: 0.7323
Accuracy: 0.9346


' & .9458 & .7128 & .7592 & .7353 & .8863 & .7323 & .9346'

## 5. train softmax on one person, then run softmax on other person

In [38]:
train_patient = data['train_patients'][0]
print('train patient:', train_patient)
X_train, y_train = data[train_patient][:, :-1], data[train_patient][:, -1]

val_patient = data['validate_patients'][0]
print('val patient:', val_patient)
X_val, y_val = data[val_patient][:, :-1], data[val_patient][:, -1]

X_train.shape, X_val.shape

train patient: SC4241
val patient: SC4261


((2702, 64), (2800, 64))

In [39]:
# Transform data to have mean zero and standard deviation one
scaler = StandardScaler().fit(X_train)
train_X = scaler.transform(X_train)
val_X = scaler.transform(X_val)

X_train.shape, X_val.shape

((2702, 64), (2800, 64))

In [40]:
train_nan_mask = np.isnan(y_train)
val_nan_mask = np.isnan(y_val)

model = LogisticRegression()
model.fit(X_train[~train_nan_mask], y_train[~train_nan_mask])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [41]:
train_pred = model.predict(X_train[~train_nan_mask])

report(y_train, train_pred, train_nan_mask, "training")

training report
              precision    recall  f1-score   support

         0.0       0.99      0.99      0.99      1873
         1.0       0.57      0.25      0.34        85
         2.0       0.86      0.92      0.89       516
         3.0       0.74      0.64      0.68        44
         5.0       0.80      0.83      0.81       184

    accuracy                           0.94      2702
   macro avg       0.79      0.72      0.74      2702
weighted avg       0.93      0.94      0.93      2702

proportions in classes
[0.69319023 0.03145818 0.19096965 0.01628423 0.06809771]

confusion matrix
[[0.99 0.   0.   0.   0.  ]
 [0.15 0.25 0.45 0.   0.15]
 [0.01 0.01 0.92 0.02 0.04]
 [0.   0.   0.36 0.64 0.  ]
 [0.03 0.02 0.12 0.   0.83]]

Fowlkes-Meadows: 0.9617
Homogeneity: 0.733
Completeness: 0.7725
V Measure: 0.7522
Adjusted Rand: 0.9188
Adjusted Mutual: 0.7514
Accuracy: 0.9378


' & .9617 & .733 & .7725 & .7522 & .9188 & .7514 & .9378'

In [42]:
val_pred = model.predict(X_val[~val_nan_mask])

report(y_val, val_pred, val_nan_mask, "validation")

validation report
              precision    recall  f1-score   support

         0.0       0.97      0.93      0.95      1830
         1.0       0.26      0.27      0.26       245
         2.0       0.55      0.93      0.69       405
         3.0       0.03      0.01      0.02        85
         4.0        nan      0.00      0.00         9
         5.0       0.82      0.30      0.44       226

    accuracy                           0.79      2800
   macro avg       0.53      0.41      0.39      2800
weighted avg       0.81      0.79      0.78      2800

proportions in classes
[0.65357143 0.0875     0.14464286 0.03035714 0.00321429 0.08071429]

confusion matrix
[[0.93 0.04 0.03 0.   0.   0.  ]
 [0.18 0.27 0.48 0.04 0.   0.03]
 [0.   0.04 0.93 0.02 0.   0.01]
 [0.   0.01 0.98 0.01 0.   0.  ]
 [0.   0.   1.   0.   0.   0.  ]
 [0.   0.47 0.16 0.06 0.   0.3 ]]

Fowlkes-Meadows: 0.8535
Homogeneity: 0.5008
Completeness: 0.5416
V Measure: 0.5204
Adjusted Rand: 0.7288
Adjusted Mutual: 0.5188
A

' & .8535 & .5008 & .5416 & .5204 & .7288 & .5188 & .7893'

## 6. train softmax on multiple people, then run softmax on other people

In [43]:
print('number of train patients:', len(data['train_patients']))
train_patients = data['train_patients']
print('train patients:', train_patients)
X_trains, y_train = \
  np.concatenate([data[train_patient][:, :-1] for train_patient in train_patients], axis=0), \
  np.concatenate([data[train_patient][:, -1] for train_patient in train_patients])
print()

print('number of val patients:', len(data['validate_patients']))
val_patients = data['validate_patients']
print('val patients:', val_patients)
X_vals, y_val = \
  np.concatenate([data[val_patient][:, :-1] for val_patient in val_patients], axis=0), \
  np.concatenate([data[val_patient][:, -1] for val_patient in val_patients])

num_features = X_trains.shape[1]

number of train patients: 75
train patients: ['SC4241' 'SC4242' 'SC4001' 'SC4002' 'SC4271' 'SC4272' 'SC4761' 'SC4762'
 'SC4021' 'SC4022' 'SC4341' 'SC4342' 'SC4011' 'SC4012' 'SC4411' 'SC4412'
 'SC4281' 'SC4282' 'SC4431' 'SC4432' 'SC4041' 'SC4042' 'SC4081' 'SC4082'
 'SC4731' 'SC4732' 'SC4631' 'SC4632' 'SC4121' 'SC4122' 'SC4051' 'SC4052'
 'SC4061' 'SC4062' 'SC4031' 'SC4032' 'SC4091' 'SC4092' 'SC4771' 'SC4772'
 'SC4211' 'SC4212' 'SC4451' 'SC4452' 'SC4251' 'SC4252' 'SC4111' 'SC4112'
 'SC4371' 'SC4372' 'SC4171' 'SC4172' 'SC4561' 'SC4562' 'SC4471' 'SC4472'
 'SC4581' 'SC4582' 'SC4741' 'SC4742' 'SC4231' 'SC4232' 'SC4522' 'SC4751'
 'SC4752' 'SC4651' 'SC4652' 'SC4611' 'SC4612' 'SC4151' 'SC4152' 'SC4321'
 'SC4322' 'SC4701' 'SC4702']

number of val patients: 31
val patients: ['SC4261' 'SC4262' 'SC4491' 'SC4492' 'SC4161' 'SC4162' 'SC4301' 'SC4302'
 'SC4311' 'SC4312' 'SC4571' 'SC4572' 'SC4591' 'SC4592' 'SC4501' 'SC4502'
 'SC4362' 'SC4721' 'SC4722' 'SC4641' 'SC4642' 'SC4461' 'SC4462' 'SC4201'
 'SC4202

In [44]:
# Transform data to have mean zero and standard deviation one
scaler = StandardScaler().fit(X_trains)
X_trains = scaler.transform(X_trains)
X_vals = scaler.transform(X_vals)

In [47]:
train_nan_mask = np.isnan(y_train)
val_nan_mask = np.isnan(y_val)

model = LogisticRegression(max_iter=200)
model.fit(X_trains[~train_nan_mask], y_train[~train_nan_mask])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [48]:
train_pred = model.predict(X_trains[~train_nan_mask])

report(y_train, train_pred, train_nan_mask, "training")

training report
              precision    recall  f1-score   support

         0.0       0.97      0.98      0.97    139982
         1.0       0.41      0.15      0.22      9958
         2.0       0.74      0.90      0.81     33507
         3.0       0.54      0.38      0.45      4618
         4.0       0.74      0.52      0.61      2609
         5.0       0.69      0.67      0.68     12809

    accuracy                           0.89    203483
   macro avg       0.68      0.60      0.62    203483
weighted avg       0.87      0.89      0.87    203483

proportions in classes
[0.6879297  0.04893775 0.16466732 0.02269477 0.01282171 0.06294875]

confusion matrix
[[0.98 0.01 0.01 0.   0.   0.01]
 [0.27 0.15 0.4  0.   0.   0.18]
 [0.03 0.01 0.9  0.01 0.   0.04]
 [0.02 0.   0.52 0.38 0.07 0.  ]
 [0.01 0.   0.09 0.38 0.52 0.  ]
 [0.07 0.05 0.22 0.   0.   0.67]]

Fowlkes-Meadows: 0.9233
Homogeneity: 0.5919
Completeness: 0.6498
V Measure: 0.6195
Adjusted Rand: 0.8406
Adjusted Mutual: 0.6195
Acc

' & .9233 & .5919 & .6498 & .6195 & .8406 & .6195 & .8861'

In [49]:
val_pred = model.predict(X_vals[~val_nan_mask])

report(y_val, val_pred, val_nan_mask, "validation")

validation report
              precision    recall  f1-score   support

         0.0       0.95      0.97      0.96     57161
         1.0       0.37      0.15      0.22      4775
         2.0       0.74      0.89      0.81     14654
         3.0       0.25      0.12      0.16      1147
         4.0       0.11      0.11      0.11       207
         5.0       0.60      0.55      0.58      5290

    accuracy                           0.87     83234
   macro avg       0.50      0.47      0.47     83234
weighted avg       0.85      0.87      0.85     83234

proportions in classes
[0.68675061 0.05736838 0.17605786 0.01378043 0.00248696 0.06355576]

confusion matrix
[[0.97 0.01 0.01 0.   0.   0.01]
 [0.22 0.15 0.43 0.   0.   0.19]
 [0.05 0.02 0.89 0.01 0.   0.03]
 [0.04 0.   0.76 0.12 0.09 0.  ]
 [0.   0.   0.43 0.46 0.11 0.  ]
 [0.19 0.05 0.2  0.   0.   0.55]]

Fowlkes-Meadows: 0.8951
Homogeneity: 0.5139
Completeness: 0.5683
V Measure: 0.5397
Adjusted Rand: 0.7801
Adjusted Mutual: 0.5396
A

' & .8951 & .5139 & .5683 & .5397 & .7801 & .5396 & .8659'

# Train on first and test on second nights

## 7. train on first night, test on second night

train separate Kalman on each patient's first night, train one softmax, then filter on second night using corresponding Kalman and use same softmax

In [56]:
num_patients = 5

rng = np.random.default_rng(42)

# Trim the 1 or 2 at the end (which indicates the first or second night)
# Make sure both nights are in the set
patients = list({ night[:-1] for night in data["train_patients"]
                 if night[:-1] + "1" in data["train_patients"]
                 and night[:-1] + "2" in data["train_patients"]})
all_train_nights = np.array([ patient + "1" for patient in patients ])
all_val_nights = np.array([ patient + "2" for patient in patients ])

print("number of patients:", len(all_train_nights))

idx = rng.choice(len(all_train_nights), size=num_patients, replace=False)

train_nights = all_train_nights[idx]
print('train nights:', train_nights)
X_trains, y_train = [data[train_night][:, :-1] for train_night in train_nights], \
  np.concatenate([data[train_night][:, -1] for train_night in train_nights])
print()

val_nights = all_val_nights[idx]
print('val nights:', val_nights)
X_vals, y_val = [data[val_night][:, :-1] for val_night in val_nights], \
  np.concatenate([data[val_night][:, -1] for val_night in val_nights])

num_features = X_trains[0].shape[1]

number of patients: 37
train nights: ['SC4651' 'SC4011' 'SC4771' 'SC4171' 'SC4121']

val nights: ['SC4652' 'SC4012' 'SC4772' 'SC4172' 'SC4122']


In [58]:
num_expectation_maximization_cycles = 3

sleep_stages=  ('m', 'w', '1', '2', '3', '4', 'r')
num_stages = len(sleep_stages)

# dim_x doesn't necessarily need to be the same as num_stages
dim_x = num_stages
dim_z = num_features

# Init one Kalman filter for each patient
kfs = [KalmanFilter(n_dim_state=dim_x, n_dim_obs=dim_z, em_vars="all") for _ in range(num_patients)]

# Fit to fourier data with expectation maximization

for X_train, kf in zip(X_trains, kfs):
  kf.em(X_train, n_iter=num_expectation_maximization_cycles)

In [59]:
# Predict the states for the fourier data
X_trains_states = np.concatenate([kf.filter(X_train)[0] for X_train, kf in zip(X_trains, kfs)], axis=0)
X_vals_states = np.concatenate([kf.filter(X_val)[0] for X_val, kf in zip(X_vals, kfs)], axis=0)

In [60]:
# Transform data to have mean zero and standard deviation one
scaler = StandardScaler().fit(X_trains_states)
X_trains_states = scaler.transform(X_trains_states) 
X_vals_states = scaler.transform(X_vals_states)

In [61]:
train_nan_mask = np.isnan(y_train)
val_nan_mask = np.isnan(y_val)

model = LogisticRegression(max_iter=200)
model.fit(X_trains_states[~train_nan_mask], y_train[~train_nan_mask])

In [62]:
train_pred = model.predict(X_trains_states[~train_nan_mask])

report(y_train, train_pred, train_nan_mask, "training")

training report
              precision    recall  f1-score   support

         0.0       0.93      0.95      0.94      9053
         1.0       0.48      0.23      0.31       626
         2.0       0.72      0.89      0.79      2636
         3.0       0.61      0.38      0.47       447
         4.0       0.68      0.12      0.20       125
         5.0       0.47      0.34      0.39       958

    accuracy                           0.84     13845
   macro avg       0.65      0.48      0.52     13845
weighted avg       0.82      0.84      0.83     13845

proportions in classes
[0.65388227 0.04521488 0.19039364 0.03228602 0.00902853 0.06919466]

confusion matrix
[[0.95 0.01 0.01 0.   0.   0.02]
 [0.26 0.23 0.34 0.   0.   0.17]
 [0.07 0.01 0.89 0.01 0.   0.02]
 [0.09 0.   0.53 0.38 0.   0.  ]
 [0.05 0.   0.14 0.7  0.12 0.  ]
 [0.27 0.05 0.34 0.   0.   0.34]]

Fowlkes-Meadows: 0.8589
Homogeneity: 0.4664
Completeness: 0.5392
V Measure: 0.5002
Adjusted Rand: 0.7219
Adjusted Mutual: 0.4997
Acc

' & .8589 & .4664 & .5392 & .5002 & .7219 & .4997 & .8405'

In [63]:
val_pred = model.predict(X_vals_states[~val_nan_mask])

report(y_val, val_pred, val_nan_mask, "validation")

validation report
              precision    recall  f1-score   support

         0.0       0.92      0.96      0.94      8824
         1.0       0.17      0.07      0.10       572
         2.0       0.63      0.84      0.72      2337
         3.0       0.52      0.21      0.30       477
         4.0       0.00      0.00      0.00       163
         5.0       0.40      0.21      0.28       886

    accuracy                           0.81     13259
   macro avg       0.44      0.38      0.39     13259
weighted avg       0.77      0.81      0.78     13259

proportions in classes
[0.66551022 0.04314051 0.17625764 0.03597556 0.01229354 0.06682254]

confusion matrix
[[0.96 0.02 0.01 0.   0.   0.01]
 [0.32 0.07 0.47 0.   0.   0.14]
 [0.1  0.   0.84 0.01 0.   0.05]
 [0.1  0.   0.69 0.21 0.   0.  ]
 [0.14 0.   0.39 0.47 0.   0.  ]
 [0.3  0.03 0.45 0.   0.   0.21]]

Fowlkes-Meadows: 0.8459
Homogeneity: 0.4028
Completeness: 0.5045
V Measure: 0.448
Adjusted Rand: 0.6832
Adjusted Mutual: 0.4474
Ac

' & .8459 & .4028 & .5045 & .448 & .6832 & .4474 & .81'

## 8. train on first night, sample and compare with second night

train Kalman on each patient's first night, train softmax on hidden states and observations (Fourier), then predict hidden states and observations on other night and softmax those

In [5]:
num_patients = 5

rng = np.random.default_rng(42)

# Trim the 1 or 2 at the end (which indicates the first or second night)
# Make sure both nights are in the set
patients = list({ night[:-1] for night in data["train_patients"]
                 if night[:-1] + "1" in data["train_patients"]
                 and night[:-1] + "2" in data["train_patients"]})
all_train_nights = np.array([ patient + "1" for patient in patients ])
all_val_nights = np.array([ patient + "2" for patient in patients ])

print("number of patients:", len(all_train_nights))

idx = rng.choice(len(all_train_nights), size=num_patients, replace=False)

train_nights = all_train_nights[idx]
print('train nights:', train_nights)
X_trains, y_train = [data[train_night][:, :-1] for train_night in train_nights], \
  np.concatenate([data[train_night][:, -1] for train_night in train_nights])
print()

val_nights = all_val_nights[idx]
print('val nights:', val_nights)
X_vals, y_val = [data[val_night][:, :-1] for val_night in val_nights], \
  np.concatenate([data[val_night][:, -1] for val_night in val_nights])

num_features = X_trains[0].shape[1]

number of patients: 37
train nights: ['SC4341' 'SC4241' 'SC4281' 'SC4471' 'SC4211']

val nights: ['SC4342' 'SC4242' 'SC4282' 'SC4472' 'SC4212']


In [6]:
num_expectation_maximization_cycles = 3

sleep_stages=  ('m', 'w', '1', '2', '3', '4', 'r')
num_stages = len(sleep_stages)

# dim_x doesn't necessarily need to be the same as num_stages
dim_x = num_stages
dim_z = num_features

# Init one Kalman filter for each patient
kfs = [KalmanFilter(n_dim_state=dim_x, n_dim_obs=dim_z, em_vars="all") for _ in range(num_patients)]

rng = np.random.default_rng(42)

# Fit to fourier data with expectation maximization

for X_train, kf in zip(X_trains, kfs):
  kf.em(X_train, n_iter=num_expectation_maximization_cycles)

In [7]:
# A typical night has about 2700 steps
start_steps = 800
random_seed = 42

# Get last hidden state to start sampling from
train_start_states = [kf.filter(X_train[:start_steps])[0][-1] for X_train, kf in zip(X_trains, kfs)]
val_start_states = [kf.filter(X_val[:start_steps])[0][-1] for X_val, kf in zip(X_vals, kfs)]

# Sample the hidden states and observations (Fourier coefficients) for remaining timesteps
X_trains_states_obs = [
  kf.sample(
    n_timesteps=len(X_train) - start_steps,
    initial_state=start_state,
    random_state=random_seed
  ) for X_train, kf, start_state in zip(X_trains, kfs, train_start_states)
]
X_trains_states, X_trains_obs = zip(*X_trains_states_obs)
X_trains_states = np.concatenate(X_trains_states, axis=0)
X_trains_obs = np.concatenate(X_trains_obs, axis=0)

X_vals_states_obs = [
  kf.sample(
    n_timesteps=len(X_val) - start_steps,
    initial_state=start_state,
    random_state=random_seed
  ) for X_val, kf, start_state in zip(X_vals, kfs, val_start_states)
]
X_vals_states, X_vals_obs = zip(*X_vals_states_obs)
X_vals_states = np.concatenate(X_vals_states, axis=0)
X_vals_obs = np.concatenate(X_vals_obs, axis=0)

In [8]:
# Drop the labels for the start steps
lengths = [0] + list(map(len, X_trains))
cumul_lengths = list(np.cumsum(lengths))

y_train = np.concatenate([y_train[length_prev + start_steps : length_curr]
  for length_prev, length_curr in zip(cumul_lengths[:-1], cumul_lengths[1:])])

lengths = [0] + list(map(len, X_vals))
cumul_lengths = list(np.cumsum(lengths))

y_val = np.concatenate([y_val[length_prev + start_steps : length_curr]
  for length_prev, length_curr in zip(cumul_lengths[:-1], cumul_lengths[1:])])

In [9]:
# Transform data to have mean zero and standard deviation one
states_scaler = StandardScaler().fit(X_trains_states)
X_trains_states = states_scaler.transform(X_trains_states)
X_vals_states = states_scaler.transform(X_vals_states)

obs_scaler = StandardScaler().fit(X_trains_obs)
X_trains_obs = obs_scaler.transform(X_trains_obs)
X_vals_obs = obs_scaler.transform(X_vals_obs)

In [10]:
train_nan_mask = np.isnan(y_train)
val_nan_mask = np.isnan(y_val)

states_model = LogisticRegression(max_iter=200)
states_model.fit(X_trains_states[~train_nan_mask], y_train[~train_nan_mask])

obs_model = LogisticRegression(max_iter=200)
obs_model.fit(X_trains_obs[~train_nan_mask], y_train[~train_nan_mask])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [11]:
train_states_pred = states_model.predict(X_trains_states[~train_nan_mask])
train_obs_pred = obs_model.predict(X_trains_obs[~train_nan_mask])

report(y_train, train_states_pred, train_nan_mask, "training on hidden states")
report(y_train, train_obs_pred, train_nan_mask, "training on observations")

training on hidden states report
              precision    recall  f1-score   support

         0.0       0.58      0.98      0.73      5546
         1.0        nan      0.00      0.00       452
         2.0       0.35      0.06      0.10      2468
         3.0        nan      0.00      0.00       320
         4.0        nan      0.00      0.00       208
         5.0        nan      0.00      0.00       796

    accuracy                           0.57      9790
   macro avg       0.47      0.17      0.14      9790
weighted avg       0.51      0.57      0.44      9790

proportions in classes
[0.56649642 0.04616956 0.25209397 0.03268641 0.02124617 0.08130746]

confusion matrix
[[0.98 0.   0.02 0.   0.   0.  ]
 [0.94 0.   0.06 0.   0.   0.  ]
 [0.94 0.   0.06 0.   0.   0.  ]
 [0.94 0.   0.06 0.   0.   0.  ]
 [0.95 0.   0.05 0.   0.   0.  ]
 [0.9  0.   0.1  0.   0.   0.  ]]

Fowlkes-Meadows: 0.6158
Homogeneity: 0.0052
Completeness: 0.0354
V Measure: 0.0091
Adjusted Rand: 0.0287
Adjusted M

' & .6007 & .0066 & .0291 & .0108 & .0356 & .0095 & .5662'

In [12]:
val_states_pred = states_model.predict(X_vals_states[~val_nan_mask])
val_obs_pred = obs_model.predict(X_vals_obs[~val_nan_mask])

report(y_val, val_states_pred, val_nan_mask, "validation on hidden states")
report(y_val, val_obs_pred, val_nan_mask, "validation on observations")

validation on hidden states report
              precision    recall  f1-score   support

         0.0       0.62      0.98      0.76      5944
         1.0        nan      0.00      0.00       533
         2.0       0.46      0.08      0.14      2268
         3.0        nan      0.00      0.00       237
         4.0        nan      0.00      0.00       101
         5.0        nan      0.00      0.00       715

    accuracy                           0.61      9798
   macro avg       0.54      0.18      0.15      9798
weighted avg       0.57      0.61      0.49      9798

proportions in classes
[0.60665442 0.05439886 0.23147581 0.02418861 0.01030823 0.07297408]

confusion matrix
[[0.98 0.   0.02 0.   0.   0.  ]
 [0.94 0.   0.06 0.   0.   0.  ]
 [0.92 0.   0.08 0.   0.   0.  ]
 [0.94 0.   0.06 0.   0.   0.  ]
 [0.99 0.   0.01 0.   0.   0.  ]
 [0.96 0.   0.04 0.   0.   0.  ]]

Fowlkes-Meadows: 0.6432
Homogeneity: 0.0063
Completeness: 0.0421
V Measure: 0.0109
Adjusted Rand: 0.029
Adjusted 

' & .6237 & .0042 & .0176 & .0068 & .0298 & .0055 & .5977'

# Test

## 3. train Kalman and softmax on multiple people, then run Kalman and softmax on other people

In [16]:
print('number of train patients:', len(data['train_patients']))
train_patients = data['train_patients']
print('train patients:', train_patients)
X_trains, y_train = [data[train_patient][:, :-1] for train_patient in train_patients], \
  np.concatenate([data[train_patient][:, -1] for train_patient in train_patients])
print()

print('number of test patients:', len(data['test_patients']))
test_patients = data['test_patients']
print('val patients:', val_patients)
X_tests, y_test = [data[test_patient][:, :-1] for test_patient in test_patients], \
  np.concatenate([data[test_patient][:, -1] for test_patient in test_patients])

num_features = X_trains[0].shape[1]

number of train patients: 75
train patients: ['SC4241' 'SC4242' 'SC4001' 'SC4002' 'SC4271' 'SC4272' 'SC4761' 'SC4762'
 'SC4021' 'SC4022' 'SC4341' 'SC4342' 'SC4011' 'SC4012' 'SC4411' 'SC4412'
 'SC4281' 'SC4282' 'SC4431' 'SC4432' 'SC4041' 'SC4042' 'SC4081' 'SC4082'
 'SC4731' 'SC4732' 'SC4631' 'SC4632' 'SC4121' 'SC4122' 'SC4051' 'SC4052'
 'SC4061' 'SC4062' 'SC4031' 'SC4032' 'SC4091' 'SC4092' 'SC4771' 'SC4772'
 'SC4211' 'SC4212' 'SC4451' 'SC4452' 'SC4251' 'SC4252' 'SC4111' 'SC4112'
 'SC4371' 'SC4372' 'SC4171' 'SC4172' 'SC4561' 'SC4562' 'SC4471' 'SC4472'
 'SC4581' 'SC4582' 'SC4741' 'SC4742' 'SC4231' 'SC4232' 'SC4522' 'SC4751'
 'SC4752' 'SC4651' 'SC4652' 'SC4611' 'SC4612' 'SC4151' 'SC4152' 'SC4321'
 'SC4322' 'SC4701' 'SC4702']



number of test patients: 47
val patients: ['SC4261' 'SC4641' 'SC4621' 'SC4721' 'SC4491' 'SC4801' 'SC4461' 'SC4302'
 'SC4362' 'SC4312' 'SC4462' 'SC4802' 'SC4102' 'SC4262' 'SC4501' 'SC4642'
 'SC4201' 'SC4572' 'SC4202' 'SC4301' 'SC4162' 'SC4311' 'SC4492' 'SC4571'
 'SC4591' 'SC4722' 'SC4592' 'SC4161' 'SC4622' 'SC4502' 'SC4101']


In [17]:
num_expectation_maximization_cycles = 3

sleep_stages=  ('m', 'w', '1', '2', '3', '4', 'r')
num_stages = len(sleep_stages)

# dim_x doesn't necessarily need to be the same as num_stages
dim_x = num_stages
dim_z = num_features

# Init
kf = KalmanFilter(n_dim_state=dim_x, n_dim_obs=dim_z, em_vars="all")

rng = np.random.default_rng(42)

# Fit to fourier data with expectation maximization

# Perform num_expecation_maximization_cycles
for i in range(num_expectation_maximization_cycles):
  # Each cycle, choose a random order of the patients to perform EM
  for idx in rng.permutation(len(X_trains)):
    kf.em(X_trains[idx], n_iter=1)

In [18]:
# Predict the states for the fourier data
X_trains_states = np.concatenate([kf.filter(X_train)[0] for X_train in X_trains], axis=0)
X_tests_states = np.concatenate([kf.filter(X_test)[0] for X_test in X_tests], axis=0)

In [19]:
# Transform data to have mean zero and standard deviation one
scaler = StandardScaler().fit(X_trains_states)
X_trains_states = scaler.transform(X_trains_states) 
X_tests_states = scaler.transform(X_tests_states)

In [20]:
train_nan_mask = np.isnan(y_train)
test_nan_mask = np.isnan(y_test)

model = LogisticRegression(max_iter=200)
model.fit(X_trains_states[~train_nan_mask], y_train[~train_nan_mask])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:
train_pred = model.predict(X_trains_states[~train_nan_mask])

report(y_train, train_pred, train_nan_mask, "training")

training report
              precision    recall  f1-score   support

         0.0       0.69      0.99      0.81    139982
         1.0        nan      0.00      0.00      9958
         2.0       0.14      0.00      0.01     33507
         3.0        nan      0.00      0.00      4618
         4.0        nan      0.00      0.00      2609
         5.0       0.04      0.00      0.00     12809

    accuracy                           0.68    203483
   macro avg       0.29      0.17      0.14    203483
weighted avg       0.54      0.68      0.56    203483

proportions in classes
[0.6879297  0.04893775 0.16466732 0.02269477 0.01282171 0.06294875]

confusion matrix
[[0.99 0.   0.   0.   0.   0.  ]
 [0.99 0.   0.01 0.   0.   0.  ]
 [1.   0.   0.   0.   0.   0.  ]
 [1.   0.   0.   0.   0.   0.  ]
 [1.   0.   0.   0.   0.   0.  ]
 [0.99 0.   0.01 0.   0.   0.  ]]

Fowlkes-Meadows: 0.7078
Homogeneity: 0.0005
Completeness: 0.0128
V Measure: 0.001
Adjusted Rand: -0.0001
Adjusted Mutual: 0.001
Accu

' & .7078 & .0005 & .0128 & .001 & -0.0001 & .001 & .6842'

In [22]:
test_pred = model.predict(X_tests_states[~test_nan_mask])

report(y_test, test_pred, test_nan_mask, "test")

test report
              precision    recall  f1-score   support

         0.0       0.69      1.00      0.82     88290
         1.0        nan      0.00      0.00      6799
         2.0       0.26      0.00      0.01     21020
         3.0        nan      0.00      0.00      3042
         4.0        nan      0.00      0.00      1434
         5.0       0.23      0.00      0.00      7787

    accuracy                           0.69    128372
   macro avg       0.39      0.17      0.14    128372
weighted avg       0.58      0.69      0.56    128372

proportions in classes
[0.6877668  0.05296326 0.16374287 0.02369676 0.01117066 0.06065965]

confusion matrix
[[1.   0.   0.   0.   0.   0.  ]
 [0.99 0.   0.01 0.   0.   0.  ]
 [1.   0.   0.   0.   0.   0.  ]
 [1.   0.   0.   0.   0.   0.  ]
 [1.   0.   0.   0.   0.   0.  ]
 [0.99 0.   0.01 0.   0.   0.  ]]

Fowlkes-Meadows: 0.7114
Homogeneity: 0.0008
Completeness: 0.0405
V Measure: 0.0016
Adjusted Rand: 0.0038
Adjusted Mutual: 0.0015
Accurac

' & .7114 & .0008 & .0405 & .0016 & .0038 & .0015 & .6873'

## 6. train softmax on multiple people, then run softmax on other people

In [74]:
print('number of train patients:', len(data['train_patients']))
train_patients = data['train_patients']
print('train patients:', train_patients)
X_trains, y_train = \
  np.concatenate([data[train_patient][:, :-1] for train_patient in train_patients], axis=0), \
  np.concatenate([data[train_patient][:, -1] for train_patient in train_patients])
print()

print('number of test patients:', len(data['test_patients']))
test_patients = data['test_patients']
print('test patients:', test_patients)
X_tests, y_test = \
  np.concatenate([data[test_patient][:, :-1] for test_patient in test_patients], axis=0), \
  np.concatenate([data[test_patient][:, -1] for test_patient in test_patients])

num_features = X_trains.shape[1]

number of train patients: 75
train patients: ['SC4241' 'SC4242' 'SC4001' 'SC4002' 'SC4271' 'SC4272' 'SC4761' 'SC4762'
 'SC4021' 'SC4022' 'SC4341' 'SC4342' 'SC4011' 'SC4012' 'SC4411' 'SC4412'
 'SC4281' 'SC4282' 'SC4431' 'SC4432' 'SC4041' 'SC4042' 'SC4081' 'SC4082'
 'SC4731' 'SC4732' 'SC4631' 'SC4632' 'SC4121' 'SC4122' 'SC4051' 'SC4052'
 'SC4061' 'SC4062' 'SC4031' 'SC4032' 'SC4091' 'SC4092' 'SC4771' 'SC4772'
 'SC4211' 'SC4212' 'SC4451' 'SC4452' 'SC4251' 'SC4252' 'SC4111' 'SC4112'
 'SC4371' 'SC4372' 'SC4171' 'SC4172' 'SC4561' 'SC4562' 'SC4471' 'SC4472'
 'SC4581' 'SC4582' 'SC4741' 'SC4742' 'SC4231' 'SC4232' 'SC4522' 'SC4751'
 'SC4752' 'SC4651' 'SC4652' 'SC4611' 'SC4612' 'SC4151' 'SC4152' 'SC4321'
 'SC4322' 'SC4701' 'SC4702']

number of test patients: 47
test patients: ['SC4181' 'SC4182' 'SC4551' 'SC4552' 'SC4541' 'SC4542' 'SC4191' 'SC4192'
 'SC4661' 'SC4662' 'SC4421' 'SC4422' 'SC4141' 'SC4142' 'SC4071' 'SC4072'
 'SC4401' 'SC4402' 'SC4331' 'SC4332' 'SC4291' 'SC4292' 'SC4381' 'SC4382'
 'SC45

In [75]:
# Transform data to have mean zero and standard deviation one
scaler = StandardScaler().fit(X_trains)
X_trains = scaler.transform(X_trains)
X_tests = scaler.transform(X_tests)

In [76]:
train_nan_mask = np.isnan(y_train)
test_nan_mask = np.isnan(y_test)

model = LogisticRegression(max_iter=200)
model.fit(X_trains[~train_nan_mask], y_train[~train_nan_mask])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [77]:
train_pred = model.predict(X_trains[~train_nan_mask])

report(y_train, train_pred, train_nan_mask, "training")

training report
              precision    recall  f1-score   support

         0.0       0.97      0.98      0.97    139982
         1.0       0.41      0.15      0.22      9958
         2.0       0.74      0.90      0.81     33507
         3.0       0.54      0.38      0.45      4618
         4.0       0.74      0.52      0.61      2609
         5.0       0.69      0.67      0.68     12809

    accuracy                           0.89    203483
   macro avg       0.68      0.60      0.62    203483
weighted avg       0.87      0.89      0.87    203483

proportions in classes
[0.6879297  0.04893775 0.16466732 0.02269477 0.01282171 0.06294875]

confusion matrix
[[0.98 0.01 0.01 0.   0.   0.01]
 [0.27 0.15 0.4  0.   0.   0.18]
 [0.03 0.01 0.9  0.01 0.   0.04]
 [0.02 0.   0.52 0.38 0.07 0.  ]
 [0.01 0.   0.09 0.38 0.52 0.  ]
 [0.07 0.05 0.22 0.   0.   0.67]]

Fowlkes-Meadows: 0.9233
Homogeneity: 0.5919
Completeness: 0.6498
V Measure: 0.6195
Adjusted Rand: 0.8406
Adjusted Mutual: 0.6195
Acc

' & .9233 & .5919 & .6498 & .6195 & .8406 & .6195 & .8861'

In [79]:
test_pred = model.predict(X_tests[~test_nan_mask])

report(y_test, test_pred, test_nan_mask, "test")

test report
              precision    recall  f1-score   support

         0.0       0.96      0.97      0.96     88290
         1.0       0.37      0.25      0.30      6799
         2.0       0.72      0.90      0.80     21020
         3.0       0.44      0.29      0.35      3042
         4.0       0.64      0.21      0.31      1434
         5.0       0.76      0.52      0.62      7787

    accuracy                           0.87    128372
   macro avg       0.65      0.52      0.56    128372
weighted avg       0.86      0.87      0.86    128372

proportions in classes
[0.6877668  0.05296326 0.16374287 0.02369676 0.01117066 0.06065965]

confusion matrix
[[0.97 0.02 0.01 0.   0.   0.  ]
 [0.26 0.25 0.39 0.   0.   0.1 ]
 [0.04 0.03 0.9  0.01 0.   0.02]
 [0.06 0.   0.6  0.29 0.04 0.01]
 [0.02 0.   0.19 0.58 0.21 0.  ]
 [0.14 0.11 0.22 0.   0.   0.52]]

Fowlkes-Meadows: 0.9071
Homogeneity: 0.5448
Completeness: 0.6091
V Measure: 0.5752
Adjusted Rand: 0.8056
Adjusted Mutual: 0.5751
Accurac

' & .9071 & .5448 & .6091 & .5752 & .8056 & .5751 & .8702'

## 7. train on first night, test on second night

train separate Kalman on each patient's first night, train one softmax, then filter on second night using corresponding Kalman and use same softmax

In [57]:
# Trim the 1 or 2 at the end (which indicates the first or second night)
# Make sure both nights are in the set
patients = list({ night[:-1] for night in data["test_patients"]
                 if night[:-1] + "1" in data["test_patients"]
                 and night[:-1] + "2" in data["test_patients"]})
all_nights_1 = np.array([ patient + "1" for patient in patients ])
all_nights_2 = np.array([ patient + "2" for patient in patients ])

print("number of patients:", len(all_train_nights))

nights_1 = all_nights_1
print('first nights:', all_nights_1)
Xs_1, y_1 = [data[night_1][:, :-1] for night_1 in nights_1], \
  np.concatenate([data[night_1][:, -1] for night_1 in nights_1])
print()

nights_2 = all_nights_2
print('second nights:', nights_2)
Xs_2, y_2 = [data[night_2][:, :-1] for night_2 in nights_2], \
  np.concatenate([data[night_2][:, -1] for night_2 in nights_2])

num_features = Xs_1[0].shape[1]

number of patients: 37
first nights: ['SC4811' 'SC4141' 'SC4401' 'SC4671' 'SC4181' 'SC4551' 'SC4541' 'SC4711'
 'SC4601' 'SC4441' 'SC4331' 'SC4511' 'SC4531' 'SC4661' 'SC4421' 'SC4821'
 'SC4481' 'SC4351' 'SC4071' 'SC4221' 'SC4191' 'SC4291' 'SC4381']

second nights: ['SC4812' 'SC4142' 'SC4402' 'SC4672' 'SC4182' 'SC4552' 'SC4542' 'SC4712'
 'SC4602' 'SC4442' 'SC4332' 'SC4512' 'SC4532' 'SC4662' 'SC4422' 'SC4822'
 'SC4482' 'SC4352' 'SC4072' 'SC4222' 'SC4192' 'SC4292' 'SC4382']


In [58]:
num_expectation_maximization_cycles = 3

sleep_stages=  ('m', 'w', '1', '2', '3', '4', 'r')
num_stages = len(sleep_stages)

# dim_x doesn't necessarily need to be the same as num_stages
dim_x = num_stages
dim_z = num_features

# Init one Kalman filter for each patient
kfs = [KalmanFilter(n_dim_state=dim_x, n_dim_obs=dim_z, em_vars="all") for _ in range(len(nights_1))]

# Fit to fourier data with expectation maximization

for X_1, kf in zip(Xs_1, kfs):
  kf.em(X_1, n_iter=num_expectation_maximization_cycles)

In [59]:
# Predict the states for the fourier data
Xs_1_states = np.concatenate([kf.filter(X_1)[0] for X_1, kf in zip(Xs_1, kfs)], axis=0)
Xs_2_states = np.concatenate([kf.filter(X_2)[0] for X_2, kf in zip(Xs_2, kfs)], axis=0)

In [60]:
# Transform data to have mean zero and standard deviation one
scaler = StandardScaler().fit(Xs_1_states)
Xs_1_states = scaler.transform(Xs_1_states)
Xs_2_states = scaler.transform(Xs_2_states)

In [61]:
_1_nan_mask = np.isnan(y_1)
_2_nan_mask = np.isnan(y_2)

model = LogisticRegression(max_iter=200)
model.fit(Xs_1_states[~_1_nan_mask], y_1[~_1_nan_mask])

In [62]:
_1_pred = model.predict(Xs_1_states[~_1_nan_mask])

report(y_1, _1_pred, _1_nan_mask, "night 1")

night 1 report
              precision    recall  f1-score   support

         0.0       0.89      0.96      0.92     42499
         1.0       0.30      0.12      0.17      3711
         2.0       0.74      0.87      0.80     10566
         3.0       0.59      0.31      0.40      1476
         4.0       0.72      0.51      0.60       650
         5.0       0.53      0.24      0.33      3668

    accuracy                           0.83     62570
   macro avg       0.63      0.50      0.54     62570
weighted avg       0.80      0.83      0.81     62570

proportions in classes
[0.67922327 0.05930957 0.16886687 0.02358958 0.01038837 0.05862234]

confusion matrix
[[0.96 0.02 0.01 0.   0.   0.01]
 [0.5  0.12 0.32 0.   0.   0.06]
 [0.1  0.01 0.87 0.01 0.   0.02]
 [0.11 0.   0.54 0.31 0.05 0.  ]
 [0.05 0.   0.12 0.33 0.51 0.  ]
 [0.52 0.05 0.19 0.   0.   0.24]]

Fowlkes-Meadows: 0.8419
Homogeneity: 0.4113
Completeness: 0.5155
V Measure: 0.4576
Adjusted Rand: 0.6554
Adjusted Mutual: 0.4574
Accu

' & .8419 & .4113 & .5155 & .4576 & .6554 & .4574 & .8332'

In [63]:
_2_pred = model.predict(Xs_2_states[~_2_nan_mask])

report(y_2, _2_pred, _2_nan_mask, "night 2")

night 2 report
              precision    recall  f1-score   support

         0.0       0.88      0.93      0.90     43850
         1.0       0.28      0.09      0.13      3031
         2.0       0.76      0.77      0.77      9957
         3.0       0.43      0.29      0.35      1503
         4.0       0.22      0.63      0.32       700
         5.0       0.39      0.28      0.33      3947

    accuracy                           0.80     62988
   macro avg       0.50      0.50      0.47     62988
weighted avg       0.79      0.80      0.79     62988

proportions in classes
[0.69616435 0.04812028 0.15807773 0.02386169 0.01111323 0.06266273]

confusion matrix
[[0.93 0.01 0.01 0.   0.02 0.02]
 [0.56 0.09 0.22 0.   0.   0.14]
 [0.15 0.01 0.77 0.02 0.03 0.02]
 [0.1  0.   0.47 0.29 0.14 0.  ]
 [0.03 0.   0.07 0.28 0.63 0.  ]
 [0.53 0.04 0.16 0.   0.   0.28]]

Fowlkes-Meadows: 0.8064
Homogeneity: 0.3583
Completeness: 0.3999
V Measure: 0.378
Adjusted Rand: 0.5772
Adjusted Mutual: 0.3778
Accur

' & .8064 & .3583 & .3999 & .378 & .5772 & .3778 & .8038'