# Imports

In [1]:
%matplotlib inline

import scipy as sp
from scipy import stats
import numpy as np
import pandas as pd

import statistics

# Load Data

In [2]:
tracing_df = statistics.load_tracing_features()
recording_df = statistics.load_recording_relations()
recording_df = statistics.associate_recordings(recording_df, tracing_df)
gaze_dfs = statistics.load_gaze_features()
statistics.check_gaze_recording_associations(recording_df, gaze_dfs)
statistics.compute_gaze_features(gaze_dfs)
gaze_df = statistics.combine_gaze_features(gaze_dfs)
full_df = statistics.combine_all_features(recording_df, gaze_df)
full_df.columns

Index(['id', 'subjectNumber', 'scenarioNumber', 'newAfterOld', 'scenarioType',
       'displayType', 'sensorPlacementTime', 'ppvStartTime', 'ccStartTime',
       'inSpO2TargetRangeDuration', 'inSpO2LooseTargetRangeDuration',
       'inSpO2TargetRangeStartTime', 'aboveSpO2TargetRangeDuration',
       'belowSpO2TargetRangeDuration', 'inFiO2TargetRangeDuration',
       'inFiO2TargetRangeStartTime', 'aboveFiO2TargetRangeDuration',
       'belowFiO2TargetRangeDuration', 'spO2SignedErrorIntegral',
       'spO2UnsignedErrorIntegral', 'spO2SquaredErrorIntegral',
       'fiO2LargeAdjustments', 'visitDuration_fiO2Dial',
       'visitDuration_infant', 'visitDuration_monitorApgarTimer',
       'visitDuration_monitorBlank', 'visitDuration_monitorFiO2',
       'visitDuration_monitorFull', 'visitDuration_monitorGraph',
       'visitDuration_monitorHeartRate', 'visitDuration_monitorSpO2',
       'visitDuration_spO2ReferenceTable',
       'visitDuration_warmerInstrumentPanel', 'visitDuration_combinedFi

# Pairing

### Scenario Type Pairing

In [3]:
scenario_pairing = statistics.build_pairing(full_df, 'scenarioType')

In [4]:
scenario_pairing.describe()

Pairing against scenarioType:
  0: easy vs. 1: hard
  22 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [5]:
statistics.test_gaze_duration_outcomes(scenario_pairing)

visitDuration_infant:
  mean diff = -13.453; stdev diff = 24.960
  Paired t-test:
  **|diff| > 0: p = 0.022
  **diff < 0: p = 0.011
    diff > 0: p = 0.989
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.033
  **P(x > y) < 0.5: p = 0.017
    P(x > y) > 0.5: p = 0.983
visitDuration_warmerInstrumentPanel:
  mean diff = 1.327; stdev diff = 8.296
  Paired t-test:
    |diff| > 0: p = 0.472
    diff < 0: p = 0.764
    diff > 0: p = 0.236
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.355
    P(x > y) < 0.5: p = 0.823
   ~P(x > y) > 0.5: p = 0.177
visitDuration_fiO2Dial:
  mean diff = 2.428; stdev diff = 5.650
  Paired t-test:
   *|diff| > 0: p = 0.062
    diff < 0: p = 0.969
  **diff > 0: p = 0.031
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.168
    P(x > y) < 0.5: p = 0.916
   *P(x > y) > 0.5: p = 0.084
visitDuration_spO2ReferenceTable:
  mean diff = 4.176; stdev diff = 6.495
  Paired t-test:
  **|diff| > 0: p = 0.008
    diff < 0: p = 0.996
  **diff > 0: p = 0.

Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Subjects look at the infant less, the FiO2 dial more, the SpO2 reference table more, the heart rate less, the graph more, and the combined SpO2 elements more in the easy scenario than in the hard scenario.

### Display Type Pairing

In [6]:
display_pairing = statistics.build_pairing(full_df, 'displayType')

In [7]:
display_pairing.describe()

Pairing against displayType:
  0: minimal vs. 1: full
  26 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [8]:
statistics.test_gaze_duration_outcomes(display_pairing)

visitDuration_infant:
  mean diff = 0.480; stdev diff = 23.748
  Paired t-test:
    |diff| > 0: p = 0.920
    diff < 0: p = 0.540
    diff > 0: p = 0.460
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.713
    P(x > y) < 0.5: p = 0.644
    P(x > y) > 0.5: p = 0.356
visitDuration_warmerInstrumentPanel:
  mean diff = -0.173; stdev diff = 9.184
  Paired t-test:
    |diff| > 0: p = 0.926
    diff < 0: p = 0.463
    diff > 0: p = 0.537
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.770
    P(x > y) < 0.5: p = 0.385
    P(x > y) > 0.5: p = 0.615
visitDuration_fiO2Dial:
  mean diff = 0.986; stdev diff = 5.208
  Paired t-test:
    |diff| > 0: p = 0.353
    diff < 0: p = 0.824
   ~diff > 0: p = 0.176
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.316
    P(x > y) < 0.5: p = 0.842
   ~P(x > y) > 0.5: p = 0.158
visitDuration_spO2ReferenceTable:
  mean diff = 0.851; stdev diff = 4.313
  Paired t-test:
    |diff| > 0: p = 0.334
    diff < 0: p = 0.833
   ~diff > 0: p = 0.1

Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Subjects look at the apgar timer less, the heart rate less, the SpO2 less, the combined FiO2 more, and the combined SpO2 more in the full display than in the minimal display.

### Display Type Pairing, Split by Scenario

In [9]:
scenario_display_pairings = {
    scenario: statistics.build_pairing(scenario_subset, 'displayType')
    for (scenario, scenario_subset) in enumerate(scenario_pairing)
}

#### Easy Scenarios

In [10]:
scenario_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  11 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [11]:
statistics.test_gaze_duration_outcomes(scenario_display_pairings[0])

visitDuration_infant:
  mean diff = -0.642; stdev diff = 19.301
  Paired t-test:
    |diff| > 0: p = 0.918
    diff < 0: p = 0.459
    diff > 0: p = 0.541
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 1.000
    P(x > y) < 0.5: p = 0.500
    P(x > y) > 0.5: p = 0.500
visitDuration_warmerInstrumentPanel:
  mean diff = 0.203; stdev diff = 8.799
  Paired t-test:
    |diff| > 0: p = 0.943
    diff < 0: p = 0.528
    diff > 0: p = 0.472
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.929
    P(x > y) < 0.5: p = 0.535
    P(x > y) > 0.5: p = 0.465
visitDuration_fiO2Dial:
  mean diff = 1.118; stdev diff = 5.596
  Paired t-test:
    |diff| > 0: p = 0.542
    diff < 0: p = 0.729
    diff > 0: p = 0.271
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.286
    P(x > y) < 0.5: p = 0.857
   ~P(x > y) > 0.5: p = 0.143
visitDuration_spO2ReferenceTable:
  mean diff = -0.285; stdev diff = 5.330
  Paired t-test:
    |diff| > 0: p = 0.869
    diff < 0: p = 0.434
    diff > 0: p = 0.

Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Subjects look at the apgar timer less, the combined FiO2 more, and the combined SpO2 more in the full display than in the minimal display. They maybe look at the heart rate less.
* This is generally similar to the results for easy and hard scenarios combined, but with fewer significantly different outcomes.

#### Hard Scenarios

In [12]:
scenario_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  11 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [13]:
statistics.test_gaze_duration_outcomes(scenario_display_pairings[1])

visitDuration_infant:
  mean diff = -2.841; stdev diff = 24.110
  Paired t-test:
    |diff| > 0: p = 0.717
    diff < 0: p = 0.359
    diff > 0: p = 0.641
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.859
    P(x > y) < 0.5: p = 0.429
    P(x > y) > 0.5: p = 0.571
visitDuration_warmerInstrumentPanel:
  mean diff = 0.745; stdev diff = 10.132
  Paired t-test:
    |diff| > 0: p = 0.821
    diff < 0: p = 0.590
    diff > 0: p = 0.410
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.790
    P(x > y) < 0.5: p = 0.605
    P(x > y) > 0.5: p = 0.395
visitDuration_fiO2Dial:
  mean diff = -0.367; stdev diff = 3.074
  Paired t-test:
    |diff| > 0: p = 0.713
    diff < 0: p = 0.357
    diff > 0: p = 0.643
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.790
    P(x > y) < 0.5: p = 0.395
    P(x > y) > 0.5: p = 0.605
visitDuration_spO2ReferenceTable:
  mean diff = 0.720; stdev diff = 2.121
  Paired t-test:
    |diff| > 0: p = 0.308
    diff < 0: p = 0.846
   ~diff > 0: p = 0

Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Subjects look at the heart rate less and the combined FiO2 more in the full display than in the minimal display. They maybe look at the combined SpO2 more in the full display than in the minimal display.
* This is similar to the results for easy and hard scenarios combined only for the combined FiO2 test.

#### Summary

* In both scenario types, subjects look at the combined FiO2 more with the full display (which shows FiO2 reading) than with the minimal display (which doesn't show FiO2 reading)
* In the easy scenario type, subjects look at the combined SpO2 more with the full display (which shows SpO2 graph) than with the minimal display (which doesn't show SpO2 graph).  In the hard scenario type, they might also look at combined SpO2 more.

### Scenario Order Pairing, Split by Scenario

In [14]:
scenario_order_pairings = {
    0: statistics.build_pairing(scenario_pairing[0], 'scenarioNumber', values=(1, 4), check_validity=False),
    1: statistics.build_pairing(scenario_pairing[1], 'scenarioNumber', values=(2, 3), check_validity=False),
}

#### Easy Scenarios

In [15]:
scenario_order_pairings[0].describe()

Pairing against scenarioNumber:
  0: first vs. 1: second
  11 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [16]:
statistics.test_gaze_duration_outcomes(scenario_order_pairings[0])

visitDuration_infant:
  mean diff = 14.653; stdev diff = 12.579
  Paired t-test:
  **|diff| > 0: p = 0.004
    diff < 0: p = 0.998
  **diff > 0: p = 0.002
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.013
    P(x > y) < 0.5: p = 0.994
  **P(x > y) > 0.5: p = 0.006
visitDuration_warmerInstrumentPanel:
  mean diff = 0.247; stdev diff = 8.798
  Paired t-test:
    |diff| > 0: p = 0.931
    diff < 0: p = 0.535
    diff > 0: p = 0.465
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.286
    P(x > y) < 0.5: p = 0.857
   ~P(x > y) > 0.5: p = 0.143
visitDuration_fiO2Dial:
  mean diff = 0.863; stdev diff = 5.641
  Paired t-test:
    |diff| > 0: p = 0.639
    diff < 0: p = 0.681
    diff > 0: p = 0.319
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.594
    P(x > y) < 0.5: p = 0.703
    P(x > y) > 0.5: p = 0.297
visitDuration_spO2ReferenceTable:
  mean diff = -0.213; stdev diff = 5.334
  Paired t-test:
    |diff| > 0: p = 0.902
    diff < 0: p = 0.451
    diff > 0: p = 0.

Observations:

* Participants in scenario 1 look more at the infant, less at the monitor, and maybe less at combined SpO2 compared to scenario 4.
* These results suggest an adaptation effect between scenarios 1 and 4.

#### Hard Scenarios

In [17]:
scenario_order_pairings[1].describe()

Pairing against scenarioNumber:
  0: first vs. 1: second
  11 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [18]:
statistics.test_gaze_duration_outcomes(scenario_order_pairings[1])

visitDuration_infant:
  mean diff = 5.870; stdev diff = 23.556
  Paired t-test:
    |diff| > 0: p = 0.449
    diff < 0: p = 0.776
    diff > 0: p = 0.224
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.424
    P(x > y) < 0.5: p = 0.788
    P(x > y) > 0.5: p = 0.212
visitDuration_warmerInstrumentPanel:
  mean diff = -4.431; stdev diff = 9.142
  Paired t-test:
   ~|diff| > 0: p = 0.156
   *diff < 0: p = 0.078
    diff > 0: p = 0.922
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.213
   ~P(x > y) < 0.5: p = 0.107
    P(x > y) > 0.5: p = 0.893
visitDuration_fiO2Dial:
  mean diff = -0.265; stdev diff = 3.084
  Paired t-test:
    |diff| > 0: p = 0.791
    diff < 0: p = 0.396
    diff > 0: p = 0.604
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.722
    P(x > y) < 0.5: p = 0.361
    P(x > y) > 0.5: p = 0.639
visitDuration_spO2ReferenceTable:
  mean diff = -0.462; stdev diff = 2.192
  Paired t-test:
    |diff| > 0: p = 0.520
    diff < 0: p = 0.260
    diff > 0: p = 0

Observations:

* Participants in scenario 2 look less at the blank part of the monitor, maybe more at the apgar timer, and maybe more at combined SpO2, and maybe less at the instrument panel compared to scenario 3.
* These results are pretty weak but might suggest adaptation between scenarios 2 and 3.

#### Summary

* There seems to be a learning effect between scenarios 1 and 4, but maybe not a strong effect between scenarios 2 and 3.