# Imports

In [1]:
%matplotlib inline

import scipy as sp
from scipy import stats
import numpy as np
import pandas as pd

import statistics

# Load Data

In [2]:
tracing_df = statistics.load_tracing_features()
tracing_df.columns

Index(['subjectNumber', 'scenarioNumber', 'newAfterOld', 'scenarioType',
       'displayType', 'sensorPlacementTime', 'ppvStartTime', 'ccStartTime',
       'inSpO2TargetRangeDuration', 'inSpO2LooseTargetRangeDuration',
       'inSpO2TargetRangeStartTime', 'aboveSpO2TargetRangeDuration',
       'belowSpO2TargetRangeDuration', 'inFiO2TargetRangeDuration',
       'inFiO2TargetRangeStartTime', 'aboveFiO2TargetRangeDuration',
       'belowFiO2TargetRangeDuration', 'spO2SignedErrorIntegral',
       'spO2UnsignedErrorIntegral', 'spO2SquaredErrorIntegral',
       'fiO2LargeAdjustments'],
      dtype='object')

# Pairing

### Scenario Type Pairing

In [3]:
scenario_pairing = statistics.build_pairing(tracing_df, 'scenarioType')

In [4]:
scenario_pairing.describe()

Pairing against scenarioType:
  0: easy vs. 1: hard
  30 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [5]:
statistics.test_tracing_outcomes(scenario_pairing)

sensorPlacementTime:
  mean difference = 0.800
  Paired t-test:
    |diff| > 0: p = 0.800
    diff < 0: p = 0.600
    diff > 0: p = 0.400
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.793
    P(x > y) < 0.5: p = 0.603
    P(x > y) > 0.5: p = 0.397
ppvStartTime:
  mean difference = 4.833
  Paired t-test:
   *|diff| > 0: p = 0.075
    diff < 0: p = 0.963
  **diff > 0: p = 0.037
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.130
    P(x > y) < 0.5: p = 0.935
   *P(x > y) > 0.5: p = 0.065
ccStartTime:
  mean difference = 99.567
  Paired t-test:
  **|diff| > 0: p = 0.000
    diff < 0: p = 1.000
  **diff > 0: p = 0.000
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.000
    P(x > y) < 0.5: p = 1.000
  **P(x > y) > 0.5: p = 0.000
inSpO2TargetRangeStartTime:
  mean difference = -67.733
  Paired t-test:
  **|diff| > 0: p = 0.000
  **diff < 0: p = 0.000
    diff > 0: p = 1.000
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.000
  **P(x > y) < 0.5: p = 0.000
   

Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Most outcomes show significant differences.
* PPV starts later in the easy scenario, unexpectedly. Perhaps the hard scenario gives a higher sense of urgency.
* SpO2 target range is reached earlier in the easy scenario and maintained for longer in the easy scenario, as expected.
* Participants are above SpO2 target range for longer in the easy scenario. This could just be explained as the hard scenario making it harder for participants to get SpO2 high enough to be in the target range, though.
* Participants are below the SpO2 target range for longer in the hard scenario, as expected.
* Participants are in the FiO2 target range less of the time in the easy scenario, unexpectedly. They're above the target range more of the time in the easy scenario. They're below the target range less of the time in the hard scenario. This suggests that, in the easy scenario, participants tend to keep FiO2 higher than they should. This could be an artifact of the scenario design, or it could point to a tendency to overestimate oxygenation needs. Note that, in the hard scenario, the target range is the maximum possible, so we can't say whether overestimation differs by scenario difficulty.
* Participants have lower absolute and squared SpO2 error in the easy scenario, as expected. However, they have higher signed SpO2 error in the easy scenario, confirming that participants tend to be above the target range more than below the target range in the easy scenario. This could just be because it's impossible to be above the target range in the hard scenario, however.

### Display Type Pairing

In [6]:
display_pairing = statistics.build_pairing(tracing_df, 'displayType')

In [7]:
display_pairing.describe()

Pairing against displayType:
  0: minimal vs. 1: full
  33 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [8]:
statistics.test_tracing_outcomes(display_pairing)

sensorPlacementTime:
  mean difference = -5.000
  Paired t-test:
    |diff| > 0: p = 0.203
   ~diff < 0: p = 0.102
    diff > 0: p = 0.898
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.194
   *P(x > y) < 0.5: p = 0.097
    P(x > y) > 0.5: p = 0.903
ppvStartTime:
  mean difference = -0.636
  Paired t-test:
    |diff| > 0: p = 0.813
    diff < 0: p = 0.406
    diff > 0: p = 0.594
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.503
    P(x > y) < 0.5: p = 0.251
    P(x > y) > 0.5: p = 0.749
ccStartTime:
  mean difference = -2.485
  Paired t-test:
    |diff| > 0: p = 0.522
    diff < 0: p = 0.261
    diff > 0: p = 0.739
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.712
    P(x > y) < 0.5: p = 0.356
    P(x > y) > 0.5: p = 0.644
inSpO2TargetRangeStartTime:
  mean difference = 6.242
  Paired t-test:
    |diff| > 0: p = 0.535
    diff < 0: p = 0.732
    diff > 0: p = 0.268
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.692
    P(x > y) < 0.5: p = 0.654
   

Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* No outcomes show significant differences. This suggests that we need to split the groups by scenario difficulty, as expected.

### Display Type Pairing, Split by Scenario

In [9]:
scenario_display_pairings = {
    scenario: statistics.build_pairing(scenario_subset, 'displayType')
    for (scenario, scenario_subset) in enumerate(scenario_pairing)
}

#### Easy Scenarios

In [10]:
scenario_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  15 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [11]:
statistics.test_tracing_outcomes(scenario_display_pairings[0])

sensorPlacementTime:
  mean difference = -3.533
  Paired t-test:
    |diff| > 0: p = 0.557
    diff < 0: p = 0.279
    diff > 0: p = 0.721
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.222
   ~P(x > y) < 0.5: p = 0.111
    P(x > y) > 0.5: p = 0.889
ppvStartTime:
  mean difference = -4.933
  Paired t-test:
    |diff| > 0: p = 0.368
   ~diff < 0: p = 0.184
    diff > 0: p = 0.816
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.609
    P(x > y) < 0.5: p = 0.305
    P(x > y) > 0.5: p = 0.695
ccStartTime:
  mean difference = 0.000
    Skipped t-test.
    Skipped Wilcoxon test.
inSpO2TargetRangeStartTime:
  mean difference = 9.467
  Paired t-test:
    |diff| > 0: p = 0.623
    diff < 0: p = 0.689
    diff > 0: p = 0.311
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.727
    P(x > y) < 0.5: p = 0.637
    P(x > y) > 0.5: p = 0.363
inSpO2TargetRangeDuration:
  mean difference = 0.400
  Paired t-test:
    |diff| > 0: p = 0.966
    diff < 0: p = 0.517
    diff > 0: p = 

  z = (T - mn - correction) / se


Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* No outcomes show significant differences. Thus, we can't conclude that the display improves performance during easy scenarios.

#### Hard Scenarios

In [12]:
scenario_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  15 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [13]:
statistics.test_tracing_outcomes(scenario_display_pairings[1])

sensorPlacementTime:
  mean difference = -3.933
  Paired t-test:
    |diff| > 0: p = 0.490
    diff < 0: p = 0.245
    diff > 0: p = 0.755
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.826
    P(x > y) < 0.5: p = 0.413
    P(x > y) > 0.5: p = 0.587
ppvStartTime:
  mean difference = 4.067
  Paired t-test:
  **|diff| > 0: p = 0.045
    diff < 0: p = 0.978
  **diff > 0: p = 0.022
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.073
    P(x > y) < 0.5: p = 0.963
  **P(x > y) > 0.5: p = 0.037
ccStartTime:
  mean difference = -5.533
  Paired t-test:
    |diff| > 0: p = 0.526
    diff < 0: p = 0.263
    diff > 0: p = 0.737
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.675
    P(x > y) < 0.5: p = 0.337
    P(x > y) > 0.5: p = 0.663
inSpO2TargetRangeStartTime:
  mean difference = -1.200
  Paired t-test:
    |diff| > 0: p = 0.891
    diff < 0: p = 0.446
    diff > 0: p = 0.554
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.789
    P(x > y) < 0.5: p = 0.395
   

  z = (T - mn - correction) / se


Observations:

* PPV starts earlier with the full display. Perhaps the prominently displayed graph region prompts participants to think about oxygen supplementation sooner.
* There's no significant difference in the time when participants first reach the SpO2 target range.
* There might be a lower time in the SpO2 target range (and higher time below the target range) with the full display. However, if we test for within +/- 5 percentage points of being in the target range, we can't find significance.
* There might also be a lower time in the FiO2 target range (and higher time below the target range) with the full display. The Wilcoxon test suggests this is a significant finding.
* There might be higher absolute (and maybe maybe squared) error with the full display. The Wilcoxon test suggests this is a significant finding for absolute error.

#### Summary

* The display doesn't seem to improve outcomes on the easy scenario.
* The display seems to make outcomes worse on the hard scenario.

### Scenario Order Pairing, Split by Scenario

In [14]:
scenario_order_pairings = {
    0: statistics.build_pairing(scenario_pairing[0], 'scenarioNumber', values=(1, 4), check_validity=False),
    1: statistics.build_pairing(scenario_pairing[1], 'scenarioNumber', values=(2, 3), check_validity=False),
}

#### Easy Scenarios

In [15]:
scenario_order_pairings[0].describe()

Pairing against scenarioNumber:
  0: first vs. 1: second
  15 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [16]:
statistics.test_tracing_outcomes(scenario_order_pairings[0])

sensorPlacementTime:
  mean difference = 13.000
  Paired t-test:
  **|diff| > 0: p = 0.018
    diff < 0: p = 0.991
  **diff > 0: p = 0.009
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.015
    P(x > y) < 0.5: p = 0.993
  **P(x > y) > 0.5: p = 0.007
ppvStartTime:
  mean difference = 14.800
  Paired t-test:
  **|diff| > 0: p = 0.002
    diff < 0: p = 0.999
  **diff > 0: p = 0.001
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.003
    P(x > y) < 0.5: p = 0.999
  **P(x > y) > 0.5: p = 0.001
ccStartTime:
  mean difference = 0.000
    Skipped t-test.
    Skipped Wilcoxon test.
inSpO2TargetRangeStartTime:
  mean difference = 30.267
  Paired t-test:
   *|diff| > 0: p = 0.100
    diff < 0: p = 0.950
  **diff > 0: p = 0.050
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.021
    P(x > y) < 0.5: p = 0.989
  **P(x > y) > 0.5: p = 0.011
inSpO2TargetRangeDuration:
  mean difference = -18.267
  Paired t-test:
  **|diff| > 0: p = 0.036
  **diff < 0: p = 0.018
    diff > 0: p

  z = (T - mn - correction) / se


Observations:

* Participants in scenario 1 take longer (compared to scenario 4) to place the sensor, start PPV, and reach the SpO2 target range.
* Participants in scenario 4 spend more time (compared to scenario 1) in the SpO2 target range.
* Participants in scenario 1 accumulate higher absolute and squared errors (compared to scenario 4).
* These results all point to a learning effect between scenarios 1 and 4.

#### Hard Scenarios

In [17]:
scenario_order_pairings[1].describe()

Pairing against scenarioNumber:
  0: first vs. 1: second
  15 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [18]:
statistics.test_tracing_outcomes(scenario_order_pairings[1])

sensorPlacementTime:
  mean difference = 9.533
  Paired t-test:
   *|diff| > 0: p = 0.080
    diff < 0: p = 0.960
  **diff > 0: p = 0.040
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.116
    P(x > y) < 0.5: p = 0.942
   *P(x > y) > 0.5: p = 0.058
ppvStartTime:
  mean difference = -1.133
  Paired t-test:
    |diff| > 0: p = 0.601
    diff < 0: p = 0.301
    diff > 0: p = 0.699
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.649
    P(x > y) < 0.5: p = 0.325
    P(x > y) > 0.5: p = 0.675
ccStartTime:
  mean difference = 21.267
  Paired t-test:
  **|diff| > 0: p = 0.006
    diff < 0: p = 0.997
  **diff > 0: p = 0.003
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.003
    P(x > y) < 0.5: p = 0.999
  **P(x > y) > 0.5: p = 0.001
inSpO2TargetRangeStartTime:
  mean difference = -0.667
  Paired t-test:
    |diff| > 0: p = 0.940
    diff < 0: p = 0.470
    diff > 0: p = 0.530
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.824
    P(x > y) < 0.5: p = 0.412
   

  z = (T - mn - correction) / se


Observations:

* Participants in scenario 2 take longer (compared to scenario 3) to place the sensor and start CC. There is no significant difference for starting PPV or entering the SpO2 target range.
* There are no significant differences for time in SpO2 and FiO2 target ranges between scenarios 2 and 3.
* There are no significant differences for accumulated errors between scenarios 2 and 3. Caveat is that accumulated squared error might be higher in scenario 2 than 3.
* These results do not suggest a learning effect between scenarios 2 and 3.

#### Summary

* There seems to be a learning effect between scenarios 1 and 4, but not between scenarios 2 and 3.