# Imports

In [1]:
%matplotlib inline

import scipy as sp
from scipy import stats
import numpy as np
import pandas as pd

import statistics

# Load Data

In [2]:
tracing_df = statistics.load_tracing_features()
tracing_df.columns

Index(['subjectNumber', 'scenarioNumber', 'newAfterOld', 'scenarioType',
       'displayType', 'sensorPlacementTime', 'ppvStartTime', 'ccStartTime',
       'inSpO2TargetRangeDuration', 'inSpO2LooseTargetRangeDuration',
       'inSpO2TargetRangeStartTime', 'aboveSpO2TargetRangeDuration',
       'belowSpO2TargetRangeDuration', 'inFiO2TargetRangeDuration',
       'inFiO2TargetRangeStartTime', 'aboveFiO2TargetRangeDuration',
       'belowFiO2TargetRangeDuration', 'spO2SignedErrorIntegral',
       'spO2UnsignedErrorIntegral', 'spO2SquaredErrorIntegral',
       'fiO2LargeAdjustments'],
      dtype='object')

# Pairing

### Scenario Type Pairing

In [3]:
scenario_pairing = statistics.build_pairing(tracing_df, 'scenarioType')

In [4]:
scenario_pairing.describe()

Pairing against scenarioType:
  0: easy vs. 1: hard
  40 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [5]:
statistics.test_tracing_outcomes(scenario_pairing)

inSpO2TargetRangeDuration:
  mean diff = 9.800; stdev diff = 37.349
  Paired t-test:
   ~|diff| > 0: p = 0.109
    diff < 0: p = 0.945
   *diff > 0: p = 0.055
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.110
    P(x > y) < 0.5: p = 0.945
   *P(x > y) > 0.5: p = 0.055
inSpO2LooseTargetRangeDuration:
  mean diff = 54.200; stdev diff = 35.827
  Paired t-test:
  **|diff| > 0: p = 0.000
    diff < 0: p = 1.000
  **diff > 0: p = 0.000
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.000
    P(x > y) < 0.5: p = 1.000
  **P(x > y) > 0.5: p = 0.000


Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Participants seem to perform better on the easy scenario than the hard scenario, as expected.

### Display Type Pairing

In [6]:
display_pairing = statistics.build_pairing(tracing_df, 'displayType')

In [7]:
display_pairing.describe()

Pairing against displayType:
  0: minimal vs. 1: full
  44 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [8]:
statistics.test_tracing_outcomes(display_pairing)

inSpO2TargetRangeDuration:
  mean diff = -0.136; stdev diff = 34.405
  Paired t-test:
    |diff| > 0: p = 0.979
    diff < 0: p = 0.490
    diff > 0: p = 0.510
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.844
    P(x > y) < 0.5: p = 0.422
    P(x > y) > 0.5: p = 0.578
inSpO2LooseTargetRangeDuration:
  mean diff = 1.045; stdev diff = 38.446
  Paired t-test:
    |diff| > 0: p = 0.859
    diff < 0: p = 0.570
    diff > 0: p = 0.430
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.769
    P(x > y) < 0.5: p = 0.615
    P(x > y) > 0.5: p = 0.385


Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* No outcomes show significant differences. This suggests that we need to split the groups by scenario difficulty, as expected.

### Display Type Pairing, Split by Scenario

In [9]:
scenario_display_pairings = {
    scenario: statistics.build_pairing(scenario_subset, 'displayType')
    for (scenario, scenario_subset) in enumerate(scenario_pairing)
}

#### Easy Scenarios

In [10]:
scenario_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  20 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [11]:
statistics.test_tracing_outcomes(scenario_display_pairings[0])

inSpO2TargetRangeDuration:
  mean diff = -8.800; stdev diff = 35.504
  Paired t-test:
    |diff| > 0: p = 0.293
   ~diff < 0: p = 0.147
    diff > 0: p = 0.853
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.151
   *P(x > y) < 0.5: p = 0.075
    P(x > y) > 0.5: p = 0.925
inSpO2LooseTargetRangeDuration:
  mean diff = -5.000; stdev diff = 47.017
  Paired t-test:
    |diff| > 0: p = 0.648
    diff < 0: p = 0.324
    diff > 0: p = 0.676
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.533
    P(x > y) < 0.5: p = 0.266
    P(x > y) > 0.5: p = 0.734


Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* No outcomes show significant differences. Thus, we can't conclude that the display improves performance during easy scenarios.

#### Hard Scenarios

In [12]:
scenario_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  20 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [13]:
statistics.test_tracing_outcomes(scenario_display_pairings[1])

inSpO2TargetRangeDuration:
  mean diff = 14.000; stdev diff = 27.691
  Paired t-test:
  **|diff| > 0: p = 0.040
    diff < 0: p = 0.980
  **diff > 0: p = 0.020
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.044
    P(x > y) < 0.5: p = 0.978
  **P(x > y) > 0.5: p = 0.022
inSpO2LooseTargetRangeDuration:
  mean diff = 11.400; stdev diff = 26.443
  Paired t-test:
   *|diff| > 0: p = 0.076
    diff < 0: p = 0.962
  **diff > 0: p = 0.038
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.088
    P(x > y) < 0.5: p = 0.956
  **P(x > y) > 0.5: p = 0.044


Observations:

* There is a lower time in the SpO2 strict target range with the full display. Same for the loose target range.

#### Summary

* The display doesn't seem to improve outcomes on the easy scenario.
* The display seems to make outcomes worse on the hard scenario.

### Scenario Order Pairing, Split by Scenario

In [14]:
scenario_order_pairings = {
    0: statistics.build_pairing(scenario_pairing[0], 'scenarioNumber', values=(1, 4), check_validity=False),
    1: statistics.build_pairing(scenario_pairing[1], 'scenarioNumber', values=(2, 3), check_validity=False),
}

#### Easy Scenarios

In [15]:
scenario_order_pairings[0].describe()

Pairing against scenarioNumber:
  0: first vs. 1: second
  20 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [16]:
statistics.test_tracing_outcomes(scenario_order_pairings[0])

inSpO2TargetRangeDuration:
  mean diff = -19.200; stdev diff = 31.135
  Paired t-test:
  **|diff| > 0: p = 0.015
  **diff < 0: p = 0.007
    diff > 0: p = 0.993
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.021
  **P(x > y) < 0.5: p = 0.010
    P(x > y) > 0.5: p = 0.990
inSpO2LooseTargetRangeDuration:
  mean diff = -21.600; stdev diff = 42.060
  Paired t-test:
  **|diff| > 0: p = 0.037
  **diff < 0: p = 0.019
    diff > 0: p = 0.981
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.015
  **P(x > y) < 0.5: p = 0.007
    P(x > y) > 0.5: p = 0.993


Observations:

* Participants in scenario 1 take longer (compared to scenario 4) to place the sensor, start PPV, and reach the SpO2 target range.
* Participants in scenario 4 spend more time (compared to scenario 1) in the SpO2 target range.
* Participants in scenario 1 accumulate higher absolute and squared errors (compared to scenario 4).
* These results all point to a learning effect between scenarios 1 and 4.

#### Hard Scenarios

In [17]:
scenario_order_pairings[1].describe()

Pairing against scenarioNumber:
  0: first vs. 1: second
  20 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [18]:
statistics.test_tracing_outcomes(scenario_order_pairings[1])

inSpO2TargetRangeDuration:
  mean diff = 7.800; stdev diff = 30.033
  Paired t-test:
    |diff| > 0: p = 0.272
    diff < 0: p = 0.864
   ~diff > 0: p = 0.136
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.293
    P(x > y) < 0.5: p = 0.853
   ~P(x > y) > 0.5: p = 0.147
inSpO2LooseTargetRangeDuration:
  mean diff = 4.600; stdev diff = 28.426
  Paired t-test:
    |diff| > 0: p = 0.489
    diff < 0: p = 0.755
    diff > 0: p = 0.245
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.453
    P(x > y) < 0.5: p = 0.773
    P(x > y) > 0.5: p = 0.227


Observations:

* Participants in scenario 2 take longer (compared to scenario 3) to place the sensor and start CC. There is no significant difference for starting PPV or entering the SpO2 target range.
* There are no significant differences for time in SpO2 and FiO2 target ranges between scenarios 2 and 3.
* There are no significant differences for accumulated errors between scenarios 2 and 3. Caveat is that accumulated squared error might be higher in scenario 2 than 3.
* These results do not suggest a learning effect between scenarios 2 and 3.

#### Summary

* There seems to be a learning effect between scenarios 1 and 4, but not between scenarios 2 and 3.