# Imports

In [1]:
%matplotlib inline

import scipy as sp
from scipy import stats
import numpy as np
import pandas as pd

import statistics

# Load Data

In [2]:
tracing_df = statistics.load_tracing_features()
recording_df = statistics.load_recording_relations()
recording_df = statistics.associate_recordings(recording_df, tracing_df)
gaze_dfs = statistics.load_gaze_features()
statistics.check_gaze_recording_associations(recording_df, gaze_dfs)
statistics.compute_gaze_features(gaze_dfs)
gaze_df = statistics.combine_gaze_features(gaze_dfs)
full_df = statistics.combine_all_features(recording_df, gaze_df)
full_df.columns

Index(['id', 'subjectNumber', 'scenarioNumber', 'newAfterOld', 'scenarioType',
       'displayType', 'sensorPlacementTime', 'ppvStartTime', 'ccStartTime',
       'inSpO2TargetRangeDuration', 'inSpO2LooseTargetRangeDuration',
       'inSpO2TargetRangeStartTime', 'aboveSpO2TargetRangeDuration',
       'belowSpO2TargetRangeDuration', 'inFiO2TargetRangeDuration',
       'inFiO2TargetRangeStartTime', 'aboveFiO2TargetRangeDuration',
       'belowFiO2TargetRangeDuration', 'spO2SignedErrorIntegral',
       'spO2UnsignedErrorIntegral', 'spO2SquaredErrorIntegral',
       'fiO2LargeAdjustments', 'visitDuration_fiO2Dial',
       'visitDuration_infant', 'visitDuration_monitorApgarTimer',
       'visitDuration_monitorBlank', 'visitDuration_monitorFiO2',
       'visitDuration_monitorFull', 'visitDuration_monitorGraph',
       'visitDuration_monitorHeartRate', 'visitDuration_monitorSpO2',
       'visitDuration_spO2ReferenceTable',
       'visitDuration_warmerInstrumentPanel', 'visitDuration_combinedFi

# Stratification
For each scenario type, we stratify by the total accumulated error across both scenarios of the same type per subject.

In [3]:
stratification_size = 5
scenario_split = statistics.build_split(full_df, 'scenarioType')
errors = [
    df[['subjectNumber', 'spO2UnsignedErrorIntegral']]
    .groupby('subjectNumber')
    .sum()
    .sort_values(by=['spO2UnsignedErrorIntegral'])
    for df in scenario_split.dfs
]
tails = [
    [
        df.index[:stratification_size].values,
         df.index[-stratification_size:].values
    ]
    for df in errors
]

masks = [
    [df.subjectNumber.isin(lowest_tail), df.subjectNumber.isin(highest_tail)]
    for (df, (lowest_tail, highest_tail)) in zip(scenario_split.dfs, tails)
]
print('Best subjects, easy:', tails[0][0])
print('Worst subjects, easy:', tails[0][1])
print('Best subjects, hard:', tails[1][0])
print('Worst subjects, hard:', tails[1][1])

Best subjects, easy: [ 9  8  7 12 10]
Worst subjects, easy: [17 13  4 19 20]
Best subjects, hard: [16  6  9  7 18]
Worst subjects, hard: [11 17 13  8 15]


### Accumulated Absolute Error Stratification, Hard

In [4]:
error_stratification = statistics.build_stratification(scenario_split.dfs[1], 'absoluteError', ['lowest', 'highest'], masks[1])
error_stratification.describe(tests=False)

Pairing against absoluteError:
  0: lowest vs. 1: highest
  10 0 vs. 1 pairs.


In [5]:
error_display_pairings = [
    statistics.build_pairing(df, 'displayType')
    for df in error_stratification.dfs
]

#### Lowest Errors

In [6]:
error_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [7]:
statistics.test_tracing_outcomes(error_display_pairings[0])

sensorPlacementTime:
  mean diff = -2.200; stdev diff = 13.746
  Paired t-test:
    |diff| > 0: p = 0.765
    diff < 0: p = 0.382
    diff > 0: p = 0.618
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.446
    P(x > y) > 0.5: p = 0.554
ppvStartTime:
  mean diff = 5.600; stdev diff = 5.499
  Paired t-test:
   ~|diff| > 0: p = 0.111
    diff < 0: p = 0.944
   *diff > 0: p = 0.056
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.221
    P(x > y) < 0.5: p = 0.890
   ~P(x > y) > 0.5: p = 0.110
ccStartTime:
  mean diff = 6.400; stdev diff = 15.895
  Paired t-test:
    |diff| > 0: p = 0.466
    diff < 0: p = 0.767
    diff > 0: p = 0.233
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.345
    P(x > y) < 0.5: p = 0.827
   ~P(x > y) > 0.5: p = 0.173
inSpO2TargetRangeStartTime:
  mean diff = 35.600; stdev diff = 44.729
  Paired t-test:
   ~|diff| > 0: p = 0.187
    diff < 0: p = 0.907
   *diff > 0: p = 0.093
  Wilcoxon signed-rank test:
   ~P(

  z = (T - mn - correction) / se


Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way, though the t-tests are more significant. We ignore the Wilcoxon tests because the sample sizes are too small.
* With the new display, participants are in the FiO2 target range significantly more and achieve a significantly lower signed and squared SpO2 squared error integrals.

#### Highest Errors

In [8]:
error_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [9]:
statistics.test_tracing_outcomes(error_display_pairings[1])

sensorPlacementTime:
  mean diff = -10.200; stdev diff = 34.114
  Paired t-test:
    |diff| > 0: p = 0.582
    diff < 0: p = 0.291
    diff > 0: p = 0.709
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.686
    P(x > y) < 0.5: p = 0.343
    P(x > y) > 0.5: p = 0.657
ppvStartTime:
  mean diff = 8.400; stdev diff = 7.552
  Paired t-test:
   *|diff| > 0: p = 0.090
    diff < 0: p = 0.955
  **diff > 0: p = 0.045
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.078
    P(x > y) < 0.5: p = 0.961
  **P(x > y) > 0.5: p = 0.039
ccStartTime:
  mean diff = -11.200; stdev diff = 38.426
  Paired t-test:
    |diff| > 0: p = 0.591
    diff < 0: p = 0.296
    diff > 0: p = 0.704
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.593
    P(x > y) < 0.5: p = 0.296
    P(x > y) > 0.5: p = 0.704
inSpO2TargetRangeStartTime:
  mean diff = -15.200; stdev diff = 42.494
  Paired t-test:
    |diff| > 0: p = 0.514
    diff < 0: p = 0.257
    diff > 0: p = 0.743
  Wilcoxon signed-rank test:
  




    diff > 0: p = 0.296
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.593
    P(x > y) < 0.5: p = 0.704
    P(x > y) > 0.5: p = 0.296
aboveSpO2TargetRangeDuration:
  mean diff = 0.000; stdev diff = 0.000
    Skipped t-test.
    Skipped Wilcoxon test.
belowSpO2TargetRangeDuration:
  mean diff = -6.400; stdev diff = 29.486
  Paired t-test:
    |diff| > 0: p = 0.687
    diff < 0: p = 0.343
    diff > 0: p = 0.657
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.785
    P(x > y) < 0.5: p = 0.393
    P(x > y) > 0.5: p = 0.607
inFiO2TargetRangeStartTime:
  mean diff = -29.600; stdev diff = 55.647
  Paired t-test:
    |diff| > 0: p = 0.347
   ~diff < 0: p = 0.174
    diff > 0: p = 0.826
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.285
   ~P(x > y) < 0.5: p = 0.143
    P(x > y) > 0.5: p = 0.857
inFiO2TargetRangeDuration:
  mean diff = 22.800; stdev diff = 43.462
  Paired t-test:
    |diff| > 0: p = 0.353
    diff < 0: p = 0.823
   ~diff > 0: p = 0.177
  Wilcoxon si

  z = (T - mn - correction) / se


Observations:

* As before, we ignore the Wilcoxon signed-rank tests.
* With the new display, participants start PPV significantly earlier but don't achieve significantly better outcomes.

#### Summary

* The top 5 participants who accumulated the lowest SpO2 errors on the hard scenario were significantly helped by the display, but the bottom 5 participants who accumulated the highest SpO2 errors on the hard scenario weren't significantly helped by the display.

### Accumulated Absolute Error Stratification, Easy

In [10]:
error_stratification = statistics.build_stratification(scenario_split.dfs[0], 'absoluteError', ['lowest', 'highest'], masks[0])
error_stratification.describe(tests=False)

Pairing against absoluteError:
  0: lowest vs. 1: highest
  10 0 vs. 1 pairs.


In [11]:
error_display_pairings = [
    statistics.build_pairing(df, 'displayType')
    for df in error_stratification.dfs
]

#### Lowest Errors

In [12]:
error_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [13]:
statistics.test_tracing_outcomes(error_display_pairings[0])

sensorPlacementTime:
  mean diff = -19.200; stdev diff = 13.452
  Paired t-test:
  **|diff| > 0: p = 0.046
  **diff < 0: p = 0.023
    diff > 0: p = 0.977
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.080
  **P(x > y) < 0.5: p = 0.040
    P(x > y) > 0.5: p = 0.960
ppvStartTime:
  mean diff = -19.400; stdev diff = 19.704
  Paired t-test:
   ~|diff| > 0: p = 0.120
   *diff < 0: p = 0.060
    diff > 0: p = 0.940
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.138
   *P(x > y) < 0.5: p = 0.069
    P(x > y) > 0.5: p = 0.931
ccStartTime:
  mean diff = 0.000; stdev diff = 0.000
    Skipped t-test.
    Skipped Wilcoxon test.
inSpO2TargetRangeStartTime:
  mean diff = -30.000; stdev diff = 20.239
  Paired t-test:
  **|diff| > 0: p = 0.041
  **diff < 0: p = 0.021
    diff > 0: p = 0.979
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.043
  **P(x > y) < 0.5: p = 0.022
    P(x > y) > 0.5: p = 0.978
inSpO2TargetRangeDuration:
  mean diff = 25.200; stdev diff = 37.621
  Pair

  z = (T - mn - correction) / se


Observations:

* As before, we ignore the Wilcoxon tests.
* With the new display, participants place the sensor significantly later and enter the SpO2 and FiO2 target ranges significantly later, but do not produce any other significant differences.

#### Highest Errors

In [14]:
error_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [15]:
statistics.test_tracing_outcomes(error_display_pairings[1])

sensorPlacementTime:
  mean diff = -12.000; stdev diff = 16.432
  Paired t-test:
    |diff| > 0: p = 0.218
   ~diff < 0: p = 0.109
    diff > 0: p = 0.891
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.225
   ~P(x > y) < 0.5: p = 0.112
    P(x > y) > 0.5: p = 0.888
ppvStartTime:
  mean diff = -2.000; stdev diff = 9.695
  Paired t-test:
    |diff| > 0: p = 0.701
    diff < 0: p = 0.351
    diff > 0: p = 0.649
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.588
    P(x > y) < 0.5: p = 0.294
    P(x > y) > 0.5: p = 0.706
ccStartTime:
  mean diff = 0.000; stdev diff = 0.000
    Skipped t-test.
    Skipped Wilcoxon test.
inSpO2TargetRangeStartTime:
  mean diff = 52.800; stdev diff = 79.404
  Paired t-test:
    |diff| > 0: p = 0.254
    diff < 0: p = 0.873
   ~diff > 0: p = 0.127
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.273
    P(x > y) < 0.5: p = 0.863
   ~P(x > y) > 0.5: p = 0.137
inSpO2TargetRangeDuration:
  mean diff = -18.800; stdev diff = 17.781


  z = (T - mn - correction) / se


  Paired t-test:
   ~|diff| > 0: p = 0.102
   *diff < 0: p = 0.051
    diff > 0: p = 0.949
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.068
  **P(x > y) < 0.5: p = 0.034
    P(x > y) > 0.5: p = 0.966
inSpO2LooseTargetRangeDuration:
  mean diff = -50.400; stdev diff = 44.531
  Paired t-test:
   *|diff| > 0: p = 0.086
  **diff < 0: p = 0.043
    diff > 0: p = 0.957
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.068
  **P(x > y) < 0.5: p = 0.034
    P(x > y) > 0.5: p = 0.966
aboveSpO2TargetRangeDuration:
  mean diff = -32.400; stdev diff = 60.085
  Paired t-test:
    |diff| > 0: p = 0.342
   ~diff < 0: p = 0.171
    diff > 0: p = 0.829
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.465
    P(x > y) < 0.5: p = 0.233
    P(x > y) > 0.5: p = 0.767
belowSpO2TargetRangeDuration:
  mean diff = 53.200; stdev diff = 56.379
  Paired t-test:
   ~|diff| > 0: p = 0.132
    diff < 0: p = 0.934
   *diff > 0: p = 0.066
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.

Observations:

* As before, we ignore the Wilcoxon signed-rank tests.
* With the new display, participants are in the loose SpO2 target range significantly more, and their total absolute SpO2 error is almost significantly lower.

#### Summary

* The top 5 participants who accumulated the lowest SpO2 errors on the hard scenario were significantly helped by the display, but the bottom 5 participants who accumulated the highest SpO2 errors on the hard scenario weren't significantly helped by the display.