# Imports

In [1]:
%matplotlib inline

import scipy as sp
from scipy import stats
import numpy as np
import pandas as pd

import statistics

# Load Data

In [2]:
tracing_df = statistics.load_tracing_features()
recording_df = statistics.load_recording_relations()
recording_df = statistics.associate_recordings(recording_df, tracing_df)
gaze_dfs = statistics.load_gaze_features()
statistics.check_gaze_recording_associations(recording_df, gaze_dfs)
statistics.compute_gaze_features(gaze_dfs)
gaze_df = statistics.combine_gaze_features(gaze_dfs)
full_df = statistics.combine_all_features(recording_df, gaze_df)
full_df.columns

Index(['id', 'subjectNumber', 'scenarioNumber', 'newAfterOld', 'scenarioType',
       'displayType', 'sensorPlacementTime', 'ppvStartTime', 'ccStartTime',
       'inSpO2TargetRangeDuration', 'inSpO2LooseTargetRangeDuration',
       'inSpO2TargetRangeStartTime', 'aboveSpO2TargetRangeDuration',
       'belowSpO2TargetRangeDuration', 'inFiO2TargetRangeDuration',
       'inFiO2TargetRangeStartTime', 'aboveFiO2TargetRangeDuration',
       'belowFiO2TargetRangeDuration', 'spO2SignedErrorIntegral',
       'spO2UnsignedErrorIntegral', 'spO2SquaredErrorIntegral',
       'fiO2LargeAdjustments', 'visitDuration_fiO2Dial',
       'visitDuration_infant', 'visitDuration_monitorApgarTimer',
       'visitDuration_monitorBlank', 'visitDuration_monitorFiO2',
       'visitDuration_monitorFull', 'visitDuration_monitorGraph',
       'visitDuration_monitorHeartRate', 'visitDuration_monitorSpO2',
       'visitDuration_spO2ReferenceTable',
       'visitDuration_warmerInstrumentPanel', 'visitDuration_combinedFi

# Stratification

### Accumulated Absolute Error Stratification, Hard
We stratify by the total accumulated error across both difficult scenarios.

In [3]:
hard_df = full_df[full_df.scenarioType == 1]
hard_errors = (
    hard_df[['subjectNumber', 'spO2UnsignedErrorIntegral']]
    .groupby('subjectNumber')
    .sum()
    .sort_values(by=['spO2UnsignedErrorIntegral'])
)
hard_errors

Unnamed: 0_level_0,spO2UnsignedErrorIntegral
subjectNumber,Unnamed: 1_level_1
16,41.74
6,45.95
9,50.816667
7,54.253333
18,58.126667
12,62.636667
10,62.716667
19,64.206667
20,65.553333
11,71.296667


In [4]:
stratification_size = 5
lowest = hard_errors.index[:stratification_size].values
highest = hard_errors.index[-stratification_size:].values
masks = [
    hard_df.subjectNumber.isin(lowest),
    hard_df.subjectNumber.isin(highest)
]
error_stratification = statistics.build_stratification(hard_df, 'absoluteError', ['lowest', 'highest'], masks)

In [5]:
error_stratification.describe()

Pairing against absoluteError:
  0: lowest vs. 1: highest
  10 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [6]:
error_stratification.dfs[0][['subjectNumber', 'scenarioNumber', 'displayType', 'spO2UnsignedErrorIntegral']]

Unnamed: 0_level_0,subjectNumber,scenarioNumber,displayType,spO2UnsignedErrorIntegral
recording,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Recording008,6,3,0,26.273333
Recording007,6,2,1,19.676667
Recording011,7,2,0,26.693333
Recording012,7,3,1,27.56
Recording020,9,3,0,27.616667
Recording019,9,2,1,23.2
Recording043,16,2,0,24.34
Recording044,16,3,1,17.4
Recording051,18,2,0,29.303333
Recording052,18,3,1,28.823333


In [7]:
error_stratification.dfs[1][['subjectNumber', 'scenarioNumber', 'displayType', 'spO2UnsignedErrorIntegral']]

Unnamed: 0_level_0,subjectNumber,scenarioNumber,displayType,spO2UnsignedErrorIntegral
recording,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Recording016,8,3,0,27.926667
Recording015,8,2,1,52.843333
Recording028,11,3,0,41.383333
Recording027,11,2,1,29.913333
Recording035,13,2,0,39.123333
Recording036,13,3,1,38.223333
Recording039,15,2,0,63.483333
Recording040,15,3,1,65.283333
Recording048,17,3,0,32.74
Recording047,17,2,1,41.713333


In [8]:
error_display_pairings = [
    statistics.build_pairing(df, 'displayType')
    for df in error_stratification.dfs
]

#### Lowest Errors

In [9]:
error_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [10]:
statistics.test_tracing_outcomes(error_display_pairings[0])

sensorPlacementTime:
  mean diff = -2.200; stdev diff = 13.746
  Paired t-test:
    |diff| > 0: p = 0.765
    diff < 0: p = 0.382
    diff > 0: p = 0.618
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.446
    P(x > y) > 0.5: p = 0.554
ppvStartTime:
  mean diff = 5.600; stdev diff = 5.499
  Paired t-test:
   ~|diff| > 0: p = 0.111
    diff < 0: p = 0.944
   *diff > 0: p = 0.056
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.221
    P(x > y) < 0.5: p = 0.890
   ~P(x > y) > 0.5: p = 0.110
ccStartTime:
  mean diff = 6.400; stdev diff = 15.895
  Paired t-test:
    |diff| > 0: p = 0.466
    diff < 0: p = 0.767
    diff > 0: p = 0.233
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.345
    P(x > y) < 0.5: p = 0.827
   ~P(x > y) > 0.5: p = 0.173
inSpO2TargetRangeStartTime:
  mean diff = 35.600; stdev diff = 44.729
  Paired t-test:
   ~|diff| > 0: p = 0.187
    diff < 0: p = 0.907
   *diff > 0: p = 0.093
  Wilcoxon signed-rank test:
   ~P(

  z = (T - mn - correction) / se


Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way, though the t-tests are more significant. We ignore the Wilcoxon tests because the sample sizes are too small.
* With the new display, participants are in the FiO2 target range significantly more and achieve a significantly lower signed and squared SpO2 squared error integrals.

#### Highest Errors

In [14]:
error_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [12]:
statistics.test_tracing_outcomes(error_display_pairings[1])

sensorPlacementTime:
  mean diff = -10.200; stdev diff = 34.114
  Paired t-test:
    |diff| > 0: p = 0.582
    diff < 0: p = 0.291
    diff > 0: p = 0.709
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.686
    P(x > y) < 0.5: p = 0.343
    P(x > y) > 0.5: p = 0.657
ppvStartTime:
  mean diff = 8.400; stdev diff = 7.552
  Paired t-test:
   *|diff| > 0: p = 0.090
    diff < 0: p = 0.955
  **diff > 0: p = 0.045
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.078
    P(x > y) < 0.5: p = 0.961
  **P(x > y) > 0.5: p = 0.039
ccStartTime:
  mean diff = -11.200; stdev diff = 38.426
  Paired t-test:
    |diff| > 0: p = 0.591
    diff < 0: p = 0.296
    diff > 0: p = 0.704
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.593
    P(x > y) < 0.5: p = 0.296
    P(x > y) > 0.5: p = 0.704
inSpO2TargetRangeStartTime:
  mean diff = -15.200; stdev diff = 42.494
  Paired t-test:
    |diff| > 0: p = 0.514
    diff < 0: p = 0.257
    diff > 0: p = 0.743
  Wilcoxon signed-rank test:
  



    |diff| > 0: p = 0.730
    diff < 0: p = 0.635
    diff > 0: p = 0.365
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 1.000
    P(x > y) < 0.5: p = 0.500
    P(x > y) > 0.5: p = 0.500
inSpO2LooseTargetRangeDuration:
  mean diff = 10.000; stdev diff = 34.316
  Paired t-test:
    |diff| > 0: p = 0.591
    diff < 0: p = 0.704
    diff > 0: p = 0.296
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.593
    P(x > y) < 0.5: p = 0.704
    P(x > y) > 0.5: p = 0.296
aboveSpO2TargetRangeDuration:
  mean diff = 0.000; stdev diff = 0.000
    Skipped t-test.
    Skipped Wilcoxon test.
belowSpO2TargetRangeDuration:
  mean diff = -6.400; stdev diff = 29.486
  Paired t-test:
    |diff| > 0: p = 0.687
    diff < 0: p = 0.343
    diff > 0: p = 0.657
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.785
    P(x > y) < 0.5: p = 0.393
    P(x > y) > 0.5: p = 0.607
inFiO2TargetRangeStartTime:
  mean diff = -29.600; stdev diff = 55.647
  Paired t-test:
    |diff| > 0: p = 0.347
   ~dif

  z = (T - mn - correction) / se


Observations:

* As before, we ignore the Wilcoxon signed-rank tests.
* With the new display, participants start PPV significantly earlier but don't achieve significantly better outcomes.

#### Summary

* The top 5 participants who accumulated the lowest SpO2 errors on the hard scenario were significantly helped by the display, but the bottom 5 participants who accumulated the highest SpO2 errors on the hard scenario weren't significantly helped by the display.

### Scenario Order Pairing, Split by Scenario

In [13]:
scenario_order_pairings = {
    0: statistics.build_pairing(scenario_pairing[0], 'scenarioNumber', values=(1, 4), check_validity=False),
    1: statistics.build_pairing(scenario_pairing[1], 'scenarioNumber', values=(2, 3), check_validity=False),
}

NameError: name 'scenario_pairing' is not defined

#### Easy Scenarios

In [None]:
scenario_order_pairings[0].describe()

In [None]:
statistics.test_gaze_duration_outcomes(scenario_order_pairings[0])

Observations:

* Participants in scenario 1 look more at the infant, less at the monitor, and maybe less at combined SpO2 compared to scenario 4.
* These results suggest an adaptation effect between scenarios 1 and 4.

#### Hard Scenarios

In [None]:
scenario_order_pairings[1].describe()

In [None]:
statistics.test_gaze_duration_outcomes(scenario_order_pairings[1])

Observations:

* Participants in scenario 2 look less at the blank part of the monitor, maybe more at the apgar timer, and maybe more at combined SpO2, and maybe less at the instrument panel compared to scenario 3.
* These results are pretty weak but might suggest adaptation between scenarios 2 and 3.

#### Summary

* There seems to be a learning effect between scenarios 1 and 4, but maybe not a strong effect between scenarios 2 and 3.