# Imports

In [1]:
%matplotlib inline

import scipy as sp
from scipy import stats
import numpy as np
import pandas as pd

import statistics

# Load Data

In [2]:
tracing_df = statistics.load_tracing_features()
recording_df = statistics.load_recording_relations()
recording_df = statistics.associate_recordings(recording_df, tracing_df)
gaze_dfs = statistics.load_gaze_features()
statistics.check_gaze_recording_associations(recording_df, gaze_dfs)
statistics.compute_gaze_features(gaze_dfs)
gaze_df = statistics.combine_gaze_features(gaze_dfs)
full_df = statistics.combine_all_features(recording_df, gaze_df)
full_df.columns

Index(['id', 'subjectNumber', 'scenarioNumber', 'newAfterOld', 'scenarioType',
       'displayType', 'sensorPlacementTime', 'ppvStartTime', 'ccStartTime',
       'inSpO2TargetRangeDuration', 'inSpO2LooseTargetRangeDuration',
       'inSpO2TargetRangeStartTime', 'aboveSpO2TargetRangeDuration',
       'belowSpO2TargetRangeDuration', 'inFiO2TargetRangeDuration',
       'inFiO2TargetRangeStartTime', 'aboveFiO2TargetRangeDuration',
       'belowFiO2TargetRangeDuration', 'spO2SignedErrorIntegral',
       'spO2UnsignedErrorIntegral', 'spO2SquaredErrorIntegral',
       'fiO2LargeAdjustments', 'visitDuration_fiO2Dial',
       'visitDuration_infant', 'visitDuration_monitorApgarTimer',
       'visitDuration_monitorBlank', 'visitDuration_monitorFiO2',
       'visitDuration_monitorFull', 'visitDuration_monitorGraph',
       'visitDuration_monitorHeartRate', 'visitDuration_monitorSpO2',
       'visitDuration_spO2ReferenceTable',
       'visitDuration_warmerInstrumentPanel', 'visitDuration_combinedSp

# Error Stratification
For each scenario type, we stratify by the total accumulated error across both scenarios of the same type per subject.

In [3]:
stratification_size = 5
scenario_split = statistics.build_split(full_df, 'scenarioType')
errors = [
    df[['subjectNumber', 'spO2UnsignedErrorIntegral']]
    .groupby('subjectNumber')
    .sum()
    .sort_values(by=['spO2UnsignedErrorIntegral'])
    for df in scenario_split.dfs
]
tails = [
    [
        df.index[:stratification_size].values,
         df.index[-stratification_size:].values
    ]
    for df in errors
]

masks = [
    [df.subjectNumber.isin(lowest_tail), df.subjectNumber.isin(highest_tail)]
    for (df, (lowest_tail, highest_tail)) in zip(scenario_split.dfs, tails)
]
print('Lowest-error subjects, easy:', tails[0][0])
print('Highest-error subjects, easy:', tails[0][1])
print('Lowest-error subjects, hard:', tails[1][0])
print('Highest-error subjects, hard:', tails[1][1])

Lowest-error subjects, easy: [ 9  8  7 12 10]
Highest-error subjects, easy: [17 13  4 19 20]
Lowest-error subjects, hard: [16  6  9  7 18]
Highest-error subjects, hard: [11 17 13  8 15]


### Accumulated Absolute Error Stratification, Hard

In [4]:
error_stratification = statistics.build_stratification(scenario_split.dfs[1], 'absoluteError', ['lowest', 'highest'], masks[1])
error_stratification.describe(tests=False)

Pairing against absoluteError:
  0: lowest vs. 1: highest
  10 0 vs. 1 pairs.


In [5]:
error_display_pairings = [
    statistics.build_pairing(df, 'displayType')
    for df in error_stratification.dfs
]

#### Lowest Errors

In [6]:
error_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [7]:
statistics.test_gaze_duration_outcomes(error_display_pairings[0])

visitDuration_infant:
  mean diff = 17.004; stdev diff = 21.697
  Paired t-test:
   ~|diff| > 0: p = 0.192
    diff < 0: p = 0.904
   *diff > 0: p = 0.096
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.043
    P(x > y) < 0.5: p = 0.978
  **P(x > y) > 0.5: p = 0.022
visitDuration_warmerInstrumentPanel:
  mean diff = 0.704; stdev diff = 5.876
  Paired t-test:
    |diff| > 0: p = 0.822
    diff < 0: p = 0.589
    diff > 0: p = 0.411
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.554
    P(x > y) > 0.5: p = 0.446
visitDuration_fiO2Dial:
  mean diff = 0.620; stdev diff = 2.416
  Paired t-test:
    |diff| > 0: p = 0.635
    diff < 0: p = 0.683
    diff > 0: p = 0.317
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.686
    P(x > y) < 0.5: p = 0.657
    P(x > y) > 0.5: p = 0.343
visitDuration_spO2ReferenceTable:
  mean diff = 1.832; stdev diff = 1.684
  Paired t-test:
   *|diff| > 0: p = 0.095
    diff < 0: p = 0.952
  **diff > 0: p = 0.0



Observations:

* As before, we ignore the Wilcoxon tests because the sample sizes are too small.
* With the new display, participants spend less time looking at the SpO2 reference table, less time looking at the Apgar timer on the monitor, less time looking at the heart rate on the monitor, and more time looking at combined SpO2 elements.

#### Highest Errors

In [8]:
error_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [9]:
statistics.test_gaze_duration_outcomes(error_display_pairings[1])

visitDuration_infant:
  mean diff = -7.270; stdev diff = 32.861
  Paired t-test:
    |diff| > 0: p = 0.681
    diff < 0: p = 0.341
    diff > 0: p = 0.659
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.446
    P(x > y) > 0.5: p = 0.554
visitDuration_warmerInstrumentPanel:
  mean diff = 1.024; stdev diff = 12.326
  Paired t-test:
    |diff| > 0: p = 0.876
    diff < 0: p = 0.562
    diff > 0: p = 0.438
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.500
    P(x > y) < 0.5: p = 0.750
    P(x > y) > 0.5: p = 0.250
visitDuration_fiO2Dial:
  mean diff = 1.032; stdev diff = 2.740
  Paired t-test:
    |diff| > 0: p = 0.493
    diff < 0: p = 0.753
    diff > 0: p = 0.247
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.686
    P(x > y) < 0.5: p = 0.657
    P(x > y) > 0.5: p = 0.343
visitDuration_spO2ReferenceTable:
  mean diff = 1.216; stdev diff = 2.248
  Paired t-test:
    |diff| > 0: p = 0.340
    diff < 0: p = 0.830
   ~diff > 0: p = 0.



Observations:

* As before, we ignore the Wilcoxon signed-rank tests.
* With the new display, participants spend significantly more time looking at the combined FiO2 elements, and almost significantly more time looking at the combined SpO2 elements.

#### Summary

* The top 5 participants who accumulated the lowest SpO2 errors on the hard scenario spent less time looking at the SpO2 reference table, Apgar timer, and heart rate display and more time looking at the SpO2 elements on the screen, but the bottom 5 participants who accumulated the highest SpO2 errors on the hard scenario spent more time looking at FiO2 and maybe also at SpO2 in general (no significant difference in either direction with the SpO2 reference table).

### Accumulated Absolute Error Stratification, Easy

In [10]:
error_stratification = statistics.build_stratification(scenario_split.dfs[0], 'absoluteError', ['lowest', 'highest'], masks[0])
error_stratification.describe(tests=False)

Pairing against absoluteError:
  0: lowest vs. 1: highest
  10 0 vs. 1 pairs.


In [11]:
error_display_pairings = [
    statistics.build_pairing(df, 'displayType')
    for df in error_stratification.dfs
]

#### Lowest Errors

In [12]:
error_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [13]:
statistics.test_gaze_duration_outcomes(error_display_pairings[0])

visitDuration_infant:
  mean diff = -0.388; stdev diff = 18.461
  Paired t-test:
    |diff| > 0: p = 0.968
    diff < 0: p = 0.484
    diff > 0: p = 0.516
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.446
    P(x > y) > 0.5: p = 0.554
visitDuration_warmerInstrumentPanel:
  mean diff = 3.059; stdev diff = 8.856
  Paired t-test:
    |diff| > 0: p = 0.528
    diff < 0: p = 0.736
    diff > 0: p = 0.264
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.554
    P(x > y) > 0.5: p = 0.446
visitDuration_fiO2Dial:
  mean diff = -0.128; stdev diff = 7.721
  Paired t-test:
    |diff| > 0: p = 0.975
    diff < 0: p = 0.488
    diff > 0: p = 0.512
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.446
    P(x > y) > 0.5: p = 0.554
visitDuration_spO2ReferenceTable:
  mean diff = 0.480; stdev diff = 6.066
  Paired t-test:
    |diff| > 0: p = 0.882
    diff < 0: p = 0.559
    diff > 0: p = 0.



Observations:

* As before, we ignore the Wilcoxon tests.
* With the new display, participants spent almost significantly less time looking at the Apgar timer, significantly less time looking at SpO2, and significantly more time looking at combined FiO2 elements.

#### Highest Errors

In [14]:
error_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [15]:
statistics.test_gaze_duration_outcomes(error_display_pairings[1])

visitDuration_infant:
  mean diff = -7.186; stdev diff = 22.444
  Paired t-test:
    |diff| > 0: p = 0.557
    diff < 0: p = 0.278
    diff > 0: p = 0.722
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.225
   ~P(x > y) < 0.5: p = 0.112
    P(x > y) > 0.5: p = 0.888
visitDuration_warmerInstrumentPanel:
  mean diff = -1.348; stdev diff = 5.446
  Paired t-test:
    |diff| > 0: p = 0.647
    diff < 0: p = 0.323
    diff > 0: p = 0.677
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.500
    P(x > y) < 0.5: p = 0.250
    P(x > y) > 0.5: p = 0.750
visitDuration_fiO2Dial:
  mean diff = 3.879; stdev diff = 6.262
  Paired t-test:
    |diff| > 0: p = 0.283
    diff < 0: p = 0.858
   ~diff > 0: p = 0.142
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.225
    P(x > y) < 0.5: p = 0.888
   ~P(x > y) > 0.5: p = 0.112
visitDuration_spO2ReferenceTable:
  mean diff = 2.699; stdev diff = 5.711
  Paired t-test:
    |diff| > 0: p = 0.398
    diff < 0: p = 0.801
   ~diff > 0: p = 0.




   ~|diff| > 0: p = 0.150
    diff < 0: p = 0.925
   *diff > 0: p = 0.075
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.225
    P(x > y) < 0.5: p = 0.888
   ~P(x > y) > 0.5: p = 0.112
visitDuration_monitorFiO2:
  mean diff = -8.322; stdev diff = 1.483
  Paired t-test:
  **|diff| > 0: p = 0.000
  **diff < 0: p = 0.000
    diff > 0: p = 1.000
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.043
  **P(x > y) < 0.5: p = 0.022
    P(x > y) > 0.5: p = 0.978
visitDuration_monitorGraph:
  mean diff = -24.460; stdev diff = 7.566
  Paired t-test:
  **|diff| > 0: p = 0.003
  **diff < 0: p = 0.001
    diff > 0: p = 0.999
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.043
  **P(x > y) < 0.5: p = 0.022
    P(x > y) > 0.5: p = 0.978
visitDuration_monitorSpO2:
  mean diff = -5.715; stdev diff = 22.334
  Paired t-test:
    |diff| > 0: p = 0.636
    diff < 0: p = 0.318
    diff > 0: p = 0.682
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.500
    P(x > y) < 0.5: p = 0

Observations:

* As before, we ignore the Wilcoxon signed-rank tests.
* With the new display, participants spent significantly more time looking at combined SpO2 elements.

#### Summary

* The top 5 participants who accumulated the lowest SpO2 errors on the hard scenario shifted their attention away from SpO2 reference table, Apgar timer, and heart rate to the SpO2 graph with the new display, but the bottom 5 participants who accumulated the highest SpO2 errors on the hard scenario didn't shift their attention as much (and looked at FiO2 more).
* The top 5 participants who accumulated the lowest SpO2 errors on the easy scenario shifted their attention from SpO2 to the SpO2 graph and from Apgar timer to the FiO2 on the new display, but the bottom 5 participants who accumulated the highest SpO2 errors on the easy scenario shifted their attention to the SpO2 graph but probably shifted their attention away from different areas.

# Duration Stratification
For each scenario type, we stratify by the total time in the loose SpO2 target range across both scenarios of the same type per subject. Stratifying by the strict SpO2 target range doesn't yield any significant differences.

In [16]:
stratification_size = 5
scenario_split = statistics.build_split(full_df, 'scenarioType')
durations = [
    df[['subjectNumber', 'inSpO2LooseTargetRangeDuration']]
    .groupby('subjectNumber')
    .sum()
    .sort_values(by=['inSpO2LooseTargetRangeDuration'], ascending=False)
    for df in scenario_split.dfs
]
tails = [
    [
        df.index[:stratification_size].values,
         df.index[-stratification_size:].values
    ]
    for df in durations
]

masks = [
    [df.subjectNumber.isin(lowest_tail), df.subjectNumber.isin(highest_tail)]
    for (df, (lowest_tail, highest_tail)) in zip(scenario_split.dfs, tails)
]
print('Highest-duration subjects, easy:', tails[0][0])
print('Lowest-duration subjects, easy:', tails[0][1])
print('Highest-duration subjects, hard:', tails[1][0])
print('Lowest-duration subjects, hard:', tails[1][1])

Highest-duration subjects, easy: [ 8  9 12  7 10]
Lowest-duration subjects, easy: [18 13  4 19 20]
Highest-duration subjects, hard: [16  6 10  7 17]
Lowest-duration subjects, hard: [19  8 13 15 18]


### Duration Stratification, Hard

In [17]:
duration_stratification = statistics.build_stratification(scenario_split.dfs[1], 'spO2LooseTargetDuration', ['highest', 'lowest'], masks[1])
duration_stratification.describe(tests=False)

Pairing against spO2LooseTargetDuration:
  0: highest vs. 1: lowest
  10 0 vs. 1 pairs.


In [18]:
duration_display_pairings = [
    statistics.build_pairing(df, 'displayType')
    for df in duration_stratification.dfs
]

#### Highest Durations

In [19]:
duration_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [20]:
statistics.test_gaze_duration_outcomes(duration_display_pairings[0])

visitDuration_infant:
  mean diff = 17.779; stdev diff = 26.983
  Paired t-test:
    |diff| > 0: p = 0.258
    diff < 0: p = 0.871
   ~diff > 0: p = 0.129
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.225
    P(x > y) < 0.5: p = 0.888
   ~P(x > y) > 0.5: p = 0.112
visitDuration_warmerInstrumentPanel:
  mean diff = 1.524; stdev diff = 4.411
  Paired t-test:
    |diff| > 0: p = 0.528
    diff < 0: p = 0.736
    diff > 0: p = 0.264
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.500
    P(x > y) < 0.5: p = 0.750
    P(x > y) > 0.5: p = 0.250
visitDuration_fiO2Dial:
  mean diff = 0.252; stdev diff = 1.883
  Paired t-test:
    |diff| > 0: p = 0.802
    diff < 0: p = 0.599
    diff > 0: p = 0.401
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.686
    P(x > y) < 0.5: p = 0.657
    P(x > y) > 0.5: p = 0.343
visitDuration_spO2ReferenceTable:
  mean diff = 0.416; stdev diff = 2.699
  Paired t-test:
    |diff| > 0: p = 0.773
    diff < 0: p = 0.613
    diff > 0: p = 0.3



Observations:

* As before, we ignore the Wilcoxon tests because the sample sizes are too small.
* With the new display, participants look at combined FiO2 elements significantly more.

#### Lowest Durations

In [21]:
duration_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [22]:
statistics.test_gaze_duration_outcomes(duration_display_pairings[1])

visitDuration_infant:
  mean diff = -7.002; stdev diff = 29.533
  Paired t-test:
    |diff| > 0: p = 0.660
    diff < 0: p = 0.330
    diff > 0: p = 0.670
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.500
    P(x > y) < 0.5: p = 0.250
    P(x > y) > 0.5: p = 0.750
visitDuration_warmerInstrumentPanel:
  mean diff = -3.899; stdev diff = 14.923
  Paired t-test:
    |diff| > 0: p = 0.629
    diff < 0: p = 0.314
    diff > 0: p = 0.686
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.500
    P(x > y) < 0.5: p = 0.250
    P(x > y) > 0.5: p = 0.750
visitDuration_fiO2Dial:
  mean diff = -1.232; stdev diff = 3.221
  Paired t-test:
    |diff| > 0: p = 0.487
    diff < 0: p = 0.244
    diff > 0: p = 0.756
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.345
   ~P(x > y) < 0.5: p = 0.173
    P(x > y) > 0.5: p = 0.827
visitDuration_spO2ReferenceTable:
  mean diff = 1.508; stdev diff = 1.250
  Paired t-test:
   *|diff| > 0: p = 0.073
    diff < 0: p = 0.963
  **diff > 0: p = 



  Paired t-test:
   *|diff| > 0: p = 0.055
  **diff < 0: p = 0.028
    diff > 0: p = 0.972
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.080
  **P(x > y) < 0.5: p = 0.040
    P(x > y) > 0.5: p = 0.960
visitDuration_combinedSpO2:
  mean diff = -4.963; stdev diff = 12.570
  Paired t-test:
    |diff| > 0: p = 0.474
    diff < 0: p = 0.237
    diff > 0: p = 0.763
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.345
   ~P(x > y) < 0.5: p = 0.173
    P(x > y) > 0.5: p = 0.827


Observations:

* As before, we ignore the Wilcoxon signed-rank tests.
* With the new display, participants look at the SpO2 reference table significantly less, and the combined FiO2 elements significantly more.

#### Summary

* The top 5 participants who spent the most time within +/- 5 percentage points of the SpO2 target range shifted their attention to FiO2 on the new display, but the areas they shifted their attention away from were probably different.
* The bottom 5 participants who spent the least time within +/- 5 percentage points of the SpO2 target range shifted their attention away from the SpO2 reference table and other areas (probably different across participants) to the FiO2 on the new display.

### Duration Stratification, Easy

In [23]:
duration_stratification = statistics.build_stratification(scenario_split.dfs[0], 'spO2LooseTargetDuration', ['highest', 'lowest'], masks[0])
duration_stratification.describe(tests=False)

Pairing against spO2LooseTargetDuration:
  0: highest vs. 1: lowest
  10 0 vs. 1 pairs.


In [24]:
duration_display_pairings = [
    statistics.build_pairing(df, 'displayType')
    for df in duration_stratification.dfs
]

#### Highest Durations

In [25]:
duration_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [26]:
statistics.test_gaze_duration_outcomes(duration_display_pairings[0])

visitDuration_infant:
  mean diff = -0.388; stdev diff = 18.461
  Paired t-test:
    |diff| > 0: p = 0.968
    diff < 0: p = 0.484
    diff > 0: p = 0.516
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.446
    P(x > y) > 0.5: p = 0.554
visitDuration_warmerInstrumentPanel:
  mean diff = 3.059; stdev diff = 8.856
  Paired t-test:
    |diff| > 0: p = 0.528
    diff < 0: p = 0.736
    diff > 0: p = 0.264
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.554
    P(x > y) > 0.5: p = 0.446
visitDuration_fiO2Dial:
  mean diff = -0.128; stdev diff = 7.721
  Paired t-test:
    |diff| > 0: p = 0.975
    diff < 0: p = 0.488
    diff > 0: p = 0.512
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.446
    P(x > y) > 0.5: p = 0.554
visitDuration_spO2ReferenceTable:
  mean diff = 0.480; stdev diff = 6.066
  Paired t-test:
    |diff| > 0: p = 0.882
    diff < 0: p = 0.559
    diff > 0: p = 0.



Observations:

* As before, we ignore the Wilcoxon tests.
* With the new display, participants look at the Apgar timer maybe significantly less, the SpO2 significantly less, and the combined FiO2 significantly more.

#### Lowest Durations

In [27]:
duration_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  5 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [28]:
statistics.test_gaze_duration_outcomes(duration_display_pairings[1])

visitDuration_infant:
  mean diff = -10.673; stdev diff = 19.069
  Paired t-test:
    |diff| > 0: p = 0.326
   ~diff < 0: p = 0.163
    diff > 0: p = 0.837
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.225
   ~P(x > y) < 0.5: p = 0.112
    P(x > y) > 0.5: p = 0.888
visitDuration_warmerInstrumentPanel:
  mean diff = 0.108; stdev diff = 6.517
  Paired t-test:
    |diff| > 0: p = 0.975
    diff < 0: p = 0.512
    diff > 0: p = 0.488
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.893
    P(x > y) < 0.5: p = 0.554
    P(x > y) > 0.5: p = 0.446
visitDuration_fiO2Dial:
  mean diff = 4.879; stdev diff = 6.087
  Paired t-test:
   ~|diff| > 0: p = 0.184
    diff < 0: p = 0.908
   *diff > 0: p = 0.092
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.138
    P(x > y) < 0.5: p = 0.931
   *P(x > y) > 0.5: p = 0.069
visitDuration_spO2ReferenceTable:
  mean diff = 2.695; stdev diff = 5.717
  Paired t-test:
    |diff| > 0: p = 0.399
    diff < 0: p = 0.800
   ~diff > 0: p = 0.




  mean diff = -22.378; stdev diff = 10.709
  Paired t-test:
  **|diff| > 0: p = 0.014
  **diff < 0: p = 0.007
    diff > 0: p = 0.993
  Wilcoxon signed-rank test:
  **P(x > y) != 0.5: p = 0.043
  **P(x > y) < 0.5: p = 0.022
    P(x > y) > 0.5: p = 0.978
visitDuration_monitorSpO2:
  mean diff = -5.587; stdev diff = 22.314
  Paired t-test:
    |diff| > 0: p = 0.643
    diff < 0: p = 0.321
    diff > 0: p = 0.679
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.500
    P(x > y) < 0.5: p = 0.250
    P(x > y) > 0.5: p = 0.750
visitDuration_combinedFiO2:
  mean diff = -3.215; stdev diff = 6.695
  Paired t-test:
    |diff| > 0: p = 0.391
   ~diff < 0: p = 0.196
    diff > 0: p = 0.804
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.500
    P(x > y) < 0.5: p = 0.250
    P(x > y) > 0.5: p = 0.750
visitDuration_combinedSpO2:
  mean diff = -25.269; stdev diff = 21.692
  Paired t-test:
   *|diff| > 0: p = 0.080
  **diff < 0: p = 0.040
    diff > 0: p = 0.960
  Wilcoxon signed-rank 

Observations:

* As before, we ignore the Wilcoxon signed-rank tests.
* With the new display, participants look at the Apgar timer maybe significantly less and the combined SpO2 elements significantly more.

#### Summary

* Both groups of participants looked at the Apgar timer significantly less, and their attention was shifted towards FiO2 (for the top participants) or SpO2 (for the bottom participants), but for both groups the areas their attention shifted away from seemed to differ among participants.