# Imports

In [1]:
%matplotlib inline

import scipy as sp
from scipy import stats
import numpy as np
import pandas as pd

import statistics

# Load Data

In [2]:
tracing_df = statistics.load_tracing_features()
recording_df = statistics.load_recording_relations()
recording_df = statistics.associate_recordings(recording_df, tracing_df)
gaze_dfs = statistics.load_gaze_features()
statistics.check_gaze_recording_associations(recording_df, gaze_dfs)
statistics.compute_gaze_features(gaze_dfs)
gaze_df = statistics.combine_gaze_features(gaze_dfs)
full_df = statistics.combine_all_features(recording_df, gaze_df)
full_df.columns

Index(['id', 'subjectNumber', 'scenarioNumber', 'newAfterOld', 'scenarioType',
       'displayType', 'sensorPlacementTime', 'ppvStartTime', 'ccStartTime',
       'inSpO2TargetRangeDuration', 'inSpO2LooseTargetRangeDuration',
       'inSpO2TargetRangeStartTime', 'aboveSpO2TargetRangeDuration',
       'belowSpO2TargetRangeDuration', 'inFiO2TargetRangeDuration',
       'inFiO2TargetRangeStartTime', 'aboveFiO2TargetRangeDuration',
       'belowFiO2TargetRangeDuration', 'spO2SignedErrorIntegral',
       'spO2UnsignedErrorIntegral', 'spO2SquaredErrorIntegral',
       'fiO2LargeAdjustments', 'visitDuration_fiO2Dial',
       'visitDuration_infant', 'visitDuration_monitorApgarTimer',
       'visitDuration_monitorBlank', 'visitDuration_monitorFiO2',
       'visitDuration_monitorFull', 'visitDuration_monitorGraph',
       'visitDuration_monitorHeartRate', 'visitDuration_monitorSpO2',
       'visitDuration_spO2ReferenceTable',
       'visitDuration_warmerInstrumentPanel', 'visitDuration_combinedFi

# Pairing

### Scenario Type Pairing

In [3]:
scenario_pairing = statistics.build_pairing(full_df, 'scenarioType')

In [4]:
scenario_pairing.describe()

Pairing against scenarioType:
  0: easy vs. 1: hard
  22 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [5]:
statistics.test_gaze_count_outcomes(scenario_pairing)

visitCount_infant:
  mean diff = -3.273; stdev diff = 12.174
  Paired t-test:
    |diff| > 0: p = 0.232
   ~diff < 0: p = 0.116
    diff > 0: p = 0.884
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.222
   ~P(x > y) < 0.5: p = 0.111
    P(x > y) > 0.5: p = 0.889
visitCount_warmerInstrumentPanel:
  mean diff = 2.182; stdev diff = 11.504
  Paired t-test:
    |diff| > 0: p = 0.395
    diff < 0: p = 0.803
   ~diff > 0: p = 0.197
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.548
    P(x > y) < 0.5: p = 0.726
    P(x > y) > 0.5: p = 0.274
visitCount_fiO2Dial:
  mean diff = 1.818; stdev diff = 6.478
  Paired t-test:
    |diff| > 0: p = 0.212
    diff < 0: p = 0.894
   ~diff > 0: p = 0.106
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.216
    P(x > y) < 0.5: p = 0.892
   ~P(x > y) > 0.5: p = 0.108
visitCount_spO2ReferenceTable:
  mean diff = 3.955; stdev diff = 7.719
  Paired t-test:
  **|diff| > 0: p = 0.029
    diff < 0: p = 0.986
  **diff > 0: p = 0.014
  Wilcox

Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Subjects look at the SpO2 reference table more frequently, the monitor maybe more frequently, the graph more frequently, the SpO2 maybe more frequently, the combined FiO2 elements maybe more frequently, and the combined SpO2 elements more frequently in the easy scenario than in the hard scenario.
* These results generally correspond to the duration results.

### Display Type Pairing

In [6]:
display_pairing = statistics.build_pairing(full_df, 'displayType')

In [7]:
display_pairing.describe()

Pairing against displayType:
  0: minimal vs. 1: full
  26 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [8]:
statistics.test_gaze_count_outcomes(display_pairing)

visitCount_infant:
  mean diff = 1.000; stdev diff = 9.081
  Paired t-test:
    |diff| > 0: p = 0.587
    diff < 0: p = 0.707
    diff > 0: p = 0.293
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.399
    P(x > y) < 0.5: p = 0.801
   ~P(x > y) > 0.5: p = 0.199
visitCount_warmerInstrumentPanel:
  mean diff = 0.615; stdev diff = 9.507
  Paired t-test:
    |diff| > 0: p = 0.749
    diff < 0: p = 0.626
    diff > 0: p = 0.374
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.830
    P(x > y) < 0.5: p = 0.585
    P(x > y) > 0.5: p = 0.415
visitCount_fiO2Dial:
  mean diff = 2.500; stdev diff = 6.362
  Paired t-test:
   *|diff| > 0: p = 0.061
    diff < 0: p = 0.970
  **diff > 0: p = 0.030
  Wilcoxon signed-rank test:
   *P(x > y) != 0.5: p = 0.069
    P(x > y) < 0.5: p = 0.966
  **P(x > y) > 0.5: p = 0.034
visitCount_spO2ReferenceTable:
  mean diff = 1.692; stdev diff = 5.750
  Paired t-test:
   ~|diff| > 0: p = 0.154
    diff < 0: p = 0.923
   *diff > 0: p = 0.077
  Wilcoxon 

Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Subjects look at the FiO2 dial less frequently, the SpO2 reference table maybe less frequently, the monitor less frequently, the blank parts of the monitor less frequently, the apgar timer less frequently, the heart rate less frequently, the combined FiO2 elements more frequently, and the combined SpO2  elements more frequently in the full display than in the minimal display.
* These results generally correspond to duration results.

### Display Type Pairing, Split by Scenario

In [9]:
scenario_display_pairings = {
    scenario: statistics.build_pairing(scenario_subset, 'displayType')
    for (scenario, scenario_subset) in enumerate(scenario_pairing)
}

#### Easy Scenarios

In [10]:
scenario_display_pairings[0].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  11 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [11]:
statistics.test_gaze_count_outcomes(scenario_display_pairings[0])

visitCount_infant:
  mean diff = 0.091; stdev diff = 10.317
  Paired t-test:
    |diff| > 0: p = 0.978
    diff < 0: p = 0.511
    diff > 0: p = 0.489
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.858
    P(x > y) < 0.5: p = 0.571
    P(x > y) > 0.5: p = 0.429
visitCount_warmerInstrumentPanel:
  mean diff = -0.909; stdev diff = 8.426
  Paired t-test:
    |diff| > 0: p = 0.740
    diff < 0: p = 0.370
    diff > 0: p = 0.630
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.635
    P(x > y) < 0.5: p = 0.318
    P(x > y) > 0.5: p = 0.682
visitCount_fiO2Dial:
  mean diff = 2.727; stdev diff = 6.648
  Paired t-test:
    |diff| > 0: p = 0.224
    diff < 0: p = 0.888
   ~diff > 0: p = 0.112
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.138
    P(x > y) < 0.5: p = 0.931
   *P(x > y) > 0.5: p = 0.069
visitCount_spO2ReferenceTable:
  mean diff = -0.818; stdev diff = 6.176
  Paired t-test:
    |diff| > 0: p = 0.684
    diff < 0: p = 0.342
    diff > 0: p = 0.658
  Wilcox



Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Subjects look at the blank parts of the monitor maybe less frequently, the apgar timer less frequently, the heart rate less frequently, the combined FiO2 elements more frequently, and the combined SpO2 elements more frequently in the full display than in the minimal display.
* This is generally similar to the results for easy and hard scenarios combined.

#### Hard Scenarios

In [12]:
scenario_display_pairings[1].describe()

Pairing against displayType:
  0: minimal vs. 1: full
  11 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [13]:
statistics.test_gaze_count_outcomes(scenario_display_pairings[1])

visitCount_infant:
  mean diff = 0.091; stdev diff = 6.097
  Paired t-test:
    |diff| > 0: p = 0.963
    diff < 0: p = 0.518
    diff > 0: p = 0.482
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.789
    P(x > y) < 0.5: p = 0.605
    P(x > y) > 0.5: p = 0.395
visitCount_warmerInstrumentPanel:
  mean diff = 1.273; stdev diff = 11.104
  Paired t-test:
    |diff| > 0: p = 0.725
    diff < 0: p = 0.638
    diff > 0: p = 0.362
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.722
    P(x > y) < 0.5: p = 0.639
    P(x > y) > 0.5: p = 0.361
visitCount_fiO2Dial:
  mean diff = 1.455; stdev diff = 5.229
  Paired t-test:
    |diff| > 0: p = 0.400
    diff < 0: p = 0.800
   ~diff > 0: p = 0.200
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.422
    P(x > y) < 0.5: p = 0.789
    P(x > y) > 0.5: p = 0.211
visitCount_spO2ReferenceTable:
  mean diff = 2.182; stdev diff = 3.157
  Paired t-test:
   *|diff| > 0: p = 0.054
    diff < 0: p = 0.973
  **diff > 0: p = 0.027
  Wilcoxon

Observations:

* Wilcoxon signed-rank and paired t-tests seem to behave generally the same way.
* Subjects look at the SpO2 reference table less frequently, the monitor maybe less frequently, the blank parts of the monitor less frequently, the apgar timer less frequently, the heart rate less frequently, the combined FiO2 elements more frequently, and the combined SpO2 elements more frequently in the full display than in the minimal display.
* This is similar to the results for easy and hard scenarios combined only for the combined FiO2 test.

#### Summary

* In both scenario types, subjects look at the combined FiO2 elements and combined SpO2 elements more frequently, and the apgar timer and heart rate less frequently with the full display (which shows FiO2 reading and SpO2 graph) than with the minimal display (which doesn't show FiO2 reading and SpO2 graph).

### Scenario Order Pairing, Split by Scenario

In [14]:
scenario_order_pairings = {
    0: statistics.build_pairing(scenario_pairing[0], 'scenarioNumber', values=(1, 4), check_validity=False),
    1: statistics.build_pairing(scenario_pairing[1], 'scenarioNumber', values=(2, 3), check_validity=False),
}

#### Easy Scenarios

In [15]:
scenario_order_pairings[0].describe()

Pairing against scenarioNumber:
  0: first vs. 1: second
  11 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [16]:
statistics.test_gaze_count_outcomes(scenario_order_pairings[0])

visitCount_infant:
  mean diff = 3.182; stdev diff = 9.815
  Paired t-test:
    |diff| > 0: p = 0.329
    diff < 0: p = 0.835
   ~diff > 0: p = 0.165
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.339
    P(x > y) < 0.5: p = 0.831
   ~P(x > y) > 0.5: p = 0.169
visitCount_warmerInstrumentPanel:
  mean diff = 2.727; stdev diff = 8.024
  Paired t-test:
    |diff| > 0: p = 0.308
    diff < 0: p = 0.846
   ~diff > 0: p = 0.154
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.236
    P(x > y) < 0.5: p = 0.882
   ~P(x > y) > 0.5: p = 0.118
visitCount_fiO2Dial:
  mean diff = 1.818; stdev diff = 6.952
  Paired t-test:
    |diff| > 0: p = 0.428
    diff < 0: p = 0.786
    diff > 0: p = 0.214
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.413
    P(x > y) < 0.5: p = 0.793
    P(x > y) > 0.5: p = 0.207
visitCount_spO2ReferenceTable:
  mean diff = 1.364; stdev diff = 6.079
  Paired t-test:
    |diff| > 0: p = 0.494
    diff < 0: p = 0.753
    diff > 0: p = 0.247
  Wilcoxon 



No significant differences here.

#### Hard Scenarios

In [17]:
scenario_order_pairings[1].describe()

Pairing against scenarioNumber:
  0: first vs. 1: second
  11 0 vs. 1 pairs.
  Paired t-test alternative hypotheses:
    Ha left-tailed (diff < 0): mean 0 - mean 1 < 0
    Ha two-tailed (|diff| > 0): mean 0 - mean 1 != 0
    Ha right-tailed (diff > 0): mean 0 - mean 1 > 0
  Wilcoxon signed-rank alternative hypotheses:
    Ha left-tailed (P(x > y) < 0.5)
    Ha two-tailed (P(x > y) != 0.5)
    Ha right-tailed (P(x > y) > 0.5)


In [18]:
statistics.test_gaze_count_outcomes(scenario_order_pairings[1])

visitCount_infant:
  mean diff = -1.000; stdev diff = 6.015
  Paired t-test:
    |diff| > 0: p = 0.611
    diff < 0: p = 0.305
    diff > 0: p = 0.695
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.476
    P(x > y) < 0.5: p = 0.238
    P(x > y) > 0.5: p = 0.762
visitCount_warmerInstrumentPanel:
  mean diff = -5.273; stdev diff = 9.854
  Paired t-test:
   ~|diff| > 0: p = 0.122
   *diff < 0: p = 0.061
    diff > 0: p = 0.939
  Wilcoxon signed-rank test:
   ~P(x > y) != 0.5: p = 0.130
   *P(x > y) < 0.5: p = 0.065
    P(x > y) > 0.5: p = 0.935
visitCount_fiO2Dial:
  mean diff = -1.455; stdev diff = 5.229
  Paired t-test:
    |diff| > 0: p = 0.400
   ~diff < 0: p = 0.200
    diff > 0: p = 0.800
  Wilcoxon signed-rank test:
    P(x > y) != 0.5: p = 0.448
    P(x > y) < 0.5: p = 0.224
    P(x > y) > 0.5: p = 0.776
visitCount_spO2ReferenceTable:
  mean diff = -1.273; stdev diff = 3.620
  Paired t-test:
    |diff| > 0: p = 0.292
   ~diff < 0: p = 0.146
    diff > 0: p = 0.854
  Wilco

Observations:

* Participants in scenario 2 maybe look less at the insttrument panel of the warmer and less at the blank parts of the monitor compared to scenario 3.
* No other significant differences.

#### Summary

* There doesn't seem to be a learning or adaptation effect when we look at gaze counts. This is unlike the results for gaze durations.