# Analyse the results from the various hyperparm operations on i2b2 dataset

macro F1 scores are of the form ```8 way, Problem-Treatment, Problem-Test, Problem-Problem```

In [5]:
%load_ext autoreload

In [6]:
%autoreload

from scipy.stats import ttest_rel

In [7]:
def paired_ttest(score1, score2):
    all_three_macroF1_score1 = [x for x in zip(*score1)]
    all_three_macroF1_score2 = [x for x in zip(*score2)]
    ttests = [ttest_rel(macro_f1_score1, macro_f1_score2) 
            for macro_f1_score1, macro_f1_score2 in zip(all_three_macroF1_score1, all_three_macroF1_score2)]
    print('8 way evaluation: \t', ttests[0])
    print('Problem-Treatment: \t', ttests[1])
    print('Problem-Test: \t\t', ttests[2])
    print('Problem-Problem: \t\t', ttests[3])


baseline model with default hyperparams (no pre processing, no handling of the other class but with ranking loss)

In [8]:
baseline = [(90.35, 84.26, 92.58, 92.86), (88.71, 77.25, 92.89, 93.27), (89.57, 81.2, 92.55, 93.16), 
            (86.16, 75.21, 89.89, 91.82), (87.79, 78.66, 92.47, 89.47)]

### filter sizes variation

In [9]:
filter_234 = [(90.12, 81.33, 94.09, 92.73), (88.24, 76.07, 92.39, 93.69), (90.05, 82.05, 92.91, 93.45), 
              (86.63, 76.15, 90.19, 91.89), (87.56, 76.86, 92.71, 90.27)]

In [10]:
paired_ttest(baseline, filter_234)

8 way evaluation: 	 Ttest_relResult(statistic=-0.02028184785788131, pvalue=0.984789917560547)
Problem-Treatment: 	 Ttest_relResult(statistic=1.090051992778353, pvalue=0.33695740087286746)
Problem-Test: 		 Ttest_relResult(statistic=-1.185192872687789, pvalue=0.3015421300970829)
Problem-Problem: 		 Ttest_relResult(statistic=-1.8308285121912307, pvalue=0.14109181493838494)


In [11]:
filter_345 = [(88.94, 81.33, 92.27, 91.4), (88.94, 79.15, 92.89, 92.31), (90.52, 82.55, 92.88, 94.78), 
              (85.92, 75.31, 90.0, 90.41), (88.03, 78.66, 92.15, 90.91)]

In [12]:
paired_ttest(baseline, filter_345)

8 way evaluation: 	 Ttest_relResult(statistic=0.11783025003278201, pvalue=0.9118820017251688)
Problem-Treatment: 	 Ttest_relResult(statistic=-0.10042747032174407, pvalue=0.924837241734427)
Problem-Test: 		 Ttest_relResult(statistic=0.30410706453338954, pvalue=0.7762100004139154)
Problem-Problem: 		 Ttest_relResult(statistic=0.22204036423310525, pvalue=0.8351583502608262)


### batch size

In [13]:
batch_70 = [(89.88, 83.9, 92.07, 92.38), (89.41, 81.2, 92.39, 92.79), (90.52, 84.12, 92.88, 93.1), 
            (86.4, 77.37, 89.66, 90.83), (88.73, 82.5, 92.43, 89.08)]    

In [14]:
paired_ttest(baseline, batch_70)

8 way evaluation: 	 Ttest_relResult(statistic=-1.7586249604151594, pvalue=0.15346291817209728)
Problem-Treatment: 	 Ttest_relResult(statistic=-3.1814351191075465, pvalue=0.03348920662616892)
Problem-Test: 		 Ttest_relResult(statistic=1.2101665350671142, pvalue=0.2928361845091716)
Problem-Problem: 		 Ttest_relResult(statistic=3.21937454634215, pvalue=0.03229970131798839)


In [15]:
batch_30 = [(88.24, 78.86, 91.62, 92.79), (88.71, 76.99, 92.82, 94.12), (88.15, 78.81, 91.4, 92.37), 
            (84.49, 74.31, 89.12, 88.46), (86.62, 77.73, 91.34, 88.39)]

In [16]:
paired_ttest(baseline, batch_30)

8 way evaluation: 	 Ttest_relResult(statistic=3.595571984216195, pvalue=0.02284806369508635)
Problem-Treatment: 	 Ttest_relResult(statistic=2.1375144205141874, pvalue=0.09936492100177781)
Problem-Test: 		 Ttest_relResult(statistic=4.106989462863282, pvalue=0.01476841806568388)
Problem-Problem: 		 Ttest_relResult(statistic=1.2683906604324637, pvalue=0.2734535337122975)


In [17]:
paired_ttest(batch_70, batch_30)

8 way evaluation: 	 Ttest_relResult(statistic=6.071216649722227, pvalue=0.0037180498383699136)
Problem-Treatment: 	 Ttest_relResult(statistic=11.237431176231217, pvalue=0.0003571876219289638)
Problem-Test: 		 Ttest_relResult(statistic=1.932743267937933, pvalue=0.12541911380143606)
Problem-Problem: 		 Ttest_relResult(statistic=0.6598157954083025, pvalue=0.5454282013681919)


Batch size of 70 seems to be better than batch size of 30. It does not seem significantly better than original batch size. One problem is that batch 70 is better for Problem-Treatment, but baseline is better for Problem-Problem. This makes me think that changing batch size is not the right option, especially because it does not affec the overall 8 way evaluation. 

## num of epoches (worth exploring)

In [18]:
epochs_50 = [(89.65, 83.98, 91.92, 91.48), (88.24, 80.17, 90.91, 91.89), (90.28, 83.12, 92.59, 93.62), 
             (88.31, 81.22, 91.34, 90.35), (88.5, 81.67, 92.43, 89.08)] 

In [19]:
paired_ttest(baseline, epochs_50)

8 way evaluation: 	 Ttest_relResult(statistic=-0.9423002205056477, pvalue=0.39939274814304343)
Problem-Treatment: 	 Ttest_relResult(statistic=-2.676926321218678, pvalue=0.0554053238401088)
Problem-Test: 		 Ttest_relResult(statistic=0.42820440664183107, pvalue=0.6905518507940647)
Problem-Problem: 		 Ttest_relResult(statistic=2.195457007835645, pvalue=0.09312326892853189)


In [20]:
epochs_100 = [(90.35, 85.34, 92.39, 91.96), (88.71, 80.34, 91.65, 92.31), (90.05, 83.26, 92.06, 93.56), 
              (87.11, 77.97, 91.2, 89.87), (88.26, 80.0, 92.39, 90.04)]

In [21]:
paired_ttest(baseline, epochs_100)

8 way evaluation: 	 Ttest_relResult(statistic=-2.1380053305099627, pvalue=0.0993101237372443)
Problem-Treatment: 	 Ttest_relResult(statistic=-5.308335982824204, pvalue=0.00605299530981918)
Problem-Test: 		 Ttest_relResult(statistic=0.3326953259669462, pvalue=0.7560700112103909)
Problem-Problem: 		 Ttest_relResult(statistic=1.2101717241273928, pvalue=0.2928344003975201)


In [22]:
epochs_150 =  [(90.12, 84.12, 92.58, 92.04), (88.0, 76.07, 92.39, 92.79), (90.05, 83.62, 91.82, 93.56), 
               (86.16, 76.67, 89.76, 90.32), (89.44, 81.17, 93.44, 91.38)]

In [23]:
paired_ttest(baseline, epochs_150)

8 way evaluation: 	 Ttest_relResult(statistic=-0.5925349970564563, pvalue=0.5853663493950922)
Problem-Treatment: 	 Ttest_relResult(statistic=-1.395529720580799, pvalue=0.2353410521208601)
Problem-Test: 		 Ttest_relResult(statistic=0.2667325346846324, pvalue=0.8028614167287009)
Problem-Problem: 		 Ttest_relResult(statistic=0.16669077450791, pvalue=0.8757003732351485)


In [24]:
epochs_200 = [(90.12, 82.35, 93.3, 92.86), (88.0, 77.97, 91.6, 92.31), (90.05, 82.91, 91.82, 94.37), 
              (86.4, 76.99, 89.42, 91.4), (89.2, 80.67, 92.91, 91.85)]

In [25]:
paired_ttest(baseline, epochs_200)

8 way evaluation: 	 Ttest_relResult(statistic=-0.6665490036878363, pvalue=0.5415375543961087)
Problem-Treatment: 	 Ttest_relResult(statistic=-1.1849494127561362, pvalue=0.30162818414580445)
Problem-Test: 		 Ttest_relResult(statistic=0.7139659776618404, pvalue=0.5146966075570344)
Problem-Problem: 		 Ttest_relResult(statistic=-0.7341382437574384, pvalue=0.5035767927305833)


100 epochs seems like a good idea

## Learning rate changes (worth exploring)

decay from 0.001, 0.0001, 0.00001 at 60 and 120 epochs

In [26]:
lr_decay = [(89.88, 84.21, 91.92, 92.04), (88.71, 80.17, 91.92, 91.89), (90.28, 83.62, 92.06, 94.02), 
            (88.07, 78.97, 92.11, 90.67), (88.73, 81.67, 92.11, 90.52)]

In [27]:
paired_ttest(baseline, lr_decay)

8 way evaluation: 	 Ttest_relResult(statistic=-1.5106728585730524, pvalue=0.20539377446754556)
Problem-Treatment: 	 Ttest_relResult(statistic=-3.7010846218944997, pvalue=0.020815663264479135)
Problem-Test: 		 Ttest_relResult(statistic=0.0901044036514339, pvalue=0.9325357573064759)
Problem-Problem: 		 Ttest_relResult(statistic=0.5580577990338955, pvalue=0.6065631216378966)


lr of 0.01

In [28]:
lr_high = [(82.59, 71.32, 89.06, 84.62), (83.29, 68.57, 90.03, 87.85), 
           (82.46, 70.12, 89.12, 85.19), (78.28, 61.85, 86.56, 82.95), (84.98, 73.42, 90.77, 87.11)]

In [29]:
paired_ttest(baseline, lr_high)

8 way evaluation: 	 Ttest_relResult(statistic=6.49793684448928, pvalue=0.0028934133845599573)
Problem-Treatment: 	 Ttest_relResult(statistic=6.827751379762409, pvalue=0.002406315455049697)
Problem-Test: 		 Ttest_relResult(statistic=8.812360866382733, pvalue=0.0009149337964702109)
Problem-Problem: 		 Ttest_relResult(statistic=5.449723081567165, pvalue=0.005507496252817981)


lr decay only seems to be helping for the Problem-Treatment - worth exploring more

## SGD momentum

In [30]:
sgd_momentum = [(88.0, 79.17, 90.21, 93.69), (87.29, 73.95, 90.62, 95.61), (87.91, 76.19, 91.44, 93.72), 
                (89.02, 82.46, 89.82, 94.27), (88.73, 82.25, 90.58, 92.05)]

In [31]:
paired_ttest(baseline, sgd_momentum)

8 way evaluation: 	 Ttest_relResult(statistic=0.33600820280283605, pvalue=0.753750808390631)
Problem-Treatment: 	 Ttest_relResult(statistic=0.20395486514170735, pvalue=0.848345165343118)
Problem-Test: 		 Ttest_relResult(statistic=3.5898928056394013, pvalue=0.02296401736789849)
Problem-Problem: 		 Ttest_relResult(statistic=-4.024938881496008, pvalue=0.0157996322773076)


contradicting results for individual relations, but overall seems to not make a difference. 

tests reveal that both learning rate decay and SGD momentum do not help

## Border Size (default is best)

In [32]:
border_20 = [(90.12, 83.9, 92.35, 92.79), (88.94, 78.79, 92.89, 92.44), (89.81, 81.2, 92.55, 94.02), 
             (86.87, 75.63, 90.67, 92.44), (88.03, 77.31, 92.63, 91.45)]

In [33]:
paired_ttest(baseline, border_20)

8 way evaluation: 	 Ttest_relResult(statistic=-1.6011786190295072, pvalue=0.1845926891852554)
Problem-Treatment: 	 Ttest_relResult(statistic=-0.1055547867025079, pvalue=0.9210171353234092)
Problem-Test: 		 Ttest_relResult(statistic=-0.8295162373921146, pvalue=0.4534466539886332)
Problem-Problem: 		 Ttest_relResult(statistic=-1.0881948213156576, pvalue=0.33768508393083546)


In [34]:
border_minus1 = [(89.41, 81.51, 92.03, 93.27), (87.76, 75.74, 91.6, 93.69), (89.57, 80.0, 92.35, 94.47), 
                 (84.49, 73.9, 88.71, 89.4), (88.03, 78.51, 92.15, 91.23)]

In [35]:
paired_ttest(baseline, border_minus1)

8 way evaluation: 	 Ttest_relResult(statistic=1.9056713469377615, pvalue=0.12938986941425473)
Problem-Treatment: 	 Ttest_relResult(statistic=3.3352205703831355, pvalue=0.02896438998822179)
Problem-Test: 		 Ttest_relResult(statistic=3.174090145759352, pvalue=0.03372541516115184)
Problem-Problem: 		 Ttest_relResult(statistic=-0.4070383333681037, pvalue=0.7048196678884107)


In [36]:
border_1 = [(89.65, 79.67, 93.23, 94.22), (86.82, 74.68, 91.33, 91.56), (89.57, 84.21, 90.77, 92.83), 
            (86.4, 74.19, 91.69, 91.24), (89.91, 82.16, 93.12, 92.7)]

In [37]:
paired_ttest(baseline, border_1)

8 way evaluation: 	 Ttest_relResult(statistic=0.07013499061401354, pvalue=0.9474525919337373)
Problem-Treatment: 	 Ttest_relResult(statistic=0.21237788780010725, pvalue=0.842195819739582)
Problem-Test: 		 Ttest_relResult(statistic=0.06901126296413315, pvalue=0.9482928433277223)
Problem-Problem: 		 Ttest_relResult(statistic=-0.45684851972593227, pvalue=0.6714889018536513)


baseline (border 50) seems better than border minus 1 for 2 relation types (doesn't make a difference otherwise)

## Pos embedding size (Worth trying 50)

In [38]:
pos_10 = [(88.71, 81.51, 92.07, 90.5), (87.29, 76.07, 91.92, 90.91), (89.1, 80.0, 92.02, 93.86),
          (85.92, 76.15, 89.42, 90.5), (88.73, 80.33, 93.23, 89.96)]

In [39]:
paired_ttest(baseline, pos_10)

8 way evaluation: 	 Ttest_relResult(statistic=1.2254757323179004, pvalue=0.28761698362210264)
Problem-Treatment: 	 Ttest_relResult(statistic=0.6300659634835066, pvalue=0.5628514655696966)
Problem-Test: 		 Ttest_relResult(statistic=1.1838443804709968, pvalue=0.3020190608603854)
Problem-Problem: 		 Ttest_relResult(statistic=1.4535036614466064, pvalue=0.21975074732684807)


In [40]:
pos_50 = [(90.35, 83.54, 93.33, 92.38), (89.41, 79.66, 93.33, 92.86), (90.05, 79.48, 93.16, 95.32), 
          (86.4, 75.11, 89.54, 92.98), (88.26, 76.86, 93.37, 91.85)]

In [41]:
paired_ttest(baseline, pos_50)

8 way evaluation: 	 Ttest_relResult(statistic=-3.1694282750286904, pvalue=0.03387635516795916)
Problem-Treatment: 	 Ttest_relResult(statistic=0.5027439735877641, pvalue=0.6415628359178364)
Problem-Test: 		 Ttest_relResult(statistic=-2.149505130733302, pvalue=0.0980360224997812)
Problem-Problem: 		 Ttest_relResult(statistic=-1.5765323691783786, pvalue=0.190032887741472)


In [42]:
pos_80 = [(89.41, 82.16, 91.91, 92.92), (89.41, 81.86, 92.35, 92.31), (89.57, 80.85, 92.02, 94.42), 
          (87.11, 76.99, 90.37, 92.44), (88.5, 79.01, 92.27, 92.31)]

In [43]:
paired_ttest(baseline, pos_80)

8 way evaluation: 	 Ttest_relResult(statistic=-0.8238080954933396, pvalue=0.4563327432804614)
Problem-Treatment: 	 Ttest_relResult(statistic=-0.7614316297326074, pvalue=0.4888180565539835)
Problem-Test: 		 Ttest_relResult(statistic=1.4036839401305974, pvalue=0.23308328136329956)
Problem-Problem: 		 Ttest_relResult(statistic=-1.2046577730047159, pvalue=0.29473600635135333)


position 50 seems superior but 80 doesn't

## Number of filters

In [44]:
filter_50 = [(89.88, 83.54, 92.27, 92.44), (88.71, 76.92, 93.09, 93.33), (89.57, 80.51, 92.02, 94.83), 
             (84.49, 73.6, 88.77, 89.72), (89.44, 81.48, 93.12, 91.77)]

In [45]:
paired_ttest(baseline, filter_50)

8 way evaluation: 	 Ttest_relResult(statistic=0.18382447836286242, pvalue=0.8630936863185209)
Problem-Treatment: 	 Ttest_relResult(statistic=0.13923452002205414, pvalue=0.895993730339263)
Problem-Test: 		 Ttest_relResult(statistic=0.7304291546421722, pvalue=0.5056079265734786)
Problem-Problem: 		 Ttest_relResult(statistic=-0.3864674270954, pvalue=0.7188284381821282)


In [46]:
filter_150 = [(89.41, 82.01, 91.95, 92.92), (89.88, 79.83, 93.61, 93.81), (89.81, 81.39, 92.27, 94.12), 
              (87.11, 78.33, 89.25, 92.92), (89.67, 81.67, 93.44, 91.77)]

In [47]:
paired_ttest(baseline, filter_150)

8 way evaluation: 	 Ttest_relResult(statistic=-1.3808558823522028, pvalue=0.23945877699881377)
Problem-Treatment: 	 Ttest_relResult(statistic=-1.2766937834956362, pvalue=0.2707913624551339)
Problem-Test: 		 Ttest_relResult(statistic=-0.08185385137617517, pvalue=0.9386951524551894)
Problem-Problem: 		 Ttest_relResult(statistic=-2.652899581063497, pvalue=0.05680956845917672)


filter size 150 helps for problem-problem but also makes the model slower

## Early stop

In [48]:
early_stop = [(90.35, 82.7, 93.57, 92.86), (89.18, 80.52, 92.11, 92.92), (90.52, 83.26, 92.59, 94.42), 
              (85.44, 76.23, 88.06, 91.24), (88.73, 80.17, 92.95, 90.52)]

In [49]:
paired_ttest(baseline, early_stop)

8 way evaluation: 	 Ttest_relResult(statistic=-1.0404889156397397, pvalue=0.35686798954116095)
Problem-Treatment: 	 Ttest_relResult(statistic=-1.577924335332977, pvalue=0.18972127044632103)
Problem-Test: 		 Ttest_relResult(statistic=0.4432296133575453, pvalue=0.6805165046079849)
Problem-Problem: 		 Ttest_relResult(statistic=-0.7418765152232051, pvalue=0.49935883001607495)


no significant difference

So it looks like number of epoches, learning rate decay and position embedding size seem to make a difference

## More variations to num epoches, learning rate decay and position embedding size

num_epoches=100, pos_embed_size=50

In [51]:
epochs_100_pos_50 = [(90.12, 83.9, 92.82, 91.96), (89.41, 80.0, 92.86, 93.27), 
                      (90.28, 81.2, 92.84, 95.28), (88.54, 82.1, 89.71, 93.04), (88.5, 79.34, 93.44,89.96)]

In [52]:
paired_ttest(epochs_100, epochs_100_pos_50)

8 way evaluation: 	 Ttest_relResult(statistic=-1.6891339480712448, pvalue=0.1664618614912973)
Problem-Treatment: 	 Ttest_relResult(statistic=0.0676965702107323, pvalue=0.9492759893444138)
Problem-Test: 		 Ttest_relResult(statistic=-0.808675375217455, pvalue=0.4640543575629432)
Problem-Problem: 		 Ttest_relResult(statistic=-1.913167411639602, pvalue=0.12827691572669606)


In [54]:
paired_ttest(baseline, epochs_100_pos_50)

8 way evaluation: 	 Ttest_relResult(statistic=-2.0216429796412725, pvalue=0.11328550973117937)
Problem-Treatment: 	 Ttest_relResult(statistic=-1.4891760881954892, pvalue=0.2106774232824587)
Problem-Test: 		 Ttest_relResult(statistic=-1.303862028124255, pvalue=0.2622540621632008)
Problem-Problem: 		 Ttest_relResult(statistic=-1.1366412042887972, pvalue=0.319162804172232)


num_epoches=100, lr_values 0.001 0.0001, lr_boundaries 70, 

In [55]:
epochs_100_lr_decay = [(89.18, 83.84, 91.18, 91.07), (88.47, 78.97, 91.92, 92.31), (90.28, 83.62, 92.31, 93.62), 
                       (87.59, 78.11, 92.06, 89.87), (88.5, 80.0, 92.63, 90.52)]

In [56]:
paired_ttest(epochs_100, epochs_100_lr_decay)

8 way evaluation: 	 Ttest_relResult(statistic=0.31325513951613165, pvalue=0.7697411167331684)
Problem-Treatment: 	 Ttest_relResult(statistic=1.193852404892653, pvalue=0.298496215260095)
Problem-Test: 		 Ttest_relResult(statistic=-0.23855561210479237, pvalue=0.8231733142813319)
Problem-Problem: 		 Ttest_relResult(statistic=0.3127993773638927, pvalue=0.7700628702497245)


In [57]:
paired_ttest(baseline, epochs_100_lr_decay)

8 way evaluation: 	 Ttest_relResult(statistic=-0.6388275261258322, pvalue=0.5576808977223613)
Problem-Treatment: 	 Ttest_relResult(statistic=-2.7873089488994274, pvalue=0.04944775079453674)
Problem-Test: 		 Ttest_relResult(statistic=0.09038538315387412, pvalue=0.9323260915279812)
Problem-Problem: 		 Ttest_relResult(statistic=1.0628109130311894, pvalue=0.3477740151272388)


num_epoches=100,  lr_values 0.001 0.0001, lr_boundaries 70, pos_embed_size=50, 

In [59]:
epochs_100_lr_decay_pos_embed_50 = [(89.88, 83.76, 92.35, 91.96), 
                                    (89.41, 81.03, 92.39, 92.86), (89.57, 80.69, 92.06, 94.42), (88.78, 81.74, 90.58, 92.92), (88.5, 80.83, 92.43, 89.96)]

In [60]:
paired_ttest(baseline, epochs_100_lr_decay_pos_embed_50)

8 way evaluation: 	 Ttest_relResult(statistic=-1.3522492463819107, pvalue=0.24769194708931258)
Problem-Treatment: 	 Ttest_relResult(statistic=-1.7137517856100406, pvalue=0.16172865555150615)
Problem-Test: 		 Ttest_relResult(statistic=0.521500239191536, pvalue=0.6295609025087125)
Problem-Problem: 		 Ttest_relResult(statistic=-0.7319363526981685, pvalue=0.5047818366148028)


maybe keep epochs 100, definitely keep pos embed size 50 if the epochs is 250, lr decay is worth keeping. 

finally, I think it is better to prioritize 100 epochs (run time is less, lr_decay)