# Plot Results for SST-2 Development Set 

In this notebook, we load all the results of the different random seeds on the development predictions. 
We print the mistakes made by the checkpoints and then investigate the agreement on mistakes between the different models with Fleiss' Kappa. 

## Import and Load

In [1]:
import json
import pandas as pd
from nltk.metrics.agreement import AnnotationTask
from collections import Counter, defaultdict

In [2]:
def load_predictions_eval(file): 
    with open(file, "r") as f: 
        preds = json.load(f)
        
    return preds 

rs0_preds = load_predictions_eval("results/sst2-dev/rs0-shuffle-train-predictions.txt")
rs1_preds = load_predictions_eval("results/sst2-dev/rs1-shuffle-train-predictions.txt")
rs2_preds = load_predictions_eval("results/sst2-dev/rs2-shuffle-train-predictions.txt")
rs3_preds = load_predictions_eval("results/sst2-dev/rs3-shuffle-train-predictions.txt")
rs4_preds = load_predictions_eval("results/sst2-dev/rs4-shuffle-train-predictions.txt")
rs0_swa_preds = load_predictions_eval("results/sst2-dev/rs0-swa-linear-60-start2-drop-shuffle-predictions.txt")
rs1_swa_preds = load_predictions_eval("results/sst2-dev/rs1-swa-linear-75-start2-drop-shuffle-predictions.txt")
rs2_swa_preds = load_predictions_eval("results/sst2-dev/rs2-swa-linear-60-start2-drop-shuffle-predictions.txt")
rs3_swa_preds = load_predictions_eval("results/sst2-dev/rs3-swa-linear-60-start2-drop-shuffle-predictions.txt")
rs4_swa_preds = load_predictions_eval("results/sst2-dev/rs4-swa-linear-75-start2-drop-shuffle-predictions.txt")

In [3]:
rs5_preds = load_predictions_eval("results/sst2-dev/rs5-shuffle-train-4-predictions.txt")
rs6_preds = load_predictions_eval("results/sst2-dev/rs6-shuffle-train-2-predictions.txt")
rs7_preds = load_predictions_eval("results/sst2-dev/rs7-shuffle-train-2-predictions.txt")
rs8_preds = load_predictions_eval("results/sst2-dev/rs8-shuffle-train-3-predictions.txt")
rs9_preds = load_predictions_eval("results/sst2-dev/rs9-shuffle-train-4-predictions.txt")
rs5_swa_preds = load_predictions_eval("results/sst2-dev/rs5-swa-linear-60-start2-drop-shuffle-3-predictions.txt")
rs6_swa_preds = load_predictions_eval("results/sst2-dev/rs6-swa-linear-60-start2-drop-shuffle-7-predictions.txt")
rs7_swa_preds = load_predictions_eval("results/sst2-dev/rs7-swa-linear-60-start2-drop-shuffle-6-predictions.txt")
rs8_swa_preds = load_predictions_eval("results/sst2-dev/rs8-swa-linear-60-start2-drop-shuffle-4-predictions.txt")
rs9_swa_preds = load_predictions_eval("results/sst2-dev/rs9-swa-linear-75-start2-drop-shuffle-6-predictions.txt")

In [4]:
dev_df = pd.read_csv("~/Downloads/SST-2/dev.tsv", sep='\t')
og_labels = dev_df["label"].to_list()
og_samples = dev_df["sentence"].to_list()

## Printing Mistakes of Random Seeds

### Random Seed 0

In [78]:
# Printing all the mistakes
model_idxs = []
for i, pred in enumerate(rs0_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

20:  pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins .  label:  0 prediction:  1
22:  holden caulfield did it better .  label:  0 prediction:  1
44:  the title not only describes its main characters , but the lazy people behind the camera as well .  label:  0 prediction:  1
64:  the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow .  label:  0 prediction:  1
66:  if you 're hard up for raunchy college humor , this is your ticket right here .  label:  1 prediction:  0
87:  jaglom ... put ( s ) the audience in the privileged position of eavesdropping on his characters  label:  1 prediction:  0
92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  this riveting world war ii moral suspe

### Random Seed 1

In [79]:
model_idxs = []
for i, pred in enumerate(rs1_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

62:  the primitive force of this film seems to bubble up from the vast collective memory of the combatants .  label:  1 prediction:  0
92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .  label:  0 prediction:  1
102:  does paint some memorable images ... , but makhmalbaf keeps her distance from the characters  label:  1 prediction:  0
112:  hilariously inept and ridiculous .  label:  1 prediction:  0
118:  every nanosecond of the the new guy reminds you that you could be doing something else far more pleasurable .  label:  0 prediction:  1
123:  turns potentially forgettable formula into something strangely diverting .  label:  1 prediction:  0
183:  the lower your expectations , the more you

### Random Seed 2

In [80]:
model_idxs = []
for i, pred in enumerate(rs2_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

85:  the movie achieves as great an impact by keeping these thoughts hidden as ... ( quills ) did by showing them .  label:  1 prediction:  0
92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .  label:  0 prediction:  1
102:  does paint some memorable images ... , but makhmalbaf keeps her distance from the characters  label:  1 prediction:  0
112:  hilariously inept and ridiculous .  label:  1 prediction:  0
123:  turns potentially forgettable formula into something strangely diverting .  label:  1 prediction:  0
149:  the volatile dynamics of female friendship is the subject of this unhurried , low-key film that is so off-hollywood that it seems positively french in its rhythms and resonance 

### Random Seed 3

In [81]:
model_idxs = []
for i, pred in enumerate(rs3_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

66:  if you 're hard up for raunchy college humor , this is your ticket right here .  label:  1 prediction:  0
92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .  label:  0 prediction:  1
112:  hilariously inept and ridiculous .  label:  1 prediction:  0
123:  turns potentially forgettable formula into something strangely diverting .  label:  1 prediction:  0
143:  a solid film ... but more conscientious than it is truly stirring .  label:  1 prediction:  0
149:  the volatile dynamics of female friendship is the subject of this unhurried , low-key film that is so off-hollywood that it seems positively french in its rhythms and resonance .  label:  1 prediction:  0
171:  rarely has leukemia lo

### Random Seed 4

In [82]:
model_idxs = []
for i, pred in enumerate(rs4_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .  label:  0 prediction:  1
112:  hilariously inept and ridiculous .  label:  1 prediction:  0
139:  it 's not the ultimate depression-era gangster movie .  label:  0 prediction:  1
149:  the volatile dynamics of female friendship is the subject of this unhurried , low-key film that is so off-hollywood that it seems positively french in its rhythms and resonance .  label:  1 prediction:  0
171:  rarely has leukemia looked so shimmering and benign .  label:  0 prediction:  1
183:  the lower your expectations , the more you 'll enjoy it .  label:  0 prediction:  1
189:  its story may be a thousand years old , but why did it have to seem like it t

### Random Seed SWA 0

In [83]:
model_idxs = []
for i, pred in enumerate(rs0_swa_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

20:  pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins .  label:  0 prediction:  1
22:  holden caulfield did it better .  label:  0 prediction:  1
33:  if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is far too self-conscious to draw you deeply into its world .  label:  0 prediction:  1
37:  ( w ) hile long on amiable monkeys and worthy environmentalism , jane goodall 's wild chimpanzees is short on the thrills the oversize medium demands .  label:  0 prediction:  1
83:  though it 's become almost redundant to say so , major kudos go to leigh for actually casting people who look working-class .  label:  1 prediction:  0
92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  thi

### Random Seed SWA 1

In [84]:
model_idxs = []
for i, pred in enumerate(rs1_swa_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .  label:  0 prediction:  1
112:  hilariously inept and ridiculous .  label:  1 prediction:  0
123:  turns potentially forgettable formula into something strangely diverting .  label:  1 prediction:  0
183:  the lower your expectations , the more you 'll enjoy it .  label:  0 prediction:  1
200:  the format gets used best ... to capture the dizzying heights achieved by motocross and bmx riders , whose balletic hotdogging occasionally ends in bone-crushing screwups .  label:  1 prediction:  0
230:  reign of fire looks as if it was made without much thought -- and is best watched that way .  label:  1 prediction:  0
266:  a coda in every sense , 

### Random Seed SWA 2

In [85]:
model_idxs = []
for i, pred in enumerate(rs2_swa_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .  label:  0 prediction:  1
112:  hilariously inept and ridiculous .  label:  1 prediction:  0
121:  it seems to me the film is about the art of ripping people off without ever letting them consciously know you have done so  label:  0 prediction:  1
158:  by getting myself wrapped up in the visuals and eccentricities of many of the characters , i found myself confused when it came time to get to the heart of the movie .  label:  0 prediction:  1
171:  rarely has leukemia looked so shimmering and benign .  label:  0 prediction:  1
183:  the lower your expectations , the more you 'll enjoy it .  label:  0 prediction:  1
200:  the format gets used

### Random Seed SWA 3

In [86]:
model_idxs = []
for i, pred in enumerate(rs3_swa_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .  label:  0 prediction:  1
112:  hilariously inept and ridiculous .  label:  1 prediction:  0
123:  turns potentially forgettable formula into something strangely diverting .  label:  1 prediction:  0
171:  rarely has leukemia looked so shimmering and benign .  label:  0 prediction:  1
183:  the lower your expectations , the more you 'll enjoy it .  label:  0 prediction:  1
200:  the format gets used best ... to capture the dizzying heights achieved by motocross and bmx riders , whose balletic hotdogging occasionally ends in bone-crushing screwups .  label:  1 prediction:  0
218:  all that 's missing is the spontaneity , originality and deligh

### Random Seed SWA 4

In [87]:
for i, pred in enumerate(rs4_swa_preds):
    if og_labels[i] != pred:
        print(f"{i}: ", og_samples[i], "label: ", og_labels[i], "prediction: ", pred)

92:  you wo n't like roger , but you will quickly recognize him .  label:  0 prediction:  1
93:  if steven soderbergh 's ` solaris ' is a failure it is a glorious failure .  label:  1 prediction:  0
95:  this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms .  label:  0 prediction:  1
112:  hilariously inept and ridiculous .  label:  1 prediction:  0
115:  sam mendes has become valedictorian at the school for soft landings and easy ways out .  label:  0 prediction:  1
171:  rarely has leukemia looked so shimmering and benign .  label:  0 prediction:  1
183:  the lower your expectations , the more you 'll enjoy it .  label:  0 prediction:  1
189:  its story may be a thousand years old , but why did it have to seem like it took another thousand to tell it to us ?  label:  0 prediction:  1
200:  the format gets used best ... to capture the dizzying heights achieved by motocross and bmx riders , whose

## Fleiss' Kappa

### Agreement on predictions of the vanilla models
We measure the Fleiss' Kappa agreement on all the predictions of the vanilla models.

In [27]:
# Uncomment this for predictions with all random seeds. 
# vanilla_preds = [rs0_preds, rs1_preds, rs2_preds, rs3_preds, rs4_preds, rs5_preds, rs6_preds, rs7_preds, rs8_preds, rs9_preds]

# Uncomment this for predictions with all random seeds, except random seed 0. 
# vanilla_preds = [rs1_preds, rs2_preds, rs3_preds, rs4_preds, rs5_preds, rs6_preds, rs7_preds, rs8_preds, rs9_preds]

# Uncomment this for predictions with initial five random seeds. 
# vanilla_preds = [rs0_preds, rs1_preds, rs2_preds, rs3_preds, rs4_preds]

# Uncomment this for predictions with initial four random seeds, so no random seed 0. 
vanilla_preds = [rs1_preds, rs2_preds, rs3_preds, rs4_preds]

triples = []
for i, preds in enumerate(vanilla_preds):
    for j, pred in enumerate(preds): 
        triples.append((i, j, pred))

AnnotationTask(data=triples).multi_kappa()

0.9258441643835617

### Agreement on predictions of the SWA models
We measure the Fleiss' Kappa agreement on all the predictions of the SWA models.

In [28]:
# Uncomment this for predictions with all random seeds. 
# swa_preds = [rs0_swa_preds, rs1_swa_preds, rs2_swa_preds, rs3_swa_preds, rs4_swa_preds, rs5_swa_preds, rs6_swa_preds, rs7_swa_preds, rs8_swa_preds, rs9_swa_preds]

# Uncomment this for predictions with all random seeds, except random seed 0. 
# swa_preds = [rs1_swa_preds, rs2_swa_preds, rs3_swa_preds, rs4_swa_preds, rs5_swa_preds, rs6_swa_preds, rs7_swa_preds, rs8_swa_preds, rs9_swa_preds]

# Uncomment this for predictions with initial five random seeds. 
# swa_preds = [rs0_swa_preds, rs1_swa_preds, rs2_swa_preds, rs3_swa_preds, rs4_swa_preds]

# Uncomment this for predictions with initial four random seeds, so no random seed 0. 
swa_preds = [rs1_swa_preds, rs2_swa_preds, rs3_swa_preds, rs4_swa_preds]

triples = []
for i, preds in enumerate(swa_preds):
    for j, pred in enumerate(preds): 
        triples.append((i, j, pred))

AnnotationTask(data=triples).multi_kappa()

0.9537450418570723

### Agreement on mistakes of the vanilla models
We measure the Fleiss' Kappa agreement on all the mistakes of the vanilla models.

In [29]:
# Agreement on mistakes of the vanilla models
vanilla_mistakes_idxs = []
for model_preds in vanilla_preds: 
    model_idxs = []
    for i, pred in enumerate(model_preds):
        if og_labels[i] != pred:
            model_idxs.append(i)
    vanilla_mistakes_idxs.extend(model_idxs)
    
vanilla_mistakes_idxs_count = Counter(vanilla_mistakes_idxs)

triples = []
vanilla_mistakes_idxs = list(set(vanilla_mistakes_idxs))
for i, preds in enumerate(vanilla_preds):
    for j in vanilla_mistakes_idxs: 
        triples.append((i, j, preds[j]))

AnnotationTask(data=triples).multi_kappa()

0.22672487425263324

### Agreement on mistakes of the SWA models
We measure the Fleiss' Kappa agreement on all the mistakes of the SWA models.

In [30]:
swa_mistakes_idxs = []
for model_preds in swa_preds: 
    model_idxs = []
    for i, pred in enumerate(model_preds):
        if og_labels[i] != pred:
            model_idxs.append(i)
    swa_mistakes_idxs.extend(model_idxs)
    
swa_mistakes_idxs_count = Counter(swa_mistakes_idxs)
            
triples = []
swa_mistakes_idxs = list(set(swa_mistakes_idxs))
for i, preds in enumerate(swa_preds):
    for j in swa_mistakes_idxs: 
        triples.append((i, j, preds[j]))

AnnotationTask(data=triples).multi_kappa()

0.3603171980835949