#### Models evaluated in this notebook:
1. Zero-shot MLC, with traits descriptions (or keywords) in prompt, no training
2. Random classifier, trivial baseline classifier that randomly picks an output label, with equal probabilities.
3. Few-shot MLC, with trais descriptions (or keywords) in prompt, and with training.

## 1) Zero-shot classifier, with traits descriptions in prompt, no training
The file "templates.py" was modified to include descriptions (or keywords) in prompt

#### Importing libraries

In [1]:
from gpt_zero_shot_clf import MultiLabelZeroShotGPTClassifier
import pandas as pd
import numpy as np
import ast
from sklearn.model_selection import train_test_split

#### Importing test set (Gold standard)

In [2]:
df_test = pd.read_csv("./Labels - data.csv")
df_test = df_test.replace(np.nan, '')
df_test.head()

Unnamed: 0,File Name,Power Plant,Sentence/Paragraph,Safety Trait #1,Safety Trait #2,Safety Trait #3,Safety Trait #4,Labelled by,Reviewed by,Review Notes,Corrected?
0,3462002008.pdf,Davis-Besse,"Following unit shutdown on February 16, 2002, ...",problem_identification,questioning_attitude,,,Anood,Ruturaj,,
1,3462002009.pdf,Davis-Besse,The cause of the cracks appears to be high cyc...,problem_identification,questioning_attitude,work_processes,,Anood,Ruturaj,,
2,3462003001.pdf,Davis-Besse,"These conditions, apparently caused by design ...",questioning_attitude,work_processes,,,Anood,Ruturaj,work_process ?,Yes
3,3462003002.pdf,Davis-Besse,The apparent cause of the HPI pump debris tole...,work_processes,continuous_learning,questioning_attitude,,Anood,Ruturaj,,
4,3462003004.pdf,Davis-Besse,The previous procedures used to calibrate the ...,work_processes,,,,Anood,Ruturaj,,


In [3]:
replacements = {
    'problem_identification':'problem identification and resolution',
    'work_processes':'work processes',
    'questioning_attitude':'questioning attitude',
    'continuous_learning':'continuous learning',
    'personal_accountability':'personal accountability',
    'respectful_environment':'respectful work environment',
    'decision_making':'decision making',
    'leadership_values':'leadership safety values and actions',
    'safety_communication':'effective safety communication',
    'environment_raising_concerns':'environment for raising concerns'
}
df_test = df_test.replace(replacements)
df_test.head()

Unnamed: 0,File Name,Power Plant,Sentence/Paragraph,Safety Trait #1,Safety Trait #2,Safety Trait #3,Safety Trait #4,Labelled by,Reviewed by,Review Notes,Corrected?
0,3462002008.pdf,Davis-Besse,"Following unit shutdown on February 16, 2002, ...",problem identification and resolution,questioning attitude,,,Anood,Ruturaj,,
1,3462002009.pdf,Davis-Besse,The cause of the cracks appears to be high cyc...,problem identification and resolution,questioning attitude,work processes,,Anood,Ruturaj,,
2,3462003001.pdf,Davis-Besse,"These conditions, apparently caused by design ...",questioning attitude,work processes,,,Anood,Ruturaj,work_process ?,Yes
3,3462003002.pdf,Davis-Besse,The apparent cause of the HPI pump debris tole...,work processes,continuous learning,questioning attitude,,Anood,Ruturaj,,
4,3462003004.pdf,Davis-Besse,The previous procedures used to calibrate the ...,work processes,,,,Anood,Ruturaj,,


#### Extracting paragraphs and labels

In [44]:
X = [sentence for sentence in df_test["Sentence/Paragraph"].tolist()]
len(X)

100

In [45]:
df_test["All Labels"] = df_test["Safety Trait #1"].astype(str) + "," + df_test["Safety Trait #2"].astype(str) + "," + df_test["Safety Trait #3"].astype(str) + "," + df_test["Safety Trait #4"].astype(str)
y = [[label for label in labels.split(",") if label] for labels in df_test["All Labels"].tolist()]
len(y)

100

#### Candidate labels to be used by the gpt classifier

In [6]:
candidate_labels = [
'personal accountability',
'effective safety communication',
'respectful work environment',
'continuous learning',
'decision making',
'questioning attitude',
'leadership safety values and actions',
'environment for raising concerns',
'work processes',
'problem identification and resolution'
]

#### Fitting the gpt model (no training data provided)

In [7]:
api_key1 = "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"
api_key2 = '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'

clf = MultiLabelZeroShotGPTClassifier(openai_key=api_key2)

In [8]:
clf.fit(None, [candidate_labels])

#### Predicting on one instance to test

In [9]:
#Test on one instance
print("sentence:", X[1],'\n')
print("true:", y[1],'\n')
print("predicted:",clf.predict([X[1]]))

sentence: The cause of the cracks appears to be high cyclic thermal fatigue. This event was determined to not meet the requirements of a reportable condition under 10 CFR 50.73. However, due to the industry interest in HPI thermal sleeve failure, this event is being reported voluntarily as a Licensee Event Report in accordance with the guidance provided in Section 2.7 of NUREG-1022, Revision 2, Event Reporting Guidelines. 

true: ['problem identification and resolution', 'questioning attitude', 'work processes'] 



100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.07it/s]

predicted: [['problem identification and resolution', 'work processes']]





#### Predicting labels for all test set (100 instances)

In [10]:
labels_zeroshot1 = clf.predict(X)

100%|█████████████████████████████████████████| 100/100 [01:25<00:00,  1.17it/s]


In [12]:
# labels_zeroshot1[:10]

#### Saving results in CSV file for analysis

In [13]:
i = 0

columnNamesForResult = ["Power Plant", "Sentence/Paragraph", "Predicted Labels", "True Labels"]
powerplants = []
sentenceOrParagraphs = []
predictedLabels = []
trueLabels = []

for pred_labels in labels_zeroshot1:
    true_labels = sorted(y[i])
    pred_labels = sorted(pred_labels)

    causeOfEventDescription = X[i]
    row = df_test.loc[df_test['Sentence/Paragraph'] == causeOfEventDescription]
    powerplants.append(row['Power Plant'].item())
    sentenceOrParagraphs.append(causeOfEventDescription)
    predictedLabels.append(pred_labels)
    trueLabels.append(true_labels)

    i+=1

resultingDF = pd.DataFrame(columns=columnNamesForResult, data=
                           {"Power Plant": powerplants,
                            "Sentence/Paragraph": sentenceOrParagraphs,
                            "Predicted Labels": predictedLabels,
                            "True Labels": trueLabels})

In [14]:
resultingDF
resultingDF.to_csv("ZeroShot_withKeywords.csv", index=False)

#### Calculating Metrics

In [21]:
# Metrics Functions
def accuracy(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection)/union

def precision(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    return float(intersection)/len(list1)

def recall(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    return float(intersection)/len(list2)

def f1(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    precision_score = precision(list1, list2)
    recall_score = recall(list1, list2)
    try:
        f1_score = float(2*precision_score*recall_score)/(precision_score+recall_score)
    except:
        f1_score = 0
    return f1_score

In [22]:
def calculate_metrics(df: pd.DataFrame):
    all_accuracies = []
    all_precision = []
    all_recall = []
    all_f1 = []

    for index, row in df.iterrows():
        print(row["Sentence/Paragraph"],'\n')
        true_labels = ast.literal_eval(row["True Labels"])
        pred_labels = ast.literal_eval(row["Predicted Labels"])
        print("true:",true_labels,"\n")
        print("predicted:",pred_labels,"\n")
        accuracy_score = accuracy(pred_labels, true_labels)
        precision_score = precision(pred_labels, true_labels)
        recall_score = recall(pred_labels, true_labels)
        f1_score = f1(pred_labels, true_labels)

        all_accuracies.append(accuracy_score)
        all_precision.append(precision_score)
        all_recall.append(recall_score)
        all_f1.append(f1_score)
        print("\nAccuracy score:", accuracy_score)
        print("\nPrecision score:", precision_score)
        print("\nRecall score:", recall_score)
        print("\nF1 score:", f1_score)
        print(100*'-')

    avg_acc = sum(all_accuracies)/len(all_accuracies)
    avg_prec = sum(all_precision)/len(all_precision)
    avg_rec = sum(all_recall)/len(all_recall)
    avg_f1 = sum(all_f1)/len(all_f1)

    print("Average Accuracy on Test Data:", round(avg_acc*100,2), "%")
    print("Average Precision on Test Data:", round(avg_prec*100,2), "%")
    print("Average Recall on Test Data:", round(avg_rec*100,2), "%")
    print("Average F1-Score on Test Data:", round(avg_f1*100,2), "%")

#### Results when prompt included trait descriptions

In [24]:
df = pd.read_csv("ZeroShot_withDescriptions.csv")
print(f"Total test size: {len(df)}","\n")
calculate_metrics(df)

Total test size: 100 

Following unit shutdown on February 16, 2002, various degraded conditions were identified associated with the CACs which were documented in condition reports (CRs). The issues were related to structural integrity (seismic adequacy, boric acid corrosion, and post accident thermal stress); maintenance, test, and configuration control; thermal performance; and 10 CFR 21 reports. 

true: ['problem identification and resolution', 'questioning attitude', 'work processes'] 

predicted: ['problem identification and resolution', 'work processes'] 


Accuracy score: 0.6666666666666666

Precision score: 1.0

Recall score: 0.6666666666666666

F1 score: 0.8
----------------------------------------------------------------------------------------------------
The cause of the cracks appears to be high cyclic thermal fatigue. This event was determined to not meet the requirements of a reportable condition under 10 CFR 50.73. However, due to the industry interest in HPI thermal sl

Precision score: 0.5

Recall score: 0.5

F1 score: 0.5
----------------------------------------------------------------------------------------------------
On September 4, 2002, with the reactor defueled, investigations determined that a gap in the sump screen larger than allowed by design basis (greater than 1/4-inch openings) existed. Also, the existing amount of unqualified coatings and other debris inside containment could have potentially blocked the emergency sump intake screen, rendering the sump inoperable, following a loss of coolant accident. 

true: ['personal accountability', 'work processes'] 

predicted: ['problem identification and resolution', 'questioning attitude', 'work processes'] 


Accuracy score: 0.25

Precision score: 0.3333333333333333

Recall score: 0.5

F1 score: 0.4
----------------------------------------------------------------------------------------------------
The unprotected EDG Exhaust Piping and unprotected MSSVs have been in this condition since ini

#### Results when prompt included trait keywords

In [19]:
df2 = pd.read_csv("ZeroShot_withKeywords.csv")
print(f"Total test size: {len(df2)}")
calculate_metrics(df2)

Total test size: 100
Following unit shutdown on February 16, 2002, various degraded conditions were identified associated with the CACs which were documented in condition reports (CRs). The issues were related to structural integrity (seismic adequacy, boric acid corrosion, and post accident thermal stress); maintenance, test, and configuration control; thermal performance; and 10 CFR 21 reports. 

true: ['problem identification and resolution', 'questioning attitude'] 

predicted: ['problem identification and resolution', 'questioning attitude', 'work processes'] 


Accuracy score: 0.6666666666666666

Precision score: 0.6666666666666666

Recall score: 1.0

F1 score: 0.8
----------------------------------------------------------------------------------------------------
The cause of the cracks appears to be high cyclic thermal fatigue. This event was determined to not meet the requirements of a reportable condition under 10 CFR 50.73. However, due to the industry interest in HPI therma

Accuracy score: 1.0

Precision score: 1.0

Recall score: 1.0

F1 score: 1.0
----------------------------------------------------------------------------------------------------
An apparent cause evaluation concluded that PG&E did not have a design process requirement in place to evaluate GDC-2 design criteria relative to SSC functional requirements other than for structural integrity issues. 

true: ['work processes'] 

predicted: ['problem identification and resolution', 'work processes'] 


Accuracy score: 0.5

Precision score: 0.5

Recall score: 1.0

F1 score: 0.6666666666666666
----------------------------------------------------------------------------------------------------
Knowledge Gap — Operators did not have adequate knowledge of the established standards contained in OP 0-36 to prevent inadvertent operational events. 

true: ['continuous learning', 'personal accountability', 'questioning attitude'] 

predicted: ['continuous learning', 'personal accountability'] 


Accuracy 

#### Conclusions:
Using keywords related to each trait in the prompt made the model perform better than using trait descriptions.

## 2) Baseline Model: Random classifier
The trivial baseline: a classifier that randomly picks an output label, with equal probabilities.

#### Importing test set, merging all labels into single column

In [25]:
df_test2 = df_test.copy()
df_test2['True_Labels'] = df_test2.apply(lambda x: [x['Safety Trait #1'], x['Safety Trait #2'], x['Safety Trait #3'], x['Safety Trait #4']], axis=1)
df_test2.drop(['Safety Trait #1', 'Safety Trait #2', 'Safety Trait #3', 'Safety Trait #4'], axis=1, inplace=True)
df_test2.drop(['Labelled by', 'Reviewed by', 'Review Notes', 'Corrected?', 'All Labels'], axis=1, inplace=True)
df_test2['True_Labels'] = df_test2['True_Labels'].apply(lambda x: [val for val in x if not pd.isna(val)])
df_test2.head()

Unnamed: 0,File Name,Power Plant,Sentence/Paragraph,True_Labels
0,3462002008.pdf,Davis-Besse,"Following unit shutdown on February 16, 2002, ...","[problem identification and resolution, questi..."
1,3462002009.pdf,Davis-Besse,The cause of the cracks appears to be high cyc...,"[problem identification and resolution, questi..."
2,3462003001.pdf,Davis-Besse,"These conditions, apparently caused by design ...","[questioning attitude, work processes, , ]"
3,3462003002.pdf,Davis-Besse,The apparent cause of the HPI pump debris tole...,"[work processes, continuous learning, question..."
4,3462003004.pdf,Davis-Besse,The previous procedures used to calibrate the ...,"[work processes, , , ]"


#### Encoding labels (converting them to numerical format)

In [26]:
#Encoding labels (converting them to numerical format)
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(df_test2['True_Labels'].explode().unique())

# Convert true labels to numerical format
df_test2['true_labels_num'] = df_test2['True_Labels'].apply(lambda x: le.transform(x))
df_test2.head()

Unnamed: 0,File Name,Power Plant,Sentence/Paragraph,True_Labels,true_labels_num
0,3462002008.pdf,Davis-Besse,"Following unit shutdown on February 16, 2002, ...","[problem identification and resolution, questi...","[7, 8, 0, 0]"
1,3462002009.pdf,Davis-Besse,The cause of the cracks appears to be high cyc...,"[problem identification and resolution, questi...","[7, 8, 10, 0]"
2,3462003001.pdf,Davis-Besse,"These conditions, apparently caused by design ...","[questioning attitude, work processes, , ]","[8, 10, 0, 0]"
3,3462003002.pdf,Davis-Besse,The apparent cause of the HPI pump debris tole...,"[work processes, continuous learning, question...","[10, 1, 8, 0]"
4,3462003004.pdf,Davis-Besse,The previous procedures used to calibrate the ...,"[work processes, , , ]","[10, 0, 0, 0]"


#### Function to randomly generate labels (up to 4 labels per paragraph)

In [27]:
# For each example in the dataframe, randomly generate predicted labels based on their probabilities of occurrence.
# You can use the random.choices() function to generate random labels. But need to make sure you get unique labels by putting an if statement

import random

def generate_random_labels(prob_dict):
    labels = []
    probs = []
    choices = []
    for label, prob in prob_dict.items():
        labels.append(label)
        probs.append(prob)
    while len(choices) < 4:
        selection = random.choices(labels, probs)
        if selection not in choices:
            choices.append(selection)
    return choices

def equal_prob_labels():
    choices = []
    while len(choices) < 4:
        selection = random.randint(0, 9)
        if selection not in choices:
            choices.append(selection)
    return choices

#NEW BASELINE, ASSUMING EQUAL PROBABILITIES
df_test2['baseline_labels2'] = df_test2.apply(lambda x: equal_prob_labels(), axis=1)

df_test2.head()

Unnamed: 0,File Name,Power Plant,Sentence/Paragraph,True_Labels,true_labels_num,baseline_labels2
0,3462002008.pdf,Davis-Besse,"Following unit shutdown on February 16, 2002, ...","[problem identification and resolution, questi...","[7, 8, 0, 0]","[3, 4, 1, 9]"
1,3462002009.pdf,Davis-Besse,The cause of the cracks appears to be high cyc...,"[problem identification and resolution, questi...","[7, 8, 10, 0]","[8, 5, 7, 6]"
2,3462003001.pdf,Davis-Besse,"These conditions, apparently caused by design ...","[questioning attitude, work processes, , ]","[8, 10, 0, 0]","[2, 8, 3, 1]"
3,3462003002.pdf,Davis-Besse,The apparent cause of the HPI pump debris tole...,"[work processes, continuous learning, question...","[10, 1, 8, 0]","[8, 0, 4, 1]"
4,3462003004.pdf,Davis-Besse,The previous procedures used to calibrate the ...,"[work processes, , , ]","[10, 0, 0, 0]","[2, 7, 1, 4]"


#### Monte Carlo Simulation
To find average performance of baseline model

In [28]:
#MONTE CARLO SIMULATION TO FIND AVERAGE PERFORMANCE OF BASELINE MODEL (EQUAL PROBABILITIES)
n = 0
avg_acc_list = []
avg_prec_list = []
avg_rec_list = []
avg_f1_list = []

while n < 500:
    df_test2['baseline_labels2'] = df_test2.apply(lambda x: equal_prob_labels(), axis=1)
    
    all_accuracies = []
    all_precision = []
    all_recall = []
    all_f1 = []

    for index, row in df_test2.iterrows():
        accuracy_score = accuracy(row['true_labels_num'], row['baseline_labels2'])
        precision_score = precision(row['true_labels_num'], row['baseline_labels2'])
        recall_score = recall(row['true_labels_num'], row['baseline_labels2'])
        f1_score = f1(row['true_labels_num'], row['baseline_labels2'])

        all_accuracies.append(accuracy_score)
        all_precision.append(precision_score)
        all_recall.append(recall_score)
        all_f1.append(f1_score)

    avg_acc = sum(all_accuracies)/len(all_accuracies)
    avg_prec = sum(all_precision)/len(all_precision)
    avg_rec = sum(all_recall)/len(all_recall)
    avg_f1 = sum(all_f1)/len(all_f1)

    avg_acc_list.append(avg_acc)
    avg_prec_list.append(avg_prec)
    avg_rec_list.append(avg_rec)
    avg_f1_list.append(avg_f1)
    
    n+=1

avg_acc = sum(avg_acc_list)/len(avg_acc_list)
avg_prec = sum(avg_prec_list)/len(avg_prec_list)
avg_rec = sum(avg_rec_list)/len(avg_rec_list)
avg_f1 = sum(avg_f1_list)/len(avg_f1_list)
print("Average Accuracy on Test Data:", round(avg_acc*100,2), "%")
print("Average Precision on Test Data:", round(avg_prec*100,2), "%")
print("Average Recall on Test Data:", round(avg_rec*100,2), "%")
print("Average F1-Score on Test Data:", round(avg_f1*100,2), "%")

Average Accuracy on Test Data: 17.39 %
Average Precision on Test Data: 27.29 %
Average Recall on Test Data: 27.29 %
Average F1-Score on Test Data: 27.29 %


## 3) Few-shot classifier, with traits descriptions in prompt, with training
The file "templates.py" was modified to include descriptions (or keywords) in prompt

#### Importing libraries

In [4]:
import pandas as pd
import numpy as np
import pickle
from gpt_few_shot_clf import MultiLabelFewShotGPTClassifier
from sklearn.model_selection import train_test_split
import ast

In [5]:
df_test.head()

Unnamed: 0,File Name,Power Plant,Sentence/Paragraph,Safety Trait #1,Safety Trait #2,Safety Trait #3,Safety Trait #4,Labelled by,Reviewed by,Review Notes,Corrected?
0,3462002008.pdf,Davis-Besse,"Following unit shutdown on February 16, 2002, ...",problem identification and resolution,questioning attitude,,,Anood,Ruturaj,,
1,3462002009.pdf,Davis-Besse,The cause of the cracks appears to be high cyc...,problem identification and resolution,questioning attitude,work processes,,Anood,Ruturaj,,
2,3462003001.pdf,Davis-Besse,"These conditions, apparently caused by design ...",questioning attitude,work processes,,,Anood,Ruturaj,work_process ?,Yes
3,3462003002.pdf,Davis-Besse,The apparent cause of the HPI pump debris tole...,work processes,continuous learning,questioning attitude,,Anood,Ruturaj,,
4,3462003004.pdf,Davis-Besse,The previous procedures used to calibrate the ...,work processes,,,,Anood,Ruturaj,,


#### Counting number of trait occurrences in the test set

In [29]:
df_test["All Labels"] = df_test["Safety Trait #1"].astype(str) + "," + df_test["Safety Trait #2"].astype(str) + "," + df_test["Safety Trait #3"].astype(str) + "," + df_test["Safety Trait #4"].astype(str)
y = [[label for label in labels.split(",") if label] for labels in df_test["All Labels"].tolist()]

count_traits = {}
for sublist in y:
    for trait in sublist:
        count_traits[trait] = count_traits.get(trait,0)+1
count_traits

{'problem identification and resolution': 37,
 'questioning attitude': 28,
 'work processes': 60,
 'continuous learning': 19,
 'personal accountability': 42,
 'leadership safety values and actions': 17,
 'decision making': 20,
 'effective safety communication': 16,
 'environment for raising concerns': 6,
 'respectful work environment': 1}

#### Importing training data
Train/test split was not used, a small balanced training data consisting of 15 samples was created to ensure all safety traits are represented.

In [30]:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.70, random_state=7)

In [31]:
df_train = pd.read_csv('Training_data.csv')
df_train

Unnamed: 0,File Name,Power Plant,Sentence/Paragraph,Safety Trait #1,Safety Trait #2,Safety Trait #3,Safety Trait #4,Unnamed: 7,Unnamed: 8,Unnamed: 9
0,3462002008.pdf,Davis-Besse,"Following unit shutdown on February 16, 2002, ...",problem_identification,work_processes,questioning_attitude,,,,
1,2752012003.pdf,Diablo Canyon,Electrical Maintenance (EM) supervisors have n...,leadership_values,safety_communication,personal_accountability,,,,
2,2752013003.pdf,Diablo Canyon,"Degraded insulation, contamination, and weathe...",questioning_attitude,problem_identification,environment_raising_concerns,,,,
3,2752000005.pdf,Diablo Canyon,The shift foreman did not review the TS prior ...,personal_accountability,work_processes,continuous_learning,safety_communication,,,
4,4982002001.pdf,South Texas,Verbal communications between Control Room per...,safety_communication,environment_raising_concerns,continuous_learning,,,,
5,4982000002.pdf,South Texas,"On February 14, 1995, Technical Specifications...",safety_communication,work_processes,leadership_values,,,,
6,2752015002.pdf,Diablo Canyon,PG&E completed an ACE and determined that the ...,questioning_attitude,continuous_learning,,,,,
7,3461999003.pdf,Davis-Besse,The cause of this event was determined to be i...,decision_making,leadership_values,safety_communication,work_processes,,,
8,3462000004.pdf,Davis-Besse,The electrician checking the status of the loc...,personal_accountability,decision_making,work_processes,,,,
9,4982000005.pdf,South Texas,The root cause of this event was the lack of q...,questioning_attitude,safety_communication,work_processes,,,,


In [32]:
df_train = df_train.replace(np.nan, '')
df_train = df_train.replace(replacements)
X_train = [sentence for sentence in df_train["Sentence/Paragraph"].tolist()]
df_train["All Labels"] = df_train["Safety Trait #1"].astype(str) + "," + df_train["Safety Trait #2"].astype(str) + "," + df_train["Safety Trait #3"].astype(str) + "," + df_train["Safety Trait #4"].astype(str)
y_train = [[label for label in labels.split(",") if label] for labels in df_train["All Labels"].tolist()]

In [34]:
len(X_train)

15

#### Training the few-shot classifier

In [11]:
api_key2 = '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'
clf_fewshot = MultiLabelFewShotGPTClassifier(openai_model="gpt-3.5-turbo", openai_key=api_key2)
clf_fewshot.fit(X_train, y_train)

#### Predicting on one instance

In [35]:
#Test on one instance
print("sentence:", X_train[1],'\n')
print("true:", y_train[1],'\n')
print("predicted:",clf_fewshot.predict([X_train[1]]))

sentence: Electrical Maintenance (EM) supervisors have not consistently reinforced self-checking standards. 

true: ['leadership safety values and actions', 'effective safety communication', 'personal accountability'] 



100%|█████████████████████████████████████████████| 1/1 [00:01<00:00,  1.60s/it]

predicted: [['leadership safety values and actions', 'effective safety communication', 'personal accountability']]





#### Predicting labels for test set
85 samples

In [13]:
train_paragraphs = list(df_train['Sentence/Paragraph'].values)
condition = df_test['Sentence/Paragraph'].isin(train_paragraphs)
df_test2 = df_test.drop(df_test[condition].index)
df_test2

Unnamed: 0,File Name,Power Plant,Sentence/Paragraph,Safety Trait #1,Safety Trait #2,Safety Trait #3,Safety Trait #4,Labelled by,Reviewed by,Review Notes,Corrected?,All Labels
1,3462002009.pdf,Davis-Besse,The cause of the cracks appears to be high cyc...,problem identification and resolution,questioning attitude,work processes,,Anood,Ruturaj,,,"problem identification and resolution,question..."
2,3462003001.pdf,Davis-Besse,"These conditions, apparently caused by design ...",questioning attitude,work processes,,,Anood,Ruturaj,work_process ?,Yes,"questioning attitude,work processes,,"
3,3462003002.pdf,Davis-Besse,The apparent cause of the HPI pump debris tole...,work processes,continuous learning,questioning attitude,,Anood,Ruturaj,,,"work processes,continuous learning,questioning..."
4,3462003004.pdf,Davis-Besse,The previous procedures used to calibrate the ...,work processes,,,,Anood,Ruturaj,,,"work processes,,,"
5,3462003010.pdf,Davis-Besse,"Therefore, the cause of the loss of taper pins...",personal accountability,work processes,,,Anood,Ruturaj,,,"personal accountability,work processes,,"
...,...,...,...,...,...,...,...,...,...,...,...,...
95,4992001002.pdf,South Texas,The linkage mechanism that operates the breake...,personal accountability,,problem identification and resolution,,Nikki,Vijay,LGTM,,"personal accountability,,problem identificatio..."
96,4992001003.pdf,South Texas,The majority of the defective tubes detected i...,problem identification and resolution,respectful work environment,decision making,,Nikki,Vijay,Decision making: Seems like some of the design...,,"problem identification and resolution,respectf..."
97,4992001004.pdf,South Texas,1. Personnel did not recognize the laptop comp...,personal accountability,work processes,,,Nikki,Vijay,Work processes: Would add this as sufficient t...,,"personal accountability,work processes,,"
98,4992002001.pdf,South Texas,The root cause of this incident was a lack of ...,problem identification and resolution,continuous learning,,,Nikki,Vijay,"Work processes: Don't think it fits into this,...",,"problem identification and resolution,continuo..."


In [14]:
X2 = [sentence for sentence in df_test2["Sentence/Paragraph"].tolist()]
len(X2)

85

In [15]:
df_test2["All Labels"] = df_test2["Safety Trait #1"].astype(str) + "," + df_test2["Safety Trait #2"].astype(str) + "," + df_test2["Safety Trait #3"].astype(str) + "," + df_test2["Safety Trait #4"].astype(str)
y2 = [[label for label in labels.split(",") if label] for labels in df_test2["All Labels"].tolist()]
len(y2)

85

In [16]:
labels_fewshot1 = clf_fewshot.predict(X2)

100%|███████████████████████████████████████████| 85/85 [01:28<00:00,  1.04s/it]


#### Predicting labels for all instances (training and test)
100 samples

In [46]:
labels_fewshot_all = clf_fewshot.predict(X)

100%|█████████████████████████████████████████| 100/100 [01:54<00:00,  1.15s/it]


#### Saving test results in CSV file for analysis

In [39]:
i = 0

columnNamesForResult = ["Power Plant", "Sentence/Paragraph", "Predicted Labels", "True Labels"]
powerplants = []
sentenceOrParagraphs = []
predictedLabels = []
trueLabels = []

for pred_labels in labels_fewshot1:
    true_labels = sorted(y2[i])
    pred_labels = sorted(pred_labels)

    causeOfEventDescription = X2[i]
    row = df_test.loc[df_test['Sentence/Paragraph'] == causeOfEventDescription]
    powerplants.append(row['Power Plant'].item())
    sentenceOrParagraphs.append(causeOfEventDescription)
    predictedLabels.append(pred_labels)
    trueLabels.append(true_labels)

    i+=1

resultingDF = pd.DataFrame(columns=columnNamesForResult, data=
                           {"Power Plant": powerplants,
                            "Sentence/Paragraph": sentenceOrParagraphs,
                            "Predicted Labels": predictedLabels,
                            "True Labels": trueLabels})

In [40]:
resultingDF
resultingDF.to_csv("FewShot_Keywords_FewExamples.csv", index=False)

#### Evaluating model (Calculating Metrics)

In [41]:
df = pd.read_csv("FewShot_Keywords_FewExamples.csv")
print(f"Total test size: {len(df)}")
calculate_metrics(df)

Total test size: 85
The cause of the cracks appears to be high cyclic thermal fatigue. This event was determined to not meet the requirements of a reportable condition under 10 CFR 50.73. However, due to the industry interest in HPI thermal sleeve failure, this event is being reported voluntarily as a Licensee Event Report in accordance with the guidance provided in Section 2.7 of NUREG-1022, Revision 2, Event Reporting Guidelines. 

true: ['problem identification and resolution', 'questioning attitude', 'work processes'] 

predicted: ['problem identification and resolution', 'work processes'] 


Accuracy score: 0.6666666666666666

Precision score: 1.0

Recall score: 0.6666666666666666

F1 score: 0.8
----------------------------------------------------------------------------------------------------
These conditions, apparently caused by design assumptions that were not fully adequate, are being reported in accordance with 10CFR50.73(a)(2)(i)(B) as operation .or condition prohibited by

Accuracy score: 0.16666666666666666

Precision score: 0.3333333333333333

Recall score: 0.25

F1 score: 0.28571428571428575
----------------------------------------------------------------------------------------------------
The linkage mechanism that operates the breaker Y600 C phase pole failed due to a linkage connection pin falling out of the linkage. Inspection revealed that a bushing that is required between the linkage pin and the operating linkage was not installed. The bushing was most likely left out during fabrication. The lack of a bushing creates increased friction and tolerances leading to accelerated wear of the components. 

true: ['personal accountability', 'problem identification and resolution'] 

predicted: ['problem identification and resolution', 'work processes'] 


Accuracy score: 0.3333333333333333

Precision score: 0.5

Recall score: 0.5

F1 score: 0.5
----------------------------------------------------------------------------------------------------
The majo

#### Saving all results (test and training) in CSV file for analysis

In [47]:
i = 0

columnNamesForResult = ["Power Plant", "Sentence/Paragraph", "Predicted Labels", "True Labels"]
powerplants = []
sentenceOrParagraphs = []
predictedLabels = []
trueLabels = []

for pred_labels in labels_fewshot_all:
    true_labels = sorted(y[i])
    pred_labels = sorted(pred_labels)

    causeOfEventDescription = X[i]
    row = df_test.loc[df_test['Sentence/Paragraph'] == causeOfEventDescription]
    powerplants.append(row['Power Plant'].item())
    sentenceOrParagraphs.append(causeOfEventDescription)
    predictedLabels.append(pred_labels)
    trueLabels.append(true_labels)

    i+=1

resultingDF = pd.DataFrame(columns=columnNamesForResult, data=
                           {"Power Plant": powerplants,
                            "Sentence/Paragraph": sentenceOrParagraphs,
                            "Predicted Labels": predictedLabels,
                            "True Labels": trueLabels})

resultingDF
resultingDF.to_csv("FewShot_Keywords_FewExamples_All.csv", index=False)