University of Aberdeen\
Atanas Komsiyski
## Evaluating GPT-3.5-turbo for Action Item Extraction in Meeting Transcripts

This is a Jupyter notebook containing the majority of our quantitative analysis, it contains:
- BERTScore calculations for each meeting
- Our BERTScore F1 standard deviation study of consistency between outputs


#### Libraries
To ensure all libraries are installed before executing the notebook run -pip install requirements.txt

In [6]:
# imports
import pandas as pd
import xml.etree.ElementTree as ET
from bert_score import score
from transformers import logging

# Set the pandas display options to show all rows and columns of tables
pd.set_option('display.max_rows', None)  # unlimited rows
pd.set_option('display.max_columns', None)  # unlimited columns
pd.set_option("display.precision", 5) # controls number precision after the point

logging.set_verbosity_error() # supress all messages from scoring model except errors for cleaner output and readability purposes

In [7]:
# function that reads the contents of our XML files and saves it in a nested list for easier access
def read_xml(file_path):
    meetings = {}
    tree = ET.parse(file_path)
    root = tree.getroot()
    for meeting in root.findall('Meeting'):
        meeting_name = meeting.get('Name')
        meetings[meeting_name] = {}
        for iteration in meeting.findall('Iteration'):
            iteration_number = int(iteration.get('Number'))
            items = [item.text for item in iteration.findall('Item')]
            meetings[meeting_name][iteration_number] = items
    return meetings


In [8]:
# the main function used to compute the BERTScore metrics
def compute_bert_score(gpt_meetings, human_meetings): # takes as input the two nested lists read from the XMLs
    bert_scores = []
    for meeting_name, gpt_iterations in gpt_meetings.items():              # for each gpt meeting
        human_iteration = human_meetings.get(meeting_name, {}).get(0, [])  # get the single ground truth iteration from human annotators
        for iteration_number, gpt_items in gpt_iterations.items():
            gpt_text = ' '.join(gpt_items)                                 # feed all action items as a pragraph
            human_text = ' '.join(human_iteration)
            # note we are feeding the whole list as a paragraph or it will throw an error for mismatch in number of sentences between candidate and reference
            P, R, F1 = score([gpt_text], [human_text], lang='en', verbose=False, model_type="microsoft/deberta-xlarge-mnli") 
            bert_scores.append({
                "Meeting": meeting_name,
                "GPT Iteration Number": iteration_number,
                "Precision": P.item(),
                "Recall": R.item(),
                "F1": F1.item()
            })
    return bert_scores


### Version 1: Base prompt
"Extract all action items from the following meeting transcript and display them in the form of a numbered list."

In [9]:
# reading from XML files into nested lists
gpt_meetings = read_xml("GPT_action_items_v1.xml")
human_meetings = read_xml("Human_action_items.xml")

# computing BERTScore
bert_scores = compute_bert_score(gpt_meetings, human_meetings)

# convert to DataFrame
df_v1 = pd.DataFrame(bert_scores)

# display DataFrame
display(df_v1) 

# Table containig all meetings and their iterations with their respective scores

Unnamed: 0,Meeting,GPT Iteration Number,Precision,Recall,F1
0,Bed002.txt,0,0.6306,0.66248,0.64615
1,Bed002.txt,1,0.66171,0.64967,0.65563
2,Bed002.txt,2,0.64829,0.64754,0.64791
3,Bed003.txt,0,0.58179,0.66153,0.6191
4,Bed003.txt,1,0.57063,0.65658,0.61059
5,Bed003.txt,2,0.61801,0.68796,0.65111
6,Bed004.txt,0,0.60137,0.60853,0.60493
7,Bed004.txt,1,0.60589,0.60857,0.60722
8,Bed004.txt,2,0.59659,0.60128,0.59893
9,Bed005.txt,0,0.5512,0.57072,0.56079


In [10]:
# group by 'Meeting' and compute the mean of the three iterations' scores
mean_df_v1 = df_v1.groupby('Meeting').mean()

# reset the index to keep 'Meeting' as a column
mean_df_v1.reset_index(inplace=True)

# drop the GPT Iteration Number from the table as not relevant after computing the mean
mean_df_v1.drop('GPT Iteration Number', axis=1, inplace=True)

# display the resulting DataFrame
display(mean_df_v1)

# Table containig the mean of scores from all iterations grouped by meeting

Unnamed: 0,Meeting,Precision,Recall,F1
0,Bed002.txt,0.64687,0.65323,0.6499
1,Bed003.txt,0.59014,0.66869,0.62694
2,Bed004.txt,0.60128,0.60613,0.60369
3,Bed005.txt,0.58473,0.61005,0.59709
4,Bed006.txt,0.53554,0.58284,0.55816
5,Bed008.txt,0.57722,0.74206,0.64926
6,Bed009.txt,0.60217,0.60674,0.60443
7,Bed010.txt,0.66107,0.65256,0.65662
8,Bmr001.txt,0.56738,0.56836,0.56782
9,Bmr002.txt,0.64092,0.67985,0.65971


In [11]:
mean_df_v1.drop('Meeting', axis=1, inplace=True) # drop the meeting name from the table as not needed for next step
display(mean_df_v1.mean()) # display the mean score for each metric

# Table displaying the mean of scores of all meetings

Precision    0.62618
Recall       0.70091
F1           0.66009
dtype: float64

### Version 2: Base prompt and Examples
"Examples of action items: Arrange a meeting with Amy next Friday, Call Ben after the presentation, Allison to begin data collection today."

In [12]:
# reading from XML files into nested lists
gpt_meetings = read_xml("GPT_action_items_v2.xml")
human_meetings = read_xml("Human_action_items.xml")

# computing BERTScore
bert_scores = compute_bert_score(gpt_meetings, human_meetings)

# convert to DataFrame
df_v2 = pd.DataFrame(bert_scores)

# display DataFrame
display(df_v2)

# Table containig all meetings and their iterations with their respective scores

Unnamed: 0,Meeting,GPT Iteration Number,Precision,Recall,F1
0,Bed002.txt,0,0.64249,0.67893,0.66021
1,Bed002.txt,1,0.6072,0.68907,0.64555
2,Bed002.txt,2,0.67312,0.6843,0.67866
3,Bed003.txt,0,0.6042,0.65063,0.62656
4,Bed003.txt,1,0.60356,0.63913,0.62084
5,Bed003.txt,2,0.61793,0.64985,0.63349
6,Bed004.txt,0,0.656,0.61619,0.63547
7,Bed004.txt,1,0.60973,0.60618,0.60795
8,Bed004.txt,2,0.6538,0.62955,0.64144
9,Bed005.txt,0,0.59065,0.67844,0.63151


In [13]:
# group by 'Meeting' and compute the mean of the three iterations' scores
mean_df_v2 = df_v2.groupby('Meeting').mean()

# reset the index to keep 'Meeting' as a column
mean_df_v2.reset_index(inplace=True)

# Drop the GPT Iteration Number as not relevant
mean_df_v2.drop('GPT Iteration Number', axis=1, inplace=True)

# display the resulting DataFrame
display(mean_df_v2)

# Table containig the mean of scores from all iterations grouped by meeting

Unnamed: 0,Meeting,Precision,Recall,F1
0,Bed002.txt,0.64094,0.6841,0.66147
1,Bed003.txt,0.60856,0.64654,0.62696
2,Bed004.txt,0.63984,0.6173,0.62829
3,Bed005.txt,0.58063,0.65415,0.61507
4,Bed006.txt,0.58805,0.57597,0.58132
5,Bed008.txt,0.57972,0.71352,0.63964
6,Bed009.txt,0.57682,0.60361,0.58991
7,Bed010.txt,0.67688,0.62364,0.64853
8,Bmr001.txt,0.57123,0.57132,0.57066
9,Bmr002.txt,0.6132,0.58029,0.59178


In [14]:
mean_df_v2.drop('Meeting', axis=1, inplace=True) # drop the meeting name from the table as not needed for next step
display(mean_df_v2.mean()) # display the mean score for each metric

# Table displaying the mean of scores of all meetings

Precision    0.61009
Recall       0.65681
F1           0.63063
dtype: float64

### Version 3: Base prompt and Definition
"Action items must contain information on who needs to complete what action item and when if known."

In [15]:
# reading from XML files into nested lists
gpt_meetings = read_xml("GPT_action_items_v3.xml")
human_meetings = read_xml("Human_action_items.xml")

# computing BERTScore
bert_scores = compute_bert_score(gpt_meetings, human_meetings)

# convert to DataFrame
df_v3 = pd.DataFrame(bert_scores)

# display DataFrame
display(df_v3)

# Table containig all meetings and their iterations with their respective scores

Unnamed: 0,Meeting,GPT Iteration Number,Precision,Recall,F1
0,Bed002.txt,0,0.57546,0.65625,0.61321
1,Bed002.txt,1,0.61912,0.66491,0.6412
2,Bed002.txt,2,0.62576,0.67463,0.64927
3,Bed003.txt,0,0.5802,0.67633,0.62459
4,Bed003.txt,1,0.53632,0.64592,0.58604
5,Bed003.txt,2,0.57883,0.66306,0.61809
6,Bed004.txt,0,0.58116,0.60055,0.59069
7,Bed004.txt,1,0.57699,0.63131,0.60293
8,Bed004.txt,2,0.56441,0.59862,0.58101
9,Bed005.txt,0,0.5996,0.69301,0.64293


In [16]:
# group by 'Meeting' and compute the mean of the three iterations' scores
mean_df_v3 = df_v3.groupby('Meeting').mean()

# reset the index to keep 'Meeting' as a column
mean_df_v3.reset_index(inplace=True)

# drop the GPT Iteration Number as not relevant
mean_df_v3.drop('GPT Iteration Number', axis=1, inplace=True)

# display the resulting DataFrame
display(mean_df_v3)

# Table containig the mean of scores from all iterations grouped by meeting

Unnamed: 0,Meeting,Precision,Recall,F1
0,Bed002.txt,0.60678,0.66526,0.63456
1,Bed003.txt,0.56511,0.66177,0.60957
2,Bed004.txt,0.57418,0.61016,0.59154
3,Bed005.txt,0.58588,0.68473,0.63144
4,Bed006.txt,0.53162,0.57049,0.55026
5,Bed008.txt,0.56125,0.73815,0.63725
6,Bed009.txt,0.5814,0.63249,0.60581
7,Bed010.txt,0.64293,0.64916,0.64469
8,Bmr001.txt,0.53638,0.60144,0.56697
9,Bmr002.txt,0.62109,0.63844,0.62941


In [17]:
mean_df_v3.drop('Meeting', axis=1, inplace=True) # drop the meeting name from the table as not needed for next step
display(mean_df_v3.mean()) # display the mean score for each metric

# Table displaying the mean of scores of all meetings

Precision    0.58874
Recall       0.67420
F1           0.62762
dtype: float64

## Standard deviation calculation for determining consistency between iterations using BERTScore F1 values

Standard deviation between the F1 scores for the three iterations of each meetings using prompt Version 1

In [18]:
std_deviation_df_v1 = df_v1.groupby('Meeting')['F1'].std() # compute the standard deviations of F1 values by grouping them by meeting

display(std_deviation_df_v1)

print("Mean of all standard deviations: ")
display(std_deviation_df_v1.mean())

# Table of the standart deviation between the F1 scores for the three iteractions of each meeting (shows us the consistency between iterations)

Meeting
Bed002.txt    0.00504
Bed003.txt    0.02136
Bed004.txt    0.00429
Bed005.txt    0.04824
Bed006.txt    0.01535
Bed008.txt    0.01398
Bed009.txt    0.01091
Bed010.txt    0.01728
Bmr001.txt    0.00213
Bmr002.txt    0.01225
Bmr003.txt    0.01205
Bmr005.txt    0.00740
Bmr006.txt    0.04848
Bmr007.txt    0.02169
Bmr008.txt    0.02204
Bmr009.txt    0.02356
Bmr010.txt    0.01799
Bro003.txt    0.02985
Bro004.txt    0.06520
Bro005.txt    0.06600
Bro007.txt    0.05334
Bro008.txt    0.01549
Bro010.txt    0.02508
Bro011.txt    0.05336
Bro012.txt    0.02726
Name: F1, dtype: float64

Mean of all standard deviations: 


0.02558486549883283

Standard deviation between the F1 scores for the three iterations of each meetings using prompt Version 2

In [19]:
std_deviation_df_v2 = df_v2.groupby('Meeting')['F1'].std() # compute the standard deviations of F1 values by grouping them by meeting

display(std_deviation_df_v2)

print("Mean of all standard deviations: ")
display(std_deviation_df_v2.mean())

# Table of the standart deviation between the F1 scores for the three iteractions of each meeting (shows us the consistency between iterations)

Meeting
Bed002.txt    0.01659
Bed003.txt    0.00634
Bed004.txt    0.01787
Bed005.txt    0.02576
Bed006.txt    0.01989
Bed008.txt    0.01947
Bed009.txt    0.02444
Bed010.txt    0.01388
Bmr001.txt    0.00881
Bmr002.txt    0.05952
Bmr003.txt    0.02284
Bmr005.txt    0.01849
Bmr006.txt    0.03534
Bmr007.txt    0.00914
Bmr008.txt    0.00586
Bmr009.txt    0.01696
Bmr010.txt    0.01894
Bro003.txt    0.00387
Bro004.txt    0.00000
Bro005.txt    0.02076
Bro007.txt    0.04257
Bro008.txt    0.00934
Bro010.txt    0.01579
Bro011.txt    0.02803
Bro012.txt    0.03681
Name: F1, dtype: float64

Mean of all standard deviations: 


0.019891938188148092

Standard deviation between the F1 scores for the three iterations of each meetings using prompt Version 3

In [20]:
std_deviation_df_v3 = df_v3.groupby('Meeting')['F1'].std() # compute the standard deviations of F1 values by grouping them by meeting

display(std_deviation_df_v3)

print("Mean of all standard deviations: ")
display(std_deviation_df_v3.mean())

# Table of the standart deviation between the F1 scores for the three iteractions of each meeting (shows us the consistency between iterations)

Meeting
Bed002.txt    0.01893
Bed003.txt    0.02064
Bed004.txt    0.01098
Bed005.txt    0.01022
Bed006.txt    0.02818
Bed008.txt    0.00696
Bed009.txt    0.02935
Bed010.txt    0.06178
Bmr001.txt    0.02024
Bmr002.txt    0.01512
Bmr003.txt    0.01943
Bmr005.txt    0.00075
Bmr006.txt    0.04501
Bmr007.txt    0.00974
Bmr008.txt    0.02129
Bmr009.txt    0.00417
Bmr010.txt    0.00412
Bro003.txt    0.02003
Bro004.txt    0.00462
Bro005.txt    0.00635
Bro007.txt    0.00779
Bro008.txt    0.04723
Bro010.txt    0.00984
Bro011.txt    0.01466
Bro012.txt    0.01762
Name: F1, dtype: float64

Mean of all standard deviations: 


0.018202025926441302