# Examine temporal evaluation beyond token-level similarity
In a temporal QA benchmark, the go-to evaluation metrics are F1, EM, ROUGE, or Accuracy. All of them assess the LLM's performance on the token-level, i.e., quantifying how large the overlap is between output and expected answer on token-level. However, time can be expressed in different ways, e.g., _12 months_ describes the same time as _one year_. An LLM would be punished for outputting the expected answer into the wrong format. Researchers already raised concerns that evaluation needs to account for this fact. Some therefore introduced post-processing of the LLM's output to normalise in- and output for better evaluation or force the LLM to output data. 

In the following, I am going to re-evaluate the results of TempTabQA under the assumption that temporal answer are evaluated under a normalised-setting.

__Planned steps__
1. Exclude data where output and expected answers are already an exact match
1. Find expected answers that are a date or a time
1. For these temporal answers, normalise the LLM's output to allow for temporal arithmetic
1. Quantify the error of the LLM for those answers temporally

In [1]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px


## Load data
Select GPT4 results for in-domain data set because we expect the highest performance here. I want to quantify the problem of not normalising data in the optimal setting to then estimate the minimal gain through normalisation.

In [2]:
gpt4_df = pd.read_csv("models/predictions/gpt4/fewshot_without_reasoning/indomain_eval_gpt_4_few_shot_single.csv")

In [3]:
gpt4_df.head()

Unnamed: 0,table,predicted_answer,actual_answer,question
0,"<html><body><table class=""infobox biography vc...",47 years old,47,How old was Art Carney when he first got divor...
1,"<html><body><table class=""infobox biography vc...",Jean Myers,Jean Myers,Who was Art Carney married to while he served ...
2,"<html><body><table class=""infobox biography vc...",Barbara Isaac,Barbara Isaac,Which spouse was Art Carney married to the least?
3,"<html><body><table class=""infobox biography vc...",Barbara Isaac,Barbara Isaac,Who was the spouse of Art Carney in 1970?
4,"<html><body><table class=""infobox biography vc...",54 years,54 Years,How many years did Art Carney as actor since 1...


In [4]:
gpt4_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1901 entries, 0 to 1900
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   table             1901 non-null   object
 1   predicted_answer  1901 non-null   object
 2   actual_answer     1901 non-null   object
 3   question          1901 non-null   object
dtypes: object(4)
memory usage: 59.5+ KB


In [5]:
gpt4_df_wrong_answers = gpt4_df.query("predicted_answer.str.lower() != actual_answer.str.lower()")

In [6]:
gpt4_df_wrong_answers.head()

Unnamed: 0,table,predicted_answer,actual_answer,question
0,"<html><body><table class=""infobox biography vc...",47 years old,47,How old was Art Carney when he first got divor...
5,"<html><body><table class=""infobox biography vc...",25 years,28 years,How many total years was Art Carney married to...
7,"<html><body><table class=""infobox biography vc...",23 years,23,How many years before he died was Art Carney m...
10,"<html><body><table class=""infobox biography vc...",22 years old,22,How old was Cumberbatch when his career began?
11,"<html><body><table class=""infobox biography vc...",7 years,7,For how many years has Benedict Cumberbatch be...


In [7]:
gpt4_df_wrong_answers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1208 entries, 0 to 1900
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   table             1208 non-null   object
 1   predicted_answer  1208 non-null   object
 2   actual_answer     1208 non-null   object
 3   question          1208 non-null   object
dtypes: object(4)
memory usage: 47.2+ KB


In [8]:
print(f"Number of answers that are an exact match when all text is lower-cased: {1901 - 1208}")

Number of answers that are an exact match when all text is lower-cased: 693


The in-domain dataset consists of 1901 data points. 693 predictions are already an exact match if we lower-case predictions and actual answers. 1208 data points are not an exact match. 

## Evaluate QA pairs where digits are an exact match
To assess how much temporal normalisation would help or hurt model evaluation, we need to investigate several normalisation techniques. The first one focuses on comparing digits only in the predicted and actual answer.
When assessing all predictions that were wrong on token-level but extract digits using regex from the predicted and actual answer, we see that 366 of the formally 1208 wrong answers are suddenly correct. Let us investigate where this approach is reflective of the what we try to measure. 

In [9]:
gpt4_df_extracted_nums = gpt4_df_wrong_answers.assign(
    predicted_answer_nums_only=lambda x: x["predicted_answer"].str.findall("\d+"),
    actual_answer_nums_only=lambda x: x["actual_answer"].str.findall("\d+"),
)

In [10]:
gpt4_df_extracted_nums_are_eq = gpt4_df_extracted_nums.query("predicted_answer_nums_only==actual_answer_nums_only and actual_answer_nums_only")
print(gpt4_df_extracted_nums_are_eq.shape)
gpt4_df_extracted_nums_are_eq.head()

(366, 6)


Unnamed: 0,table,predicted_answer,actual_answer,question,predicted_answer_nums_only,actual_answer_nums_only
0,"<html><body><table class=""infobox biography vc...",47 years old,47,How old was Art Carney when he first got divor...,[47],[47]
7,"<html><body><table class=""infobox biography vc...",23 years,23,How many years before he died was Art Carney m...,[23],[23]
10,"<html><body><table class=""infobox biography vc...",22 years old,22,How old was Cumberbatch when his career began?,[22],[22]
11,"<html><body><table class=""infobox biography vc...",7 years,7,For how many years has Benedict Cumberbatch be...,[7],[7]
14,"<html><body><table class=""infobox biography vc...",22 years old,22,How old was Benedict Cumberbatch when he start...,[22],[22]


We notice that the now correct predicted answers contain the word _year_ most of the time. Therefore, the first normalisation step to investigate should focus on questions that ask for a year or numbers of year. Then, I will compare numbers in the predicted and actual answers to measure the error on a time-basis.

### Answers mentioning _years_

In [11]:
gpt4_df_extracted_nums_are_eq.query("predicted_answer.str.contains('year')").shape

(241, 6)

In [12]:
gpt4_df_extracted_nums_are_eq.query("predicted_answer.str.contains('year')").loc[
    :, "question"
].str.lower().str.split().apply(lambda x: tuple(x[:3])).value_counts()

question
(how, many, years)            109
(how, old, was)                56
(what, was, the)               22
(at, what, age)                18
(for, how, many)                7
(how, long, after)              6
(what, is, the)                 3
(how, old, did)                 3
(how, long, ago)                2
(steven, paul, jobs)            1
(what, age, did)                1
(how, many, year)               1
(what, was, nicki)              1
(what, was, age)                1
(how, many, total)              1
(what, was, shabalin's)         1
(when, was, the)                1
(what, age, was)                1
(when, alphonse, gallegos)      1
(how, long, was)                1
(the, general, election)        1
(what, number, of)              1
(the, space, shuttle's)         1
(the, pby, catalina)            1
Name: count, dtype: int64

In [13]:
gpt4_df_extracted_nums_are_eq.query(
    "predicted_answer.str.contains('year') " 
    "and ~question.str.lower().str.contains('how many year') "
    "and ~question.str.lower().str.startswith('how old was') " 
    "and ~question.str.lower().str.startswith('what was the age') " 
    "and ~question.str.lower().str.startswith('at what age') " 
    "and ~question.str.lower().str.startswith('for how many years')"
    "and ~question.str.lower().str.startswith('how old did')"
    "and ~question.str.lower().str.contains(' age ')"
    "and ~question.str.lower().str.contains('number of years')"
    "and ~question.str.lower().str.contains('total years')"
).sort_values(by="question").loc[:, "question"].values

array(["How long after Bot's son Ben was born did he take office as State Secretary for the Interior?",
       'How long after achieving a personal best score for short program did Aron retire?',
       'How long after production stopped on the Firefly was it officially retired?',
       'How long after the Firefly was introduced did production stop?',
       'How long after the First Flight was the Firefly retired?',
       'How long after the beginning of the dissolution of the Soviet Union did Chechnya become an unrecognized breakaway state?',
       'How long ago did Faith die?',
       'How long ago was the first edition of Arkham Horror: The Card Game released?',
       'How long was Arnold Schwarzenegger married for?'], dtype=object)

Most questions that expect an answer counting years are either asking for a year specifically or for age, which is most of the time measured in years. Of the remaining 13 questions, the LLM is asked for a duration which happens to be answered in years because this appears to be the most appropriate answer format. 

### Answers not mentioning _years_
Let us investigate  whether answers that don't mention the string _year_ are still questions asking for a year. 125 questions are an exact match on digits only but don't have _year_ in their answers.

In [14]:
gpt4_df_extracted_nums_are_eq.query("~predicted_answer.str.contains('year')").shape

(125, 6)

In [15]:
gpt4_df_extracted_nums_are_eq.query("~predicted_answer.str.contains('year')").loc[:, "question"].str.lower().str.split().apply(lambda x: tuple(x[:3])).value_counts()

question
(when, was, the)              9
(how, many, months)           8
(what, was, the)              6
(what, year, did)             6
(how, many, days)             5
                             ..
(how, many, world)            1
(how, many, personal)         1
(how, many, championships)    1
(how, many, are)              1
(how, many, publications)     1
Name: count, Length: 71, dtype: int64

There is a less strong pattern in the types of questions than with exact digit matches where the answer contains the word _year_. In the following, I am going to manually extract questions that are still asking for a year. In these cases, a digit match makes sense. What are the other questions that lead to a numerical answers?

In [16]:
matching_numbs_that_arent_temp = gpt4_df_extracted_nums_are_eq.query(
    "~predicted_answer.str.contains('year') "
    "and ~question.str.lower().str.startswith('when')"
    "and ~question.str.lower().str.contains('what year')"
    "and ~question.str.lower().str.contains('how many days')"
    "and ~question.str.lower().str.contains('how many days') "
    "and ~question.str.lower().str.contains('how many months') "
    "and ~question.str.lower().str.contains('which year')"
    "and ~question.str.lower().str.contains('final year')"
    "and ~question.str.lower().str.contains('how long')"
    "and ~question.str.lower().str.contains('first date')"
)#.loc[[1712, 1714, 1715], "actual_answer"].values
matching_numbs_that_arent_temp.head()

Unnamed: 0,table,predicted_answer,actual_answer,question,predicted_answer_nums_only,actual_answer_nums_only
21,"<html><body><table class=""infobox biography vc...",4 different personas,4,How many different personas has professional w...,[4],[4]
32,"<html><body><table class=""infobox biography vc...",4 times,4,How many times Joan Crawford have been married?,[4],[4]
169,"<html><body><table class=""infobox vcard"" style...",1 gold medal,1,How many gold medals had Carolina Albuquerque ...,[1],[1]
185,"<html><body><table class=""infobox vcard""><capt...",5 gold medals,5,How many gold medals did Matt Biondi win in th...,[5],[5]
238,"<html><body><table class=""infobox vcard""><tbod...",2 medals,2,How many times in 2009 did Wang Xiaoli receive...,[2],[2]


In [17]:
matching_numbs_that_arent_temp.shape

(71, 6)

In [18]:
print(f"Out of 125 exact digit matches where the answer does not contain the word 'year', {125-71} questions are still temporal, i.e., asking for when, how many years, days, or months, or for the first or final date/time of some event/entity.")

Out of 125 exact digit matches where the answer does not contain the word 'year', 54 questions are still temporal, i.e., asking for when, how many years, days, or months, or for the first or final date/time of some event/entity.


In [19]:
# To get a better overview of the 71 non-temporal questions, I filter out some questions that appear more than one time. 
# The removed questions and the remaining questions shed light onto the types of questions that lead to an exact digit match but that aren't temporal
matching_numbs_that_arent_temp.query(
    "~predicted_answer.str.contains('medals') "
    "and ~predicted_answer.str.contains('\$') "
    "and ~predicted_answer.str.contains('awards') "
    "and ~predicted_answer.str.contains('times') "
    "and ~predicted_answer.str.contains('medal') "
    "and ~predicted_answer.str.contains('children') "
    "and ~predicted_answer.str.contains('\%') "
    "and ~predicted_answer.str.contains('million') "
).sort_values("question")

Unnamed: 0,table,predicted_answer,actual_answer,question,predicted_answer_nums_only,actual_answer_nums_only
1558,"<table class=""infobox vevent""><caption class=""...",2 (Artsakh and Transnistria),2,Currently how many unrecognized states are sti...,[2],[2]
1066,"<table class=""infobox vcard""><caption class=""i...",6 different events,6,For how many different events did Charlotte At...,[6],[6]
812,"<table class=""infobox vcard"" style=""width:25em...",14 seasons,14,For how many seasons did Brooks play for Tampa...,[14],[14]
755,"<table class=""infobox vcard"" style=""width:24em...",8 races,8,How many NASCAR Cup Series and IndyCar Series ...,[8],[8]
1456,"<table class=""infobox vevent"" style=""width:25....","2,121,560-2,260,000 casualties","2,121,560–2,260,000",How many allied casualties were there a year a...,"[2, 121, 560, 2, 260, 000]","[2, 121, 560, 2, 260, 000]"
1334,"<table class=""infobox vcard""><tbody><tr><th cl...",5,"5, including Katherine, Patrick and Joseph Baena",How many are children's Arnold Schwarzenegger ?,[5],[5]
1314,"<table class=""infobox vcard""><tbody><tr><th cl...",8 championships,8,How many championships has Murilo Bustamante a...,[8],[8]
335,"<table class=""infobox biography vcard"" style=""...",4 different series,4,How many different NASCAR series has Cody Ware...,[4],[4]
1461,"<table class=""infobox vevent"" style=""width:25....",11 countries,11,How many different companies fought on the all...,[11],[11]
21,"<html><body><table class=""infobox biography vc...",4 different personas,4,How many different personas has professional w...,[4],[4]


## Evaluate QA pairs where digits are not an exact match
Here, I am investigating questions that did not match on digits only. I intend to identify temporal question answer pairs and subsequently measure the error of the prediction by time.

In [20]:
no_digits_match = gpt4_df_extracted_nums.query("predicted_answer_nums_only!=actual_answer_nums_only")

In [21]:
no_digits_match.head()

Unnamed: 0,table,predicted_answer,actual_answer,question,predicted_answer_nums_only,actual_answer_nums_only
5,"<html><body><table class=""infobox biography vc...",25 years,28 years,How many total years was Art Carney married to...,[25],[28]
17,"<html><body><table class=""infobox biography vc...",4 years,2,How many years prior to Cumberbatch turning 45...,[4],[2]
20,"<html><body><table class=""infobox biography vc...",8 years,23 Years,How many years did Dwayne Johnson was in wrest...,[8],[23]
22,"<html><body><table class=""infobox biography vc...",8 years (1996-2004) and sporadically thereafter,23 years,How many active years did Dwayne Johnson wrestle?,"[8, 1996, 2004]",[23]
25,"<html><body><table class=""infobox biography vc...",2013,2019,When was the last time that Dwayne Johnson com...,[2013],[2019]


In [22]:
no_digits_match.shape

(712, 6)

### Evaluate error in QA pairs asking of the format "\d+ year"

In [23]:
# Extract digits from list and conver to int
# Only keep QA pairs that expect and answer in the format \d+ years.
no_digits_match_year_year = no_digits_match.query(
    "predicted_answer.str.lower().str.match('\d+ years$') and actual_answer.str.lower().str.contains('\d+ years$')"
).assign(
    predicted_answer_nums_only_flat=lambda x: x["predicted_answer_nums_only"].apply(lambda y: y[0]).astype(int),
    actual_answer_nums_only_flat=lambda x: x["actual_answer_nums_only"].apply(lambda y: y[0]).astype(int),
    error=lambda x: x["predicted_answer_nums_only_flat"] - x["actual_answer_nums_only_flat"],
    rel_error=lambda x: round((x["error"] / x["actual_answer_nums_only_flat"]) * 100, 2)
)

In [24]:
no_digits_match_year_year.head()

Unnamed: 0,table,predicted_answer,actual_answer,question,predicted_answer_nums_only,actual_answer_nums_only,predicted_answer_nums_only_flat,actual_answer_nums_only_flat,error,rel_error
5,"<html><body><table class=""infobox biography vc...",25 years,28 years,How many total years was Art Carney married to...,[25],[28],25,28,-3,-10.71
20,"<html><body><table class=""infobox biography vc...",8 years,23 Years,How many years did Dwayne Johnson was in wrest...,[8],[23],8,23,-15,-65.22
45,"<html><body><table class=""infobox ib-country v...",18 years,3 years,How much longer was Habte Giyorgis prime minis...,[18],[3],18,3,15,500.0
57,"<html><body><table class=""infobox vcard"" style...",2 years,15 years,How long after playing for Monterrey did Moham...,[2],[15],2,15,-13,-86.67
70,"<html><body><table class=""infobox vcard"" style...",8 years,9 years,How much longer was Zidane's senior career tha...,[8],[9],8,9,-1,-11.11


In [25]:
no_digits_match_year_year.shape

(62, 10)

In [26]:
no_digits_match_year_year.loc[:, "error"].value_counts()

error
 1      11
-1       9
 10      3
-2       3
 4       3
-3       2
 3       2
 2       2
 12      2
-100     2
-6       2
 15      2
 6       1
 11      1
 8       1
-8       1
-9       1
-40      1
-10      1
-19      1
-18      1
-4       1
-39      1
 20      1
-38      1
 76      1
-15      1
 16      1
-34      1
-13      1
-24      1
Name: count, dtype: int64

In [27]:
no_digits_match_year_year.loc[:, "error"].abs().value_counts()

error
1      20
2       5
3       4
4       4
10      4
6       3
15      3
8       2
12      2
100     2
34      1
16      1
76      1
38      1
20      1
39      1
18      1
19      1
40      1
9       1
13      1
11      1
24      1
Name: count, dtype: int64

In [28]:
fig = px.violin(no_digits_match_year_year, y="error")
fig.update_layout(width=500)
fig.show()

In [29]:
fig = px.histogram(no_digits_match_year_year,nbins=50, x="error")
fig.update_layout(width=500)
fig.show()

In [30]:
fig = px.box(no_digits_match_year_year, y="rel_error")
fig.update_layout(width=500)
fig.show()

In [31]:
no_digits_match_one_digit_one_digit = no_digits_match.query(
    "~(predicted_answer.str.lower().str.match('\d+ years$') and actual_answer.str.lower().str.contains('\d+ years$')) "
    "and question.str.lower().str.startswith('how many years') "
    "and predicted_answer_nums_only.str.len() == 1 "
    "and actual_answer_nums_only.str.len() == 1 "
).assign(
    predicted_answer_nums_only_flat=lambda x: x["predicted_answer_nums_only"].apply(lambda y: y[0]).astype(int),
    actual_answer_nums_only_flat=lambda x: x["actual_answer_nums_only"].apply(lambda y: y[0]).astype(int),
    error=lambda x: x["predicted_answer_nums_only_flat"] - x["actual_answer_nums_only_flat"],
    rel_error=lambda x: round((x["error"] / x["actual_answer_nums_only_flat"]) * 100, 2)
)

In [32]:
no_digits_match_one_digit_one_digit.head()

Unnamed: 0,table,predicted_answer,actual_answer,question,predicted_answer_nums_only,actual_answer_nums_only,predicted_answer_nums_only_flat,actual_answer_nums_only_flat,error,rel_error
17,"<html><body><table class=""infobox biography vc...",4 years,2,How many years prior to Cumberbatch turning 45...,[4],[2],4,2,2,100.0
26,"<html><body><table class=""infobox biography vc...",8 years (active wrestling),23,How many years long was Dwayne Johnson's profe...,[8],[23],8,23,-15,-65.22
47,"<html><body><table class=""infobox ib-country v...",1871,601,How many years after the Ethiopian was first e...,[1871],[601],1871,601,1270,211.31
67,"<html><body><table class=""infobox vcard"" style...",8 years,9,How many years did Zinedine Zidane spend in th...,[8],[9],8,9,-1,-11.11
103,"<html><body><table class=""infobox vcard"" style...",35 years,31,How many years after winning his first competi...,[35],[31],35,31,4,12.9


In [33]:
no_digits_match_one_digit_one_digit.shape

(50, 10)

In [34]:
fig = px.histogram(no_digits_match_one_digit_one_digit,nbins=50, x="error")
fig.update_layout(width=500)
fig.show()

In [35]:
fig = px.box(no_digits_match_one_digit_one_digit, y="rel_error")
fig.update_layout(width=500)
fig.show()

In [36]:
fig = px.box(no_digits_match_one_digit_one_digit.query("rel_error < 1000"), y="rel_error")
fig.update_layout(width=500)
fig.show()

In [46]:
no_digits_match.assign(
    num_digits_predicted=lambda x: x["predicted_answer_nums_only"].apply(len),
    num_digits_actual=lambda x: x["actual_answer_nums_only"].apply(len),
).query(
    "~(predicted_answer.str.lower().str.match('\d+ years$') and actual_answer.str.lower().str.contains('\d+ years$')) "
    # "and question.str.lower().str.startswith('how many years') "
    "and num_digits_predicted != 1 "
    "or num_digits_actual != 1 "
)

Unnamed: 0,table,predicted_answer,actual_answer,question,predicted_answer_nums_only,actual_answer_nums_only,num_digits_predicted,num_digits_actual
22,"<html><body><table class=""infobox biography vc...",8 years (1996-2004) and sporadically thereafter,23 years,How many active years did Dwayne Johnson wrestle?,"[8, 1996, 2004]",[23],3,1
27,"<html><body><table class=""infobox biography vc...","4 years, 4 years, 4 years, and 4 years",four,How long did each of Joan Crawford's marriages...,"[4, 4, 4, 4]",[],4,0
30,"<html><body><table class=""infobox biography vc...",4 years,4 Years (from 1955 to 1959),How many years did Joan Crawford and Alfred St...,[4],"[4, 1955, 1959]",1,3
35,"<html><body><table class=""infobox biography vc...",50 years,50 Years (1924–1974),How many years did Joan Crawford was active ca...,[50],"[50, 1924, 1974]",1,3
40,"<html><body><table class=""infobox biography vc...","Retired March 30, 2003 (age 38)",Age of 39,What was the the age when Hebar Pazardzhik ori...,"[30, 2003, 38]",[39],3,1
...,...,...,...,...,...,...,...,...
1892,"<table class=""infobox""><tbody><tr><th class=""i...",Approximately 0.254 km per flight,0.25 km,Approximately how many km did Ingenuity move p...,"[0, 254]","[0, 25]",2,2
1894,"<table class=""infobox""><tbody><tr><th class=""i...","April 3, 2021 to June 11, 2022",14 months,How long did Ingenuity take to go from deploym...,"[3, 2021, 11, 2022]",[14],4,1
1897,"<table class=""infobox""><tbody><tr><th class=""i...",About 14 months,1 year 2 months,How much time since deployment did it take for...,[14],"[1, 2]",1,2
1898,"<table class=""infobox""><tbody><tr><th class=""i...","April 7, 2021 (sol 46)",4 days,How long after Ingenuity was deployed was the ...,"[7, 2021, 46]",[4],3,1


In [49]:
gpt4_df.query("predicted_answer != actual_answer")

Unnamed: 0,table,predicted_answer,actual_answer,question
0,"<html><body><table class=""infobox biography vc...",47 years old,47,How old was Art Carney when he first got divor...
4,"<html><body><table class=""infobox biography vc...",54 years,54 Years,How many years did Art Carney as actor since 1...
5,"<html><body><table class=""infobox biography vc...",25 years,28 years,How many total years was Art Carney married to...
7,"<html><body><table class=""infobox biography vc...",23 years,23,How many years before he died was Art Carney m...
8,"<html><body><table class=""infobox biography vc...",19 years ago,19 Years ago,How many years ago did Art Carney was died?
...,...,...,...,...
1892,"<table class=""infobox""><tbody><tr><th class=""i...",Approximately 0.254 km per flight,0.25 km,Approximately how many km did Ingenuity move p...
1894,"<table class=""infobox""><tbody><tr><th class=""i...","April 3, 2021 to June 11, 2022",14 months,How long did Ingenuity take to go from deploym...
1897,"<table class=""infobox""><tbody><tr><th class=""i...",About 14 months,1 year 2 months,How much time since deployment did it take for...
1898,"<table class=""infobox""><tbody><tr><th class=""i...","April 7, 2021 (sol 46)",4 days,How long after Ingenuity was deployed was the ...


## Summary

In [63]:
fig = go.Figure(
    data=[
        go.Sankey(
            node=dict(
                label=[
                    "Full in-domain dataset", # 0
                    "Exact match", # 1
                    "Not exact match", # 2
                    "Matches on digits only", # 3
                    "No match on digits only", # 4
                    '"Year" in answer', # 5
                    '"Year" not in answer', # 6 
                    "Temporal question",# 7
                    "Non temporal question", # 8
                    '"Year" in answer and question', # 9,
                    "One digit in answer and question" # 10
                ],
                align="left",
                color="blue",
                pad=15,
                thickness=20,
                line=dict(color="black", width=0.5),
            ),
            link=dict(
                source=[
                    0,
                    0,
                    2,
                    2,
                    3,
                    3,
                    6,
                    6,
                    4,
                    4
                ],  # indices correspond to labels, eg A1, A2, A1, B1, ...
                target=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                value=[693, 1208, 366, 842, 241, 125, 54, 71, 56, 50],
            ),
        )
    ]
)

fig.update_layout(title_text="Basic Sankey Diagram", font_size=10, width=1200)
fig.show()