In [2]:
import pandas as pd
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


# Evaluation Results

After fine-tuning the models, we need to evaluate their performance. We use [GERBIL](https://gerbil-qa.aksw.org/gerbil/config) to calculate the F1 score of each model and language individually. We convert the questions in the test dataset to SPARQL queries, send them to [Wikidata Query Service](https://query.wikidata.org/) to get answers, and build a dataset in QALD format. Then, we upload this file to GERBIL and evaluate it.

We conduct ablation studies using different models with different tokenizers. We fine-tune the models on the [QALD-9-plus](https://github.com/KGQA/QALD_9_plus) training dataset, a multilingual dataset that extends QALD-9 with more questions in nine languages and covers both DBpedia and Wikidata as knowledge graphs, and evaluate them on the test dataset. Since there are only european languages in the QALD-9-plus dataset, we also add Chinese and Japanese translations of questions to expand the coverage of languages for KGQA research and applications and enable multilingual KGQA models to learn from more diverse and rich data.

Note that unless otherwise stated, all answers is limited to 50 results to avoid excessive file size. 

## Experiment 1

The first experiment aims to leverage a multilingual text-to-text transformer model called [mT5-base](https://huggingface.co/google/mt5-base) to answer natural language questions over linked data. 
The model is previously  pre-trained on a large-scale dataset called [LC-QuAD](https://github.com/AskNowQA/LC-QuAD) that contains over 5000 questions and their corresponding SPARQL queries. 
We also filter out entities and relations in LC-QuAD dataset and add them to mt5 tokenizer to improve tokenizing knowledge graph entities. This new tokenizer is trained on LC-QuAD training dataset. 

We fine-tuned a pre-trained language model on different datasets that contain varying numbers of languages. We experimented with four settings: one-shot, few-shot, no-en, and all languages. 
- In the one-shot setting, we fine-tuned the model on only English questions. 
- In the few-shot setting, we fine-tuned the model on a small amount of languages. 
- In the no-en setting, we exclude English questions from dataset while including all other languages.
- In the all languages setting, we fine-tuned the model on all the available languages.

The experiment runs for 300 epochs before evaluation.

### mT5 tokenizer + LC-QuAD tokens

First, we show you the results of the mT5 tokenizer on tokenizing SPARQL query. 
Note that these SPARQL queries are preprocessed to simplify tokenization and reduce syntax errors.

We take three example SPARQL queries. 

query1: 
- Which instruments does Cat Stevens play?
- `SELECT DISTINCT var_uri WHERE bra_open wd_Q154216 wdt_P1303 var_uri sep_dot bra_close`

In [7]:
query1 = "SELECT DISTINCT var_uri WHERE bra_open wd_Q154216 wdt_P1303 var_uri sep_dot bra_close"

- Is horse racing a sport?
- `ASK WHERE bra_open wd_Q187916 wdt_P279* wd_Q349 sep_dot bra_close`

In [8]:
query2 = "ASK WHERE bra_open wd_Q187916 wdt_P279* wd_Q349 sep_dot bra_close"

- Which states border Illinois?
- `SELECT DISTINCT var_uri WHERE bra_open wd_Q1204 wdt_P47 var_uri sep_dot bra_close`

In [9]:
query3 = "SELECT DISTINCT var_uri WHERE bra_open wd_Q1204 wdt_P47 var_uri sep_dot bra_close"

First, we show how mT5 tokenizer tokenizes SPARQL queries.

In [11]:
mt5_tokenizer = AutoTokenizer.from_pretrained("google/mt5-base")



In [16]:
tokens = mt5_tokenizer.tokenize(query1)
print(tokens, end=", ")

['▁', 'SELECT', '▁D', 'ISTI', 'NCT', '▁var', '_', 'uri', '▁W', 'HERE', '▁bra', '_', 'open', '▁w', 'd', '_', 'Q', '1542', '16', '▁w', 'd', 't', '_', 'P', '1303', '▁var', '_', 'uri', '▁sep', '_', 'dot', '▁bra', '_', 'close'], 

In [17]:
tokens = mt5_tokenizer.tokenize(query2)
print(tokens, end=", ")

['▁', 'ASK', '▁W', 'HERE', '▁bra', '_', 'open', '▁w', 'd', '_', 'Q', '1879', '16', '▁w', 'd', 't', '_', 'P', '279', '*', '▁w', 'd', '_', 'Q', '349', '▁sep', '_', 'dot', '▁bra', '_', 'close'], 

In [18]:
tokens = mt5_tokenizer.tokenize(query3)
print(tokens, end=", ")

['▁', 'SELECT', '▁D', 'ISTI', 'NCT', '▁var', '_', 'uri', '▁W', 'HERE', '▁bra', '_', 'open', '▁w', 'd', '_', 'Q', '1204', '▁w', 'd', 't', '_', 'P', '47', '▁var', '_', 'uri', '▁sep', '_', 'dot', '▁bra', '_', 'close'], 

We see that entities and relations are tokenized as some single letters and numbers, therefore, the meaning of entities and relations is lost. 
We filter out entities and relations from LC-QuAD dataset to expand vocabulary of mT5 tokenizer. 

Now, we demonstrate the tokenization results of our new trained tokenizer. 

In [19]:
lcquad_tokenizer = AutoTokenizer.from_pretrained("../lcquad_tokenizer")

In [20]:
tokens = lcquad_tokenizer.tokenize(query1)
print(tokens, end=", ")

['▁', 'SELECT', '▁D', 'ISTI', 'NCT', '▁var', '_', 'uri', '▁W', 'HERE', '▁', 'bra_open', '▁', 'wd_Q154', '▁216', '▁', 'wdt_P1303', '▁var', '_', 'uri', '▁', 'sep_dot', '▁', 'bra_close'], 

In [21]:
tokens = lcquad_tokenizer.tokenize(query2)
print(tokens, end=", ")

['▁', 'ASK', '▁W', 'HERE', '▁', 'bra_open', '▁', 'wd_Q18', '▁7', '916', '▁', 'wdt_P279', '▁*', '▁', 'wd_Q349', '▁', 'sep_dot', '▁', 'bra_close'], 

In [22]:
tokens = lcquad_tokenizer.tokenize(query3)
print(tokens, end=", ")

['▁', 'SELECT', '▁D', 'ISTI', 'NCT', '▁var', '_', 'uri', '▁W', 'HERE', '▁', 'bra_open', '▁', 'wd_Q1204', '▁', 'wdt_P47', '▁var', '_', 'uri', '▁', 'sep_dot', '▁', 'bra_close'], 

This time, some entities and relations are tokenized correctly. 
Although many tokens from LC-QuAD are added to mT5 tokenizer, it does not contain entire wikidata entities and relations. Some entities and relations are still not tokenized correctly. 
It remains to see the tokenization result of SPARQL queries for smaller knowledge graphs when entities and relations are added into a tokenizer. 

### one-shot


We first fine-tune with only English questions, which takes less time while fine-tuning. 

result: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304030010

In [16]:
pd.read_csv("../pred_files/pretrain_lcqald/1/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,1-shot ba (uploaded),0.0,0.0,0.0,0.0441,0.0441,0.0441,0.0844
1,1-shot be (uploaded),0.007,0.3323,0.0036,0.0923,0.0953,0.0907,0.1632
2,1-shot de (uploaded),0.0055,0.1401,0.0028,0.0819,0.0821,0.0822,0.1475
3,1-shot en (uploaded),0.0094,0.2216,0.0048,0.1748,0.1785,0.1777,0.2891
4,1-shot fr (uploaded),0.0067,0.7955,0.0034,0.0776,0.0806,0.076,0.1409
5,1-shot hy (uploaded),0.0031,0.2526,0.0015,0.0411,0.0438,0.0399,0.0767
6,1-shot ja (uploaded),0.0004,0.0216,0.0002,0.07,0.0701,0.0699,0.1293
7,1-shot lt (uploaded),0.0066,0.2178,0.0034,0.0776,0.0806,0.076,0.1402
8,1-shot ru (uploaded),0.0039,0.1148,0.002,0.0793,0.0817,0.0846,0.151
9,1-shot uk (uploaded),0.0081,0.2075,0.0041,0.0666,0.0746,0.0639,0.1175


Conclusion:

- English has the best performance across this experiment
- bad performance on all other languages

### four-shot 

We experiment with four languages of questions: en, de, zh, and ru.

result: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304250003

In [25]:
pd.read_csv("../pred_files/pretrain_lcqald/4/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,pretrain_lcquad_few-shot_ba (uploaded),0.004,0.3316,0.002,0.0897,0.0938,0.0887,0.1607
1,pretrain_lcquad_few-shot_be (uploaded),0.0086,0.1741,0.0044,0.1072,0.1137,0.1055,0.1839
2,pretrain_lcquad_few-shot_de (uploaded),0.0088,0.1932,0.0045,0.1299,0.1363,0.1281,0.2173
3,pretrain_lcquad_few-shot_en (uploaded),0.0089,0.222,0.0045,0.1518,0.1582,0.15,0.2502
4,pretrain_lcquad_few-shot_fr (uploaded),0.0067,0.5866,0.0034,0.0775,0.0804,0.0761,0.1412
5,pretrain_lcquad_few-shot_hy (uploaded),0.0034,0.3354,0.0017,0.0484,0.051,0.0472,0.0901
6,pretrain_lcquad_few-shot_ja (uploaded),0.0088,0.1944,0.0045,0.1146,0.1213,0.1129,0.1961
7,pretrain_lcquad_few-shot_lt (uploaded),0.0091,0.1929,0.0047,0.0971,0.1097,0.0971,0.1729
8,pretrain_lcquad_few-shot_ru (uploaded),0.0094,0.2104,0.0048,0.1416,0.1468,0.1425,0.2387
9,pretrain_lcquad_few-shot_uk (uploaded),0.0086,0.1532,0.0044,0.115,0.1215,0.1132,0.1944


Conclusion:

- languages in training dataset performs better than other languages

### all languages (11)

In all languages setting, we include en, de, zh, ja, ru, fr, hy, ba, be, uk, lt. 
es is not included since our Spanish translation of questions is merged later than this experiment.

result: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304250002

In [26]:
pd.read_csv("../pred_files/pretrain_lcqald/11/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,pretrain_lcquad_ba (uploaded),0.0044,0.2438,0.0022,0.1266,0.1307,0.1256,0.2157
1,pretrain_lcquad_be (uploaded),0.0103,0.2195,0.0053,0.1316,0.1394,0.1293,0.2178
2,pretrain_lcquad_de (uploaded),0.0091,0.2117,0.0047,0.1666,0.1731,0.1648,0.27
3,pretrain_lcquad_en (uploaded),0.0088,0.1796,0.0045,0.1373,0.151,0.1355,0.2275
4,pretrain_lcquad_fr (uploaded),0.0067,0.7,0.0034,0.0775,0.0804,0.0761,0.1412
5,pretrain_lcquad_hy (uploaded),0.0035,0.432,0.0017,0.0631,0.0657,0.0619,0.1165
6,pretrain_lcquad_ja (uploaded),0.0091,0.2362,0.0047,0.1403,0.146,0.1402,0.2338
7,pretrain_lcquad_lt (uploaded),0.0102,0.2414,0.0052,0.126,0.1319,0.1248,0.2145
8,pretrain_lcquad_ru (uploaded),0.0089,0.2319,0.0045,0.1446,0.1582,0.1427,0.2376
9,pretrain_lcquad_uk (uploaded),0.0089,0.1759,0.0046,0.1519,0.1584,0.1501,0.2469


Conclusion: 
- `de` has the best performance
- bad performance on `fr` and `hy`
- no obvious difference between other languages

### no-en

We exclude English question to test the influence on English performance. 

result: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304250002

In [23]:
pd.read_csv("../pred_files/pretrain_lcqald/no_en/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,pretrain_lcquad_no-en_ba (uploaded),0.0043,0.2334,0.0021,0.1191,0.1232,0.1181,0.2056
1,pretrain_lcquad_no-en_be (uploaded),0.0088,0.1972,0.0045,0.1226,0.1291,0.1207,0.207
2,pretrain_lcquad_no-en_de (uploaded),0.0088,0.1898,0.0045,0.1298,0.1363,0.128,0.2173
3,pretrain_lcquad_no-en_en (uploaded),0.0087,0.1673,0.0045,0.1371,0.1435,0.1353,0.229
4,pretrain_lcquad_no-en_fr (uploaded),0.0067,0.5412,0.0034,0.0775,0.0804,0.0761,0.1411
5,pretrain_lcquad_no-en_hy (uploaded),0.0034,0.3776,0.0017,0.0631,0.0657,0.0619,0.1165
6,pretrain_lcquad_no-en_ja (uploaded),0.0089,0.2129,0.0046,0.1444,0.151,0.1426,0.2367
7,pretrain_lcquad_no-en_lt (uploaded),0.0092,0.1929,0.0047,0.1165,0.1228,0.1148,0.198
8,pretrain_lcquad_no-en_ru (uploaded),0.0087,0.2315,0.0044,0.1224,0.1288,0.1206,0.2082
9,pretrain_lcquad_no-en_uk (uploaded),0.0086,0.1823,0.0044,0.1224,0.1288,0.1206,0.2067


Conclusion:

- almost all performance is worse than with English questions
- `ja` is improved slightly
- no significant result w/o English questions for `en`

## Experiment 2

In the second experiment, we try google/mt5-base model with mT5 tokenizer without any pretraining on LC-QuAD dataset to observe whether pretraining on LC-QuAD and adding tokens to tokenizer improve the performance. 

Also, the experiment runs for 300 epochs before evaluation.

### one-shot

First, we try with only English questions.

results: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304050000

In [3]:
pd.read_csv("../pred_files/mt5/1_mt5/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,0-shot mt5 ba (uploaded),0.0001,0.0039,0.0,0.0515,0.0515,0.0515,0.0971
1,0-shot mt5 be (uploaded),0.0046,0.1299,0.0023,0.086,0.0934,0.0854,0.1524
2,0-shot mt5 de (uploaded),0.0088,0.1519,0.0046,0.1152,0.1289,0.1112,0.189
3,0-shot mt5 en (uploaded),0.0098,0.2193,0.005,0.2185,0.2324,0.2143,0.3207
4,0-shot mt5 fr (uploaded),0.0064,0.6061,0.0032,0.065,0.0732,0.0625,0.1175
5,0-shot mt5 hy (uploaded),0.0031,0.8727,0.0015,0.0485,0.0512,0.0473,0.0902
6,0-shot mt5 ja (uploaded),0.0032,0.0741,0.0016,0.0554,0.0657,0.0521,0.0963
7,0-shot mt5 lt (uploaded),0.0063,0.1477,0.0032,0.0503,0.0585,0.048,0.0903
8,0-shot mt5 ru (uploaded),0.0049,0.0936,0.0025,0.1007,0.1086,0.1001,0.1724
9,0-shot mt5 uk (uploaded),0.0046,0.0941,0.0023,0.0567,0.0645,0.0621,0.1139


Conclusion:

- `en` has a higher F1 Score than in experiment 1 one-shot setting
- the performance is worse for other languages except for `de` and `ru`

### four-shot

We also choose `en`, `de`, `zh`, and `ru` for four-shot training.

results: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304040000

In [5]:
pd.read_csv("../pred_files/mt5/4_mt5/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,4-shot mt5 ba (uploaded),0.0045,0.2023,0.0023,0.1267,0.1342,0.1256,0.2116
1,4-shot mt5 be (uploaded),0.0096,0.1835,0.0049,0.1489,0.1593,0.1514,0.227
2,4-shot mt5 de (uploaded),0.0098,0.1708,0.0051,0.2062,0.2177,0.2033,0.2821
3,4-shot mt5 en (uploaded),0.0096,0.134,0.005,0.1915,0.203,0.1886,0.2596
4,4-shot mt5 fr (uploaded),0.0068,0.5408,0.0034,0.0775,0.0804,0.0761,0.1409
5,4-shot mt5 hy (uploaded),0.0037,0.3118,0.0019,0.0579,0.0597,0.0595,0.1119
6,4-shot mt5 ja (uploaded),0.0082,0.1904,0.0042,0.143,0.1526,0.1396,0.2144
7,4-shot mt5 lt (uploaded),0.0088,0.1956,0.0045,0.134,0.1428,0.1311,0.211
8,4-shot mt5 ru (uploaded),0.0095,0.1313,0.0049,0.1548,0.1664,0.1518,0.2236
9,4-shot mt5 uk (uploaded),0.0094,0.1421,0.0048,0.1664,0.1766,0.1661,0.2391


Conculsion:
- `en` performs worse slightly
- F1 Score improves significantly for languages in training dataset.
- Also for `ja`, which has connection to `zh`, and `ba`, `be`, and `uk`, which have strong connection to `ru`, the F1 Score increases significantly.

### all languages (11)

We also fine-tuned for all 11 languages. 

results: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304080000

In [11]:
pd.read_csv("../pred_files/mt5/11_mt5/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,11-shot mt5 ba (uploaded),0.0056,0.1589,0.0029,0.1571,0.1628,0.1557,0.2457
1,11-shot mt5 be (uploaded),0.0098,0.1442,0.0051,0.1986,0.2171,0.1956,0.2704
2,11-shot mt5 de (uploaded),0.0099,0.1477,0.0051,0.1962,0.2137,0.192,0.2634
3,11-shot mt5 en (uploaded),0.0099,0.1439,0.0051,0.1914,0.2134,0.1884,0.2585
4,11-shot mt5 es (uploaded),0.01,0.1481,0.0052,0.2059,0.2137,0.2035,0.2787
5,11-shot mt5 fr (uploaded),0.0068,0.5354,0.0034,0.0775,0.0804,0.0761,0.1408
6,11-shot mt5 hy (uploaded),0.0034,0.1957,0.0017,0.0558,0.0584,0.0546,0.103
7,11-shot mt5 ja (uploaded),0.0104,0.1583,0.0054,0.1966,0.2052,0.1945,0.2625
8,11-shot mt5 lt (uploaded),0.0104,0.1307,0.0054,0.19,0.2089,0.1854,0.2641
9,11-shot mt5 ru (uploaded),0.0104,0.1369,0.0054,0.1775,0.1907,0.1742,0.2393


To know how the F1 Score decreases with limitation of 50 results, we generate a qald file without limitation. Fortunatly, these files are not very large. 

results: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304080001

In [12]:
pd.read_csv("../pred_files/mt5/11_mt5_all/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,11-shot mt5 ba (uploaded),0.0047,0.0116,0.003,0.1573,0.1625,0.1563,0.2457
1,11-shot mt5 be (uploaded),0.0071,0.0064,0.008,0.2164,0.2316,0.2151,0.2852
2,11-shot mt5 de (uploaded),0.0079,0.0077,0.008,0.2139,0.228,0.2115,0.2828
3,11-shot mt5 en (uploaded),0.0072,0.0066,0.0081,0.2093,0.228,0.2083,0.2749
4,11-shot mt5 es (uploaded),0.0066,0.0055,0.0082,0.2237,0.2283,0.2234,0.295
5,11-shot mt5 fr (uploaded),0.0111,0.4137,0.0056,0.088,0.0877,0.0882,0.1613
6,11-shot mt5 hy (uploaded),0.006,0.0144,0.0038,0.066,0.0659,0.0662,0.1235
7,11-shot mt5 ja (uploaded),0.0069,0.0059,0.0083,0.2144,0.2198,0.214,0.2761
8,11-shot mt5 lt (uploaded),0.0076,0.0071,0.0083,0.1931,0.2087,0.1902,0.265
9,11-shot mt5 ru (uploaded),0.007,0.006,0.0083,0.1953,0.2051,0.1938,0.2553


The macro F1 Score is improved about 0 to 2 depending on language without limitation.

Conclusion:

- As in Experiment 1, `fr` and `hy` still perform worst
- 9 of 12 languages reach a F1 Score above 0.19
- The performance of Experiment 2 for all languages surpassed that of Experiment 1

In both Experiment 1 and 2, one-shot setting delievers the highest F1 Score for `en`. We conclude that multilingual fine-tuning hurts performance on single language. However, in a multilingual KGQA system, performance on other languages is also important. With more languages in the training dataset, the average F1 Score increases. 

Compare to Experiment 1:



## Experiment 3

Config:
- model: mt5-base
- tokenizer: mt5 tokenizer with added tokens from lcquad
- epochs: 300

### one-shot

en

results: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304050004

In [15]:
pd.read_csv("../pred_files/mt5_lcquad_tokenizer/1_mt5_lcquadtokenizer/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,1-shot mt5 lcquad tokenizer ba (uploaded),0.0032,0.1932,0.0016,0.0585,0.0588,0.0581,0.1089
1,1-shot mt5 lcquad tokenizer be (uploaded),0.0086,0.1854,0.0044,0.1107,0.1213,0.1091,0.1892
2,1-shot mt5 lcquad tokenizer de (uploaded),0.0082,0.1813,0.0042,0.1005,0.1072,0.0986,0.171
3,1-shot mt5 lcquad tokenizer en (uploaded),0.0091,0.2416,0.0046,0.1666,0.1726,0.1662,0.2718
4,1-shot mt5 lcquad tokenizer fr (uploaded),0.0067,0.3608,0.0034,0.0776,0.0806,0.076,0.1407
5,1-shot mt5 lcquad tokenizer hy (uploaded),0.0,0.0,0.0,0.0368,0.0368,0.0368,0.0707
6,1-shot mt5 lcquad tokenizer ja (uploaded),0.0038,0.169,0.0019,0.0886,0.0956,0.0879,0.1569
7,1-shot mt5 lcquad tokenizer lt (uploaded),0.0071,0.1728,0.0036,0.0793,0.0861,0.0771,0.1406
8,1-shot mt5 lcquad tokenizer ru (uploaded),0.0077,0.2111,0.0039,0.1122,0.1236,0.1088,0.1882
9,1-shot mt5 lcquad tokenizer uk (uploaded),0.0085,0.1648,0.0044,0.1034,0.1216,0.1003,0.1749


### four-shot

en, de, zh, ru

results: https://gerbil-qa.aksw.org/gerbil/experiment?id=202304050001

In [14]:
pd.read_csv("../pred_files/mt5_lcquad_tokenizer/4_mt5_lcquadtokenizer/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,4-shot mt5 lcquad tokenizer ba (uploaded),0.0039,0.202,0.002,0.0897,0.0938,0.0887,0.1591
1,4-shot mt5 lcquad tokenizer be (uploaded),0.0083,0.1421,0.0043,0.1071,0.1135,0.1054,0.1819
2,4-shot mt5 lcquad tokenizer de (uploaded),0.0089,0.1736,0.0046,0.1513,0.1578,0.1496,0.2406
3,4-shot mt5 lcquad tokenizer en (uploaded),0.0089,0.1725,0.0046,0.1591,0.1653,0.1642,0.2578
4,4-shot mt5 lcquad tokenizer fr (uploaded),0.0067,0.7955,0.0034,0.0775,0.0804,0.0761,0.1411
5,4-shot mt5 lcquad tokenizer hy (uploaded),0.0034,0.53,0.0017,0.0558,0.0584,0.0546,0.1034
6,4-shot mt5 lcquad tokenizer ja (uploaded),0.0086,0.2036,0.0044,0.1218,0.1281,0.1201,0.2041
7,4-shot mt5 lcquad tokenizer lt (uploaded),0.0086,0.1568,0.0044,0.0925,0.0992,0.0908,0.1595
8,4-shot mt5 lcquad tokenizer ru (uploaded),0.0087,0.2094,0.0044,0.122,0.1356,0.1202,0.2029
9,4-shot mt5 lcquad tokenizer uk (uploaded),0.0087,0.1893,0.0044,0.1291,0.1356,0.1274,0.2135


Conclusion:

- 

## Experiment 4

Config:
- model: mt5-base
- tokenizer: mt5 tokenizer
- epochs: 300
- training data: natural language questions with POS tags, dependency relation, and dependency level

### all languages (11)

en, de, zh, ja, ru, fr, ba, be, uk, lt, es

There is no spacy model for Armenian, hence, Armenian is excluded.

https://gerbil-qa.aksw.org/gerbil/experiment?id=202304170000

In [2]:
pd.read_csv("../pred_files/mt5_linguistic/11_linguistic_mt5/results.csv")

Unnamed: 0,Annotator,Micro F1,Micro Precision,Micro Recall,Macro F1,Macro Precision,Macro Recall,Macro F1 QALD
0,linguistic_mt5_11_ba (uploaded),0.0066,0.1858,0.0034,0.159,0.1682,0.1571,0.2498
1,linguistic_mt5_11_be (uploaded),0.0101,0.1567,0.0052,0.2133,0.2245,0.2114,0.2851
2,linguistic_mt5_11_de (uploaded),0.0101,0.1397,0.0053,0.2253,0.2392,0.2212,0.2939
3,linguistic_mt5_11_en (uploaded),0.0103,0.1515,0.0053,0.2432,0.2543,0.2433,0.32
4,linguistic_mt5_11_es (uploaded),0.01,0.145,0.0052,0.2353,0.2467,0.2335,0.3161
5,linguistic_mt5_11_fr (uploaded),0.0067,0.438,0.0034,0.0849,0.0878,0.0835,0.1531
6,linguistic_mt5_11_ja (uploaded),0.0099,0.166,0.0051,0.2008,0.2098,0.2047,0.2832
7,linguistic_mt5_11_lt (uploaded),0.0095,0.135,0.0049,0.2179,0.2312,0.2138,0.2988
8,linguistic_mt5_11_ru (uploaded),0.0099,0.1348,0.0052,0.213,0.2243,0.2102,0.2775
9,linguistic_mt5_11_uk (uploaded),0.0101,0.1349,0.0052,0.2424,0.2537,0.2396,0.3019
