This notebook is intended as a guide to start experimenting with the tools that have been set up for the task of translating keywords from the GoTriple platform and linking them to a controlled vocabulary. Before running this notebook, please make sure that you have installed all the dependencies specified in the file requirements.txt (this is not necessary if you are running this notebook using Binder as specified in the Github folder of the project). The notebook makes use of various functions that are defined in the files main_functions.py, data_utils.py and tools_utils.py. So, the first thing that we will do is to import the files.

In [7]:
import os

import main_functions
import data_utils
import tools_utils

import importlib
importlib.reload(data_utils)
importlib.reload(tools_utils)
importlib.reload(main_functions)

<module 'main_functions' from 'c:\\Users\\paolo\\Desktop\\T3.4.1_KeywordsTranslation\\main_functions.py'>

The following part of this tutorial focuses on how to extract article data in a suitable format for the translation tool. We will deal with two cases: one in which we have the ID of a GoTriple article and we want to extract data, and the other in which we want to manually enter data for our own article.
Given the ID of an article on the GoTriple platform, the following line of code allows you to get all the data needed (title, abstract, keywords) for the tools to work in the right format. We're taking here an article about French feminism.

In [2]:
data = data_utils.get_item_by_id('10670%2F1.nxegh0')
data

{'Language': 'fr',
 'Id': '10670/1.nxegh0',
 'Keywords': ['anthropologie',
  'France',
  'politique',
  'représentation',
  'image',
  'États-Unis',
  'art plastique',
  'cinéma'],
 'Title_eng': '12- from French kiss to French Feminism, the French cultural exception',
 'Title_or': '12- Du French kiss au French Feminism, l’exception culturelle française',
 'Abstract_eng': 'Most of the images of France originating in the United States are stereotypical, in particular those concerning love and sex. The French kiss and the French lover are seen as cynosures and French woman are often idealised; to such an extent in fact that charm and seduction are regarded as their defining attributes. While trying to define French feminism as presented both by French and American feminists, this paper will attempt to show how images and stereotypes play an important role in dividing the feminist movement and related cultural mores.',
 'Abstract_or': None}

Suppose now that we want to apply our tool to an article which is not in the GoTriple platform. The following is an interactive function which asks you to insert your own article data, and automatically formats the data you entered in a way that is suitable for the tool functions

In [3]:
data = data_utils.get_item_from_user()

In [4]:
data

{'Language': 'fr',
 'Id': None,
 'Keywords': ['anthropologie',
  'France',
  'politique',
  'représentation',
  'image'],
 'Title_eng': 'Most of the images of France',
 'Title_or': 'Du French kiss au French Feminism',
 'Abstract_eng': 'unknown',
 'Abstract_or': 'unknown'}

For the rest of this tutorial, we will now deal with articles from the GoTriple platform. In the rest of evaluation, we will be dealing with sample of data from different language. We will start by getting a sample of data (keywords with title and abstract to experiment with). We will do this by using the function get_sample. The function allows you to decide the size of the sample and the languages included in the sample (each language will be equally represented in the sample). Let's try with English, French and Portuguese. Please, check the file data_utils.py if you are interested in more detail about the function. 

In [3]:
data = data_utils.get_sample(['en', 'fr', 'pt'], 100)

Now we analyze an item from the extracted sample for each language.

In [5]:
test_item_en = data[0]
test_item_fr = data[10]
test_item_pt = data[15]
print(test_item_en)
print(test_item_fr)
print(test_item_pt)

{'Language': 'en', 'Id': 'oai:revues.org:lhomme/21745', 'Keywords': ['Dravidian Kinship', 'Africa', 'Evolutionism', 'Historicism'], 'Title_eng': 'Dravidian Kinship Systems in Africa', 'Title_or': 'Dravidian Kinship Systems in Africa', 'Abstract_eng': 'AbstractDravidianate kinship systems based on a rule of bilateral cross-cousin marriage are usually taken as the starting point in universal theories of kinship evolution while Iroquois systems, which lack such a rule, are regarded as devolved versions of Dravidian systems. Dravidian and Iroquois systems, however, have an uneven geographical distribution. The former are well known from South Asia, Australia and America but not from Europe or Africa, while the latter are known from many regions of the world but not from South Asia. The purpose of this paper is to describe a Dravidian kinship system in a Bantu-speaking society and to suggest the presence or former presence of Dravidianate systems elsewhere in Africa.', 'Abstract_or': 'Abstr

After we have extracted an item to experiment with, let's use each one of the different tools that have been set up. Let's start with DBPedia Spotlight (DBPedia Spotlight is an application for named entity linking - see https://www.dbpedia-spotlight.org/). First, we will extract URIs for the English example.

In [5]:
output = main_functions.useDBPediaSpotlight(test_item_en, False)
output

[{'Form': 'anthropologie',
  'DBPediaURI': 'http://fr.dbpedia.org/resource/Anthropologie',
  'WikidataURI': None},
 {'Form': 'France',
  'DBPediaURI': 'http://fr.dbpedia.org/resource/France',
  'WikidataURI': None}]

Even if DBPedia Spotlight can work with different language, the performance of the tool is poorer if we switch from English. We can see that if we work with the French example, only a subset of the keywords are linked in the final output. Moreover, the functionality that allows to link DBPedia URI to Wikidata ones is not available (an analysis of reasons leading to this can be found in the file main_functions.py, before the definition of useDBPediaSpotlight). This can also be seen with the Portuguese example. In some cases, we can get error since DBPedia does not find any linking in the given text.

In [5]:
output = main_functions.useDBPediaSpotlight(test_item_fr, False)
output

[{'Form': 'féminicide',
  'DBPediaURI': 'http://fr.dbpedia.org/resource/Féminicide',
  'WikidataURI': None}]

In [6]:
output = main_functions.useDBPediaSpotlight(test_item_pt, False)
output

[{'Form': 'feminismo',
  'DBPediaURI': 'http://pt.dbpedia.org/resource/Feminismo',
  'WikidataURI': None}]

Now we will use a LLM to get the linking. More precisely, the LLM translates keywords into entities that (should) have a page on WikiData (this is what we asked it via prompt). Then, the WikiData API is used to get the URIs. The file main_functions.py provides functions to experiment with open-source LLMs (using quantized models via the llama.cpp library, which allows running without a GPU) and with proprietary models by OpenAI. Let's first use Mistral-7B-Instruct (the model given by default by the function tools_utils.loadLLM), starting with the English example.

Via the parameter context, we give the model only the title of the keywords' article as context.

As can be seen, the execution is quite slow. The model tends to produce a richer set of keywords. 

In [9]:
llm = tools_utils.loadLLM()

output = main_functions.useLLM(test_item_en, llm, context="Title")
output

[{'Keyword': 'Dravidian_people',
  'URI': 'http://www.wikidata.org/entity/Q69798'},
 {'Keyword': 'Dravidian_languages',
  'URI': 'http://www.wikidata.org/entity/Q33311'},
 {'Keyword': 'Dravidian_culture', 'URI': ''},
 {'Keyword': 'Social_organization_of_the_Dravidian_people', 'URI': ''},
 {'Keyword': 'Africa', 'URI': 'http://www.wikidata.org/entity/Q15'},
 {'Keyword': 'History_of_Africa',
  'URI': 'http://www.wikidata.org/entity/Q149813'},
 {'Keyword': 'Demography_of_Africa', 'URI': ''},
 {'Keyword': 'Culture_of_Africa',
  'URI': 'http://www.wikidata.org/entity/Q149416'},
 {'Keyword': 'Evolutionism', 'URI': 'http://www.wikidata.org/entity/Q1076026'},
 {'Keyword': 'Theory_of_evolution',
  'URI': 'http://www.wikidata.org/entity/Q11640129'},
 {'Keyword': 'Charles_Darwin', 'URI': 'http://www.wikidata.org/entity/Q1035'},
 {'Keyword': 'Social_Darwinism',
  'URI': 'http://www.wikidata.org/entity/Q202261'},
 {'Keyword': 'Historicism', 'URI': 'http://www.wikidata.org/entity/Q277466'},
 {'Keywor

Let's now try with the French and the Portuguese example. As can be seen (especially from the Portuguese example), the output is richer than the DBPedia one. 

In [10]:
output = main_functions.useLLM(test_item_fr, llm, context="Title")
output



[{'Keyword': 'Femicide', 'URI': 'http://www.wikidata.org/entity/Q1342425'},
 {'Keyword': 'Femicide', 'URI': 'http://www.wikidata.org/entity/Q1342425'},
 {'Keyword': 'Legislation', 'URI': 'http://www.wikidata.org/entity/Q49371'},
 {'Keyword': 'Repression (criminal law)', 'URI': ''},
 {'Keyword': 'Prevention (health and social welfare)', 'URI': ''}]

In [11]:
output = main_functions.useLLM(test_item_pt, llm, context="Title")
output



[{'Keyword': 'Islam', 'URI': 'http://www.wikidata.org/entity/Q432'},
 {'Keyword': 'Muslims', 'URI': 'http://www.wikidata.org/entity/Q1137457'},
 {'Keyword': 'Marriage', 'URI': 'http://www.wikidata.org/entity/Q8445'},
 {'Keyword': 'Matrimony', 'URI': 'http://www.wikidata.org/entity/Q3851974'},
 {'Keyword': 'Woman', 'URI': 'http://www.wikidata.org/entity/Q467'},
 {'Keyword': 'Female', 'URI': 'http://www.wikidata.org/entity/Q6581072'},
 {'Keyword': "Women's rights",
  'URI': 'http://www.wikidata.org/entity/Q223569'},
 {'Keyword': 'Dialogue', 'URI': 'http://www.wikidata.org/entity/Q131395'},
 {'Keyword': 'Communication', 'URI': 'http://www.wikidata.org/entity/Q11024'},
 {'Keyword': 'Feminism', 'URI': 'http://www.wikidata.org/entity/Q7252'},
 {'Keyword': "Women's rights movement",
  'URI': 'http://www.wikidata.org/entity/Q53028786'}]

Now, we will experiment with a proprietary OpenAI model. Authentication is required to use the OpenAI API. The following code block allows authentication (please insert your OpenAI key)

In [6]:
import os

os.environ['OPENAI_API_KEY'] = "YOUR OPENAI KEY HERE"
client = tools_utils.openAI_authentication(os.environ.get("OPENAI_API_KEY"))

Now that we are authenticated, we can experiment with the model. We start with the English example. As can be seen, the function available to use the model allows to choose the model. In this example, we will make use of gpt-3.5-turbo. As with the open-source model, we give only the title of the article as context. We can see that so far using a proprietary model gives the best possible answer. 

In [17]:
output = main_functions.useOpenAILLM(test_item_en, "gpt-4o-mini", "Title", client)
output

<s>[INST] {Map each keyword to one or more relevant WikiData entities.
        Keywords are from a scientific article. 
        The keyword list is: anthropologie, France, politique, représentation, image. 
        An example of answer for the list of keywords: literary life, literary fact, doing things
        is: literary life: [literature]; literary fact: [literature], [fact]; doing things: [activity]
        Please, don't match keywords to the code of WikiData entities (e.g., Q123456), but to the entity name.
        INCLUDE EACH SEPARATE ENTITY BETWEEN [] IN THE ANSWER } [/INST]
    
anthropologie: [anthropology]; France: [France]; politique: [politics]; représentation: [representation], [depiction]; image: [image], [representation]


[{'Keyword': 'anthropology', 'URI': 'http://www.wikidata.org/entity/Q23404'},
 {'Keyword': 'France', 'URI': 'http://www.wikidata.org/entity/Q142'},
 {'Keyword': 'politics', 'URI': 'http://www.wikidata.org/entity/Q7163'},
 {'Keyword': 'representation',
  'URI': 'http://www.wikidata.org/entity/Q4393498'},
 {'Keyword': 'depiction', 'URI': 'http://www.wikidata.org/entity/Q115491052'},
 {'Keyword': 'image', 'URI': 'http://www.wikidata.org/entity/Q478798'},
 {'Keyword': 'representation',
  'URI': 'http://www.wikidata.org/entity/Q4393498'}]

In order to switch to another language example, replace the first argument of the function like in the code below

In [39]:
output = main_functions.useOpenAILLM(test_item_fr, "gpt-3.5-turbo", "Title", client)
output

[{'Keyword': 'femicide', 'URI': 'http://www.wikidata.org/entity/Q1342425'},
 {'Keyword': 'violence against women',
  'URI': 'http://www.wikidata.org/entity/Q1800556'},
 {'Keyword': 'femicide', 'URI': 'http://www.wikidata.org/entity/Q1342425'},
 {'Keyword': 'gender-based violence',
  'URI': 'http://www.wikidata.org/entity/Q81552270'},
 {'Keyword': 'legislation', 'URI': 'http://www.wikidata.org/entity/Q49371'},
 {'Keyword': 'law', 'URI': 'http://www.wikidata.org/entity/Q7748'},
 {'Keyword': 'legal system', 'URI': 'http://www.wikidata.org/entity/Q2478386'},
 {'Keyword': 'suppression', 'URI': 'http://www.wikidata.org/entity/Q23056310'},
 {'Keyword': 'repression', 'URI': 'http://www.wikidata.org/entity/Q106781680'},
 {'Keyword': 'law enforcement',
  'URI': 'http://www.wikidata.org/entity/Q44554'},
 {'Keyword': 'prevention', 'URI': 'http://www.wikidata.org/entity/Q1717246'},
 {'Keyword': 'public health', 'URI': 'http://www.wikidata.org/entity/Q189603'},
 {'Keyword': 'risk reduction',
  'URI'

The tool also integrate efficient use of open source LLMs via Groq API

In [8]:
os.environ['GROQ_KEY'] = "YOUR GROQ KEY HERE"
groq_client = tools_utils.groq_authentication(os.environ.get("GROQ_KEY"))

Now that we are authenticated, we can experiment with open models with the Groq API. As can be seen, the function available to use the model allows to choose the model. In this example, we will make use of a light model, llama-3.2-3b. As can be seen, the results are promising, which suggests that smaller open models could be a good fit for this task

In [15]:
output = main_functions.useGroqLLM(test_item_fr, "llama-3.2-3b-preview", "Title", groq_client)

In [16]:
output

[{'Keyword': 'Femicide', 'URI': 'http://www.wikidata.org/entity/Q1342425'},
 {'Keyword': 'Violence Against Women',
  'URI': 'http://www.wikidata.org/entity/Q1800556'},
 {'Keyword': 'Femicide', 'URI': 'http://www.wikidata.org/entity/Q1342425'},
 {'Keyword': 'Violence Against Women',
  'URI': 'http://www.wikidata.org/entity/Q1800556'},
 {'Keyword': 'Legislation', 'URI': 'http://www.wikidata.org/entity/Q49371'},
 {'Keyword': 'Law', 'URI': 'http://www.wikidata.org/entity/Q7748'},
 {'Keyword': 'Policy', 'URI': 'http://www.wikidata.org/entity/Q1156854'},
 {'Keyword': 'Suppression', 'URI': 'http://www.wikidata.org/entity/Q23056310'},
 {'Keyword': 'Repression', 'URI': 'http://www.wikidata.org/entity/Q106781680'},
 {'Keyword': 'Oppression', 'URI': 'http://www.wikidata.org/entity/Q252000'},
 {'Keyword': 'Prevention', 'URI': 'http://www.wikidata.org/entity/Q1717246'},
 {'Keyword': 'Preventive medicine',
  'URI': 'http://www.wikidata.org/entity/Q1773974'},
 {'Keyword': 'Public health', 'URI': 'htt