# swisscoveryIA : recherche dans la collection de la bibliothèque de l'UNIGE en langage naturel

Projet réalisé dans le cadre de Hackademia 2024, Battelle 22-23 novembre 2024  

Auteurs : Antoine, Cédric, Abdoulaye, Nicolas Prongué (nicolas.prongue@unige.ch), Pablo Iriarte (pablo.iriarte@unige.ch)
Date de création : 23.11.2024  
Date de dernière modification : 22.11.2024  

* LLM : LLAMA2 -> fichier llama-2-7b-chat.Q5_K_S.gguf de 4Gb disponible sur https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
* Briques techniques : LlamaCPP et LangChain -> bibliothèques open-source pour intéragir avec le LLM

## Fonctionnement et tâches

1. Classer la question posée dans une des deux catégories, recherche booléenne (BOOL) ou recherche conceptuelle (CONCEPT), selon la présence de certains mots dans la recherche (sac de mots)
1. Réaliser un prompt qui entoure la question posée selon le type de recherche pour arriver au meilleur résultat possible
1. Envoyer le prompt à l'IA installée en local
1. Traiter la réponse donnée par l'IA pour obtenir:
   1. BOOL : un fichier JSON avce les critères de récherche extraits par l'IA et séparés dans différents champs (auteur, date et sujet)
   2. CONCEPT : un résumé fait par l'IA et des critères de recherche avec synonymes en français et anglais pour une recherche simple
1. BOOL : Traiter la question pour ajouter un critère de recherche avec le type de document si certains mots sont présents dans la recherche ("article", "livres", etc.)
2. Envoyer la requête à l'API de swisscovery pour obtenir les 10 documents les plus pertinants (critère de pertinance propre à swisscovery)
3. Parser le fichier JSON de l'API swisscovery pour avoir une liste de résultats formatés pour les présenter sur l'interface
4. Présenter le résultat :
   1. BOOL : La liste de 10 résultats avec le lien sur chaque document vers swisscovery et donner un lien à la fin pour lancer la reqûete avancée sur swisscovery
   2. CONCEPT : Le texte de l'IA suivi d'un disclaimer de la bibliothèque qui demande à la personne de vérifier l'information avec les documents trouvés sur swisscovery et donner un lien à la fin pour lancer la reqûete simple sur swisscovery

## Résultats

Interface Web simple avec :
  * un entête
  * un petit texte d'introduction
  * un champ de recherche
  * l'espace pour les résultats
  * pied de page

In [8]:
from langchain_community.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from datetime import datetime
import re
import urllib.parse
import json
import requests

# Paramètres
swisscovery_api = 'https://api-eu.hosted.exlibrisgroup.com/primo/v1/search?vid=41SLSP_UGE:VU1&tab=CentralIndex'
swisscovery_key = 'xxx' # Sandbox

# Recherche pour les tests 
recherche = "Je veux savoir ce qu'est le bosson de Higgs"

In [47]:
# function de chat
def chat(question):
    # Load the LlamaCpp language model, adjust GPU usage based on your hardware
    llm = LlamaCpp(
        model_path="C:/Users/iriarte/AI/LLMs/llama-2-7b-chat.Q4_K_M.gguf",
        temperature=0.8,
        n_ctx=4096,
        n_gpu_layers=-1,
        # n_gpu_layers=40,
        n_batch=4096,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.,
        max_tokens=4096,
        # repeat_penalty=1.5,
        # top_p=0.5,
        verbose=False,  # Enable detailed logging for debugging
    )
    
    # Define the prompt template with a placeholder for the question
    template = """
    Question: I want an concise explanation about the question "{question}" and also to transform the question into a booleen search strategy that can be used in a library catalog.
    The search query can be constructed using several synonyms of the subject in french and english.
    All the terms of the query can be combined only with 'OR' or 'AND', don't use other Boolean Operators or Filters.
    Put the string '[explanation]' before your summary about the question.
    Put the string '[query]' before que search query.
    Don't add comments inside the query, give only the serach strategy the we can use in a library database and don't add explanations about the serach strategy after the query.
    Please respect this structure on the answer: '[explanation] the subject explanation. [query] the search strategy.' 
    
    Answer:
    """
    prompt = PromptTemplate(template=template, input_variables=["question"])
    
    # Create an LLMChain to manage interactions with the prompt and model
    llm_chain = LLMChain(prompt=prompt, llm=llm)
    
    # print("Chatbot initialized, ready to chat...")
    while True:
        # question = input("> ")
        print('start IA : ' + str(datetime.now())) 
        # answer = llm_chain.run(question)
        answer = llm_chain.invoke(question)
        # print(answer, '\n')
        print("Answer done")
        print('end IA : ' + str(datetime.now())) 
        return(answer)

In [48]:
# chat()
answer = chat(recherche)

start IA : 2024-11-27 18:49:31.536599
Answer done
end IA : 2024-11-27 18:50:04.562414


In [49]:
print (answer)

{'question': "Je veux savoir ce qu'est le bosson de Higgs", 'text': ' [explanation] The Boson of Higgs is a fundamental particle in the Standard Model of Particle Physics that is responsible for giving mass to other particles. It was first proposed by physicist Peter Higgs and others in the 1960s, and was discovered in 2012 at the Large Hadron Collider (LHC) at CERN. The boson is a scalar particle with zero spin, which means it has no intrinsic angular momentum, and it is the quantum of the Higgs field, a field that permeates all of space and determines the mass of fundamental particles. [query] (Higgs OR Higgs boson OR boson de Higgs)\n    '}


In [50]:
answer['question']

"Je veux savoir ce qu'est le bosson de Higgs"

In [51]:
answer['text']

' [explanation] The Boson of Higgs is a fundamental particle in the Standard Model of Particle Physics that is responsible for giving mass to other particles. It was first proposed by physicist Peter Higgs and others in the 1960s, and was discovered in 2012 at the Large Hadron Collider (LHC) at CERN. The boson is a scalar particle with zero spin, which means it has no intrinsic angular momentum, and it is the quantum of the Higgs field, a field that permeates all of space and determines the mass of fundamental particles. [query] (Higgs OR Higgs boson OR boson de Higgs)\n    '

In [52]:
# extraction de la query
query = answer['text']
query = query.replace('`', '')
query = query.replace('´', '')
if (query[0] == '\''):
    query = query[1:]
if (query[-1] == '\''):
    query = query[:-1]
if (query[0] == ' '):
    query = query[1:]
if (query[-1] == ' '):
    query = query[:-1]

# remplacement des retours de ligne
query = query.replace('[explanation]', ':::')
query = query.replace('[query]', '|||')
query = query.replace('\n', '£££')
query = query + '_;_'
print(query)

::: The Boson of Higgs is a fundamental particle in the Standard Model of Particle Physics that is responsible for giving mass to other particles. It was first proposed by physicist Peter Higgs and others in the 1960s, and was discovered in 2012 at the Large Hadron Collider (LHC) at CERN. The boson is a scalar particle with zero spin, which means it has no intrinsic angular momentum, and it is the quantum of the Higgs field, a field that permeates all of space and determines the mass of fundamental particles. ||| (Higgs OR Higgs boson OR boson de Higgs)£££   _;_


In [53]:
if ('|||' in query) and (':::' in query) > 0:
    print ("Séparateurs OK")
    pattern = r"^(.*)\|\|\|(.*)_;_"
    match = re.search(pattern, query)
    answer_prof = match.group(1)
    query_clean = match.group(2)
print('Prof : ' + answer_prof)
print('Query : ' + query_clean)


Séparateurs OK
Prof : ::: The Boson of Higgs is a fundamental particle in the Standard Model of Particle Physics that is responsible for giving mass to other particles. It was first proposed by physicist Peter Higgs and others in the 1960s, and was discovered in 2012 at the Large Hadron Collider (LHC) at CERN. The boson is a scalar particle with zero spin, which means it has no intrinsic angular momentum, and it is the quantum of the Higgs field, a field that permeates all of space and determines the mass of fundamental particles. 
Query :  (Higgs OR Higgs boson OR boson de Higgs)£££   


In [54]:
# nettoyage des deux parties
answer_prof = answer_prof.replace(':::' , '')
answer_prof = answer_prof.replace('£££' , ' ')
answer_prof = answer_prof.replace('  ' , ' ')
answer_prof = answer_prof.replace('  ' , ' ')
answer_prof = answer_prof.replace(':::' , '')
answer_prof = answer_prof.strip()
print(answer_prof)

The Boson of Higgs is a fundamental particle in the Standard Model of Particle Physics that is responsible for giving mass to other particles. It was first proposed by physicist Peter Higgs and others in the 1960s, and was discovered in 2012 at the Large Hadron Collider (LHC) at CERN. The boson is a scalar particle with zero spin, which means it has no intrinsic angular momentum, and it is the quantum of the Higgs field, a field that permeates all of space and determines the mass of fundamental particles.


In [55]:
# nettoyage des deux parties
query_clean = query_clean.replace(':::' , '')
if ('£££' in query_clean) :
    query_clean = query_clean[:query_clean.find('£££')]
query_clean = query_clean.strip()
print(query_clean)

(Higgs OR Higgs boson OR boson de Higgs)


In [56]:
# convertir en URL
query_url = swisscovery_api + '&q=any,contains,' + urllib.parse.quote_plus(query_clean) + '&apikey=' + swisscovery_key
print(query_url)

https://api-eu.hosted.exlibrisgroup.com/primo/v1/search?vid=41SLSP_UGE:VU1&tab=CentralIndex&q=any,contains,%28Higgs+OR+Higgs+boson+OR+boson+de+Higgs%29&apikey=xxx


In [20]:
r = requests.get(query_url)
print (r.json())

{'info': {'totalResultsLocal': 419, 'totalResultsPC': -1, 'total': 419, 'first': 1, 'last': 10}, 'highlights': {'creator': ['Higgs', 'boson'], 'contributor': ['Higgs', 'boson'], 'subject': ['Modèle', 'standard', 'physique', 'MODÈLE', 'STANDARD', 'PHYSIQUE', 'DES', 'PARTICULES'], 'description': ['Higgs', 'boson'], 'toc': ['Higgs', 'Boson', 'boson'], 'title': ['Higgs', 'boson', 'Boson'], 'termsUnion': ['Higgs', 'boson', 'Modèle', 'standard', 'physique', 'MODÈLE', 'STANDARD', 'PHYSIQUE', 'DES', 'PARTICULES', 'Boson']}, 'docs': [{'context': 'L', 'adaptor': 'Local Search Engine', '@id': 'https://eu03.alma.exlibrisgroup.com/primaws/rest/pub/pnxs/L/991012021606305502', 'pnx': {'display': {'source': ['UGE_ArchiveOuverte'], 'type': ['ARTICLES'], 'language': ['eng'], 'title': ['Search for the Standard Model Higgs boson at LEP'], 'identifier': ['$$CISSN$$V0370-2693'], 'description': ['The four LEP Collaborations, ALEPH, DELPHI, L3 and OPAL, have collected a total of 2461 pb−1 of e+e− collision 

In [21]:
# nombre de résultats
sw_json = r.json()
print ('Nombre de résultats dans swisscovery : ' + str(sw_json['info']['total']))

Nombre de résultats dans swisscovery : 419


In [35]:
# Affichage des résultats
mylist = ''
docs = sw_json['docs']
for doc in docs :
    mylist = mylist + doc['pnx']['display']['title'][0]
    if ('contributor' in doc['pnx']['display']) : 
        mylist = mylist + '. - '
        i = 0
        for contributor in doc['pnx']['display']['contributor']:
            if i < 3 :
                mylist = mylist + contributor[:contributor.find('$')] + ' ; '
            i = i + 1
    if ('ispartof' in doc['pnx']['display']) : 
        mylist = mylist  + doc['pnx']['display']['ispartof'][0][:doc['pnx']['display']['ispartof'][0].find('$')] + ', '
    if ('date' in doc['pnx']['display']) : 
        mylist = mylist + doc['pnx']['display']['date'][0] + ', '
    if ('mms' in doc['pnx']['display']) : 
        mylist = mylist + ' URL : https://slsp-unige.primo.exlibrisgroup.com/discovery/search?tab=41SLSP_UGE_MyInst_CI&search_scope=MyInst_and_CI&vid=41SLSP_UGE:VU1&offset=0&query=any,contains,' + doc['pnx']['display']['mms'][0]
    mylist = mylist + '\n\n'
print (mylist)

Search for the Standard Model Higgs boson at LEP. - Achard, Pablo ; Bourquin, Maurice ; Braccini, Saverio ; Physics letters. B ; Vol. 565 (2003) p. 61-75,  URL : https://slsp-unige.primo.exlibrisgroup.com/discovery/search?tab=41SLSP_UGE_MyInst_CI&search_scope=MyInst_and_CI&vid=41SLSP_UGE:VU1&offset=0&query=any,contains,991012021606305502

Measurement of the properties of the Standard Model Higgs boson in the H →ZZ∗→4l decay channel with the ATLAS Experiment at CERN. - Benhar Noccioli, Eleonora ;  URL : https://slsp-unige.primo.exlibrisgroup.com/discovery/search?tab=41SLSP_UGE_MyInst_CI&search_scope=MyInst_and_CI&vid=41SLSP_UGE:VU1&offset=0&query=any,contains,991012073819505502

Search for a CP-odd Higgs boson decaying to Zh in pp collisions at √s=8&nbsp;TeV with the ATLAS detector. - Alexandre, Gauthier ; Ancu, Lucian Stefan ; Barone, Gaetano ; Physics letters. B ; Vol. 744 (2015) p. 163-183,  URL : https://slsp-unige.primo.exlibrisgroup.com/discovery/search?tab=41SLSP_UGE_MyInst_CI&se