# Détermination des topics à l'aide d'un modèle Génératif

Dans ce notebook, les clusters issues de la meilleur combinaison embeddings, clusterings, nombre de clusters déterminée dans le notebook "detaiiled_analysis" sera utilisé : SVD Kmeans 12 clusters . Pour ces clusters, on récupéra le nom des articles afin de les transmettre à un LLM pour trouver un thème commun et différençiant. Le modèle GPT-4o-mini sera utilisé via Azure OpenAI.

Cette méthode, qui pourrait s'étendre au texte des articles, est limité au titre par soucis de consommation de tokens. De plus, les clés d'accès utilisées dans le notebook seront à remplacer par celle de l'utilisateur pour le refaire tourner le code. Faire tourner la deuxième cellule pour avoir le graphe des clusters.

In [None]:
# Fonctions Azure OpenAI pour inférer les thèmes avec GPT 
from openai import AzureOpenAI
import os 
from dotenv import load_dotenv



# Chargement des clés 
load_dotenv(".env")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GPT_API_ENDPOINT = os.getenv("GPT_API_ENDPOINT")
GPT_API_VERSION = os.getenv("GPT_API_VERSION")

# Connection au client API 
def init_gpt(key, endpoint, api_version):
  client = AzureOpenAI(
    api_key = key,  
    api_version = api_version,
    azure_endpoint = endpoint,
  )
  return client

# Definition
def topic_gpt(prompt, client_gpt):
  response = client_gpt.chat.completions.create(
      model="gpt-4o-mini", 
      messages=[
          {"role": "system", "content": """
              You are analyst task with finding a precise and differentiating theme from provided a list of groupss of article names. 
              Please, for each group, I need you to find what make it different from group.
              It is very important that you show the specificity and uniqueness of each group and differiciate it from the other groups. 
              Try your maxminun to not repeat words between group unique traits as they all should be unique and different.
              
              Input format : [{"labels" : 1, "titres" : title 1, title 2}, {"labels" : 2, "titres" : title 3, title 4}}, ...]
              
              Output format list: [{"labels" : 1, "theme" : unique traits 1}, {"labels" : 2, "theme" : unique traits 2}, ...]
    
          """},
          {"role": "user", "content": prompt},
      ],
  )

  return response

In [3]:
# Calcul des clusters avec les paramètres optimaux
from main2 import pipeline, Kmeans_clustering, SVD_embeddings, display_tsne, score_function
import pandas as pd

def pipeline_labels(dataframe, embedding_method, clustering_method, nb_cluster = 11,reduction_method=None):
    print(f"start embedding for {embedding_method.__name__} and {clustering_method.__name__}")
    embeddings = embedding_method(dataframe)
    print("clustering")
    labels = clustering_method(nb_cluster, embeddings, embedding_method.__name__)
    print("scoring")
    scores = score_function(embeddings, labels)
    print(f"silhouette_score: {scores[0]}, davies_bouldin_score: {scores[1]}, calinski_harabasz_score: {scores[2]}")
    if reduction_method != None:
        reduction_method(embeddings, dataframe, labels, embedding_method.__name__, clustering_method.__name__)
    return embeddings, scores, labels

data_df = pd.read_csv("Data_csv/data_preprocessed.csv")
# remove outliers
embeddings, scores, labels = pipeline_labels(dataframe=data_df, embedding_method=SVD_embeddings, clustering_method=Kmeans_clustering, nb_cluster=12,reduction_method=display_tsne)

[nltk_data] Downloading package punkt to /home/damien/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


start embedding for SVD_embeddings and Kmeans_clustering
clustering
scoring
silhouette_score: 0.35619165905230044, davies_bouldin_score: 0.7227711169570644, calinski_harabasz_score: 862.0129961572617
(457, 2)


In [18]:
# Regroupement des titres par label
from collections import defaultdict

titres = data_df["Name of the document"]
groupes = defaultdict(list)

for label, titre in zip(labels, titres):
    groupes[label].append(titre)
titres_par_labels = [{"labels": label, "titres": titres} for label, titres in groupes.items()]

In [37]:
for elt in titres_par_labels:
    print(len(elt["titres"]),elt)
    print("/n")

66 {'labels': 8, 'titres': ['Intel Recommends Public Policy Principles for Artificial Intelligence', 'Artificial intelligence in Healthcare', 'Human rights in the age of Artificial Intelligence', 'AI Now 2017 Report', 'AI Now 2019 Report', 'From Principles to Practice: An interdisciplinary framework to operationalise AI ethics', 'AI and Human Rights', "Artificial Intelligence. Australia's Ethics Framework. A discussion paper", 'Human Rights and Technology', 'Artificial Intelligence Economic importance, social challenges, human responsibility', 'The AI Manifesto', 'The Brussels Effect and Artificial Intelligence: How EU regulation will impact the global AI market', 'Toward a G20 Framework for Artificial Intelligence in the Workplace', 'Comparing European and Canadian AI Regulation', 'Opinion on artificial intelligence and fundamental rights', 'Self-assessment guide for artificial intelligence (AI) systems', 'Towards regulation of AI Systems', 'Study on Discrimination, artificial intelli

In [42]:
# Inférence des thèmes
client_gpt = init_gpt(OPENAI_API_KEY, GPT_API_ENDPOINT, GPT_API_VERSION)
response = topic_gpt(str(titres_par_labels), client_gpt)

In [45]:
list_theme = response.choices[0].message.content
print(list_theme)


#Prompt used :
#You are an analyst tasked with finding a precise and differentiating theme from provided a list of groups of article names. 
#Please, for each group, I need you to find the theme of the articles inside the group.
#It is very important that the themes return show the specificity and uniqueness of each group and differentiate it from the other groups. 
#Try your maximinun to not repeat words between themes as they all should be unique and different.


```json
[
    {"labels": 8, "theme": "Comprehensive policy frameworks and ethical considerations for AI development and governance"},
    {"labels": 6, "theme": "Addressing human rights implications and ethical standards in AI and algorithmic decision-making"},
    {"labels": 11, "theme": "Regulatory strategies and international cooperation for AI governance in Europe"},
    {"labels": 1, "theme": "Evaluation and implementation of ethical frameworks in AI technologies and practices"},
    {"labels": 2, "theme": "Promoting responsible AI practices and citizen-centric applications in digital environments"},
    {"labels": 5, "theme": "Strategic initiatives for AI development, governance, and ethical implications across sectors"},
    {"labels": 10, "theme": "Building national AI strategies with a focus on trust, ethics, and regional cooperation"},
    {"labels": 7, "theme": "Safeguarding ethical standards and responsibilities in advanced technology applications"},
    {"labels": 4, "them

In [47]:
client_gpt = init_gpt(OPENAI_API_KEY, GPT_API_ENDPOINT, GPT_API_VERSION)
response = topic_gpt(str(titres_par_labels), client_gpt)
list_theme = response.choices[0].message.content
print(list_theme)

# Prompt used :
# You are an analyst tasked with finding a precise and differentiating theme from provided a list of groups of article names. 
# Please, for each group, I need you to find what make it different from group.
# It is very important that you show the specificity and uniqueness of each group and differentiate it from the other groups. 
# Try your maximinun to not repeat words between group unique traits as they all should be unique and different.

```json
[
    {"labels": 8, "theme": "Focus on public policy, governance, ethical frameworks, and human rights implications of AI across various sectors."},
    {"labels": 6, "theme": "Emphasis on bias, transparency, accountability, and ethical frameworks specifically related to algorithmic decision-making and public health."},
    {"labels": 11, "theme": "Concentration on regulatory proposals, international cooperation, and fostering trust in AI technologies with a European perspective."},
    {"labels": 1, "theme": "Engagement with societal implications, accountability, and responsibility of technologies, particularly in legislative and justice contexts."},
    {"labels": 2, "theme": "Prioritization of citizen engagement, national strategies for AI deployment, and establishing frameworks for trustworthy AI practices."},
    {"labels": 5, "theme": "Developing comprehensive governance strategies, ethical frameworks, and international collaboration to address AI's societal impacts."},
 

**Commentaires :**

Le premier prompt utilisé, demandant au LLM de générer des thèmes uniques pour chaque clusters, retourne des sujets très similaires avec de légères variations indiquant une certaine homogénéité dans les thématiques principales traitées dans le corpus. En effet, les sujets de la gouvernance de l'IA, de l'éthique sont présents dans la majorité des clusters, avec quelques variations comme un focus sur l'Europe pour le cluster 11, les applications centrés sur les citoyens pour le cluster 2 ou bien les algoritms de décision label 6. 

Dans un second temps, le deuxième prompt se focalise plus sur les différences entre les clusters. En sortie du LLM, des thèmes plus spécifiques sont données permettant de mieux distinguer les sujets et de dégager des sous thèmes plus développés que dans les résultats du premier prompt.

Concernant une piste d'amélioration de cette méthode, la caractérisation des thèmes des clusters pourrait être amélioré en utilisant aussi l'ensemble du contenu de l'article mais cela poserait un problème de consomation de tokens. 