# Analyse comparative OMS vs Forbes 
**Objectif :** mesurer dans quelle mesure les problématiques de santé mises en avant par l'OMS en Afrique sont présentes et comment elles sont traitées dans Forbes Afrique. Le notebook couvre :

- extraction et définition des *topics* OMS (référence thématique),
- scoring lexical (coverage) des articles Forbes par topic OMS,
- scoring sémantique (similarité embeddings) des articles Forbes par topic OMS,
- analyse de sentiment (document-level + aspect-based pour phrases liées aux topics OMS),
- framing (rule-based),
- extraction d'entités et réseau co-occurrence,
- appariement OMS → Forbes (nearest neighbors),
- tests statistiques simples.

**Remarques importantes :**
- La **temporal analysis** est volontairement exclue de ce notebook.
- Le notebook suppose que les fichiers produits par le pipeline existent dans `/data :
  - `all_articles_processed.csv` (articles nettoyés),
  - `tfidf.joblib` 
  - `embeddings.npy` 

## 0. Dépendances & installation

**Objectif :** installer les paquets requis (si nécessaire). Exécute en terminal :

```bash
pip install pandas numpy scikit-learn umap-learn hdbscan sentence-transformers transformers joblib spacy gensim plotly matplotlib seaborn wordcloud nltk faiss-cpu
python -m spacy download fr_core_news_sm
python -m spacy download en_core_web_sm
```


In [2]:
# 1) Imports & chemins - exécute cette cellule en premier
import os, json
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

OUT_DIR = '../data/outputs' if os.path.exists('../data/outputs') else 'outputs'
CSV_PATH = os.path.join(OUT_DIR, 'all_articles_processed.csv')
TFIDF_PATH = os.path.join(OUT_DIR, 'tfidf.joblib')
EMB_PATH = os.path.join(OUT_DIR, 'embeddings.npy')
RESULTS_DIR = os.path.join(OUT_DIR, 'analysis_results')
os.makedirs(RESULTS_DIR, exist_ok=True)
print('Expected processed CSV at:', CSV_PATH)


Expected processed CSV at: outputs\all_articles_processed.csv


In [18]:
# 2) Fusion des fichiers

oms = pd.read_csv("../outputs/oms_articles.csv")      
forbes = pd.read_csv("../outputs/forbes_articles_2.csv")

oms['source'] = 'OMS'
forbes['source'] = 'Forbes'
df = pd.concat([oms, forbes], ignore_index=True)

# aperçu
print(df.shape)
print(df.columns.tolist())
df.head()

(461, 44)
['source', 'titre', 'date', 'lien', 'texte', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Unnamed: 29', 'Unnamed: 30', 'Unnamed: 31', 'Unnamed: 32', 'Unnamed: 33', 'Unnamed: 34', 'Unnamed: 35', 'Unnamed: 36', 'Unnamed: 37', 'Unnamed: 38', 'Unnamed: 39', 'Unnamed: 40', 'Unnamed: 41', 'Unnamed: 42', 'Unnamed: 43']


Unnamed: 0,source,titre,date,lien,texte,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,Unnamed: 40,Unnamed: 41,Unnamed: 42,Unnamed: 43
0,OMS,"Soins, compassion et guérison du plus jeune pa...",2025-11-05T12:00:00Z,https://www.afro.who.int/fr/countries/democrat...,"Soins, compassion et guérison du plus jeune pa...",,,,,,...,,,,,,,,,,
1,OMS,"Comprendre la prévalence, les risques et les m...",2025-11-03T12:00:00Z,https://www.afro.who.int/fr/countries/senegal/...,"Comprendre la prévalence, les risques et les m...",,,,,,...,,,,,,,,,,
2,OMS,Les centres de prise en charge au cœur de la l...,2025-10-31T12:00:00Z,https://www.afro.who.int/fr/countries/burundi/...,Les centres de prise en charge au cœur de la l...,,,,,,...,,,,,,,,,,
3,OMS,L’épidémie de choléra en recul au Congo,2025-10-21T12:00:00Z,https://www.afro.who.int/fr/countries/congo/ne...,L’épidémie de choléra en recul au Congo\n21 oc...,,,,,,...,,,,,,,,,,
4,OMS,RDC : le deuil coutumier de 40 jours aide à fr...,2025-10-18T12:00:00Z,https://www.afro.who.int/fr/countries/democrat...,RDC : le deuil coutumier de 40 jours aide à fr...,,,,,,...,,,,,,,,,,


In [19]:
df = df.iloc[:, [0, 1, 2, 3, 4]]

In [20]:
df.head()

Unnamed: 0,source,titre,date,lien,texte
0,OMS,"Soins, compassion et guérison du plus jeune pa...",2025-11-05T12:00:00Z,https://www.afro.who.int/fr/countries/democrat...,"Soins, compassion et guérison du plus jeune pa..."
1,OMS,"Comprendre la prévalence, les risques et les m...",2025-11-03T12:00:00Z,https://www.afro.who.int/fr/countries/senegal/...,"Comprendre la prévalence, les risques et les m..."
2,OMS,Les centres de prise en charge au cœur de la l...,2025-10-31T12:00:00Z,https://www.afro.who.int/fr/countries/burundi/...,Les centres de prise en charge au cœur de la l...
3,OMS,L’épidémie de choléra en recul au Congo,2025-10-21T12:00:00Z,https://www.afro.who.int/fr/countries/congo/ne...,L’épidémie de choléra en recul au Congo\n21 oc...
4,OMS,RDC : le deuil coutumier de 40 jours aide à fr...,2025-10-18T12:00:00Z,https://www.afro.who.int/fr/countries/democrat...,RDC : le deuil coutumier de 40 jours aide à fr...


In [23]:
# 3) Nettoyage de base (HTML, espaces, encodage, dates)

# === Nettoyage léger (pour BERT) et prétraitement pour TF-IDF ===
from langdetect import detect
from bs4 import BeautifulSoup
import spacy
import re

nlp_fr = spacy.load("fr_core_news_sm", disable=["parser","ner"])
nlp_en = spacy.load("en_core_web_sm", disable=["parser","ner"])

def clean_html_only(text):
    """Nettoyage léger : retire script/style/URLs, conserve phrases intactes."""
    soup = BeautifulSoup(str(text), "html.parser")
    for s in soup(["script","style"]):
        s.decompose()
    out = soup.get_text(separator=" ")
    out = re.sub(r"https?://\S+|www\.\S+|\S+@\S+", " ", out)
    out = out.replace("’", "'").replace("‘","'")
    out = re.sub(r"\s+", " ", out).strip()
    txt = re.sub(r'javascript.*disabled.*', ' ', text, flags=re.I)
    txt = re.sub(r'cookie.*', ' ', text, flags=re.I)
    return out

def preprocess_for_tfidf(text, lang_hint='fr', min_tok=2):
    """Nettoyage plus agressif pour TF-IDF / LDA : lowercase, remove stopwords, lemmatisation."""
    if not isinstance(text, str):
        text = str(text or "")
    text = clean_html_only(text)
    text = text.lower()
    # keep letters & accents & apostrophes and spaces
    text = re.sub(r"[^a-z0-9àâäçéèêëîïôöùûüÿœæ'\s-]", " ", text)
    doc = (nlp_fr if str(lang_hint).startswith("fr") else nlp_en)(text)
    toks = []
    for t in doc:
        if t.is_stop or t.is_punct or t.is_space or t.like_num:
            continue
        lemma = t.lemma_.lower().strip()
        if len(lemma) < min_tok:
            continue
        toks.append(lemma)
    return " ".join(toks)

# detection de langue

def safe_detect(s):
    try: return detect(s)
    except: return 'unknown'


In [24]:
df_clean = df.copy()


df_clean['texte_clean_bert'] = df['texte'].astype(str).apply(clean_html_only)
df_clean['lang'] = df_clean['texte_clean_bert'].apply(lambda s: safe_detect(s) if s.strip() else 'unknown')
df_clean['texte_clean_tfidf'] = df_clean.apply(lambda r: preprocess_for_tfidf(r['texte'], lang_hint=r['lang']), axis=1)

# vérifier quelques exemples
df_clean[['texte','texte_clean_bert','texte_clean_tfidf']].head(3)

Unnamed: 0,texte,texte_clean_bert,texte_clean_tfidf
0,"Soins, compassion et guérison du plus jeune pa...","Soins, compassion et guérison du plus jeune pa...",soin compassion guérison jeune patient épidémi...
1,"Comprendre la prévalence, les risques et les m...","Comprendre la prévalence, les risques et les m...",comprendre prévalence risque mécanisme prévent...
2,Les centres de prise en charge au cœur de la l...,Les centres de prise en charge au cœur de la l...,centre prise charge cœur lutte contre mpo buru...


In [26]:
# doublons exacts sur le texte
df_clean = df_clean.drop_duplicates(subset=['texte']).reset_index(drop=True)

In [27]:
# 3) Chargement des données et séparation OMS / Forbes


for c in ['source','texte_clean_bert','texte_clean_tfidf','lang','titre','summary','keywords']:
    if c not in df_clean.columns:
        df_clean[c] = ''

mask_oms = df_clean['source'].str.contains('OMS|WHO', case=False, na=False)
mask_forbes = df_clean['source'].str.contains('Forbes', case=False, na=False)
df_oms = df_clean[mask_oms].copy()
df_forbes = df_clean[mask_forbes].copy()
print('OMS docs:', df_oms.shape[0], 'Forbes docs:', df_forbes.shape[0])


OMS docs: 59 Forbes docs: 151


## 3) Extraction des topics OMS (référence)
Objectif : créer des topics à partir du corpus OMS qui serviront de référence. Méthodes proposées : LDA via gensim; si gensim absent, prévoir une liste manuelle de keywords.

In [28]:
# 3a) Tokenization simple pour LDA (utilise texte_clean_tfidf)
texts_oms = df_oms['texte_clean_tfidf'].astype(str).tolist()
texts_oms = [t.split() for t in texts_oms if isinstance(t, str) and t.strip()]
print('Number of OMS tokenized docs:', len(texts_oms))

# 3b) Run gensim LDA if available
topics = {}
try:
    import gensim
    dictionary = gensim.corpora.Dictionary(texts_oms)
    dictionary.filter_extremes(no_below=2, no_above=0.8, keep_n=20000)
    corpus = [dictionary.doc2bow(text) for text in texts_oms]
    num_topics = 8
    lda = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, passes=8, random_state=42)
    for i in range(num_topics):
        topics[i] = [word for word, prob in lda.show_topic(i, topn=15)]
        print(f'Topic {i}:', topics[i])
    with open(os.path.join(RESULTS_DIR, 'oms_topics.json'), 'w', encoding='utf-8') as f:
        json.dump(topics, f, ensure_ascii=False, indent=2)
    print('Saved OMS topics to', os.path.join(RESULTS_DIR, 'oms_topics.json'))
except Exception as e:
    print('Gensim/LDA unavailable or failed:', e)
    # Fallback example: small manual topic lists (à adapter)
    topics = {
        0:['vaccin','vaccination','immunisation','dose','campagne'],
        1:['covid','pandemi','virus','epidemie','sars'],
        2:['sante','systeme','soins','hopital','personnel']
    }
    print('Using fallback topic keywords (manual).')


Number of OMS tokenized docs: 59
Topic 0: ['enfant', 'charge', 'soin', 'service', 'vaccination', 'personne', 'centre', 'région', 'accès', 'mobile', 'contre', 'zone', 'jeune', 'cas', 'effort']
Topic 1: ['traitement', 'soin', 'maladie', 'ebola', 'centre', 'patient', 'national', 'dépistage', 'service', 'cancer', 'mettre', 'vih', 'contre', 'femme', 'lutte']
Topic 2: ['ambulance', 'enfant', 'jour', 'communautaire', 'cas', 'charge', 'sanitaire', 'contact', 'il', 'congo', '-t', 'agent', 'prise', 'district', 'prendre']
Topic 3: ['femme', 'maternel', 'contre', 'communauté', 'enfant', 'maladie', 'pays', 'soin', 'an', 'communautaire', 'campagne', 'décès', 'population', 'sanitaire', 'accouchement']
Topic 4: ['soin', 'centre', 'charge', 'eau', 'service', 'choléra', 'accès', 'prise', 'cas', 'décès', 'former', 'communauté', 'femme', 'épidémie', 'traitement']
Topic 5: ['tuberculose', 'traitement', 'diagnostic', 'maladie', 'contre', 'pays', 'patient', 'lutte', 'cas', 'radio', 'mpo', 'laboratoire', 'dép

In [29]:
# 3c) Prepare topic keyword sets (top-K tokens per topic)
TOP_K = 30
topic_topk = {t: (list(v)[:TOP_K] if isinstance(v, list) else list(v)[:TOP_K]) for t,v in topics.items()}
print('Prepared topic keyword lists for', len(topic_topk), 'topics')


Prepared topic keyword lists for 8 topics


## 4) Embeddings (Sentence-BERT)
Objectif : encoder les documents pour analyses sémantiques. Charger `embeddings.npy` si présent sinon calculer avec `sentence-transformers`.

In [30]:
# 4a) Load existing embeddings or compute them
embeddings = None
if os.path.exists(EMB_PATH):
    try:
        embeddings = np.load(EMB_PATH)
        print('Loaded embeddings shape:', embeddings.shape)
    except Exception as e:
        print('Failed to load embeddings.npy:', e)

if embeddings is None:
    try:
        from sentence_transformers import SentenceTransformer
        model_name = 'all-mpnet-base-v2'
        print('Loading SBERT model:', model_name)
        sbert = SentenceTransformer(model_name)
        texts_all = df_clean['texte_clean_bert'].astype(str).tolist()
        embeddings = sbert.encode(texts_all, show_progress_bar=True, convert_to_numpy=True)
        np.save(EMB_PATH, embeddings)
        print('Saved embeddings to', EMB_PATH)
    except Exception as e:
        print('SBERT encoding failed (offline or missing):', e)
        embeddings = None

# map indices per subset
if embeddings is not None:
    idx_oms = df_clean.index[mask_oms].tolist()
    idx_forbes = df_clean.index[mask_forbes].tolist()
    emb_oms = embeddings[idx_oms] if idx_oms else np.empty((0, embeddings.shape[1]))
    emb_forbes = embeddings[idx_forbes] if idx_forbes else np.empty((0, embeddings.shape[1]))
    print('emb_oms:', emb_oms.shape, 'emb_forbes:', emb_forbes.shape)


  from .autonotebook import tqdm as notebook_tqdm


Loading SBERT model: all-mpnet-base-v2


Batches: 100%|██████████| 7/7 [00:58<00:00,  8.40s/it]

Saved embeddings to outputs\embeddings.npy
emb_oms: (59, 768) emb_forbes: (151, 768)





## 5) Centroids sémantiques des topics OMS
Objectif : pour chaque topic, calculer un centroid (moyenne des embeddings) sur les documents OMS associés au topic. Nous utilisons LDA assignments si disponibles, sinon un heuristic par mots-clés.

In [32]:
# 5a) Assign OMS docs to topics (LDA assignment if available, else keywords overlap)
noms_topic_assign = {}
if 'lda' in globals():
    # If we used gensim LDA, compute doc-topic for each OMS doc
    for i, bow in enumerate(corpus):
        dist = lda.get_document_topics(bow)
        if dist:
            topic_id = max(dist, key=lambda x: x[1])[0]
            noms_topic_assign.setdefault(topic_id, []).append(i)
    print('Assigned OMS docs via LDA')
else:
    # fallback: keyword overlap
    oms_texts = df_oms['texte_clean_tfidf'].astype(str).tolist()
    for i, text in enumerate(oms_texts):
        tokens = set(text.split())
        best_t = None; best_overlap = 0
        for t, kws in topic_topk.items():
            ov = len(tokens & set(kws))
            if ov > best_overlap:
                best_overlap = ov; best_t = t
        if best_t is not None and best_overlap>0:
            noms_topic_assign.setdefault(best_t, []).append(i)
    print('Assigned OMS docs via keyword overlap (fallback)')

# 5b) Compute centroids
topic_centroids = {}
if embeddings is not None and len(noms_topic_assign)>0:
    oms_global_indices = df_clean.index[mask_oms].tolist()
    for t, local_idxs in noms_topic_assign.items():
        global_idxs = [oms_global_indices[i] for i in local_idxs if i < len(oms_global_indices)]
        if not global_idxs: continue
        vecs = embeddings[global_idxs]
        topic_centroids[t] = vecs.mean(axis=0)
    print('Computed centroids for topics:', list(topic_centroids.keys()))
else:
    print('Embeddings or topic assignments missing; cannot compute centroids')


Assigned OMS docs via LDA
Computed centroids for topics: [1, 5, 6, 4, 2, 0, 3, 7]


## 6) Lexical coverage — combien d'éléments du vocabulaire d'un topic OMS apparaissent dans un article Forbes ?
Objectif : score simple basé sur overlap tokens / topic_keywords.

In [33]:
# 6a) lexical coverage function
def lexical_coverage(text, topic_words):
    toks = [t for t in str(text).split() if t]
    if not toks: return 0.0
    overlap = sum(1 for t in toks if t in set(topic_words))
    return overlap / len(toks)

# 6b) compute coverage for Forbes articles
lex_rows = []
for idx in df_forbes.index.tolist():
    row = {'global_index': int(idx)}
    txt = df_clean.loc[idx, 'texte_clean_tfidf']
    for t, kws in topic_topk.items():
        row[f'lex_topic_{t}'] = lexical_coverage(txt, kws)
    lex_rows.append(row)
lex_df = pd.DataFrame(lex_rows).fillna(0)
lex_df.to_csv(os.path.join(RESULTS_DIR, 'lexical_coverage_forbes.csv'), index=False)
print('Saved lexical coverage to', os.path.join(RESULTS_DIR, 'lexical_coverage_forbes.csv'))


Saved lexical coverage to outputs\analysis_results\lexical_coverage_forbes.csv


## 7) Similarité sémantique (embeddings) — mesure des distances entre articles Forbes et les centroids OMS
Objectif : score cosinus entre embedding article Forbes et centroid topic OMS.

In [34]:
# 7) compute semantic similarity if available
if embeddings is not None and topic_centroids:
    sim_rows = []
    for idx in df_forbes.index.tolist():
        emb = embeddings[int(idx)]
        row = {'global_index': int(idx)}
        for t, cent in topic_centroids.items():
            sim = float(cosine_similarity(emb.reshape(1,-1), cent.reshape(1,-1))[0,0])
            row[f'sim_topic_{t}'] = sim
        sim_rows.append(row)
    sim_df = pd.DataFrame(sim_rows).fillna(0)
    sim_df.to_csv(os.path.join(RESULTS_DIR, 'semantic_similarity_forbes.csv'), index=False)
    print('Saved semantic similarity to', os.path.join(RESULTS_DIR, 'semantic_similarity_forbes.csv'))
else:
    print('Embeddings or centroids missing; semantic similarity skipped')


Saved semantic similarity to outputs\analysis_results\semantic_similarity_forbes.csv


## 8) Coverage combined (lexical OR semantic) — créer des flags de couverture
Objectif : combiner lexical + semantic pour marquer si un article Forbes couvre un topic OMS (seuils à ajuster).

In [35]:
# 8) Combine coverage
lex_path = os.path.join(RESULTS_DIR, 'lexical_coverage_forbes.csv')
sim_path = os.path.join(RESULTS_DIR, 'semantic_similarity_forbes.csv')
lex = pd.read_csv(lex_path) if os.path.exists(lex_path) else pd.DataFrame()
sim = pd.read_csv(sim_path) if os.path.exists(sim_path) else pd.DataFrame()
if not lex.empty:
    cov = lex.copy()
    if not sim.empty:
        cov = cov.merge(sim, on='global_index', how='left')
    # define thresholds (modifiable)
    for t in topic_topk.keys():
        cov[f'covered_topic_{t}'] = ((cov.get(f'lex_topic_{t}',0) > 0.08) | (cov.get(f'sim_topic_{t}',0) > 0.55)).astype(int)
    cov.to_csv(os.path.join(RESULTS_DIR, 'coverage_combined_forbes.csv'), index=False)
    print('Saved combined coverage flags to', os.path.join(RESULTS_DIR, 'coverage_combined_forbes.csv'))
else:
    print('Lexical coverage missing; run lexical step first')


Saved combined coverage flags to outputs\analysis_results\coverage_combined_forbes.csv


## 9) Sentiment (document-level) — pipeline transformers
Objectif : attribuer un label de sentiment (POS/NEG/NEU) pour chaque document. Utiliser avec prudence pour le français; préférence pour un modèle francophone si majoritairement FR.

In [39]:
# 9) Document-level sentiment

import os
import numpy as np
import pandas as pd
from transformers import pipeline
from tqdm import tqdm



# 4) Choix du modèle
# Option A : modèle multilingue utile si corpus FR+EN -> nlptown/bert-base-multilingual-uncased-sentiment (1-5 stars)
# Option B : modèle francophone si corpus majoritairement FR -> 'nlptown/bert-base-multilingual-uncased-sentiment' is still ok
# Option C : modèle anglais si corpus majoritairement EN -> 'distilbert-base-uncased-finetuned-sst-2-english'
MODEL = 'nlptown/bert-base-multilingual-uncased-sentiment'  # recommandé pour FR+EN simple

# 5) Créer pipeline explicitement (device = 0 for GPU, -1 for CPU)
# If you have GPU available, set device=0; otherwise leave device=-1
sent_pipe = pipeline('sentiment-analysis', model=MODEL, tokenizer=MODEL, device=-1)

# 6) Utility: map model outputs to POS/NEU/NEG
# Note: nlptown returns labels like "1 star", "2 stars"... we map them:
def map_multilang_label(label):
    # examples: '1 star', '2 stars', ..., '5 stars'
    if isinstance(label, str) and 'star' in label:
        n = int(label.split()[0])
        if n <= 2:
            return 'NEG'
        elif n == 3:
            return 'NEU'
        else:
            return 'POS'
    # fallback for English-style labels
    if label.upper().startswith('NEG'):
        return 'NEG'
    if label.upper().startswith('POS'):
        return 'POS'
    return 'NEU'

# 7) Batch inference (safe: truncate long texts)
BATCH = 8
results = []
for i in tqdm(range(0, len(df), BATCH), desc='Sentiment batches'):
    batch_texts = df_clean['texte_clean_bert'].iloc[i:i+BATCH].tolist()
    # truncate each text to ~1000 chars (adjust) to avoid pipeline failures/long times
    batch_texts_trunc = [t[:1200] for t in batch_texts]
    try:
        outs = sent_pipe(batch_texts_trunc)
    except Exception as e:
        # fallback: try per-item to isolate problematic examples
        outs = []
        for t in batch_texts_trunc:
            try:
                outs.append(sent_pipe(t)[0])
            except Exception as ee:
                outs.append({'label': 'NEU', 'score': 0.0})
    # normalize outputs
    for out in outs:
        label = out.get('label', '')
        score = float(out.get('score', 0.0))
        norm_label = map_multilang_label(label)
        results.append({'label_raw': label, 'label': norm_label, 'score': score})

# 8) Attacher résultats au dataframe et sauvegarder
res_df = pd.DataFrame(results)
res_df.index = df_clean.index  # align indices
df_clean['sent_label_raw'] = res_df['label_raw']
df_clean['sent_label'] = res_df['label']
df_clean['sent_score'] = res_df['score']

OUT_CSV = os.path.join(OUT_DIR, 'all_with_sentiment.csv')
df.to_csv(OUT_CSV, index=False)
print('Saved sentiment-annotated CSV to', OUT_CSV)

Device set to use cpu
Sentiment batches: 100%|██████████| 58/58 [00:48<00:00,  1.20it/s]

Saved sentiment-annotated CSV to outputs\all_with_sentiment.csv





## 10) Aspect-based sentiment (phrase-level for topic keywords)
Objectif : pour chaque article Forbes qui couvre un topic OMS, extraire phrases contenant topic keywords et évaluer le sentiment de ces phrases (approche simple).

In [40]:
# 10) Aspect-based sentiment
import re
try:
    import nltk
    nltk.download('punkt', quiet=True)
    from nltk import sent_tokenize
except Exception:
    def sent_tokenize(x):
        return str(x).split('. ')

aspect_records = []
if os.path.exists(os.path.join(RESULTS_DIR, 'coverage_combined_forbes.csv')) and 'sent_pipe' in globals():
    cov = pd.read_csv(os.path.join(RESULTS_DIR, 'coverage_combined_forbes.csv'))
    for _, r in cov.iterrows():
        gidx = int(r['global_index'])
        art_text = df_clean.loc[gidx, 'texte_clean_bert']
        for t in topic_topk.keys():
            if r.get(f'covered_topic_{t}',0) == 1:
                kws = topic_topk[t]
                sents = [s for s in sent_tokenize(str(art_text)) if any(re.search(r'\b'+re.escape(k)+r'\b', s, flags=re.I) for k in kws)]
                if not sents:
                    sents = sent_tokenize(str(art_text))[:3]
                scores = []
                for s in sents:
                    try:
                        out = sent_pipe(s[:1000])[0]
                        scores.append({'label': out['label'], 'score': float(out.get('score',0.0))})
                    except Exception:
                        pass
                if scores:
                    mapping = {'POSITIVE':1,'NEGATIVE':-1}
                    vals = [mapping.get(x['label'].upper(),0)*x['score'] for x in scores]
                    aspect_records.append({'global_index': gidx, 'topic': t, 'aspect_sentiment': float(np.mean(vals)), 'n_sentences': len(sents)})
    aspect_df = pd.DataFrame(aspect_records)
    aspect_df.to_csv(os.path.join(RESULTS_DIR, 'aspect_sentiment_forbes.csv'), index=False)
    print('Saved aspect-level sentiment to', os.path.join(RESULTS_DIR, 'aspect_sentiment_forbes.csv'))
else:
    print('Either coverage file missing or sentiment pipeline not available; skip aspect-based sentiment')


Saved aspect-level sentiment to outputs\analysis_results\aspect_sentiment_forbes.csv


## 11) Framing (rule-based)
Objectif : catégoriser le cadrage d'un article (economic / health / neutral / mixed) pour les articles Forbes couvrant un topic OMS via dictionnaires de mots-clés simples.

In [41]:
# 11) Framing rule-based
ECON = set(['investissement','investir','marché','startup','business','venture','financement','investor','profit','entreprise'])
HEALTH = set(['prévention','vaccin','vaccination','soins','sante','epidemie','mortalite','clinique','hopital','OMS','ministere'])
framing = []
if os.path.exists(os.path.join(RESULTS_DIR, 'coverage_combined_forbes.csv')):
    cov = pd.read_csv(os.path.join(RESULTS_DIR, 'coverage_combined_forbes.csv'))
    for _, r in cov.iterrows():
        gidx = int(r['global_index'])
        txt = set(str(df_clean.loc[gidx,'texte_clean_bert']).lower().split())
        for t in topic_topk.keys():
            if r.get(f'covered_topic_{t}',0) == 1:
                econ = len(txt & ECON)
                heal = len(txt & HEALTH)
                if econ > heal and econ>=1:
                    f = 'economic'
                elif heal > econ and heal>=1:
                    f = 'health'
                elif econ==0 and heal==0:
                    f = 'neutral'
                else:
                    f = 'mixed'
                framing.append({'global_index': gidx, 'topic': t, 'framing': f})
    framing_df = pd.DataFrame(framing)
    framing_df.to_csv(os.path.join(RESULTS_DIR, 'framing_forbes.csv'), index=False)
    print('Saved framing to', os.path.join(RESULTS_DIR, 'framing_forbes.csv'))
else:
    print('Coverage file missing; run coverage step first')


Saved framing to outputs\analysis_results\framing_forbes.csv


## 12) Entity extraction (spaCy) et co-occurrence
Objectif : extraire entités (ORG, PERSON, GPE) depuis les articles Forbes et produire un CSV utilisable pour construire un réseau d'acteurs.

In [42]:
# 12) Entity extraction
try:
    import spacy
    nlp_fr = spacy.load('fr_core_news_sm')
    nlp_en = spacy.load('en_core_web_sm')
except Exception as e:
    print('spaCy models not installed or failed to load:', e)
    nlp_fr = nlp_en = None

ent_rows = []
if nlp_fr is not None:
    for idx, row in df_forbes.iterrows():
        lang = row.get('lang','fr')
        nlp = nlp_fr if str(lang).startswith('fr') else nlp_en
        doc = nlp(str(row['texte_clean_bert']))
        for ent in doc.ents:
            if ent.label_ in ('PER','ORG','GPE','LOC','MISC','PERSON'):
                ent_rows.append({'global_index': int(idx), 'entity': ent.text, 'label': ent.label_})
    ent_df = pd.DataFrame(ent_rows)
    ent_df.to_csv(os.path.join(RESULTS_DIR, 'forbes_entities.csv'), index=False)
    print('Saved entities to', os.path.join(RESULTS_DIR, 'forbes_entities.csv'))
else:
    print('spaCy not available — install the models to run NER')


Saved entities to outputs\analysis_results\forbes_entities.csv


## 13) Appariement OMS → Forbes (nearest neighbors)
Objectif : pour chaque document OMS, trouver les K articles Forbes les plus proches en similarité cosinus (embeddings). Permet une revue qualitative paire-à-paire.

In [43]:
# 13) Nearest neighbors (brute force)
if embeddings is not None and emb_oms.size and emb_forbes.size:
    sims = cosine_similarity(emb_oms, emb_forbes)
    top_k = 5
    pairs = []
    oms_global = df_clean.index[mask_oms].tolist()
    for i in range(sims.shape[0]):
        best_idx = np.argsort(sims[i])[-top_k:][::-1]
        for rank, j in enumerate(best_idx):
            pairs.append({'oms_index': int(oms_global[i]), 'forbes_index': int(df_clean.index[mask_forbes].tolist()[j]), 'rank': rank+1, 'similarity': float(sims[i, j])})
    pairs_df = pd.DataFrame(pairs)
    pairs_df.to_csv(os.path.join(RESULTS_DIR, 'oms_to_forbes_pairs.csv'), index=False)
    print('Saved pairs to', os.path.join(RESULTS_DIR, 'oms_to_forbes_pairs.csv'))
else:
    print('Embeddings or subsets missing; cannot compute NN pairs')


Saved pairs to outputs\analysis_results\oms_to_forbes_pairs.csv


## 14) Tests statistiques simples
Objectif : exemple de test pour vérifier si la couverture d'un topic OMS par Forbes est significativement différente (illustratif). Interpréter avec prudence sur petits corpus.

In [44]:
# 14) Chi-square example (illustrative)
import scipy.stats as stats
cov_file = os.path.join(RESULTS_DIR, 'coverage_combined_forbes.csv')
if os.path.exists(cov_file):
    cov = pd.read_csv(cov_file)
    topic = list(topic_topk.keys())[0]
    covered = cov[f'covered_topic_{topic}'].sum()
    not_covered = cov.shape[0] - covered
    oms_count = len(noms_topic_assign.get(topic, [])) if 'noms_topic_assign' in globals() else 0
    table = np.array([[covered, not_covered],[oms_count, max(1, len(df_oms)-oms_count)]])
    chi2, p, dof, ex = stats.chi2_contingency(table)
    print('Chi2:', chi2, 'p-value:', p)
    with open(os.path.join(RESULTS_DIR, 'stat_tests.txt'), 'w', encoding='utf-8') as f:
        f.write(f'Chi2 topic {topic}: {chi2}, p={p}\n')
    print('Saved simple stat test results to', os.path.join(RESULTS_DIR, 'stat_tests.txt'))
else:
    print('Coverage file missing; run coverage steps first')


Chi2: 1.7949457711303183 p-value: 0.18032472947879363
Saved simple stat test results to outputs\analysis_results\stat_tests.txt


## 15) Mots Cle

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk

# ------------- Mots-clés: TF-IDF top-n par document -------------
# Paramètres
TOP_K = 10


# Construire TF-IDF global (unigrammes; on peut filtrer stopwords via vectorizer si besoin)
tfidf = TfidfVectorizer(max_features=15000, ngram_range=(1,2))
X_tfidf = tfidf.fit_transform(df_clean['texte_clean_tfidf'].fillna(''))
feature_names = np.array(tfidf.get_feature_names_out())

# Fonction pour obtenir top-k tokens d'une ligne TF-IDF
def top_k_tfidf_row(row_vec, k=TOP_K):
    if row_vec.nnz == 0:
        return []
    # row_vec is sparse row; get indices and data
    idx = row_vec.indices
    data = row_vec.data
    # sort by value
    order = np.argsort(data)[::-1]
    top_idx = idx[order][:k]
    return feature_names[top_idx].tolist()

# Appliquer per-document
keywords_list = []
for i in range(X_tfidf.shape[0]):
    kws = top_k_tfidf_row(X_tfidf[i], k=TOP_K)
    keywords_list.append(", ".join(kws))

df_clean['keywords'] = keywords_list


## 16) Export & rapport
Les résultats (CSV/JSON) sont sauvegardés dans `<OUT_DIR>/analysis_results` :
- `oms_topics.json` (topics OMS),
- `lexical_coverage_forbes.csv`, `semantic_similarity_forbes.csv`, `coverage_combined_forbes.csv`,
- `document_sentiment.csv`, `aspect_sentiment_forbes.csv`, `framing_forbes.csv`,
- `forbes_entities.csv`, `oms_to_forbes_pairs.csv`, `stat_tests.txt`.

Next steps recommandés : ajuster seuils, inspecter manuellement paires, calibrer les modèles de sentiment pour le français, et itérer.

In [53]:
df_clean.to_csv("../data/all_data_processed.csv", index = False)