# Recherche de documents similaires

## Objectif
À partir d’un texte saisi par l’utilisateur, retrouver les documents
les plus similaires du corpus BBC News en utilisant TF-IDF
et la similarité cosinus.


# Cellule 2 — Import des bibliothèques

In [3]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Cellule 3 — Chargement des données nettoyées

In [4]:
df = pd.read_csv("data/bbc-text-cleaned.csv")
texts = df["clean_text"]

# Cellule 4 — Vectorisation TF-IDF

In [5]:
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.9,
    min_df=5,
    ngram_range=(1, 2),
    stop_words="english"
)

X_tfidf = tfidf_vectorizer.fit_transform(texts)

# Cellule 5 — Fonction de preprocessing du texte utilisateur

In [6]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_query(text):
    text = text.lower()
    text = re.sub(r"\d+", "", text)
    text = text.translate(str.maketrans("", "", string.punctuation))
    tokens = word_tokenize(text)
    tokens = [w for w in tokens if w not in stop_words and len(w) > 2]
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return " ".join(tokens)

# Cellule 6 — Texte de requête (exemple utilisateur)

In [7]:
query = """
The government announced new economic policies
to support businesses and reduce inflation.
"""

# Cellule 7 — Preprocessing + vectorisation de la requête

In [8]:
query_clean = preprocess_query(query)
query_vector = tfidf_vectorizer.transform([query_clean])

# Cellule 8 — Calcul de la similarité cosinus

In [9]:
similarities = cosine_similarity(query_vector, X_tfidf)

# Cellule 9 — Récupération des documents les plus similaires

In [10]:
top_n = 5

top_indices = similarities[0].argsort()[-top_n:][::-1]

df_results = df.iloc[top_indices][["category", "clean_text"]]
df_results["similarity_score"] = similarities[0][top_indices]

df_results

Unnamed: 0,category,clean_text,similarity_score
393,business,market signal brazilian recovery brazilian sto...,0.211913
322,business,bank opts leave rate hold bank england left in...,0.118574
704,business,newest member underpin growth european union n...,0.112868
625,business,nigeria boost cocoa production government nige...,0.107549
845,business,brazil jobless rate hit new low brazil unemplo...,0.106474


# Cellule 10 — Affichage lisible des résultats

In [11]:
for i, row in df_results.iterrows():
    print(f"Score de similarité : {row['similarity_score']:.3f}")
    print(f"Catégorie : {row['category']}")
    print(f"Extrait : {row['clean_text'][:300]}...")
    print("-" * 80)

Score de similarité : 0.212
Catégorie : business
Extrait : market signal brazilian recovery brazilian stock market risen record high investor display growing confidence durability country economic recovery main bovespa index sao paolo stock exchange closed point friday topping previous record market close reached previous day market buoyancy reflects optimi...
--------------------------------------------------------------------------------
Score de similarité : 0.119
Catégorie : business
Extrait : bank opts leave rate hold bank england left interest rate hold sixth month row bank monetary policy committee mpc decided take action amid mixed signal economy economist predict rise cost borrowing come later year interest rate rose five time november august soaring house price buoyant consumer data...
--------------------------------------------------------------------------------
Score de similarité : 0.113
Catégorie : business
Extrait : newest member underpin growth european union newest m

## Conclusion de la recherche de similarité

- Le système permet de retrouver les documents les plus proches
  d’un texte donné.
- La similarité cosinus est adaptée aux vecteurs TF-IDF.
- Cette approche correspond à un moteur de recherche textuel simple.
