# Entraînement Word2Vec

Objectif: Entraîner un modèle Word2Vec sur le corpus de reviews nettoyé pour obtenir des word embeddings.

Partie de la story **SAE-77**.

In [1]:
import sys
import os
import pandas as pd
import gensim
from gensim.models import Word2Vec
import logging

# Setup logging for Gensim
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

print(f"Gensim version: {gensim.__version__}")

Gensim version: 4.4.0


## Chargement des Données Préparées
Nous utilisons le dataset prétraité (pipeline complet SAE-74) s'il existe, sinon nous traitons une partie du dataset propre.

In [2]:
preprocessed_path = '../../outputs/reviews_preprocessed.pkl'
cleaned_path = '../../data/cleaned/reviews_clean.parquet'

if os.path.exists(preprocessed_path):
    print("Loading preprocessed data...")
    reviews = pd.read_pickle(preprocessed_path)
    corpus = reviews['tokens_final'].tolist()
    print(f"Loaded {len(corpus)} documents from preprocessed file.")
elif os.path.exists(cleaned_path):
    print("Preprocessed file not found. Loading clean data and applying simple tokenize...")
    # Fallback if SAE-74 output missing (just for demo/dev flow)
    reviews = pd.read_parquet(cleaned_path)
    # Use a sample for speed if needed
    reviews = reviews.head(5000).copy()
    
    # Minimal tokenization for W2V
    sys.path.append(os.path.abspath(os.path.join('../..', 'src')))
    from text_preprocessing import preprocess_pipeline
    
    print("Preprocessing on the fly...")
    reviews['tokens_final'] = reviews['text'].apply(lambda x: preprocess_pipeline(x))
    corpus = reviews['tokens_final'].tolist()
    print(f"Processed {len(corpus)} documents.")
else:
    print("No data found. Using dummy corpus.")
    corpus = [['good', 'food'], ['bad', 'service'], ['great', 'restaurant']]

Loading preprocessed data...
Loaded 2000 documents from preprocessed file.


## Configuration et Entraînement

In [3]:
# Parameters
vector_size = 100
window = 5
min_count = 2 # Lowering min_count for small samples testing
sg = 1 # Skip-gram

print("Training Word2Vec model...")
model = Word2Vec(
    sentences=corpus,
    vector_size=vector_size,
    window=window,
    min_count=min_count,
    workers=4,
    sg=sg,
    epochs=5
)

print("Training complete.")
print(f"Vocabulary size: {len(model.wv)}")

2026-02-06 10:42:41,017 : INFO : collecting all words and their counts


2026-02-06 10:42:41,018 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types


2026-02-06 10:42:41,031 : INFO : collected 11390 word types from a corpus of 107126 raw words and 2000 sentences


2026-02-06 10:42:41,032 : INFO : Creating a fresh vocabulary


2026-02-06 10:42:41,047 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 retains 5830 unique words (51.19% of original 11390, drops 5560)', 'datetime': '2026-02-06T10:42:41.047406', 'gensim': '4.4.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.26100-SP0', 'event': 'prepare_vocab'}


2026-02-06 10:42:41,048 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 101566 word corpus (94.81% of original 107126, drops 5560)', 'datetime': '2026-02-06T10:42:41.048410', 'gensim': '4.4.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.26100-SP0', 'event': 'prepare_vocab'}


2026-02-06 10:42:41,070 : INFO : deleting the raw counts dictionary of 11390 items


2026-02-06 10:42:41,072 : INFO : sample=0.001 downsamples 48 most-common words


2026-02-06 10:42:41,072 : INFO : Word2Vec lifecycle event {'msg': 'downsampling leaves estimated 95038.52799510738 word corpus (93.6%% of prior 101566)', 'datetime': '2026-02-06T10:42:41.072143', 'gensim': '4.4.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.26100-SP0', 'event': 'prepare_vocab'}


Training Word2Vec model...


2026-02-06 10:42:41,114 : INFO : estimated required memory for 5830 words and 100 dimensions: 7579000 bytes


2026-02-06 10:42:41,115 : INFO : resetting layer weights


2026-02-06 10:42:41,119 : INFO : Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2026-02-06T10:42:41.119914', 'gensim': '4.4.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.26100-SP0', 'event': 'build_vocab'}


2026-02-06 10:42:41,120 : INFO : Word2Vec lifecycle event {'msg': 'training model with 4 workers on 5830 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2026-02-06T10:42:41.120914', 'gensim': '4.4.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.26100-SP0', 'event': 'train'}


2026-02-06 10:42:41,279 : INFO : EPOCH 0: training on 107126 raw words (94998 effective words) took 0.2s, 618667 effective words/s


2026-02-06 10:42:41,434 : INFO : EPOCH 1: training on 107126 raw words (95057 effective words) took 0.2s, 632269 effective words/s


2026-02-06 10:42:41,591 : INFO : EPOCH 2: training on 107126 raw words (95034 effective words) took 0.2s, 619024 effective words/s


2026-02-06 10:42:41,748 : INFO : EPOCH 3: training on 107126 raw words (95081 effective words) took 0.2s, 623341 effective words/s


2026-02-06 10:42:41,903 : INFO : EPOCH 4: training on 107126 raw words (95015 effective words) took 0.2s, 632184 effective words/s


2026-02-06 10:42:41,904 : INFO : Word2Vec lifecycle event {'msg': 'training on 535630 raw words (475185 effective words) took 0.8s, 607339 effective words/s', 'datetime': '2026-02-06T10:42:41.904342', 'gensim': '4.4.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.26100-SP0', 'event': 'train'}


2026-02-06 10:42:41,905 : INFO : Word2Vec lifecycle event {'params': 'Word2Vec<vocab=5830, vector_size=100, alpha=0.025>', 'datetime': '2026-02-06T10:42:41.905409', 'gensim': '4.4.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.26100-SP0', 'event': 'created'}


Training complete.
Vocabulary size: 5830


## Évaluation Qualitative

In [4]:
test_words = ['good', 'bad', 'food', 'service', 'pizza']

for word in test_words:
    if word in model.wv:
        try:
            similar = model.wv.most_similar(word, topn=5)
            print(f"\nMost similar to '{word}':")
            for neighbor, score in similar:
                print(f"  - {neighbor}: {score:.4f}")
        except Exception as e:
            print(f"Error with word '{word}': {e}")
    else:
        print(f"\nWord '{word}' not in vocabulary.")


Most similar to 'good':
  - pretty: 0.9365
  - choice: 0.9164
  - quality: 0.9155
  - ok: 0.9145
  - tasty: 0.9088

Most similar to 'bad':
  - anything: 0.9419
  - point: 0.9417
  - say: 0.9411
  - review: 0.9399
  - reason: 0.9372

Most similar to 'food':
  - meal: 0.9355
  - waiter: 0.9332
  - fast: 0.9296
  - pretty: 0.9254
  - quick: 0.9207

Most similar to 'service':
  - quick: 0.9064
  - fast: 0.9014
  - attentive: 0.9007
  - pleasant: 0.8965
  - food: 0.8920

Most similar to 'pizza':
  - loved: 0.9611
  - favorite: 0.9486
  - tried: 0.9478
  - breakfast: 0.9465
  - seafood: 0.9447


## Sauvegarde

In [5]:
output_dir = '../../outputs/models'
os.makedirs(output_dir, exist_ok=True)
model_path = os.path.join(output_dir, 'word2vec_yelp.model')

model.save(model_path)
print(f"Model saved to {model_path}")

2026-02-06 10:42:41,950 : INFO : Word2Vec lifecycle event {'fname_or_handle': '../../outputs/models\\word2vec_yelp.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2026-02-06T10:42:41.950183', 'gensim': '4.4.0', 'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.26100-SP0', 'event': 'saving'}


2026-02-06 10:42:41,951 : INFO : not storing attribute cum_table


2026-02-06 10:42:41,959 : INFO : saved ../../outputs/models\word2vec_yelp.model


Model saved to ../../outputs/models\word2vec_yelp.model
