<a href="https://colab.research.google.com/github/crystalloide/RAG/blob/main/LAB01_Visualisation_tokenisation_embeddings_open_source2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 1: Visualisation de la tokenisation et des embeddings (Open Source)

---

**Objectif:**
- Comprendre comment un texte est segment√© en tokens et converti en vecteurs num√©riques (embeddings).
- Apprendre √† visualiser et interpr√©ter ces structures.
- **Utiliser uniquement des mod√®les open source gratuits !**

**Dur√©e estim√©e:**
- 90‚Äì120 minutes

**Livrable :**
- Notebook avec graphiques pr√©sentant la segmentation en tokens et une visualisation 2D de vecteurs num√©riques.

## Step 1: Pr√©-requis (5 min)

Installation des librairies n√©cessaires :

In [None]:
!pip install transformers sentence-transformers matplotlib scikit-learn pandas numpy networkx scipy -q

## Step 2: Inspection de la tokenisation (15 min)

Utilisation de `transformers` et `AutoTokenizer` pour d√©couper le texte en tokens :

In [None]:
from transformers import AutoTokenizer

# Initialize tokenizer with a popular open-source model (DistilBERT)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Sample text
sample_text = "Agentic AI agents can plan, reason, and use tools."

# Tokenize
tokens = tokenizer.encode(sample_text)
tokens_decoded = tokenizer.tokenize(sample_text)

# Display results
print("="*70)
print(f"Texte originel: {sample_text}")
print(f"Nombre de tokens: {len(tokens)}")
print(f"Token IDs: {tokens}")
print("\nTokens d√©cod√©s (individuellement):")
for i, token in enumerate(tokens_decoded, 1):
    print(f" {i}. '{token}'")
print("="*70)

### Exp√©rimentez avec diff√©rents types de texte

Regardons comment la ponctuation, les espaces, et les caract√®res sp√©ciaux affectent la tokenisation :

In [None]:
test_texts = [
    "Hello, World!",
    "HelloWorld",
    "hello world",
    "AI is üî•.",
    "def my_function():\n return True",
    "AGENTIC",
    "agentic",
]

print("\n" + "="*70)
print("Analyse de la Tokenisation sur diff√©rents exemples de texte")
print("="*70)

for text in test_texts:
    tokens = tokenizer.encode(text)
    decoded = tokenizer.tokenize(text)
    print(f"\nText: {repr(text)}")
    print(f"Tokens: {len(tokens)} ‚Üí {tokens}")
    print(f"Decoded: {decoded}")

## Step 3 : Comparaison de la longueur des jetons dans les textes (10 min)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

sentences = [
    "AI is amazing.",
    "Artificial Intelligence is amazing.",
    "AI is üî•.",
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.",
]

# Collect data
data = []
for sentence in sentences:
    token_list = tokenizer.encode(sentence)
    data.append({
        "Text": sentence,
        "Character Count": len(sentence),
        "Token Count": len(token_list),
        "Ratio (Chars/Token)": len(sentence) / len(token_list)
    })

df = pd.DataFrame(data)
print("\n" + "="*100)
print("Comparaison de la longueur des Tokens")
print("="*100)
print(df.to_string(index=False))
print("="*100)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart: Character vs Token count
x = range(len(df))
axes[0].bar([i - 0.2 for i in x], df['Character Count'], width=0.4, label='Characters', color='steelblue')
axes[0].bar([i + 0.2 for i in x], df['Token Count'], width=0.4, label='Tokens', color='coral')
axes[0].set_xlabel('Sentence Index', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Count', fontsize=11, fontweight='bold')
axes[0].set_title('Character Count vs Token Count', fontsize=12, fontweight='bold')
axes[0].set_xticks(x)
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# Line chart: Char/Token ratio
axes[1].plot(x, df['Ratio (Chars/Token)'], marker='o', linewidth=2, markersize=8, color='green')
axes[1].set_xlabel('Sentence Index', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Ratio (Chars/Token)', fontsize=11, fontweight='bold')
axes[1].set_title('Character-to-Token Ratio', fontsize=12, fontweight='bold')
axes[1].set_xticks(x)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Remarque : Les caract√®res sp√©ciaux ont tendance √† utiliser plus de jetons par caract√®re !")

## Step 4: G√©n√©rer des embeddings (20 min)

Utilisons le mod√®le open source `sentence-transformers` pour cr√©er des vecteurs num√©riques :

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained model (free and open-source)
# Options: 'all-MiniLM-L6-v2', 'all-mpnet-base-v2', 'distiluse-base-multilingual-cased-v2'
model = SentenceTransformer('all-MiniLM-L6-v2')

# Texts to embed
texts = [
    "Agentic AI",
    "Autonomous agents",
    "Bananas are yellow",
    "Machine learning models",
    "Fruit is delicious",
]

embeddings = []

print("\n" + "="*70)
print("GENERATING EMBEDDINGS")
print("="*70)

for text in texts:
    vector = model.encode(text)
    embeddings.append(vector)
    print(f"‚úì '{text}' ‚Üí Vector length: {len(vector)}")

print("="*70)
print(f"\nüìä Embedding Dimension: {len(embeddings[0])}")
print(f"Number of texts embedded: {len(embeddings)}")

# Show a sample embedding (first 20 values)
print(f"\nSample embedding (first 20 values) for '{texts[0]}':")
print(embeddings[0][:20])

## Step 5: R√©duction de dimensionnalit√© et visualisation (25 min)

R√©duire les Embeddings √† 2D pour permettre la visualisation :

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Convert to numpy array
embeddings_array = np.array(embeddings)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings_array)

# Get explained variance
explained_var = pca.explained_variance_ratio_

print("\nüìà PCA Explained Variance:")
print(f" PC1: {explained_var[0]:.4f} ({explained_var[0]*100:.2f}%)")
print(f" PC2: {explained_var[1]:.4f} ({explained_var[1]*100:.2f}%)")
print(f" Total: {sum(explained_var):.4f} ({sum(explained_var)*100:.2f}%)")

# Visualization
plt.figure(figsize=(10, 8))

colors = ['#FF6B6B', '#FF6B6B', '#4ECDC4', '#4ECDC4', '#4ECDC4']

for i, (txt, color) in enumerate(zip(texts, colors)):
    x, y = embeddings_2d[i]
    plt.scatter(x, y, s=300, alpha=0.7, color=color, edgecolors='black', linewidth=2)
    plt.text(x + 0.05, y + 0.05, txt, fontsize=10, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.3', facecolor=color, alpha=0.3))

plt.xlabel(f'PC1 ({explained_var[0]*100:.1f}%)', fontsize=12, fontweight='bold')
plt.ylabel(f'PC2 ({explained_var[1]*100:.1f}%)', fontsize=12, fontweight='bold')
plt.title('Embedding Visualization (PCA 2D)', fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.tight_layout()
plt.show()

print("\nüí° Remarque : Les textes s√©mantiquement similaires se retrouvent proches les uns des autres !")

## Step 6: Classe utilitaire pour analyse compl√®te

Cr√©ons une classe r√©utilisable pour analyser n'importe quel texte :

In [None]:
from scipy.spatial.distance import cosine

class EmbeddingAnalyzer:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.reference_embedding = None

    def set_reference(self, text):
        """Set reference text for similarity comparison"""
        self.reference_embedding = self.model.encode(text)

    def compute_similarity(self, embedding):
        """Compute cosine similarity with reference embedding"""
        if self.reference_embedding is None:
            return None
        return 1 - cosine(self.reference_embedding, embedding)

    def display_analysis(self, text):
        """Display complete analysis for a text"""
        # Tokenization
        tokens = self.tokenizer.encode(text)
        tokens_decoded = self.tokenizer.tokenize(text)

        # Embedding
        embedding = self.model.encode(text)

        # Similarity
        similarity = self.compute_similarity(embedding)

        # Display
        print("\n" + "="*80)
        print(f"TEXTE: '{text}'")
        print("="*80)
        print(f"Nombre de caract√®res : {len(text)}")
        print(f"Nombre de tokens : {len(tokens)}")
        print(f"Dimension Embedding : {len(embedding)}")

        if similarity is not None:
            print(f"\nSimilarit√© avec la Reference : {similarity:.4f}")

        print(f"\nTokens:")
        for i, token in enumerate(tokens_decoded, 1):
            print(f" {i}. {token}")

        print(f"\nPr√©visualisation des Embeddings (10 premi√®res valeurs): {embedding[:10].tolist()}")
        print("="*80)

# Initialize analyzer
analyzer = EmbeddingAnalyzer(model, tokenizer)
analyzer.set_reference("AI and agents")

# Test with custom texts
custom_texts = [
    "Your sentence here",
    "Another sentence",
    "Add as many as you want",
]

for text in custom_texts:
    analyzer.display_analysis(text)

## Step 7: Visualisation avanc√©e - R√©seau de similarit√©

Cr√©ons un graphique de r√©seau illustrant les relations de similarit√© :

In [None]:
import matplotlib.patches as mpatches
import networkx as nx

# Prepare all texts for analysis
all_texts = texts + custom_texts[:2]  # Use first 2 custom texts

# Generate embeddings for all texts
all_embeddings = [model.encode(text) for text in all_texts]

# Compute similarity matrix
similarity_matrix = []
for emb1 in all_embeddings:
    row = []
    for emb2 in all_embeddings:
        sim = 1 - cosine(emb1, emb2)
        row.append(sim)
    similarity_matrix.append(row)

similarity_matrix = np.array(similarity_matrix)

# Create network graph
G = nx.Graph()

# Add nodes
for i, text in enumerate(all_texts):
    G.add_node(i, label=text)

# Add edges for high similarity (>0.5)
for i in range(len(all_texts)):
    for j in range(i+1, len(all_texts)):
        if similarity_matrix[i][j] > 0.5:
            G.add_edge(i, j, weight=similarity_matrix[i][j])

# Draw the network
fig, ax = plt.subplots(figsize=(12, 10))

# Layout
pos = nx.spring_layout(G, k=0.5, iterations=50)

# Node colors (categories)
node_colors = []
for i, text in enumerate(all_texts):
    if 'AI' in text or 'agent' in text.lower():
        node_colors.append('#FF6B6B')  # Red for AI/Agents
    else:
        node_colors.append('#4ECDC4')  # Teal for other

# Draw nodes
nx.draw_networkx_nodes(G, pos, node_color=node_colors, node_size=3000, alpha=0.8, ax=ax)

# Draw edges
edges = G.edges()
weights = [G[u][v]['weight'] for u, v in edges]
nx.draw_networkx_edges(G, pos, width=[w*3 for w in weights], alpha=0.6, ax=ax)

# Draw labels
labels = {i: all_texts[i][:20] + '...' if len(all_texts[i]) > 20 else all_texts[i]
          for i in range(len(all_texts))}
nx.draw_networkx_labels(G, pos, labels, font_size=8, font_weight='bold', ax=ax)

# Legend
ai_patch = mpatches.Patch(color='#FF6B6B', label='AI/Agents', edgecolor='black')
food_patch = mpatches.Patch(color='#4ECDC4', label='Other', edgecolor='black')
ax.legend(handles=[ai_patch, food_patch], loc='upper left', fontsize=12)

plt.title('Embedding Similarity Network (Threshold > 0.5)', fontsize=14, fontweight='bold')
plt.axis('off')
plt.tight_layout()
plt.show()

print("\nüîó Graphique de r√©seau : Les lignes connectent les textes similaires !")

## Step 8: R√©sum√© et conclusions

In [None]:
print("\n" + "="*80)
print("R√âSUM√â DE L'APPRENTISSAGE")
print("="*80)
print("""
‚úì Tokenisation : Comment les textes sont divis√©s en unit√©s plus petites
‚úì Embeddings : Conversion de texte en vecteurs num√©riques haute dimension
‚úì Similarit√© s√©mantique : Les textes similaires ont des embeddings proches
‚úì Visualisation : R√©duction 2D pour explorer l'espace s√©mantique

Mod√®les utilis√©s (100% open source et gratuit):
- Tokenizer: DistilBERT-base-uncased
- Embeddings: Sentence-Transformers (all-MiniLM-L6-v2)
- R√©duction: PCA (scikit-learn)
- Graphique: NetworkX + Matplotlib
""")
print("="*80)

## Points cl√©s

| Concept | Explication |
|---------|-------------|
| **Tokenization** | Processus de division du texte en unit√©s discr√®tes (tokens) |
| **Token** | Unit√© individuelle (mot, sous-mot, caract√®re sp√©cial) |
| **Embedding** | Vecteur num√©rique repr√©sentant le sens du texte |
| **Dimension** | Nombre de valeurs dans le vecteur (ex: 384 pour all-MiniLM-L6-v2) |
| **Similarit√©** | Distance entre deux embeddings (cosine similarity) |
| **PCA** | Technique pour r√©duire les dimensions en pr√©servant la variance |

---

## Mod√®les open source recommand√©s

**Tokenizers :**
- `distilbert-base-uncased` (tr√®s l√©ger, 66M param√®tres)
- `roberta-base` (plus puissant, 125M param√®tres)
- `bert-base-multilingual-cased` (multilingual)

**Embeddings (Sentence-Transformers) :**
- `all-MiniLM-L6-v2` : **Recommand√©** (22M params, ultra-rapide)
- `all-mpnet-base-v2` : Plus puissant (109M params)
- `distiluse-base-multilingual-cased-v2` : Multilingual

**Avantages :**
- ‚úÖ Gratuit et sans limites
- ‚úÖ Fonctionnent hors ligne
- ‚úÖ Pas de cl√©s API requises
- ‚úÖ Ex√©cution locale rapide sur GPU/CPU