<a href="https://colab.research.google.com/github/rouakhadhraoui/Text-Mining-Labs-/blob/main/Text_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Objectif du Lab :
Comparer trois algorithmes de résumé extractif :

TF-IDF
TextRank
LSA

→ Pour chacun, générer un résumé à partir du texte fourni,
→ Puis évaluer sa qualité avec le score BLEU, en le comparant à un résumé de référence humain :


Étape 0 : Préparer l’environnement

In [17]:
# Installer les bibliothèques nécessaires
!pip install nltk sumy yake
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')  # nouvelle ressource depuis NLTK v3.9+




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Étape 1 : installer dépendances + imports + texte / référence

In [19]:
# 1) Install libs (Colab)
!pip install -q nltk scikit-learn networkx numpy gensim sumy

# 2) Imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import networkx as nx
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

# Ensure punkt tokenizer
nltk.download('punkt')

# 3) Input text and reference (given in the lab)
text = ("You can learn a lot about yourself through travelling. You can observe how you feel being far from your "
"country. You will find out how you feel about your homeland. You will realize how you really feel about "
"foreign people. You will find out how much you know/do not know about the world. You will be able to "
"observe how you react in completely new situations. You will test your language, orientational and "
"social skills. You will not be the same person after returning home. During travelling you will meet "
"people that are very different from you. If you travel enough, you will learn to accept and appreciate "
"these differences. Traveling makes you more open and accepting.")

reference = "Travelling teaches you about yourself. It makes you more tolerant."


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Le texte source est un paragraphe sur les bienfaits personnels du voyage.
Le résumé de référence, court et sémantiquement riche, servira de vérité terrain pour évaluer nos algorithmes.

Étape 2 : tokenisation des phrases (prétraitement)

In [18]:
sentences = sent_tokenize(text)
print("Number of sentences:", len(sentences))
for i,s in enumerate(sentences,1):
    print(f"{i}. {s}")



Number of sentences: 11
1. You can learn a lot about yourself through travelling.
2. You can observe how you feel being far from your country.
3. You will find out how you feel about your homeland.
4. You will realize how you really feel about foreign people.
5. You will find out how much you know/do not know about the world.
6. You will be able to observe how you react in completely new situations.
7. You will test your language, orientational and social skills.
8. You will not be the same person after returning home.
9. During travelling you will meet people that are very different from you.
10. If you travel enough, you will learn to accept and appreciate these differences.
11. Traveling makes you more open and accepting.


c’est l’unité sur laquelle on fait de l’extractive summarization.

Étape 3 : Résumé avec TF-IDF

In [20]:
def tfidf_summarize(sentences, top_k=2):
    # Build TF-IDF over sentences (each sentence = document)
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
    tfidf = vectorizer.fit_transform(sentences)
    # Score sentence by sum of its TF-IDF weights
    scores = tfidf.sum(axis=1).A1
    top_idx = np.argsort(scores)[-top_k:]  # indices of top scoring sentences (unsorted)
    # Preserve original order of sentences in the final summary
    summary = ' '.join([sentences[i] for i in sorted(top_idx)])
    return summary, scores, sorted(top_idx)

tfidf_sum, tfidf_scores, tfidf_idx = tfidf_summarize(sentences, top_k=2)
print("TF-IDF Summary:\n", tfidf_sum)
print("Selected sentence indices:", tfidf_idx)
print("Sentence scores:", np.round(tfidf_scores,3))



TF-IDF Summary:
 You will be able to observe how you react in completely new situations. You will test your language, orientational and social skills.
Selected sentence indices: [np.int64(5), np.int64(6)]
Sentence scores: [2.23  2.633 1.718 2.988 1.89  3.314 3.    2.236 2.64  2.997 2.646]


TF-IDF attribue un poids aux mots selon leur fréquence dans la phrase et leur rareté globale.
Les phrases avec les mots les plus "importants" sont sélectionnées.
→ Avantage : simple, rapide.
→ Limite : ne comprend pas le sens, sensible au bruit.

Étape 4 : Résumé avec TextRank

In [21]:
def textrank_summarize(sentences, top_k=2):
    # Vectorize sentences (TF-IDF)
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf = vectorizer.fit_transform(sentences)
    # Cosine similarity matrix
    sim_mat = cosine_similarity(tfidf)
    # Build graph and rank
    nx_graph = nx.from_numpy_array(sim_mat)
    try:
        scores = nx.pagerank_numpy(nx_graph, weight='weight')
    except:
        scores = nx.pagerank(nx_graph, weight='weight')
    # top-k sentences by score
    ranked = sorted(((scores[i], i) for i in range(len(sentences))), reverse=True)
    top_idx = [idx for (_, idx) in ranked[:top_k]]
    summary = ' '.join([sentences[i] for i in sorted(top_idx)])
    return summary, scores, sorted(top_idx)

tr_sum, tr_scores, tr_idx = textrank_summarize(sentences, top_k=2)
print("TextRank Summary:\n", tr_sum)
print("Selected sentence indices:", tr_idx)
print("PageRank scores:", {i:round(tr_scores[i],3) for i in tr_scores})


TextRank Summary:
 You can observe how you feel being far from your country. You will realize how you really feel about foreign people.
Selected sentence indices: [1, 3]
PageRank scores: {0: 0.095, 1: 0.097, 2: 0.091, 3: 0.095, 4: 0.091, 5: 0.083, 6: 0.091, 7: 0.091, 8: 0.091, 9: 0.085, 10: 0.091}


TextRank parvient à bien condenser le texte tout en conservant ses idées principales grâce à sa vision "relationnelle" du document. Il est particulièrement efficace pour des textes informatifs ou descriptifs où la redondance est forte.

Étape 5 : Résumé avec LSA

In [22]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

def lsa_summarize(text, top_k=2):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LsaSummarizer()
    sents = summarizer(parser.document, top_k)
    summary = ' '.join(str(s) for s in sents)
    return summary, list(sents)

lsa_sum, lsa_sents = lsa_summarize(text, top_k=2)
print("LSA Summary:\n", lsa_sum)
print("LSA selected sentences:")
for s in lsa_sents:
    print("-", s)


LSA Summary:
 You can learn a lot about yourself through travelling. You will not be the same person after returning home.
LSA selected sentences:
- You can learn a lot about yourself through travelling.
- You will not be the same person after returning home.


LSA utilise la décomposition SVD pour extraire des concepts latents, ce qui permet de gérer la synonymie et le bruit.
Il identifie les phrases qui couvrent le mieux les concepts principaux du texte.
→ Avantage : plus sémantique, robuste aux variations lexicales.

Étape 6 : calculer BLEU entre chaque résumé et la référence

In [23]:
smooth = SmoothingFunction().method1

def bleu_score(candidate, reference):
    cand_toks = [w.lower() for w in word_tokenize(candidate)]
    ref_toks = [w.lower() for w in word_tokenize(reference)]
    return sentence_bleu([ref_toks], cand_toks, smoothing_function=smooth)

print("Reference:", reference, "\n")
print("TF-IDF BLEU:", bleu_score(tfidf_sum, reference))
print("TextRank BLEU:", bleu_score(tr_sum, reference))
print("LSA BLEU:", bleu_score(lsa_sum, reference))


Reference: Travelling teaches you about yourself. It makes you more tolerant. 

TF-IDF BLEU: 0.010713701843513142
TextRank BLEU: 0.012384901282810543
LSA BLEU: 0.02642138995497447


Le score BLEU le plus élevé indique le résumé le plus proche (lexicalement) de la référence humaine.
Cependant, un faible score BLEU ne signifie pas nécessairement un mauvais résumé : BLEU ne mesure pas la cohérence sémantique, seulement le recouvrement de mots.
Par exemple, un résumé disant "Traveling helps you discover yourself and become more open-minded" est excellent, mais aura un BLEU faible car il n’utilise pas "tolerant".