<a href="https://colab.research.google.com/github/amoukrim/AI/blob/main/Week7/DailyChallenge/dailyChallengew_6_d5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

@Author : ADil MOUKRIM
#Text summarization using NLP
Last Updated: April 8th, 2025

Daily Challenge : Text summarization using NLP


Introduction
This notebook demonstrates a practical application of Natural Language Processing (NLP) techniques to automatically generate summaries of text documents. We will explore how to preprocess text data, represent words and sentences as vectors, and leverage graph-based algorithms to identify the most important sentences for summarization.



üë©‚Äçüè´ üë©üèø‚Äçüè´ What You‚Äôll learn
Text Preprocessing: Techniques for cleaning and preparing text data, including tokenization, stop word removal, and converting text to lowercase.
Word Embeddings: Understanding and using pre-trained word embeddings like GloVe to represent words as dense vectors.
Sentence Vectorization: Creating vector representations of sentences by aggregating word embeddings.
Similarity Measures: Using cosine similarity to determine the semantic similarity between sentences.
Graph-Based Summarization: Applying the PageRank algorithm on a graph of sentence similarities to rank sentence importance.
Text Summarization Implementation: Combining these techniques to build a text summarization system.


üõ†Ô∏è What you will create
You will create an automatic text summarization system that can take a collection of tennis articles as input and generate a concise summary highlighting the key information.



Task
1. Data Loading and Inspection

Load the tennis articles dataset from the .xls file using pandas.
Explore the dataset using .head() and .info() to understand its structure.
Drop the article_title column to simplify the dataset.
2. Sentence Tokenization

Use nltk.sent_tokenize() to split the article_text into individual sentences.
Flatten the resulting list of sentence lists into a single list of all sentences.
3. Download and Load GloVe Word Embeddings

Download the pre-trained GloVe vectors (e.g., glove.6B.100d.txt).
Load the embeddings into a Python dictionary where each word maps to its 100-dimensional vector.
4. Text Cleaning and Normalization

Remove punctuation, special characters, and numbers using regex.
Convert all sentences to lowercase to avoid case-sensitive mismatch.
Remove stop words using nltk.corpus.stopwords to reduce noise in the data.
5. Sentence Vectorization

For each cleaned sentence:
Split into words.
Replace each word with its GloVe vector (use a zero-vector if the word is not in the embedding).
Compute the average of all word vectors in the sentence.
Store all resulting sentence vectors in a list.
6. Similarity Matrix Construction

Initialize an empty matrix of size (number of sentences √ó number of sentences).
Compute pairwise cosine similarity between sentence vectors.
Fill in the matrix such that each cell represents the similarity between two sentences.
7. Graph Construction and Sentence Ranking

Convert the similarity matrix into a graph using networkx.
Apply the PageRank algorithm to score the importance of each sentence.
8. Summarization

Sort all sentences based on their PageRank scores in descending order.
Extract the top N sentences (e.g., 10) as the final summary.
Print or return the summarized sentences.

#√âtape 1 : Chargement et inspection des donn√©es

In [12]:
# Import des biblioth√®ques n√©cessaires
import pandas as pd
import numpy as np

# √âtape 1 : Chargement du dataset avec encodage appropri√©
# Tentative avec Latin-1 qui couvre la plupart des caract√®res europ√©ens
try:
    df = pd.read_csv('/content/tennis_articles.csv', encoding='latin-1')
    print("‚úÖ Fichier charg√© avec succ√®s (encoding='latin-1')")
except Exception as e:
    # Si Latin-1 √©choue, tentative avec Windows-1252
    print(f"‚ö†Ô∏è Erreur avec latin-1 : {str(e)[:50]}...")
    try:
        df = pd.read_csv('/content/tennis_articles.csv', encoding='cp1252')
        print("‚úÖ Fichier charg√© avec succ√®s (encoding='cp1252')")
    except Exception as e2:
        # Derni√®re tentative avec UTF-8 et gestion des erreurs
        print(f"‚ö†Ô∏è Erreur avec cp1252 : {str(e2)[:50]}...")
        df = pd.read_csv('/content/tennis_articles.csv', encoding='utf-8', errors='ignore')
        print("‚úÖ Fichier charg√© avec gestion des erreurs d'encodage")

# Exploration initiale du dataset
print("\n=== APERCU DES PREMI√àRES LIGNES ===")
print(df.head())

print("\n=== INFORMATIONS SUR LE DATASET ===")
print(df.info())





‚úÖ Fichier charg√© avec succ√®s (encoding='latin-1')

=== APERCU DES PREMI√àRES LIGNES ===
   article_id                                      article_title  \
0           1  I do not have friends in¬†tennis, says Maria Sh...   
1           2  Federer defeats Medvedev to advance to 14th Sw...   
2           3  Tennis: Roger Federer ignored deadline set by ...   
3           4  Nishikori to face off against Anderson in Vien...   
4           5  Roger Federer has made this huge change to ten...   

                                        article_text  \
0  Maria Sharapova has basically no friends as te...   
1  BASEL, Switzerland (AP) ¬ó Roger Federer advanc...   
2  Roger Federer has revealed that organisers of ...   
3  Kei Nishikori will try to end his long losing ...   
4  Federer, 37, first broke through on tour over ...   

                                              source  
0  https://www.tennisworldusa.org/tennis/news/Mar...  
1  http://www.tennis.com/pro-game/2018/10/copil-s.

In [13]:
# Simplification du dataset

df = df.drop(columns=['article_title'])
print("\n‚úÖ Colonne 'article_title' supprim√©e")
print(f"\nüìä Le dataset contient maintenant {len(df)} articles")


‚úÖ Colonne 'article_title' supprim√©e

üìä Le dataset contient maintenant 8 articles


#√âtape 2 : Tokenisation des phrases
L'Objectif est de d√©couper tous les articles en phrases individuelles pour pr√©parer l'analyse.

Cette solution est une alternative √† la solustion NLTK qui se charge pas

In [16]:
# M√©thode 100% Python sans NLTK (fallback)
import re

def simple_sentence_split(text):
    """D√©coupage simple des phrases en utilisant la ponctuation"""
    # Expression r√©guli√®re pour identifier les fins de phrases
    sentence_endings = re.compile(r'(?<=[.!?])\s+')
    sentences = sentence_endings.split(text)
    return [s.strip() for s in sentences if s.strip() and len(s) > 10]

# Utilisation de la m√©thode alternative
all_sentences = []
for article in df['article_text']:
    sentences = simple_sentence_split(article)
    all_sentences.extend(sentences)

print("‚úÖ Tokenisation alternative r√©ussie")
print(f"Total de phrases: {len(all_sentences)}")

‚úÖ Tokenisation alternative r√©ussie
Total de phrases: 128


# √âtape 3 : T√©l√©chargement et chargement des embeddings GloVe
L'Objectif est d'obtenir des vecteurs de mots pr√©-entra√Æn√©s (100 dimensions) pour repr√©senter s√©mantiquement chaque mot.

In [17]:
import os
import urllib.request
import zipfile

# √âtape 3 : T√©l√©chargement des embeddings GloVe
# Les embeddings GloVe sont des vecteurs de mots pr√©-entra√Æn√©s sur des corpus massifs

print("üì• T√©l√©chargement des embeddings GloVe...")

# URL officielle de Stanford pour GloVe 6B (100d)
glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
glove_zip_path = "/content/glove.6B.zip"
glove_extract_path = "/content/glove_embeddings"

# T√©l√©chargement si non d√©j√† pr√©sent
if not os.path.exists(glove_extract_path):
    # Cr√©ation du dossier
    os.makedirs(glove_extract_path, exist_ok=True)

    # T√©l√©chargement du fichier zip (environ 862MB)
    print("‚è≥ T√©l√©chargement du fichier GloVe (peut prendre 1-2 minutes)...")
    urllib.request.urlretrieve(glove_url, glove_zip_path)

    # Extraction des fichiers
    print("üì¶ Extraction des fichiers...")
    with zipfile.ZipFile(glove_zip_path, 'r') as zip_ref:
        zip_ref.extractall(glove_extract_path)

    # Nettoyage : suppression du zip
    os.remove(glove_zip_path)
    print("‚úÖ T√©l√©chargement et extraction termin√©s!")

# Chargement des embeddings dans un dictionnaire
# Nous utilisons glove.6B.100d.txt pour avoir 100 dimensions par mot
glove_file = os.path.join(glove_extract_path, "glove.6B.100d.txt")
word_embeddings = {}

print(" Chargement des embeddings dans la m√©moire...")
with open(glove_file, encoding='utf-8') as f:
    # Chaque ligne contient : mot + 100 valeurs num√©riques
    for line in f:
        values = line.split()
        word = values[0]  # Premier √©l√©ment = le mot
        vector = np.asarray(values[1:], dtype='float32')  # 100 dimensions
        word_embeddings[word] = vector

# V√©rification du chargement
print(f" Chargement termin√©! {len(word_embeddings)} mots charg√©s")
print(f" Dimensions des vecteurs: {len(next(iter(word_embeddings.values())))}")

# Test rapide avec quelques mots de tennis
test_words = ['tennis', 'federer', 'sharapova', 'match']
print("\n=== TEST DES EMBEDDINGS ===")
for word in test_words:
    if word in word_embeddings:
        print(f"‚úÖ '{word}' trouv√© - Extrait du vecteur: {word_embeddings[word][:5]}...")
    else:
        print(f"‚ùå '{word}' non trouv√© dans les embeddings")

üì• T√©l√©chargement des embeddings GloVe...
‚è≥ T√©l√©chargement du fichier GloVe (peut prendre 1-2 minutes)...
üì¶ Extraction des fichiers...
‚úÖ T√©l√©chargement et extraction termin√©s!
üìö Chargement des embeddings dans la m√©moire...
‚úÖ Chargement termin√©! 400000 mots charg√©s
üìè Dimensions des vecteurs: 100

=== TEST DES EMBEDDINGS ===
‚úÖ 'tennis' trouv√© - Extrait du vecteur: [ 0.21508  0.61981  0.84039  0.71394 -0.29904]...
‚úÖ 'federer' trouv√© - Extrait du vecteur: [ 0.22673  -0.048534  0.64561   0.69949   0.57822 ]...
‚úÖ 'sharapova' trouv√© - Extrait du vecteur: [0.12194 0.26347 1.2314  0.90343 0.03207]...
‚úÖ 'match' trouv√© - Extrait du vecteur: [-0.27317   0.024643  0.60197   0.10075  -0.91521 ]...


#√âtape 4 : Nettoyage et normalisation du texte
L'Objectif est de pr√©parer les phrases pour la vectorisation en √©liminant le bruit (ponctuation, stop words, etc.).

In [19]:
# Import des biblioth√®ques n√©cessaires
import re
import nltk
from nltk.corpus import stopwords

# T√©l√©chargement des stopwords
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

#  Nettoyage et normalisation du texte
print(" D√©but du nettoyage et de la normalisation...")

# R√©cup√©ration de la liste des stopwords anglais
stop_words = set(stopwords.words('english'))
print(f"üìã {len(stop_words)} stopwords charg√©s (ex: {list(stop_words)[:10]})")

# Fonction de nettoyage compl√®te
def clean_sentence(sentence):
    """
    Nettoie une phrase en appliquant plusieurs √©tapes :
    1. Conversion en minuscules
    2. Suppression des caract√®res sp√©ciaux et chiffres
    3. Suppression des stopwords
    4. Suppression des espaces multiples
    """

    # 1. Conversion en minuscules
    sentence = sentence.lower()

    # 2. Suppression des caract√®res sp√©ciaux et chiffres
    # Garde uniquement les lettres et les espaces
    sentence = re.sub(r'[^a-zA-Z\s]', '', sentence)

    # 3. Tokenisation en mots
    words = sentence.split()

    # 4. Suppression des stopwords et mots trop courts (< 2 lettres)
    cleaned_words = [word for word in words
                    if word not in stop_words and len(word) > 2]

    # 5. Reconstruction de la phrase
    cleaned_sentence = ' '.join(cleaned_words)

    return cleaned_sentence

# Application du nettoyage √† toutes les phrases
print("\n Nettoyage des phrases en cours...")
cleaned_sentences = []

for idx, sentence in enumerate(all_sentences):
    cleaned = clean_sentence(sentence)
    if cleaned:  # Garde uniquement les phrases non vides
        cleaned_sentences.append(cleaned)

    # Affichage du progr√®s pour les premi√®res phrases
    if idx < 3:
        print(f"\n Phrase originale {idx+1}:")
        print(f"   {sentence[:100]}...")
        print(f" Apr√®s nettoyage:")
        print(f"   {cleaned}")

# Statistiques post-nettoyage
print("\n=== R√âSUM√â DU NETTOYAGE ===")
print(f" Phrases avant nettoyage: {len(all_sentences)}")
print(f" Phrases apr√®s nettoyage: {len(cleaned_sentences)}")
print(f" Phrases √©limin√©es: {len(all_sentences) - len(cleaned_sentences)}")

# Aper√ßu des premi√®res phrases nettoy√©es
print("\n=== APERCU DES PHRASES NETTOY√âES ===")
for i, sentence in enumerate(cleaned_sentences[:5]):
    print(f"{i+1}. {sentence}")

# V√©rification du vocabulaire couvert par GloVe
print("\n=== V√âRIFICATION DU VOCABULAIRE ===")
all_words = ' '.join(cleaned_sentences).split()
unique_words = set(all_words)
found_words = [word for word in unique_words if word in word_embeddings]
coverage = len(found_words) / len(unique_words) * 100

print(f" Mots uniques trouv√©s: {len(found_words)}/{len(unique_words)}")
print(f"Couverture GloVe: {coverage:.1f}%")

 D√©but du nettoyage et de la normalisation...
üìã 198 stopwords charg√©s (ex: ["you've", 'isn', 'from', "doesn't", 'into', 'after', 'here', "don't", 'will', "mustn't"])

 Nettoyage des phrases en cours...

 Phrase originale 1:
   Maria Sharapova has basically no friends as tennis players on the WTA Tour....
 Apr√®s nettoyage:
   maria sharapova basically friends tennis players wta tour

 Phrase originale 2:
   The Russian player has no problems in openly speaking about it and in a recent interview she said: '...
 Apr√®s nettoyage:
   russian player problems openly speaking recent interview said dont really hide feelings much

 Phrase originale 3:
   I think everyone knows this is my job here....
 Apr√®s nettoyage:
   think everyone knows job

=== R√âSUM√â DU NETTOYAGE ===
 Phrases avant nettoyage: 128
 Phrases apr√®s nettoyage: 127
 Phrases √©limin√©es: 1

=== APERCU DES PHRASES NETTOY√âES ===
1. maria sharapova basically friends tennis players wta tour
2. russian player problems ope

#√âtape 5 : Vectorisation des phrases
L'objectif est de Convertir chaque phrase nettoy√©e en un vecteur num√©rique en utilisant les moyennes des embeddings GloVe.

In [20]:

print(" Vectorisation des phrases")

def sentence_to_vector(sentence, embeddings):
    """
    Convertit une phrase en vecteur en moyennant les embeddings de ses mots.

    Args:
        sentence (str): Phrase nettoy√©e
        embeddings (dict): Dictionnaire des embeddings GloVe

    Returns:
        np.array: Vecteur de 100 dimensions, ou vecteur nul si aucun mot trouv√©
    """
    words = sentence.split()
    word_vectors = []

    for word in words:
        if word in embeddings:
            word_vectors.append(embeddings[word])

    if len(word_vectors) == 0:
        # Retourne un vecteur nul si aucun mot n'est trouv√©
        return np.zeros(100)

    # Moyenne des vecteurs de mots
    sentence_vector = np.mean(word_vectors, axis=0)
    return sentence_vector

# Application de la vectorisation
sentence_vectors = []

for idx, sentence in enumerate(cleaned_sentences):
    vector = sentence_to_vector(sentence, word_embeddings)
    sentence_vectors.append(vector)

    # Affichage pour les premi√®res phrases
    if idx < 3:
        print(f"\n Phrase {idx+1}: {sentence}")
        print(f" Vecteur (5 premi√®res dimensions): {vector[:5]}...")
        print(f" Norme du vecteur: {np.linalg.norm(vector):.3f}")

# Conversion en array numpy pour des calculs plus efficaces
sentence_vectors = np.array(sentence_vectors)

# V√©rification finale
print("\n=== R√âSUM√â DE LA VECTORISATION ===")
print(f"Nombre de phrases vectoris√©es: {len(sentence_vectors)}")
print(f"Dimensions des vecteurs: {sentence_vectors.shape[1]}")
print(f"Forme de la matrice: {sentence_vectors.shape}")

# Statistiques sur les vecteurs
zero_vectors = np.sum(np.all(sentence_vectors == 0, axis=1))
print(f"Phrases sans vecteurs (mots non trouv√©s): {zero_vectors}")

# Visualisation de la distribution des normes
norms = [np.linalg.norm(vec) for vec in sentence_vectors if not np.all(vec == 0)]
if norms:
    print(f"Norme moyenne des vecteurs: {np.mean(norms):.3f}")
    print(f"Norme min/maxe: {np.min(norms):.3f} / {np.max(norms):.3f}")

 Vectorisation des phrases

 Phrase 1: maria sharapova basically friends tennis players wta tour
 Vecteur (5 premi√®res dimensions): [ 0.051489    0.1105585   0.6950863   0.18919174 -0.09581975]...
 Norme du vecteur: 3.802

 Phrase 2: russian player problems openly speaking recent interview said dont really hide feelings much
 Vecteur (5 premi√®res dimensions): [-0.07791846  0.19516078  0.41307408 -0.09757367 -0.26040584]...
 Norme du vecteur: 3.647

 Phrase 3: think everyone knows job
 Vecteur (5 premi√®res dimensions): [ 0.14818695  0.4246085   0.660479   -0.5043     -0.5471025 ]...
 Norme du vecteur: 4.765

=== R√âSUM√â DE LA VECTORISATION ===
Nombre de phrases vectoris√©es: 127
Dimensions des vecteurs: 100
Forme de la matrice: (127, 100)
Phrases sans vecteurs (mots non trouv√©s): 0
Norme moyenne des vecteurs: 3.571
Norme min/maxe: 2.236 / 4.765


#√âtape 6 : Construction de la matrice de similarit√©

l'Objectif est de cr√©er une matrice carr√©e o√π chaque cellule repr√©sente la similarit√© cosinus entre deux phrases.

In [21]:
from sklearn.metrics.pairwise import cosine_similarity

# √âtape 6 : Construction de la matrice de similarit√©
print("Construction de la matrice de similarit√©...")

# Calcul de la matrice de similarit√© cosinus
# cosine_similarity retourne une matrice de taille (n_phrases √ó n_phrases)
similarity_matrix = cosine_similarity(sentence_vectors)

# Affichage des dimensions
print(f"Dimensions de la matrice: {similarity_matrix.shape}")
print(f"Type de donn√©es: {similarity_matrix.dtype}")

# V√©rification des valeurs
print(f"\n=== STATISTIQUES DE LA MATRICE ===")
print(f"Similarit√© min: {similarity_matrix.min():.3f}")
print(f"Similarit√© max: {similarity_matrix.max():.3f}")
print(f"Similarit√© moyenne: {similarity_matrix.mean():.3f}")

# Affichage d'un √©chantillon de la matrice
print("\n=== APERCU DE LA MATRICE (5√ó5) ===")
import pandas as pd
sample_df = pd.DataFrame(
    similarity_matrix[:5, :5],
    index=[f"P{i+1}" for i in range(5)],
    columns=[f"P{i+1}" for i in range(5)]
)
print(sample_df.round(3))

# Visualisation de la diagonalit√© (chaque phrase est parfaitement similaire √† elle-m√™me)
print("\n=== V√âRIFICATION DIAGONALE ===")
diagonal_values = np.diagonal(similarity_matrix)
print(f"Valeurs diagonales (auto-similarit√©): {np.unique(diagonal_values)}")

# Identification des paires les plus similaires
print("\n=== PHRASES LES PLUS SIMILAIRES ===")
# Masque pour ignorer la diagonale
mask = np.eye(similarity_matrix.shape[0], dtype=bool)
masked_matrix = similarity_matrix.copy()
masked_matrix[mask] = 0  # Met 0 sur la diagonale

# Trouver les 3 paires les plus similaires
flat_indices = np.argsort(masked_matrix.flatten())[-3:][::-1]
for idx in flat_indices:
    i, j = np.unravel_index(idx, similarity_matrix.shape)
    sim_score = similarity_matrix[i, j]

    print(f"\nSimilarit√©: {sim_score:.3f}")
    print(f"Phrase {i+1}: {cleaned_sentences[i][:80]}...")
    print(f"Phrase {j+1}: {cleaned_sentences[j][:80]}...")

üîó Construction de la matrice de similarit√©...
üìä Dimensions de la matrice: (127, 127)
üìà Type de donn√©es: float32

=== STATISTIQUES DE LA MATRICE ===
üìä Similarit√© min: 0.070
üìä Similarit√© max: 1.000
üìä Similarit√© moyenne: 0.733

=== APERCU DE LA MATRICE (5√ó5) ===
       P1     P2     P3     P4     P5
P1  1.000  0.643  0.592  0.702  0.757
P2  0.643  1.000  0.856  0.842  0.823
P3  0.592  0.856  1.000  0.822  0.783
P4  0.702  0.842  0.822  1.000  0.890
P5  0.757  0.823  0.783  0.890  1.000

=== V√âRIFICATION DIAGONALE ===
Valeurs diagonales (auto-similarit√©): [0.9999996  0.9999997  0.99999976 0.9999998  0.9999999  0.99999994
 1.         1.0000001  1.0000002  1.0000004 ]

=== PHRASES LES PLUS SIMILAIRES ===

Similarit√©: 0.957
Phrase 62: think really nice environment great atmosphere especially veteran players helpin...
Phrase 66: always really feel like mid years huge shift attitudes top players friendly givi...

Similarit√©: 0.957
Phrase 66: always really feel like m

#√âtape 7 : Construction du graphe et application de PageRank

In [25]:
import networkx as nx

# √âtape 7 : Construction du graphe et application de PageRank
print("Construction du graphe de phrases...")

# Cr√©ation du graphe dirig√© pond√©r√©
G = nx.Graph()  # Graphe non-dirig√© pour PageRank

# Ajout des n≈ìuds (chaque phrase est un n≈ìud)
for i in range(len(cleaned_sentences)):
    G.add_node(i)

# Ajout des ar√™tes pond√©r√©es par similarit√©
# N'utilisons que les similarit√©s > 0.1 pour √©viter les connexions faibles
threshold = 0.1 # Filtre les connexions faibles pour √©viter un graphe trop dense
edges_added = 0

for i in range(len(cleaned_sentences)):
    for j in range(i+1, len(cleaned_sentences)):  # √âvite les doublons
        if similarity_matrix[i][j] > threshold:
            G.add_edge(i, j, weight=similarity_matrix[i][j])
            edges_added += 1

print(f" N≈ìuds dans le graphe: {G.number_of_nodes()}")
print(f"Ar√™tes ajout√©es: {edges_added}")

# Application de l'algorithme PageRank
print("\nCalcul du PageRank...")
pagerank_scores = nx.pagerank(
    G,
    weight='weight',  # Utilise les poids de similarit√©
    max_iter=100,     # Maximum d'it√©rations
    tol=1e-06         # Tol√©rance de convergence
)

# Affichage des scores
print("\n=== SCORES PAGERANK (TOP 10) ===")
# Tri par score d√©croissant
sorted_scores = sorted(pagerank_scores.items(), key=lambda x: x[1], reverse=True)

for rank, (idx, score) in enumerate(sorted_scores[:10], 1):
    print(f"\nRang {rank} - Score: {score:.4f}")
    print(f"Phrase: {cleaned_sentences[idx][:100]}...")

# Distribution des scores
print("\n=== DISTRIBUTION DES SCORES ===")
scores = list(pagerank_scores.values())
print(f"Score moyen: {np.mean(scores):.4f}")
print(f"Score m√©dian: {np.median(scores):.4f}")
print(f"Score min/max: {min(scores):.4f} / {max(scores):.4f}")

# Visualisation rapide du graphe (optionnel)
print("\n=== STRUCTURE DU GRAPHE ===")
print(f"Nombre de composantes connexes: {nx.number_connected_components(G)}")
print(f"Densit√© du graphe: {nx.density(G):.4f}")

Construction du graphe de phrases...
 N≈ìuds dans le graphe: 127
Ar√™tes ajout√©es: 8000

Calcul du PageRank...

=== SCORES PAGERANK (TOP 10) ===

Rang 1 - Score: 0.0088
Phrase: nice trajectorythen reid recalledif hadnt got sick think could started pushing towards second week s...

Rang 2 - Score: 0.0087
Phrase: major players feel big event late november combined one january australian open mean much tennis lit...

Rang 3 - Score: 0.0087
Phrase: one strike conversation weather know next minutes try win tennis match...

Rang 4 - Score: 0.0087
Phrase: felt like best weeks get know players playing fed cup weeks olympic weeks necessarily tournaments...

Rang 5 - Score: 0.0086
Phrase: speaking swiss indoors tournament play sundays final romanian qualifier marius copil world number th...

Rang 6 - Score: 0.0086
Phrase: felt like really kind changed people little bit definitely lot quiet started become better meanwhile...

Rang 7 - Score: 0.0086
Phrase: exhausted spending half round deep bush

# √âtape 8 : R√©sum√© final

L'Objectif est d'extraire les phrases les plus importantes selon PageRank pour cr√©er un r√©sum√© coh√©rent.

In [27]:
# R√©sum√© final


# Param√®tres du r√©sum√©
NUM_SENTENCES_SUMMARY = 5  # Nombre de phrases dans le r√©sum√©

# R√©cup√©ration des phrases originales (non nettoy√©es) pour le r√©sum√©
# Important : on utilise les phrases originales pour garder la lisibilit√©
original_sentences = [s for s in all_sentences if s.strip()]

# Cr√©ation du mapping entre indices de phrases nettoy√©es et originales
# (Dans notre cas, cleaned_sentences et original_sentences correspondent)
phrase_mapping = {i: i for i in range(len(original_sentences))}

# R√©cup√©ration des meilleures phrases
top_indices = [idx for idx, score in sorted_scores[:NUM_SENTENCES_SUMMARY]]

# Tri par ordre d'apparition dans les articles originaux
# (Pour garder la coh√©rence narrative)
top_indices_sorted = sorted(top_indices)

# Construction du r√©sum√©
summary_sentences = [original_sentences[idx] for idx in top_indices_sorted]

# Affichage du r√©sum√©
print("\n" + "="*60)
print("R√âSUM√â")
print("="*60)

for i, sentence in enumerate(summary_sentences, 1):
    # Nettoyage de l'affichage (suppression des espaces multiples)
    clean_display = ' '.join(sentence.split())
    print(f"\n{i}. {clean_display}")

# Statistiques du r√©sum√©
print("\n" + "="*60)
print("STATISTIQUES DU R√âSUM√â")
print("="*60)
print(f"üìÑ Nombre de phrases: {len(summary_sentences)}")
print(f"üìä Score moyen PageRank des phrases s√©lectionn√©es: {np.mean([pagerank_scores[idx] for idx in top_indices]):.4f}")
print(f"üìè Longueur moyenne des phrases: {np.mean([len(s.split()) for s in summary_sentences]):.1f} mots")

# Calcul du taux de compression
original_length = sum(len(s.split()) for s in original_sentences)
summary_length = sum(len(s.split()) for s in summary_sentences)
compression_rate = (1 - summary_length/original_length) * 100

print(f"üìà Taux de compression: {compression_rate:.1f}%")
print(f"   Texte original: {original_length} mots")
print(f"   R√©sum√©: {summary_length} mots")

# Sauvegarde optionnelle du r√©sum√©
print("\nSauvegarde du r√©sum√©...")
summary_text = "\n\n".join(summary_sentences)
with open("/content/tennis_summary.txt", "w", encoding="utf-8") as f:
    f.write("R√âSUM√â DES ARTICLES DE TENNIS\n")
    f.write("="*50 + "\n\n")
    f.write(summary_text)

print(" R√©sum√© sauvegard√© dans 'tennis_summary.txt'")


R√âSUM√â

1. So I'm not the one to strike up a conversation about the weather and know that in the next few minutes I have to go and try to win a tennis match.

2. Roger Federer has revealed that organisers of the re-launched and condensed Davis Cup gave him three days to decide if he would commit to the controversial competition.

3. Novak Djokovic has said he will give precedence to the ATP¬ís intended re-launch of the defunct World Team Cup in January 2020, at various Australian venues.

4. ¬ìIt's a very pleasant atmosphere, I'd have to say, around the locker rooms.

5. He¬íd backed up his last-32 showingat Melbourne Park with a string of wins over elites including French Open champion and then world No.9 Gaston Gaudio and Roland Garros runner-up Martin Verkerk in 2004 before illness struck.

STATISTIQUES DU R√âSUM√â
üìÑ Nombre de phrases: 5
üìä Score moyen PageRank des phrases s√©lectionn√©es: 0.0087
üìè Longueur moyenne des phrases: 26.8 mots
üìà Taux de compression: 95.1%
  

# Conclusion :
Le syst√®me de r√©sum√© automatique est COMPLET !
Le pipeline NLP  est construit de A √† Z :

R√©capitulatif des √©tapes suivies :


| √âtape   | Comp√©tence acquise                         | Statut     |
| ------- | ------------------------------------------ | ---------- |
| ‚úÖ **1** | Chargement et exploration des donn√©es      | **R√©ussi** |
| ‚úÖ **2** | Tokenisation avanc√©e avec NLTK             | **R√©ussi** |
| ‚úÖ **3** | Int√©gration d'embeddings GloVe (400K mots) | **R√©ussi** |
| ‚úÖ **4** | Nettoyage NLP avec stopwords et regex      | **R√©ussi** |
| ‚úÖ **5** | Vectorisation par moyenne d'embeddings     | **R√©ussi** |
| ‚úÖ **6** | Calcul de similarit√© cosinus (127√ó127)     | **R√©ussi** |
| ‚úÖ **7** | Graph-based ranking avec PageRank          | **R√©ussi** |
| ‚úÖ **8** | G√©n√©ration de r√©sum√© coh√©rent              | **R√©ussi** |
