**Chargement des données**

In [43]:
import pandas as pd

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
data = pd.read_csv(url, sep='\t')

new_columns = ['label', 'message'] 

data.columns = new_columns

data.head(15)
print(data.columns) 


Nouvelles colonnes : Index(['label', 'message'], dtype='object')
Index(['label', 'message'], dtype='object')


**Étape 2 : Prétraitement du texte**

In [44]:
import spacy
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')

nlp = spacy.load('en_core_web_sm')

def preprocess_text(text):
    doc = nlp(text)
    lemmatized = ' '.join([token.lemma_ for token in doc if not token.is_stop])
    return lemmatized.lower()

data['clean_text'] = data['message'].apply(preprocess_text)

data[['message', 'clean_text']].head()


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/clementgranjou/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,message,clean_text
0,Ok lar... Joking wif u oni...,ok lar ... joke wif u oni ...
1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
2,U dun say so early hor... U c already then say...,u dun early hor ... u c ...
3,"Nah I don't think he goes to usf, he lives aro...","nah think go usf , live"
4,FreeMsg Hey there darling it's been 3 week's n...,freemsg hey darle 3 week word ! like fun ? tb ...


**Étape 3 : Vectorisation avec TF-IDF**

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['clean_text'])
y = data['label'].map({'ham': 0, 'spam': 1})  

X.shape, y.shape


((5571, 7613), (5571,))

**Étape 4 : Séparation des données (train-test-split)**

In [46]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

(X_train.shape, X_test.shape), (y_train.shape, y_test.shape)


(((4456, 7613), (1115, 7613)), ((4456,), (1115,)))

**Étape 5 : Entraîner le modèle Naïve Baye**


In [47]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)


**Étape 6 : Évaluer le modèle Naïve Bayes**

In [48]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

y_pred_nb = nb_model.predict(X_test)

print("Naïve Bayes Performance:")
print(confusion_matrix(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))
print("Accuracy:", accuracy_score(y_test, y_pred_nb))


Naïve Bayes Performance:
[[955   0]
 [ 35 125]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       955
           1       1.00      0.78      0.88       160

    accuracy                           0.97      1115
   macro avg       0.98      0.89      0.93      1115
weighted avg       0.97      0.97      0.97      1115

Accuracy: 0.968609865470852


**Étape 7 : Essayer un autre modèle (SVM)**

In [49]:
from sklearn.svm import SVC

svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)


**Étape 8 : Évaluer le modèle SVM**

In [50]:
y_pred_svm = svm_model.predict(X_test)

print("\nSVM Performance:")
print(confusion_matrix(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))
print("Accuracy:", accuracy_score(y_test, y_pred_svm))



SVM Performance:
[[952   3]
 [ 18 142]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       955
           1       0.98      0.89      0.93       160

    accuracy                           0.98      1115
   macro avg       0.98      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115

Accuracy: 0.9811659192825112


**Étape 9 : Validation croisée sur chaque modèle**

In [51]:
from sklearn.model_selection import cross_val_score
import numpy as np


nb_cv_scores = cross_val_score(nb_model, X, y, cv=5)
print("\nNaïve Bayes Cross-validation scores:", nb_cv_scores)
print("Mean accuracy:", np.mean(nb_cv_scores))

svm_cv_scores = cross_val_score(svm_model, X, y, cv=5)
print("\nSVM Cross-validation scores:", svm_cv_scores)
print("Mean accuracy:", np.mean(svm_cv_scores))



Naïve Bayes Cross-validation scores: [0.97578475 0.96499102 0.96229803 0.96768402 0.96768402]
Mean accuracy: 0.9676883689850335

SVM Cross-validation scores: [0.98206278 0.97935368 0.97576302 0.97486535 0.97486535]
Mean accuracy: 0.9773820354074921


**Étape 10 : Interprétation des résultats**

In [52]:
print("Comparaison des performances des deux modèles:")
print(f"Naïve Bayes Mean Accuracy: {np.mean(nb_cv_scores)}")
print(f"SVM Mean Accuracy: {np.mean(svm_cv_scores)}")


Comparaison des performances des deux modèles:
Naïve Bayes Mean Accuracy: 0.9676883689850335
SVM Mean Accuracy: 0.9773820354074921
