**Sources des données du fichier SMSSpamCollection.txt** 

[1] Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

[2] Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012.

[3] Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P. Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science (IJISS), 2(1), 1-18, 2013.

http://dcomp.sor.ufscar.br/talmeida/smspamcollection/

Le fichier SMSSpamCollection.txt est une simple table avec deux colonnes. La première colonne renseigne la catégorie d’un courriel parmi les deux catégories spam pour les pourriels et ham pour les courriels normaux. La deuxième colonne concerne le contenu des courriels. En résumé, chaque ligne de ces données correspond au texte d’un courriel et sa catégorie spam ou ham.  

L'objectif de cet exemple est de montrer comment appliquer les algorithmes du Machine Learning sur des données textuelles, comme celles de notre fichier et SMSSpamCollection.txt. Ainsi, nous allons développer un modèle qui sera capable d’analyser le contenu d’un courriel, c’est-à-dire le texte associé à un courriel, et décider si ce courriel est un spam ou pas.

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
from sklearn  import svm
from sklearn.preprocessing import LabelEncoder 

Le code suivant lit les données du fichier SMSSpamCollection.txt dans le DataFrame data et nomme les deux colonnes de cette base de données avec label pour la classe d’un courriel et Content pour le texte

In [2]:
data = pd.read_csv("../Data/SMSSpamCollection.txt", sep='\t', header=None)
data.columns = ['label', 'Content']
data

Unnamed: 0,label,Content
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


Le code suivant permet d’ajouter la nouvelle colonne Content_len au DataFrame data en utilisant la méthode apply que nous avons abordée au chapitre La bibliothèque Pandas. Cette nouvelle colonne enregistre la taille des courriels. En effet, la taille d’un courriel peut nous servir pour la détection des spams.

In [3]:
data['Content_len'] = data['Content'].apply(lambda x: len(x) - x.count(" "))
data

Unnamed: 0,label,Content,Content_len
0,ham,"Go until jurong point, crazy.. Available only ...",92
1,ham,Ok lar... Joking wif u oni...,24
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,128
3,ham,U dun say so early hor... U c already then say...,39
4,ham,"Nah I don't think he goes to usf, he lives aro...",49
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,131
5568,ham,Will ü b going to esplanade fr home?,29
5569,ham,"Pity, * was in mood for that. So...any other s...",48
5570,ham,The guy did some bitching but I acted like i'd...,100


Le code suivant définit la fonction count_punctuation. Celle-ci calcule le taux de caractères de ponctuation présents dans un courriel reçu en argument. L’attribut string.punctuation correspond à la liste des caractères de ponctuation. La liste binary_array est définie en utilisant le concept de liste en compréhension présenté au chapitre Le langage Python lorsque nous avons étudié les bases de ce langage.

Donc la liste binary_array contiendra autant de valeurs 1 qu’il y a de caractères de ponctuation dans la chaîne de caractères text. Donc, pour calculer le taux de caractères de ponctuation dans le texte associé à un courriel, il suffit de diviser la somme de tous les éléments de la liste binary_array sur le nombre de caractères, hormis l’espace blanc, présents dans la chaîne de caractères text.

In [4]:
def count_punctuation(text):
    binary_array = [1 for ch in text if ch in string.punctuation] 
    nb_ponctuation = sum(binary_array)
    total = len(text) - text.count(" ")
    return round(nb_ponctuation/(total), 4)*100

In [5]:
data['punctuation_rate'] = data['Content'].apply(lambda x: count_punctuation(x))
data

Unnamed: 0,label,Content,Content_len,punctuation_rate
0,ham,"Go until jurong point, crazy.. Available only ...",92,9.78
1,ham,Ok lar... Joking wif u oni...,24,25.00
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,128,4.69
3,ham,U dun say so early hor... U c already then say...,39,15.38
4,ham,"Nah I don't think he goes to usf, he lives aro...",49,4.08
...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,131,6.11
5568,ham,Will ü b going to esplanade fr home?,29,3.45
5569,ham,"Pity, * was in mood for that. So...any other s...",48,14.58
5570,ham,The guy did some bitching but I acted like i'd...,100,1.00


Le code suivant définit la fonction clean_email. Le rôle de cette fonction est de créer une liste de tokens pour chaque courriel. Ainsi, pour un courriel donné, nous aurions tous les mots qui le composent, hormis les signes de ponctuation et les espaces blancs, stockés dans une liste. Également, cette fonction applique la technique de Stemming sur tous les mots de la liste créée précédemment.

En d’autres termes, cette fonction permet de fragmenter une chaîne de caractères afin de stocker les mots qui la composent dans une liste, puis cette fonction applique le principe de Stemming sur les mots de cette liste qui ne font pas partie des stopwords de la langue anglaise. 

In [6]:
nltk.download('stopwords')
en_stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

def clean_email(email):
    result = "".join([word for word in email if word not in string.punctuation])
    tokens = re.split('\W+', result)
    text = [ps.stem(word) for word in tokens if word not in en_stopwords]
    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/amoussaid/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Le code siovant utilise un objet de type TfidfVectorizer afin de réaliser la vectorisation avec la méthode TF-IDF. Pour cet objet vectorisation_full, nous avons indiqué la fonction clean_email qui sera appliquée sur chaque valeur de la colonne Content. La vectorisation est réalisée grâce à la méthode fit_transform qui est appliquée sur la colonne Content.

In [7]:
vectorisation_full = TfidfVectorizer(analyzer=clean_email)
vect_final = vectorisation_full.fit_transform(data['Content'])

Le code suivant permet de construire le nouveau DataFrame all_data en combinant le résultat de la vectorisation précédente qui est stocké dans vect_final avec les deux nouvelles colonnes Content_len et punctuation_rate que nous avons créées précédemment.

À ce stade, nous avons transformé les données textuelles du fichier SMSSpamCollection.txt en données numériques stockées dans le DataFrame all_data et les classes associées à chaque ligne de ce DataFrame sont stockées dans la colonne label du DataFrame data.

In [8]:
all_data = pd.concat([pd.DataFrame(vect_final.toarray()), data['Content_len'], data['punctuation_rate']], axis=1)
all_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8185,8186,8187,8188,8189,8190,8191,8192,Content_len,punctuation_rate
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,92,9.78
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,24,25.00
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,128,4.69
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,39,15.38
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,49,4.08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,131,6.11
5568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.311086,0.0,0.0,29,3.45
5569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,48,14.58
5570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,100,1.00


In [9]:
#all_data.drop(['Content_len', 'punctuation_rate'], axis=1)
all_data.columns = all_data.columns.astype(str)

In [10]:
all_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8185,8186,8187,8188,8189,8190,8191,8192,Content_len,punctuation_rate
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,92,9.78
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,24,25.00
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,128,4.69
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,39,15.38
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,49,4.08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,131,6.11
5568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.311086,0.0,0.0,29,3.45
5569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,48,14.58
5570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,100,1.00


In [11]:
x_train, x_test, y_train, y_test = train_test_split(all_data, data['label'], test_size=0.2)

In [12]:
y_train

672     spam
2257     ham
2632    spam
2703     ham
1936     ham
        ... 
2289     ham
5191     ham
2840     ham
5196    spam
2375     ham
Name: label, Length: 4457, dtype: object

In [13]:
le = LabelEncoder()
le.fit(y_train)
_y_train = le.transform(y_train)
_y_train

array([1, 0, 1, ..., 0, 1, 0])

In [None]:
alg_svm= svm.SVC(kernel = 'linear')
alg_svm.fit(x_train, _y_train)

In [None]:
set(y_test)

In [None]:
le = LabelEncoder()
le.fit(y_test)
_y_test = le.transform(y_test)
set(_y_test)

In [None]:
predictions = alg_svm.predict(x_test)

In [None]:
precision, recall, fscore, _ = score(_y_test, predictions, pos_label=1, average='binary')
print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                        round(recall, 3),
                                                        round((predictions==_y_test).sum() / len(predictions),3)))