
# $\textbf{Atelier: NEWS-Classification}$


# $\textbf{1. Objectif : }$


L’objectif de cet atelier est de découvrir la classification de documents texte à travers le classificateur SVM que nous allons appliquer sur un dataset contenant les news apparus sur le fil de presse Reuters en 1987.

Le datset peut être téléchargé à partir de ce lien : https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection. 

In [1]:
import warnings
warnings.simplefilter('ignore')

In [2]:
import nltk
nltk.download('reuters')

[nltk_data] Downloading package reuters to /home/alain/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

Le dataset est disponible également sur scikitlearn 


In [3]:
from sklearn.datasets import fetch_rcv1

# $\textbf{2.	Chargement des données }$



Nous considerons le dataset reuters à partir de NLTK.

In [4]:
from nltk.corpus import reuters
#recuperation du vocabulaire du corpus
vocabulaire=reuters.words()

#recuperation de toutes les categories
categories=reuters.categories()

#recuperation de tous les id des fichiers appartenant à une categorie bien determinée
ids_coffe=reuters.fileids("coffee")

#recuperation des mots contenus dans les documents d'une categorie bien determinee
coffe_words=reuters.words(reuters.fileids("coffee"))

#recuperation du texte brut des documents d'une categorie bien determinee
cofee_docs=reuters.raw(reuters.fileids("coffee")[0])

#recuperation de toutes les autres classe d'un document annoté avec une classe bien determinee
classes_Annotated_coffee=reuters.categories(reuters.fileids("coffee"))

#recuperer le dataset d'apprentissage
train_categories=[ reuters.categories(i) for i in reuters.fileids() if i.startswith('training/')]
train_documents = [reuters.raw(i) for i in reuters.fileids() if i.startswith('training/')]

#recuperer le dataset de test
test_documents=[reuters.raw(i)  for i in reuters.fileids() if i.startswith('test/')]
test_categories = [reuters.categories(i) for i in reuters.fileids() if i.startswith('test/')]

# recuperer tout le corpus
whole_docs=[reuters.raw(i)for i in reuters.fileids()]
whole_cats = [ reuters.categories(i) for i in reuters.fileids()]


# $\textbf{
3.	Prétraitements }$



$\textbf{
Récupération des représentations vectorielles en TF-IDF}$

In [5]:
#TF-IDF
#fit(raw_documents[, y]): Learn vocabulary and idf from training set.
#fit_transform(raw_documents[, y]): Learn vocabulary and idf, return document-term matrix.
#transform(raw_documents): Transform documents to document-term matrix.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words = 'english')

vect_whole_docs = vectorizer.fit_transform(whole_docs)
vect_train_docs = vectorizer.transform(train_documents)
vect_test_docs = vectorizer.transform(test_documents)

#recuperer des labels uniques pour les categories
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform(train_categories)
test_labels = mlb.transform(test_categories)
whole_labels = mlb.fit_transform(whole_cats)

print(whole_labels.shape)

(10788, 90)


In [6]:
vect_whole_docs.shape

(10788, 30627)

# $\textbf{
4.	Classification avec SVM}$

In [7]:
"""SVM"""

import numpy as np
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

classifier_svm = OneVsRestClassifier(LinearSVC())
classifier_svm.fit(vect_train_docs,train_labels)
test_labels_predict=classifier_svm.predict(vect_test_docs)

print(classification_report(test_labels,test_labels_predict))
scores=classifier_svm.score(vect_test_docs,test_labels)

print(scores)

              precision    recall  f1-score   support

           0       0.99      0.95      0.97       719
           1       1.00      0.43      0.61        23
           2       1.00      0.64      0.78        14
           3       0.95      0.60      0.73        30
           4       0.88      0.39      0.54        18
           5       0.00      0.00      0.00         1
           6       1.00      0.94      0.97        18
           7       1.00      0.50      0.67         2
           8       0.00      0.00      0.00         3
           9       0.96      0.96      0.96        28
          10       1.00      0.78      0.88        18
          11       0.00      0.00      0.00         1
          12       0.95      0.71      0.82        56
          13       1.00      0.50      0.67        20
          14       0.00      0.00      0.00         2
          15       0.92      0.43      0.59        28
          16       0.00      0.00      0.00         1
          17       0.91    

# $\textbf{
5.	Classification  SVM avec cross validation}$

In [8]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring = ['precision_samples', 'recall_samples', 'f1_samples', 'accuracy']
scores_svm = cross_validate(classifier_svm, vect_whole_docs, whole_labels, cv=5, scoring=scoring)
print(scores_svm['fit_time'])
print(scores_svm['score_time'])
print(scores_svm['test_precision_samples'])
print(scores_svm['test_recall_samples'])
print(scores_svm['test_f1_samples'])

[1.96312261 2.05393624 1.8967886  1.87351632 1.85604215]
[0.07709217 0.07549524 0.07638812 0.07514453 0.07815814]
[0.8686747  0.88214396 0.8836886  0.89460671 0.89554165]
[0.85052207 0.86513196 0.86309211 0.87661217 0.86938649]
[0.85113967 0.86563286 0.86497342 0.87832754 0.87432217]


In [9]:
print(np.mean(scores_svm['test_precision_samples']))
print(np.mean(scores_svm['test_recall_samples']))
print(np.mean(scores_svm['test_f1_samples']))
print(np.mean(scores_svm['test_accuracy']))

0.8849311228008213
0.8649489602866532
0.8668791306612522
0.8130354734440062


# $\textbf{6. Questions}$


Q1. Réaliser le même processus avec le KNN et comparer les performances obtenues avec SVM.

Q2. Réaliser le même processus avec une méthode ensembliste et comparer les performances obtenues avec KNN et SVM.

Q3. Realiser les même processus en appliquant un features selection SelectKBest de la librairie sklearn

$\textbf{KNN, SVM }$

In [10]:
"""Classification avec le KNN"""

from sklearn.neighbors import KNeighborsClassifier
    
classifier_knn = OneVsRestClassifier(KNeighborsClassifier())
classifier_knn.fit(vect_train_docs, train_labels)
test_labels_predict = classifier_knn.predict(vect_test_docs)

#print(classification_report(test_labels,test_labels_predict))
scores = classifier_knn.score(vect_test_docs, test_labels)
print(scores)

0.728055647565419


In [11]:
test_labels_predict = classifier_knn.predict(vect_test_docs)
print(classification_report(test_labels,test_labels_predict))

              precision    recall  f1-score   support

           0       0.95      0.68      0.80       719
           1       0.77      0.43      0.56        23
           2       0.86      0.43      0.57        14
           3       0.62      0.53      0.57        30
           4       0.73      0.61      0.67        18
           5       0.00      0.00      0.00         1
           6       1.00      0.94      0.97        18
           7       1.00      1.00      1.00         2
           8       0.00      0.00      0.00         3
           9       0.83      0.89      0.86        28
          10       0.74      0.78      0.76        18
          11       0.00      0.00      0.00         1
          12       0.70      0.59      0.64        56
          13       0.82      0.45      0.58        20
          14       0.00      0.00      0.00         2
          15       0.82      0.50      0.62        28
          16       0.00      0.00      0.00         1
          17       0.81    

SVM donne un meilleur score que le KNN.

In [12]:
"""Classification avec RandomForestClassifier"""
from sklearn.ensemble import RandomForestClassifier

    
classifier_rf = OneVsRestClassifier(RandomForestClassifier())
classifier_rf.fit(vect_train_docs, train_labels)
test_labels_predict = classifier_rf.predict(vect_test_docs)

print(classification_report(test_labels,test_labels_predict))
scores = classifier_rf.score(vect_test_docs, test_labels)
print(scores)

# Use here RandomForest selector, best covariates,

              precision    recall  f1-score   support

           0       0.97      0.92      0.94       719
           1       0.00      0.00      0.00        23
           2       1.00      0.07      0.13        14
           3       1.00      0.03      0.06        30
           4       0.00      0.00      0.00        18
           5       0.00      0.00      0.00         1
           6       0.00      0.00      0.00        18
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         3
           9       1.00      0.29      0.44        28
          10       1.00      0.06      0.11        18
          11       0.00      0.00      0.00         1
          12       1.00      0.23      0.38        56
          13       0.00      0.00      0.00        20
          14       0.00      0.00      0.00         2
          15       0.80      0.14      0.24        28
          16       0.00      0.00      0.00         1
          17       0.98    

In [69]:
"""Classification avec RandomForestClassifier"""
from sklearn.ensemble import AdaBoostClassifier

classifier_ada = OneVsRestClassifier(AdaBoostClassifier())
classifier_ada.fit(vect_train_docs, train_labels)
test_labels_predict = classifier_ada.predict(vect_test_docs)

scores = classifier_ada.score(vect_test_docs, test_labels)
print(scores)

In [102]:
test_labels_predict = classifier_ada.predict(vect_test_docs)
print(classification_report(test_labels,test_labels_predict))

 SVM reste meilleur que le KNN et les forêts aléatoires(RandomForestClassifier) et AdaBoost.

$\textbf{SelectKBest}$

In [67]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [88]:
vect_whole_docs.shape, whole_labels.shape

((10788, 30627), (10788, 90))

$\textbf{Selection de 5% des variables indépendantes}$

In [73]:
X_new = SelectKBest(chi2, k=1531).fit_transform(vect_whole_docs, whole_labels)

scoring = ['precision_samples', 'recall_samples', 'f1_samples', 'accuracy']
scores_svm = cross_validate(classifier_svm, X_new, whole_labels, cv=5, scoring=scoring)

print(scores_svm['fit_time'])
print(scores_svm['score_time'])
print(scores_svm['test_precision_samples'])
print(scores_svm['test_recall_samples'])
print(scores_svm['test_f1_samples'], "\n")

print(np.mean(scores_svm['test_precision_samples']))
print(np.mean(scores_svm['test_recall_samples']))
print(np.mean(scores_svm['test_f1_samples']))
print(np.mean(scores_svm['test_accuracy']))

[0.69789314 0.90614653 0.71261907 0.8198657  0.75821567]
[0.0524838  0.06912756 0.05868912 0.07013392 0.05280447]
[0.85911338 0.8699413  0.85641026 0.86957194 0.86077113]
[0.84411696 0.85885299 0.83879364 0.85703602 0.84128419]
[0.8432317  0.85583085 0.8389015  0.85572424 0.8432789 ] 

0.8631616011494356
0.8480167604692965
0.847393438957688
0.7903236783659726


$\textbf{Selection de 10% des variables indépendantes}$

In [78]:
X_new = SelectKBest(chi2, k=3062).fit_transform(vect_whole_docs, whole_labels)

scoring = ['precision_samples', 'recall_samples', 'f1_samples', 'accuracy']
scores_svm = cross_validate(classifier_svm, X_new, whole_labels, cv=5, scoring=scoring)

print(scores_svm['fit_time'])
print(scores_svm['score_time'])
print(scores_svm['test_precision_samples'])
print(scores_svm['test_recall_samples'])
print(scores_svm['test_f1_samples'], "\n")

print(np.mean(scores_svm['test_precision_samples']))
print(np.mean(scores_svm['test_recall_samples']))
print(np.mean(scores_svm['test_f1_samples']))
print(np.mean(scores_svm['test_accuracy']))

[0.99409676 0.9782083  0.92197371 0.91986752 0.84991503]
[0.0640173  0.07003403 0.06291127 0.05480576 0.07793283]
[0.8718721  0.8853877  0.87546339 0.88842528 0.88715809]
[0.85626114 0.8705382  0.85600994 0.87395767 0.86444908]
[0.85581149 0.87000419 0.85718954 0.87428618 0.86749795] 

0.8816613144923047
0.8642432053727443
0.8649578693150837
0.8096977618401281


$\textbf{Selection de 15% des variables indépendantes}$

In [81]:
X_new = SelectKBest(chi2, k=4594).fit_transform(vect_whole_docs, whole_labels)

scoring = ['precision_samples', 'recall_samples', 'f1_samples', 'accuracy']
scores_svm = cross_validate(classifier_svm, X_new, whole_labels, cv=5, scoring=scoring)

print(scores_svm['fit_time'])
print(scores_svm['score_time'])
print(scores_svm['test_precision_samples'])
print(scores_svm['test_recall_samples'])
print(scores_svm['test_f1_samples'], "\n")

print(np.mean(scores_svm['test_precision_samples']))
print(np.mean(scores_svm['test_recall_samples']))
print(np.mean(scores_svm['test_f1_samples']))
print(np.mean(scores_svm['test_accuracy']))

[1.08230186 1.03436089 1.01106167 0.96012235 0.92841911]
[0.05676699 0.0741477  0.08689499 0.06278205 0.05581307]
[0.86837349 0.88809082 0.87855267 0.89360223 0.89175552]
[0.85430366 0.87173529 0.85813382 0.87873282 0.8689847 ]
[0.85327828 0.87184231 0.8596413  0.87904808 0.87221349] 

0.8840749482004908
0.8663780587876555
0.8672046944941944
0.8115520174202748


In [82]:
""" KNN Classifier """

X_new = SelectKBest(chi2, k=4594).fit_transform(vect_whole_docs, whole_labels)

scoring = ['precision_samples', 'recall_samples', 'f1_samples', 'accuracy']
classifier_knn = OneVsRestClassifier(KNeighborsClassifier())
classifier_knn = cross_validate(classifier_knn, X_new, whole_labels, cv=5, scoring=scoring)

print(classifier_knn['fit_time'])
print(classifier_knn['score_time'])
print(classifier_knn['test_precision_samples'])
print(classifier_knn['test_recall_samples'])
print(classifier_knn['test_f1_samples'], "\n")

print(np.mean(classifier_knn['test_precision_samples']))
print(np.mean(classifier_knn['test_recall_samples']))
print(np.mean(classifier_knn['test_f1_samples']))
print(np.mean(classifier_knn['test_accuracy']))

[0.24561357 0.29466677 0.24288511 0.24265838 0.34293342]
[81.04196477 76.25177646 73.68394494 79.26062226 80.26495886]
[0.69176707 0.72265987 0.68534137 0.71743162 0.71076341]
[0.68595812 0.715378   0.67926063 0.7110699  0.69702519]
[0.68558593 0.71483811 0.67905946 0.71065002 0.69924257] 

0.7055926655876384
0.6977383673753298
0.697875219440144
0.6694483508013008


In [83]:
""" Rf Classifier """

X_new = SelectKBest(chi2, k=4594).fit_transform(vect_whole_docs, whole_labels)

scoring = ['precision_samples', 'recall_samples', 'f1_samples', 'accuracy']
classifier_rf = OneVsRestClassifier(RandomForestClassifier())
classifier_rf = cross_validate(classifier_rf, X_new, whole_labels, cv=5, scoring=scoring)

print(classifier_rf['fit_time'])
print(classifier_rf['score_time'])
print(classifier_rf['test_precision_samples'])
print(classifier_rf['test_recall_samples'])
print(classifier_rf['test_f1_samples'], "\n")

print(np.mean(classifier_rf['test_precision_samples']))
print(np.mean(classifier_rf['test_recall_samples']))
print(np.mean(classifier_rf['test_f1_samples']))
print(np.mean(classifier_rf['test_accuracy']))

[110.49026394 118.3114078  113.67388725 109.51805329 113.13453937]
[3.54869795 3.43697834 3.14020848 2.99070048 3.02524662]
[0.7713701  0.81093605 0.77527804 0.79578118 0.79103693]
[0.74215415 0.77819961 0.74292555 0.76878143 0.75970484]
[0.74842083 0.78528951 0.75010767 0.77521302 0.76671045] 

0.7888804588920211
0.7583531148881703
0.7651482969051553
0.712552617660113


$\textbf{Conclusion : SVM }$  donne les meilleurs résultats(meilleures valeurs de métrique)(validation croisée) que les autres algorithmes testés dans cet atelier sur $15\%$ des variables indépendantes comme sur l'ensemble des variables indépendantes, même en temps d'exécution. Ainsi, $15\%$ des variables indépendantes suffiront.