# TD2 : classification de données textuelles

Ce notebook a été développé dans le cours donné par J. Velcin sur le Deep Learning à l'Université de Lyon 2.

On commence par charger en mémoire les données spam diffusée à l'occasion du tutoriel de A. Gramfort et A. Mueller à SciPy 2017
https://github.com/amueller/scipy-2017-sklearn

In [11]:
import numpy as np

import os

with open(os.path.join("datasets", "smsspam", "SMSSpamCollection")) as f:
    lines = [line.strip().split("\t") for line in f.readlines()]

text = [x[1] for x in lines]
y = [int(x[0] == "spam") for x in lines]

In [12]:
print(text[0:5])
print(y[0:5])

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'U dun say so early hor... U c already then say...', "Nah I don't think he goes to usf, he lives around here though"]
[0, 0, 1, 0, 0]


La bibliothèque scikit-learn fournit des commandes très utiles pour vectoriser le texte, cf. tutoriel SciPy 2017 et cours de text mining.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorize_spamdata = TfidfVectorizer()
vectorize_spamdata.fit(text)
data = vectorize_spamdata.transform(text).toarray()

In [14]:
dim = data.shape[1]
print(data.shape)
data[10:20, 4:8]

(5574, 8716)


array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

On sépare le jeu de données en ensemble d'entraînement et de test en conservant un équilibre dans les classes.

In [15]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(data, y, 
                                                    train_size=0.7,
                                                    test_size=0.3,
                                                    random_state=123,
                                                   stratify=y)

On construit un simple MLP avec une couche cachée, cf. TD 1.

In [6]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

def mlp():
    model = Sequential()
    model.add(Dense(8, input_dim=dim, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [16]:
simple_mlp = mlp()
simple_mlp.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_2 (Dense)             (None, 8)                 69736     
                                                                 
 dense_3 (Dense)             (None, 1)                 9         
                                                                 
Total params: 69,745
Trainable params: 69,745
Non-trainable params: 0
_________________________________________________________________


On a besoin de convertir la liste en tableau numpy.

In [17]:
test_y = np.array(test_y)
train_y = np.array(train_y)

On lance l'apprentissage sur 10 epochs avec des batch de 10 textes.

In [18]:
simple_mlp.fit(train_X, train_y, epochs=10, batch_size=10)

Epoch 1/10
 11/391 [..............................] - ETA: 2s - loss: 0.6809 - accuracy: 0.8455  

2022-10-27 09:20:56.891798: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x2a6ccb7f0>

Voyons les résultats en généralisation.

In [19]:
score = simple_mlp.evaluate(test_X, test_y)
print("test score: ", score[0])
print("test accuracy: ", score[1])



2022-10-27 09:22:14.011956: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


test score:  0.051393888890743256
test accuracy:  0.9868500232696533


Pour information, la régression logistique atteint ~96% de réussite sur le même jeu de données.

In [21]:
simple_mlp.layers[0].get_weights()

[array([[ 0.13646051,  0.179735  , -0.10603371, ...,  0.13543199,
         -0.08062363, -0.07995619],
        [ 0.22492523,  0.33349097, -0.16878554, ...,  0.23677073,
         -0.15987249, -0.15640791],
        [-0.02432143, -0.02261983,  0.03619136, ..., -0.03094219,
          0.049872  ,  0.02965823],
        ...,
        [-0.00252159,  0.01166426,  0.01861603, ..., -0.00199095,
          0.04667639,  0.04463157],
        [ 0.10752779,  0.08941199, -0.1062703 , ...,  0.08843637,
         -0.0732681 , -0.10260005],
        [-0.06009427, -0.05941159,  0.05423395, ..., -0.0587045 ,
          0.0473928 ,  0.05170109]], dtype=float32),
 array([0.16885541, 0.18907182, 0.4353391 , 0.4006136 , 0.48576194,
        0.15885757, 0.4409305 , 0.40575182], dtype=float32)]