<a href="https://colab.research.google.com/github/ZakariaabGit/zakariaabGit.github.io/blob/main/classification_NN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install tensorflow-text

Successfully installed tensorflow-text-2.7.3


In [None]:
import numpy as np
#from collections import Counter
import tensorflow_datasets as tfds
from tensorflow.keras.layers import Input, LSTM, GRU, SimpleRNN, Masking, Embedding, Dense, Flatten
from tensorflow.keras import Model
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 1 Les données

La base de données utilisée ici est la base IMDB, très connue en analyse de texte. Il s’agit
d’avis sur des films qui sont classés comme "positifs" (classe 1) ou "négatifs" (classe 0).
Le jeu de données est composé de 25000 avis positifs et 25000 avis négatifs. 

In [None]:
def get_texts_and_labels(data):
  texts, labels = [], []
  for text, label in data:
    texts.append(text.numpy().decode('utf-8'))
    labels.append(label.numpy())
  return texts, labels

def load_data():

  train_data = tfds.load(
    'imdb_reviews',
    split='train',
    # batch_size=BATCH_SIZE,  # None 
    shuffle_files=True,
    as_supervised=True)
  
  test_data = tfds.load(
    'imdb_reviews',
    split='test',
    # batch_size=BATCH_SIZE,  # None
    shuffle_files=True,
    as_supervised=True)

  return train_data, test_data


train_data, test_data = load_data()
train_texts, train_labels = get_texts_and_labels(train_data)
test_texts, test_labels = get_texts_and_labels(test_data)

In [None]:
train_texts[:5]

In [None]:
len(train_texts), len(test_texts)

(25000, 25000)

In [None]:
small_train_texts = train_texts[:5]
small_train_labels = train_labels[:5]

# 2 Traitements préliminaires sur les données

## 2.1 Observation du vocabulaire

Notez qu’un mot est uniquement défini par un espace : la présence
d’une ponctuation derrière un mot créé des entités différentes.

In [None]:
def get_words(lines):
  dict_words = {}
  index_words = {}
  index = 1
  for line in lines:
    tokens = line.split()
    for token in tokens:
      if token in dict_words:
        dict_words[token] += 1
      else:
        dict_words[token] = 1
        index_words[token] = index
        index += 1
  return dict_words, index_words

dict_words, index_words = get_words(small_train_texts)
print(dict_words)
print(len(dict_words))

{'This': 3, 'was': 6, 'an': 2, 'absolutely': 2, 'terrible': 1, 'movie.': 1, "Don't": 1, 'be': 5, 'lured': 1, 'in': 7, 'by': 1, 'Christopher': 2, 'Walken': 2, 'or': 2, 'Michael': 1, 'Ironside.': 1, 'Both': 1, 'are': 6, 'great': 2, 'actors,': 1, 'but': 6, 'this': 8, 'must': 1, 'simply': 1, 'their': 3, 'worst': 1, 'role': 1, 'history.': 1, 'Even': 1, 'acting': 1, 'could': 2, 'not': 1, 'redeem': 1, "movie's": 1, 'ridiculous': 1, 'storyline.': 1, 'movie': 2, 'is': 3, 'early': 1, 'nineties': 1, 'US': 1, 'propaganda': 1, 'piece.': 1, 'The': 3, 'most': 1, 'pathetic': 2, 'scenes': 1, 'were': 2, 'those': 1, 'when': 3, 'the': 26, 'Columbian': 1, 'rebels': 1, 'making': 1, 'cases': 1, 'for': 8, 'revolutions.': 1, 'Maria': 1, 'Conchita': 1, 'Alonso': 1, 'appeared': 1, 'phony,': 1, 'and': 13, 'her': 1, 'pseudo-love': 1, 'affair': 1, 'with': 4, 'nothing': 1, 'a': 15, 'emotional': 1, 'plug': 1, 'that': 5, 'devoid': 1, 'of': 15, 'any': 2, 'real': 2, 'meaning.': 1, 'I': 7, 'am': 1, 'disappointed': 1, 'th

## 2.2 Nettoyage des données

Notez que le nombre de mots sera réduit et que certains mots du lexique
auront désormais une occurrence plus importante, ce qui sera bénéfique pour une
modélisation automatique par une méthode d’apprentissage.

In [None]:
filters = '!"#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n\''

def clean_textlines(lines):
  new_lines = []
  for line in lines:

    # suppression des éléments du filtre
    for item in list(filters):
      line = line.replace(item, '')

    # passage du texte en minuscule
    line = line.lower()

    # suppression des mots contenant des chiffres
    line = ' '.join([word for word in line.split(' ') if word.isalpha()])

    new_lines.append(line)
  return new_lines

clean_small_train_texts = clean_textlines(small_train_texts)
print(clean_small_train_texts)

dict_words, index_words = get_words(clean_small_train_texts)
print(dict_words)
print(len(dict_words))


['this was an absolutely terrible movie dont be lured in by christopher walken or michael ironside both are great actors but this must simply be their worst role in history even their great acting could not redeem this movies ridiculous storyline this movie is an early nineties us propaganda piece the most pathetic scenes were those when the columbian rebels were making their cases for revolutions maria conchita alonso appeared phony and her pseudolove affair with walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning i am disappointed that there are movies like this ruining actors like christopher walkens good name i could barely sit through it', 'i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the sette and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly s

# 3 Formatage des données

## 3.1 Encodage vectoriel

In [None]:
def vector_count_representation(lines, id_words):

  nb_words = len(id_words)
  nb_lines = len(lines)

  vrep_lines = np.zeros((nb_lines, nb_words))
  for i,line in enumerate(lines):
      words = line.split(' ')
      for word in words:
        vrep_lines[i,id_words[word]-1] += 1

  return vrep_lines

vector_rep_st = vector_count_representation(clean_small_train_texts, index_words)
print(vector_rep_st.shape)


(5, 299)


## 3.2 Encodage séquentiel

In [None]:
def sequential_representation(lines, id_words):

  vseq_lines = []
  for i,line in enumerate(lines):
    seq = [id_words[word] for word in line.split(' ')]
    vseq_lines.append(seq)

  return vseq_lines

sequential_rep_st = sequential_representation(clean_small_train_texts, index_words)
print(sequential_rep_st)

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 1, 22, 23, 8, 24, 25, 26, 10, 27, 28, 24, 19, 29, 30, 31, 32, 1, 33, 34, 35, 1, 6, 36, 3, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 42, 49, 50, 46, 51, 24, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 13, 2, 65, 21, 66, 44, 67, 68, 10, 66, 6, 69, 2, 70, 71, 72, 73, 74, 75, 76, 77, 69, 78, 18, 33, 79, 1, 80, 20, 79, 12, 81, 82, 83, 75, 30, 84, 85, 86, 87], [75, 88, 89, 90, 91, 92, 93, 94, 95, 21, 1, 36, 96, 97, 91, 66, 98, 71, 99, 100, 101, 102, 103, 104, 60, 105, 106, 42, 107, 60, 108, 109, 110, 66, 111, 112, 106, 1, 113, 75, 114, 93, 115, 42, 116, 2, 117, 42, 118, 119, 2, 120, 121, 122, 60, 123, 99, 124, 91, 125, 21, 64, 126, 127, 71, 128, 2, 129, 130, 14, 131, 75, 132, 75, 133, 88, 134, 135, 71, 42, 116, 21, 75, 136, 42, 137, 71, 87, 60, 138, 109, 124, 91, 125, 71, 139, 140, 141, 142, 72, 73, 143, 53, 144, 145, 75, 146, 147, 1, 116, 148, 149], [150, 151, 42, 152, 153, 154, 10, 66, 155, 156, 60, 

## 3.3 Base de validation

In [None]:
def extract_valid_data(x_train_data, y_train_data, valid_proportion=0.2):

  split_point = int(len(x_train_data) * valid_proportion)
  x_valid = x_train_data[:split_point]
  x_train = x_train_data[split_point:]

  y_valid = y_train_data[:split_point]
  y_train = y_train_data[split_point:]

  return x_train, y_train, x_valid, y_valid

valid_proportion = .2
x_train_small, y_train_small, x_valid, y_valid = extract_valid_data(vector_rep_st, small_train_labels, valid_proportion=valid_proportion)

## 3.4 Formatage des données

In [None]:
y_train_small = np.asarray(y_train_small).astype('float32')
y_valid = np.asarray(y_valid).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

# 4 Réseau de neurones


4.1 Réseau feedforward

In [None]:
def model_mlp(num_words):

  input_layer = Input(shape=(num_words,))
  dense1 = Dense(64, activation='relu')(input_layer)
  dense2 = Dense(16, activation='relu')(dense1)
  dense3 = Dense(1, activation='sigmoid')(dense2)
  model = Model(input_layer, dense3)
  model.compile(optimizer='sgd',
              loss='binary_crossentropy', # si plus de deux classes: loss='categorical_crossentropy'
              metrics=['accuracy'])
  model.summary()
  return model

my_model_mlp = model_mlp(len(index_words))

Model: "model_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_12 (InputLayer)       [(None, 299)]             0         
                                                                 
 dense_17 (Dense)            (None, 64)                19200     
                                                                 
 dense_18 (Dense)            (None, 16)                1040      
                                                                 
 dense_19 (Dense)            (None, 1)                 17        
                                                                 
Total params: 20,257
Trainable params: 20,257
Non-trainable params: 0
_________________________________________________________________


In [None]:
history = my_model_mlp.fit(x=x_train_small, y=y_train_small,
                    epochs=20, batch_size=4,
                    validation_data=(x_valid, y_valid))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## 4.2 Réseau récurrent simple

In [None]:
T_max = max([len(x) for x in sequential_rep_st])
x_train_seq = pad_sequences(sequential_rep_st, maxlen=T_max, padding='post', truncating='post')
print(T_max)

x_train_small, y_train_small, x_valid, y_valid = extract_valid_data(x_train_seq, small_train_labels, valid_proportion=valid_proportion)
y_train_small = np.asarray(y_train_small).astype('float32')
y_valid = np.asarray(y_valid).astype('float32')
print(x_train_small.shape)
#x_train_small = np.reshape(x_train_small, (4, 131, 1))
print(x_train_small.shape)

131
(4, 131)
(4, 131)


In [None]:
def model_rnn(T_max):

  input_layer = Input(shape=(T_max,1))
  srnn1 = SimpleRNN(64, return_sequences=True)(input_layer)
  srnn2 = SimpleRNN(32, return_sequences=True)(srnn1)
  srnn3 = SimpleRNN(1, activation='sigmoid')(srnn2)
  model = Model(input_layer, srnn3)
  model.compile(optimizer='sgd',
              loss='binary_crossentropy', # si plus de deux classes: loss='categorical_crossentropy'
              metrics=['accuracy'])
  model.summary()
  return model

my_model_rnn = model_rnn(T_max)

Model: "model_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_13 (InputLayer)       [(None, 131, 1)]          0         
                                                                 
 simple_rnn_12 (SimpleRNN)   (None, 131, 64)           4224      
                                                                 
 simple_rnn_13 (SimpleRNN)   (None, 131, 32)           3104      
                                                                 
 simple_rnn_14 (SimpleRNN)   (None, 1)                 34        
                                                                 
Total params: 7,362
Trainable params: 7,362
Non-trainable params: 0
_________________________________________________________________


In [None]:
history = my_model_rnn.fit(x=x_train_small, y=y_train_small,
                    epochs=10, batch_size=4,
                    validation_data=(x_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# 5 Application sur l’ensemble du jeu de données

In [None]:
clean_train_texts = clean_textlines(train_texts)
dict_words, index_words = get_words(clean_train_texts)
print(len(dict_words))

117382


## 5.1 Réduction du vocabulaire

In [None]:
def get_frequent_words(dict_words, n_freq_words, train_texts):
  list_value_key = [(v, k) for k,v in dict_words.items()]
  print(list_value_key[:10])
  freq_words = sorted(list_value_key, reverse=True)[:n_freq_words]
  print(freq_words[:10])
  _, selected_words = zip(*freq_words)

  reduced_text = []
  for line in train_texts:
    line = ' '.join([word if word in selected_words else 'unk' for word in line.split(' ')])
    reduced_text.append(line)

  return reduced_text

reduced_train_texts = get_frequent_words(dict_words, 250, clean_train_texts)
reduced_dict_words, reduced_index_words = get_words(reduced_train_texts)
print(len(reduced_dict_words))
for i in range(5):
  print(reduced_train_texts[i])

[(75189, 'this'), (48007, 'was'), (21486, 'an'), (1481, 'absolutely'), (1585, 'terrible'), (41803, 'movie'), (8471, 'dont'), (26630, 'be'), (28, 'lured'), (93024, 'in')]
[(334678, 'the'), (162210, 'and'), (161936, 'a'), (145323, 'of'), (135041, 'to'), (106854, 'is'), (93024, 'in'), (77084, 'it'), (75717, 'i'), (75189, 'this')]
251
this was an unk unk movie dont be unk in by unk unk or unk unk both are great actors but this must unk be their worst role in unk even their great acting could not unk this movies unk unk this movie is an unk unk us unk unk the most unk scenes were those when the unk unk were making their unk for unk unk unk unk unk unk and her unk unk with unk was nothing but a unk unk unk in a movie that was unk of any real unk i am unk that there are movies like this unk actors like unk unk good unk i could unk unk through it
i have been unk to unk unk unk films but this is unk unk to a unk of things unk really unk being unk and unk on the unk and unk just unk a lot howeve

In [None]:
# MLP sur des représentations vectorielles
vect_rep = vector_count_representation(reduced_train_texts, reduced_index_words)
x_train, y_train, x_valid, y_valid = extract_valid_data(vect_rep, train_labels, valid_proportion=valid_proportion)
y_train = np.asarray(y_train).astype('float32')
y_valid = np.asarray(y_valid).astype('float32')
y_test = np.asarray(test_labels).astype('float32')
print(x_train.shape)
my_model_mlp = model_mlp(len(reduced_index_words))
history_mlp = my_model_mlp.fit(x=x_train, y=y_train,
                    epochs=10, batch_size=32,
                    validation_data=(x_valid, y_valid))

(20000, 251)
Model: "model_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_14 (InputLayer)       [(None, 251)]             0         
                                                                 
 dense_20 (Dense)            (None, 64)                16128     
                                                                 
 dense_21 (Dense)            (None, 16)                1040      
                                                                 
 dense_22 (Dense)            (None, 1)                 17        
                                                                 
Total params: 17,185
Trainable params: 17,185
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# RNN sur des représentations séquentielles
seq_rep = sequential_representation(reduced_train_texts, reduced_index_words)
T_max = min(max([len(x) for x in seq_rep]), 80)
print(T_max)
x_train, y_train, x_valid, y_valid = extract_valid_data(seq_rep, train_labels, valid_proportion=valid_proportion)
x_train = pad_sequences(x_train, maxlen=T_max, padding='post', truncating='post')
x_valid = pad_sequences(x_valid, maxlen=T_max, padding='post', truncating='post')
y_train = np.asarray(y_train).astype('float32')
y_valid = np.asarray(y_valid).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

print(x_train.shape)
print(x_valid.shape)
my_model_rnn = model_rnn(T_max)

80
(20000, 80)
(5000, 80)
Model: "model_14"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_15 (InputLayer)       [(None, 80, 1)]           0         
                                                                 
 simple_rnn_15 (SimpleRNN)   (None, 80, 64)            4224      
                                                                 
 simple_rnn_16 (SimpleRNN)   (None, 80, 32)            3104      
                                                                 
 simple_rnn_17 (SimpleRNN)   (None, 1)                 34        
                                                                 
Total params: 7,362
Trainable params: 7,362
Non-trainable params: 0
_________________________________________________________________


In [None]:
history_rnn = my_model_rnn.fit(x=x_train, y=y_train,
                    epochs=10, batch_size=32,
                    validation_data=(x_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## 5.2 Application

In [None]:
def model_gru(T_max):

  input_layer = Input(shape=(T_max,1))
  srnn1 = GRU(64, return_sequences=True)(input_layer)
  srnn2 = GRU(32, return_sequences=True)(srnn1)
  srnn3 = GRU(1, activation='sigmoid')(srnn2)
  model = Model(input_layer, srnn3)
  model.compile(optimizer='sgd',
              loss='binary_crossentropy', # si plus de deux classes: loss='categorical_crossentropy'
              metrics=['accuracy'])
  model.summary()
  return model

my_model_gru = model_gru(T_max)

Model: "model_15"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_16 (InputLayer)       [(None, 80, 1)]           0         
                                                                 
 gru_5 (GRU)                 (None, 80, 64)            12864     
                                                                 
 gru_6 (GRU)                 (None, 80, 32)            9408      
                                                                 
 gru_7 (GRU)                 (None, 1)                 105       
                                                                 
Total params: 22,377
Trainable params: 22,377
Non-trainable params: 0
_________________________________________________________________


In [None]:
history_gru = my_model_gru.fit(x=x_train, y=y_train,
                    epochs=10, batch_size=32,
                    validation_data=(x_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
def model_gru_better(T_max, n_words, emb_dim):

  input_layer = Input(shape=(T_max,))
  mask_layer = Masking(mask_value=0.0)(input_layer)
  embedding_layer = Embedding(input_dim=n_words,     # taille du vocabulaire
                              output_dim=emb_dim,    # taille du vecteur de mots
                              input_length=T_max)(mask_layer) # taille d'un texte)(input_layer)
  srnn1 = GRU(32, return_sequences=True)(embedding_layer)
  srnn2 = GRU(16, return_sequences=True)(srnn1)
  flatten_layer = Flatten()(srnn2)
  dense1 = Dense(32, activation='relu')(flatten_layer)
  dense2 = Dense(1, activation='sigmoid')(dense1)
  #srnn3 = GRU(1, activation='sigmoid')(srnn2)
  model = Model(input_layer, dense2)
  model.compile(optimizer='rmsprop',
              loss='binary_crossentropy', # si plus de deux classes: loss='categorical_crossentropy'
              metrics=['accuracy'])
  model.summary()
  return model

my_model_gru_better = model_gru_better(T_max, len(dict_words), 128)

Model: "model_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_17 (InputLayer)       [(None, 80)]              0         
                                                                 
 masking_3 (Masking)         (None, 80)                0         
                                                                 
 embedding_1 (Embedding)     (None, 80, 128)           15024896  
                                                                 
 gru_8 (GRU)                 (None, 80, 32)            15552     
                                                                 
 gru_9 (GRU)                 (None, 80, 16)            2400      
                                                                 
 flatten_1 (Flatten)         (None, 1280)              0         
                                                                 
 dense_23 (Dense)            (None, 32)                409

In [None]:
history_gru_better = my_model_gru_better.fit(x=x_train, y=y_train,
                    epochs=10, batch_size=32,
                    validation_data=(x_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
clean_test_texts = clean_textlines(test_texts)
reduced_test_texts = get_frequent_words(dict_words, 250, clean_test_texts)
test_vect_rep = vector_count_representation(reduced_test_texts, reduced_index_words)
test_seq_rep = sequential_representation(reduced_test_texts, reduced_index_words)
test_seq_rep = pad_sequences(test_seq_rep, maxlen=T_max, padding='post', truncating='post')
print("MLP")
print(my_model_mlp.evaluate(test_vect_rep, y_test))
print("RNN vanilla")
print(my_model_rnn.evaluate(test_seq_rep, y_test))
print("GRU")
print(my_model_gru.evaluate(test_seq_rep, y_test))
print("GRU OPTIM")
print(my_model_gru_better.evaluate(test_seq_rep, y_test))

[(75189, 'this'), (48007, 'was'), (21486, 'an'), (1481, 'absolutely'), (1585, 'terrible'), (41803, 'movie'), (8471, 'dont'), (26630, 'be'), (28, 'lured'), (93024, 'in')]
[(334678, 'the'), (162210, 'and'), (161936, 'a'), (145323, 'of'), (135041, 'to'), (106854, 'is'), (93024, 'in'), (77084, 'it'), (75717, 'i'), (75189, 'this')]
MLP
[0.569901704788208, 0.7027199864387512]
RNN vanilla
[0.695966899394989, 0.5066400170326233]
GRU
[0.6922540068626404, 0.5143200159072876]
GRU OPTIM
101/782 [==>...........................] - ETA: 10s - loss: 0.8066 - accuracy: 0.6782