##I - Introduction

#####2018 a été une année décisive dans le NLP. L'apprentissage par transfert, en particulier des modèles tels que ELMO d'Allen AI, Open-GPT d'OpenAI et BERT de Google, a fourni au reste de la communauté NLP des modèles pré-entraînés qui pourraient facilement avec moins de données et moins de temps de calcul être affinés et mis en œuvre pour produire des résultats qui dépasse the state of the art. Dans ce projet,nous allons montrer comment utiliser BERT avec la bibliothèque PyTorch pour affiner rapidement et efficacement un modèle de classification des logs d'Openstack, où notre objectif sera de détecter les anomalies dans des futurs logs.


- **Dataset** - nous allons utiliser une DT open Source d'openstack(contient au total 207,820 logs mais nous allons utiliser seulement 20K) mais il est sous forme d'un Filelog donc on avait besoin d'utiliser excel(On peut utiliser Python mais excel reste simple) pour insérer les logs dans des dataframes.

- **Objectif** - developper une solution qui permet de detecter les anomalies dans les logs.

- **Methodologie** - nous allons considerer comme nous avons un probleme de text classification et construire un deep learning model pour attiendre l'objectif.

#II- Different Models 
Il existe plusieurs types de modèles qui pourraient être utilisés pour créer la solution pour la classification de texte. Quelques exemples sont :


*   **1D- Conv Net** : 
les CNN peuvent être utilisés pour la classification des textes. **Avantage**: ils sont plus rapides à s'entraîner. En fait, un modèle CNN pourrait atteindre une précision décente **Disavantage**: ils ne parviennent pas à capturer les dépendances à long terme dans le texte et ne capturent pas les informations séquentielles dans le texte.

*   **Modèles basés sur RNN (LSTM, GRU)** : **Avantage**: ils peuvent capturer la nature séquentielle d'un texte. **Inconvénient** : Plus lent à s'entraîner.

* **Modèles basés sur des transformers (BERT, GPT2)** -
Les modèles basés sur des transformers exploite plusieurs unités Transformer et un mécanisme d'attention à multihead. L'avantage est qu'ils se concentrent uniquement sur le mécanisme d'attention. 

####Dans ce projet nous allons focaliser sur un model basé sur des transformers(encoder partie) qui est le model Bert 


# III- Bert For anomaly detection

### 1. Importing libraries and modules

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 23.6MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 50.9MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 54.1MB/s 
Installing 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import shuffle
import re
from google.colab import drive
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW
from sklearn.utils.class_weight import compute_class_weight


###2. Loading Data

####**Importing files from Google Drive in Colab**
notre dataset est stockée dans Google Drive donc nous avons besoin de 
lier notre compte Google Drive avec notre notebook.
1. La première étape consiste à monter notre Google Drive en exécutant le code en dessous.
2.  nous obtenons le code d'autorisation en nous connectant à notre compte Google.
3.   nous collons le code d'autorisation et nous appuyons sur Entrée.

In [None]:
drive.mount('/gdrive')
%cd /gdrive

Mounted at /gdrive
/gdrive


In [None]:
PATH_normal=r'/gdrive/My Drive/opensn.xlsx'
PATH_abnormal=r'/gdrive/My Drive/Opensa.xlsx'

In [None]:
data_normal = pd.read_excel(PATH_normal)
data_abnormal = pd.read_excel(PATH_abnormal)

In [None]:
data_normal.to_csv(r'/gdrive/My Drive/opennormall.csv')
data_abnormal.to_csv(r'/gdrive/My Drive/openabnormall.csv')

In [None]:
df_normal = pd.read_csv(r'/gdrive/My Drive/opennormall.csv')
df_abnormal = pd.read_csv(r'/gdrive/My Drive/openabnormall.csv')

In [None]:
df_normal=df_normal[['log']]

In [None]:
df_abnormal=df_abnormal[['log']]

###3. Data Preprocessing

* Ajouter les labels (0 normal log, 1 abnormal log)

In [None]:
df_normal['label']=0
df_abnormal['label']=1

* concatenate and shuffle data

In [None]:
#concatinate
df=pd.concat([df_normal,df_abnormal])
# Shuffle the data
df = shuffle(df).reset_index(drop=True)
df.sample(10)

Unnamed: 0,log,label
18061,nova-api.log.1.2017-05-17_12:02:19 2017-05-16 ...,0
19009,nova-api.log.1.2017-05-17_12:02:19 2017-05-16 ...,0
13322,nova-api.log.2017-05-14_21:27:04 2017-05-14 19...,1
7770,nova-compute.log.1.2017-05-17_12:02:35 2017-05...,0
14855,nova-api.log.1.2017-05-17_12:02:19 2017-05-16 ...,0
2824,nova-api.log.1.2017-05-17_12:02:19 2017-05-16 ...,0
3342,nova-api.log.1.2017-05-17_12:02:19 2017-05-16 ...,0
613,nova-api.log.2017-05-14_21:27:04 2017-05-14 19...,1
16650,nova-api.log.1.2017-05-17_12:02:19 2017-05-16 ...,0
2322,nova-compute.log.1.2017-05-17_12:02:35 2017-05...,0


In [None]:
df.shape

(21451, 2)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21451 entries, 0 to 21450
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   log     21451 non-null  object
 1   label   21451 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 335.3+ KB


In [None]:
print('Number of normal log : ', df["label"].value_counts()[0])
print('Number of abnormal log : ', df["label"].value_counts()[1])

Number of normal log :  14319
Number of abnormal log :  7132


* Nous supprimons ensuite les caractères non alphanumériques.

In [None]:
def clean_data(log):
    log = re.sub("'", "", log)
    log = re.sub("_", "", log)
    log = re.sub("(\\W)+", " ", log)
    log = log.lower()
    return log

df['log'] = df['log'].apply(clean_data)

In [None]:
df.sample(5)

Unnamed: 0,log,label
717,nova api log 2017 05 1421 27 04 2017 05 14 20 ...,1
3654,nova api log 1 2017 05 1712 02 19 2017 05 16 1...,0
7721,nova api log 1 2017 05 1712 02 19 2017 05 16 1...,0
15574,nova compute log 1 2017 05 1712 02 35 2017 05 ...,0
12216,nova api log 1 2017 05 1712 02 19 2017 05 16 1...,0


* Split the dataset into train(90%, validation 5% and test sets 5%)

In [None]:
train_text, temp_text, train_labels, temp_labels = train_test_split(df['log'], df['label'], 
                                                                    random_state=2018, 
                                                                    test_size=0.1, 
                                                                    stratify=df['label'])

# val set & test set
val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels, 
                                                                random_state=2018, 
                                                                test_size=0.5, 
                                                                stratify=temp_labels)

###4. Tokenization

* Bert accepte un type spécifique des inputs pour y répondre nous sommes tenus de:
1. Ajouter des tokens spéciaux au début[CLS] et à la fin[SEP] de chaque log.
2. Compléter et tronquer toutes les logs à une seule longueur constante "padding".
3. Différencier explicitement les vrais tokens des tokens de remplissage avec le "attention mask" 1 vrai 0 vide.

PS: l'output du dernier transformer(12eme), seul le premier embeddage (correspondant au token [CLS]) est utilisé par le classifieur.

In [None]:
# specify GPU
device = torch.device("cuda")

In [None]:
# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




* Mesurer la longueur du plus long log

In [None]:
max_len = 0
for sent in df.log:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(sent, add_special_tokens=True)

    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)


Max sentence length:  178


* fixer la longueur des logs

In [None]:
max_seq_len = 180

* Tokenize DataSet

In [None]:
# tokenize & encode sequences training set
tokens_train = tokenizer.batch_encode_plus(
    train_text.tolist(),
    add_special_tokens = True, # Add '[CLS]' and '[SEP]'
    max_length = max_seq_len,
    pad_to_max_length=True, # Pad & truncate all sentences.
    truncation=True,
    return_token_type_ids=False
)

# tokenize & encode sequences validation set
tokens_val = tokenizer.batch_encode_plus(
    val_text.tolist(),
    max_length = max_seq_len,
    pad_to_max_length=True,
    truncation=True,
    return_token_type_ids=False
)

# tokenize & encode sequences test set
tokens_test = tokenizer.batch_encode_plus(
    test_text.tolist(),
    max_length = max_seq_len,
    pad_to_max_length=True,
    truncation=True,
    return_token_type_ids=False
)



###5. Convert Data to Tensors(PyTorch Data Types)

Notre modèle attend des tenseurs PyTorch plutôt que numpy.ndarrays, donc convertissons toutes nos variables.

In [None]:
# for train set
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())

# for validation set
val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
val_y = torch.tensor(val_labels.tolist())

# for test set
test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())

###6. Create DataLoaders

Nous allons également créer un itérateur pour notre ensemble de données à l'aide de la classe Torch DataLoader. Cela permet d'économiser de la mémoire pendant l'entraînement car, contrairement à une boucle for, avec un itérateur, l'ensemble de données n'a pas besoin d'être chargé en mémoire.

In [None]:
#define a batch size
batch_size = 16

# wrap tensors
train_data = TensorDataset(train_seq, train_mask, train_y)

# sampler for sampling the data during training
train_sampler = RandomSampler(train_data)

# dataLoader for train set
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# wrap tensors
val_data = TensorDataset(val_seq, val_mask, val_y)

# sampler for sampling the data during training
val_sampler = SequentialSampler(val_data)

# dataLoader for validation set
val_dataloader = DataLoader(val_data, sampler = val_sampler, batch_size=batch_size)

###7. Define Model Architecture

In [None]:
# import BERT-base pretrained model
bert = AutoModel.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
bert.cuda()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

* Freeze BERT Parameters

In [None]:
# freeze all the parameters
for param in bert.parameters():
    param.requires_grad = False

In [None]:
class BERT_Arch(nn.Module):

    def __init__(self, bert):
      
      super(BERT_Arch, self).__init__()

      self.bert = bert 
      
      # dropout layer
      self.dropout = nn.Dropout(0.1)
      
      # relu activation function
      self.relu =  nn.ReLU()

      # dense layer 1
      self.fc1 = nn.Linear(768,512)
      
      # dense layer 2 (Output layer)
      self.fc2 = nn.Linear(512,2)

      #softmax activation function
      self.softmax = nn.LogSoftmax(dim=1)

    #define the forward pass
    def forward(self, sent_id, mask):

      #pass the inputs to the model  
      _, cls_hs = self.bert(sent_id, attention_mask=mask, return_dict=False)
      
      x = self.fc1(cls_hs)

      x = self.relu(x)

      x = self.dropout(x)

      # output layer
      x = self.fc2(x)
      # apply softmax activation
      x = self.softmax(x)

      return x

In [None]:
# pass the pre-trained BERT to our define architecture
model = BERT_Arch(bert)

# Tell pytorch to run this model on the GPU.
model = model.to(device)

Maintenant que notre modèle est chargé, nous devons récupérer les hyperparamètres d'entraînement.
- Batch size: 16 (nous en avons choisi 16 lors de la création du DataLoaders).
- Learning rate (Adam):nous utiliserons 2e-5.
- Nombre d'époques:nous en utiliserons 4.

In [None]:
# define the optimizer
optimizer = AdamW(model.parameters(), lr =2e-5)

* Find Class Weights

In [None]:
#compute the class weights
class_wts = compute_class_weight('balanced', np.unique(train_labels), train_labels)

print(class_wts)

[0.74906876 1.5037389 ]


* convert class weights to tensor

In [None]:
# convert class weights to tensor
weights= torch.tensor(class_wts,dtype=torch.float)
weights = weights.to(device)

# loss function
cross_entropy  = nn.NLLLoss(weight=weights) 

# number of training epochs
epochs = 4

###8. Fine-Tune BERT

In [None]:
# function to train the model
def train():
  
  model.train()

  total_loss, total_accuracy = 0, 0
  
  # empty list to save model predictions
  total_preds=[]
  
  # iterate over batches
  for step,batch in enumerate(train_dataloader):
    
    # progress update after every 50 batches.
    if step % 50 == 0 and not step == 0:
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(train_dataloader)))

    # push the batch to gpu
    batch = [r.to(device) for r in batch]
 
    sent_id, mask, labels = batch

    # clear previously calculated gradients 
    model.zero_grad()        

    # get model predictions for the current batch
    preds = model(sent_id, mask)

    # compute the loss between actual and predicted values
    loss = cross_entropy(preds, labels)

    # add on to the total loss
    total_loss = total_loss + loss.item()

    # backward pass to calculate the gradients
    loss.backward()

    # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # update parameters
    optimizer.step()

    # model predictions are stored on GPU. So, push it to CPU
    preds=preds.detach().cpu().numpy()

    # append the model predictions
    total_preds.append(preds)

  # compute the training loss of the epoch
  avg_loss = total_loss / len(train_dataloader)
  
  # predictions are in the form of (no. of batches, size of batch, no. of classes).
  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  #returns the loss and predictions
  return avg_loss, total_preds

In [None]:
# function for evaluating the model
def evaluate():
  
  print("\nEvaluating...")
  
  # deactivate dropout layers
  model.eval()

  total_loss, total_accuracy = 0, 0
  
  # empty list to save the model predictions
  total_preds = []

  # iterate over batches
  for step,batch in enumerate(val_dataloader):
    
    # Progress update every 50 batches.
    if step % 50 == 0 and not step == 0:
      
      # Calculate elapsed time in minutes.
      #elapsed = format_time(time.time() - t0)
            
      # Report progress.
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(val_dataloader)))

    # push the batch to gpu
    batch = [t.to(device) for t in batch]

    sent_id, mask, labels = batch

    # deactivate autograd
    with torch.no_grad():
      
      # model predictions
      preds = model(sent_id, mask)
      # compute the validation loss between actual and predicted values
      loss = cross_entropy(preds,labels)

      total_loss = total_loss + loss.item()

      preds = preds.detach().cpu().numpy()

      total_preds.append(preds)

  # compute the validation loss of the epoch
  avg_loss = total_loss / len(val_dataloader) 

  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)

  return avg_loss, total_preds

###9. Model Training

In [None]:
# set initial loss to infinite
best_valid_loss = float('inf')

# empty lists to store training and validation loss of each epoch
train_losses=[]
valid_losses=[]

#for each epoch
for epoch in range(epochs):
     
    print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
    
    #train model
    train_loss, _ = train()
    
    #evaluate model
    valid_loss, _ = evaluate()
    
    # append training and validation loss
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    
    print(f'\nTraining Loss: {train_loss:.3f}')
    print(f'Validation Loss: {valid_loss:.3f}')


 Epoch 1 / 4
  Batch    50  of  1,207.
  Batch   100  of  1,207.
  Batch   150  of  1,207.
  Batch   200  of  1,207.
  Batch   250  of  1,207.
  Batch   300  of  1,207.
  Batch   350  of  1,207.
  Batch   400  of  1,207.
  Batch   450  of  1,207.
  Batch   500  of  1,207.
  Batch   550  of  1,207.
  Batch   600  of  1,207.
  Batch   650  of  1,207.
  Batch   700  of  1,207.
  Batch   750  of  1,207.
  Batch   800  of  1,207.
  Batch   850  of  1,207.
  Batch   900  of  1,207.
  Batch   950  of  1,207.
  Batch 1,000  of  1,207.
  Batch 1,050  of  1,207.
  Batch 1,100  of  1,207.
  Batch 1,150  of  1,207.
  Batch 1,200  of  1,207.

Evaluating...
  Batch    50  of    121.
  Batch   100  of    121.

Training Loss: 0.691
Validation Loss: 0.679

 Epoch 2 / 4
  Batch    50  of  1,207.
  Batch   100  of  1,207.
  Batch   150  of  1,207.
  Batch   200  of  1,207.
  Batch   250  of  1,207.
  Batch   300  of  1,207.
  Batch   350  of  1,207.
  Batch   400  of  1,207.
  Batch   450  of  1,207.
  

###10. Get Predictions for Test Data

In [None]:
# get predictions for test data
with torch.no_grad():
  preds = model(test_seq.to(device), test_mask.to(device))
  preds = preds.detach().cpu().numpy()

In [None]:
# model's performance
preds = np.argmax(preds, axis = 1)
print(classification_report(test_y, preds))

              precision    recall  f1-score   support

           0       0.74      0.85      0.79       144
           1       0.56      0.38      0.45        71

    accuracy                           0.70       215
   macro avg       0.65      0.62      0.62       215
weighted avg       0.68      0.70      0.68       215



Bert n'a pas arrivé a bien classer les logs, la longueur des logs à jouer un grand role dans cette performance (limitation de bert) ainssi que bert est pré-entainé sur des mots générales et pas des symboles comme:
"nova-compute.log.2017-05-14_21:27:09 2017-05-14 19:39:15.195 2931 INFO nova.virt.libvirt.imagecache req-addc1839-2ed5-4778-b57e-5854eb7b8b09 - - - - - image 0673dd71-34c5-4fbb-86c4-40623fbe45b4 at (/var/lib/nova/instances/_base/a489c868f0c37da93b76227c91bb03908ac0e742): checking".