##Simple Example of Normal Autoencoder

The idea of this approach is that the model has to be able to represent the same log after being transfered into a low dimension. Then, if we train the model with the normal logs (the majority of logs in our dataset) we would have trained the model to represent the normal logs. Therefore, if an anomalous log is presented, the model would work badly, and here is where we can identify the anomaly.

In [1]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# imports:

import pandas as pd
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
import torch
import torch.nn as nn
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


from skipgram import *
import os

from torch.nn.utils.rnn import pad_sequence
from torch.nn.functional import cosine_similarity

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [4]:
ROOT_DIR = os.path.dirname(os.path.abspath(""))

###Step 1: Preprocessing

Para no estar tanto tiempo, poner aqui el codigo de juntar todos los csvs en un solo dataframe, poruqe se me ha acabado el timepo por eso. Solo me faltaria eso, entrenarlo y descargarme el modelo.

In [5]:
# Load the CSV file into a pandas DataFrame
logs_df = pd.read_csv('/content/drive/MyDrive/oscar/sitges_access_prepared_whole_set_but_last.csv')

### Step 2: dataset


In [6]:
# Split the data into train, validation, and test sets
X_train, X_temp = train_test_split(logs_df, test_size=0.4, random_state=42)
X_val, X_test = train_test_split(X_temp, test_size=0.5, random_state=42)
X_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,minute_cos,petition_-,petition_GET,petition_HEAD,petition_POST,petition_other,status_1,status_2,status_3,status_4
71104,-0.333113,-0.09744,-0.154752,-0.094864,-0.728179,-0.086234,0.191327,0.193671,-0.67362,-0.995385,...,-0.978148,0,1,0,0,0,0,1,0,0
372547,-0.123155,-0.437729,0.092965,0.2728,-0.367416,-0.61879,-0.213928,0.213864,-0.437109,-1.328275,...,0.913545,0,1,0,0,0,0,1,0,0
16053,-0.392114,0.186366,-0.197917,0.008702,-0.552015,-0.151153,-0.445999,0.399997,-0.108754,-0.268425,...,-0.913545,0,1,0,0,0,0,1,0,0
429070,-0.152999,-0.088565,-0.096604,0.014419,-0.143944,-0.537258,-0.268547,0.428526,-0.534418,-0.766975,...,0.866025,0,1,0,0,0,0,1,0,0
138182,-0.274839,0.249271,0.163601,0.248794,0.203307,-0.59914,-0.152062,0.453232,-0.485348,-0.475886,...,-0.951057,0,1,0,0,0,0,1,0,0


In [7]:
class CustomDataset(Dataset):
    def __init__(self, dataframe):
        self.data = dataframe.values.astype(np.float32) # Assuming dataframe is a pandas DataFrame

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        return torch.tensor(sample)

In [8]:
train_dataset = CustomDataset(X_train)
train_loader = DataLoader(train_dataset, batch_size=1000, shuffle=False)

val_dataset = CustomDataset(X_val)
val_loader = DataLoader(val_dataset, batch_size=1000, shuffle=False)

test_dataset = CustomDataset(X_test)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

###Step 3: Model

In [9]:
# Define Autoencoder Model
class LogAnomalyDetector(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(LogAnomalyDetector, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_size, hidden_size[0]),
            nn.ReLU(),
            nn.Linear(hidden_size[0], hidden_size[1]),
            nn.ReLU(),
            nn.Linear(hidden_size[1], hidden_size[2])
        )
        self.decoder = nn.Sequential(
            nn.Linear(hidden_size[2], hidden_size[1]),
            nn.ReLU(),
            nn.Linear(hidden_size[1], hidden_size[0]),
            nn.ReLU(),
            nn.Linear(hidden_size[0], input_size)
        )

    def forward(self, x):
        emb = self.encoder(x)
        #print("Latent space:", emb)
        x = self.decoder(emb)
        return x

input_size=len(logs_df.columns)
hidden_size=[50, 20, 5] #input_size//4
num_layers=1

# Create an instance of the model
model = LogAnomalyDetector(input_size=input_size, hidden_size=hidden_size)
model.to(device)

# Print model summary
print(model)

LogAnomalyDetector(
  (encoder): Sequential(
    (0): Linear(in_features=115, out_features=50, bias=True)
    (1): ReLU()
    (2): Linear(in_features=50, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=5, bias=True)
  )
  (decoder): Sequential(
    (0): Linear(in_features=5, out_features=20, bias=True)
    (1): ReLU()
    (2): Linear(in_features=20, out_features=50, bias=True)
    (3): ReLU()
    (4): Linear(in_features=50, out_features=115, bias=True)
  )
)


###Step 4: Train the model

In [10]:
# Function for training the model
def train_model(model, train_loader, criterion, optimizer, num_epochs=10):
    model.train()
    for epoch in range(num_epochs):
        running_loss = 0.0
        for i, inputs in tqdm(enumerate(train_loader)):
            inputs = inputs.to(device)

            optimizer.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs.squeeze(), inputs.squeeze())

            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss / len(train_loader)}")



In [11]:
# Define the criterion (loss function)
criterion = nn.MSELoss()

# Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adjust learning rate as needed

# Train the model
num_epochs = 20

train_loss = train_model(model, train_loader, criterion, optimizer, num_epochs=num_epochs)

534it [00:05, 92.27it/s]


Epoch [1/20], Loss: 0.08764286366490166


534it [00:06, 86.35it/s] 


Epoch [2/20], Loss: 0.049868681297152676


534it [00:05, 104.31it/s]


Epoch [3/20], Loss: 0.047958169718471805


534it [00:05, 90.56it/s]


Epoch [4/20], Loss: 0.04538564170511921


534it [00:05, 96.07it/s] 


Epoch [5/20], Loss: 0.043962398090500954


534it [00:05, 106.57it/s]


Epoch [6/20], Loss: 0.04331939338884327


534it [00:06, 83.38it/s] 


Epoch [7/20], Loss: 0.04264342656519529


534it [00:05, 105.40it/s]


Epoch [8/20], Loss: 0.04213571413123652


534it [00:06, 88.39it/s]


Epoch [9/20], Loss: 0.041813429047384956


534it [00:05, 96.86it/s] 


Epoch [10/20], Loss: 0.04158894991160332


534it [00:05, 105.04it/s]


Epoch [11/20], Loss: 0.04142337720771407


534it [00:06, 85.38it/s] 


Epoch [12/20], Loss: 0.04129487549544274


534it [00:05, 100.27it/s]


Epoch [13/20], Loss: 0.04118955597867457


534it [00:06, 82.35it/s]


Epoch [14/20], Loss: 0.04109313778495521


534it [00:05, 102.32it/s]


Epoch [15/20], Loss: 0.04100658329489749


534it [00:05, 97.75it/s]


Epoch [16/20], Loss: 0.040924259318450416


534it [00:06, 86.61it/s] 


Epoch [17/20], Loss: 0.04084469608209106


534it [00:05, 104.16it/s]


Epoch [18/20], Loss: 0.04076496230556947


534it [00:06, 82.06it/s] 


Epoch [19/20], Loss: 0.04067158990640765


534it [00:05, 104.61it/s]

Epoch [20/20], Loss: 0.040562229740262475





### Step 5: Test the model

In [12]:
# Test the model
def test_model(model, test_loader, criterion):

    model.eval()  # Set model to evaluation mode

    test_loss = 0.0

    with torch.no_grad():
        for inputs in test_loader:
            inputs = inputs.to(device)

            outputs = model(inputs)
            loss = criterion(outputs.squeeze(), inputs.squeeze())

            test_loss += loss.item()

    avg_test_loss = test_loss / len(test_loader)
    print(f"Average Test Loss: {avg_test_loss}")

In [13]:
val_loss = test_model(model, val_loader, criterion)

Average Test Loss: 0.04044552532474646


In [14]:
test_loss = test_model(model, test_loader, criterion)

Average Test Loss: 0.04053921034831679


## A partir d'aqui faig probes amb logs i faig comentaris sobre com seguir el model, no cal que ho miris

In [15]:
# Sample log entry
log_entry = test_dataset[35]

# Convert the log entry to a tensor
log_tensor = torch.tensor(log_entry).unsqueeze(0)  # Add batch dimension
log_tensor = log_tensor.to(device)

# Pass the log tensor through the model
output_log_tensor = model(log_tensor)

# Convert the output tensor back to a numpy array
#output_log_entry = output_log_tensor.squeeze().detach().numpy()

# Output the reconstructed log entry
print("Reconstructed Log Entry:")
print("Input:")
print(log_tensor)
print("Output:")
print(output_log_tensor)


Reconstructed Log Entry:
Input:


  log_tensor = torch.tensor(log_entry).unsqueeze(0)  # Add batch dimension


tensor([[-0.2920, -0.4493, -0.1840,  0.1897, -0.4050, -0.3815, -0.0033,  0.1816,
         -0.5851, -1.2037,  1.1592, -0.5351, -0.0380, -0.3548, -0.8190, -0.7348,
          0.3678,  0.1169,  0.3443, -0.6040, -0.9051,  0.7812,  0.0525, -0.1780,
         -0.4797,  0.1021, -0.5613, -0.9422, -0.2094,  0.0615,  2.2997, -0.6250,
          0.1860,  0.5531, -0.8339,  0.1611, -0.0180,  0.6361, -0.1663, -0.4541,
          0.3472,  0.2737, -0.3329, -0.5535,  0.8550, -0.1287,  0.3145,  0.3071,
          0.4009,  0.2895,  0.2259,  0.3145, -0.2274,  0.4788, -0.0969, -0.2542,
         -0.0932, -0.2083,  0.6219, -0.3068,  0.4183,  0.2237, -0.6051, -0.1068,
          0.2934,  0.3182, -0.0763, -0.6612, -0.1029, -0.3324, -0.0741,  0.0672,
         -0.0412, -0.3145, -0.7690, -0.2783, -0.3512,  0.2889, -0.0064, -0.1137,
          0.0753, -0.0852,  0.2571, -0.0048,  0.0234,  0.0731,  0.5162, -0.2457,
          0.3249,  0.1492, -0.3155,  0.0000, -0.7377, -1.2431,  0.2670,  0.7229,
          0.5000,  0.8660, -

In [16]:
print(cosine_similarity(log_tensor, output_log_tensor))

tensor([0.9227], device='cuda:0', grad_fn=<SumBackward1>)


In [17]:
print(len(test_dataset))
count=0
for log in test_dataset:
    # Sample log entry
    log_entry = log

    # Convert the log entry to a tensor
    log_tensor = torch.tensor(log_entry).unsqueeze(0)  # Add batch dimension
    log_tensor = log_tensor.to(device)

    # Pass the log tensor through the model
    output_log_tensor = model(log_tensor)
    similarity=cosine_similarity(log_tensor, output_log_tensor)
    if(similarity<0.5):
        count+=1
        print(similarity)
print(count)

177894


  log_tensor = torch.tensor(log_entry).unsqueeze(0)  # Add batch dimension


tensor([0.4863], device='cuda:0', grad_fn=<SumBackward1>)
1


In [18]:
# To get the scoring system, compute the similarity between the output vector and the input vector. (This is just an idea)
# An anomaly should give low similarity (This is if the model performs accuratelly the task of autoencode the same vector)
# Ver como puedo saber cual es el log malo, es decir, dado un resultado, solo pedo saber el vector, pero no el log. Buscar como saber el log. Associar de alguna forma el log con su vector.
# Hacer el entreno offline y el jugar con el modelo online.
# Para hacer bien el entreno hacer un csv con todos los logs menos el del dia del ataque. despues filtrar los logs con un nivel alto, para eliminarlos, y crear un nuevo csv limpio.
# El último archivo de logs es en el que hay un ataque.

# La mediana de lo del word2vec
# Probar una dim de 26 para las hiden layers, porque la similitud tendria que ser de 1
# Mirar lo de la TimeStep
# Justificar bien porqeu he hecho este modelo y como lo queia conseguir. No entiende el profe porque un solo log par a un LSMT
# Hacer la curva de loss de training y validaition

# hacer una interfaz para ver los logs malos y su score que dan cada modelo. Podriamos utilizar el último csv para comparar los resultados de los 3 modelos

# En el report hablar de lo que hemos aprendido y como hemos evolucionado

# Si el level del log es mayor a 10, lo mas seguro es que sea un ataque. Por lo tanto usar esto para descartar logs. Para los levels necesitaria el wazuh instalado, asi que de momento dejarlo. Preguntarle al erik si me lo puede hacer para mis csvs o si me puede pasar los suyos.

# En el clean file no se añaden los embeding. Es en el guide file donde se hace, pero no se porque no se hace con las dimensiones que toca. Tengo que einvestigarlo. Si consigo hacerlo con todas las features, decirlo al grupo, por si a alguien le interesa.

In [19]:
percentage=count/len(test_dataset)
percentage

5.621325058742847e-06

In [20]:
# Save the model state dictionary to Google Drive
torch.save(model.state_dict(), '/content/drive/MyDrive/oscar/normalAutoencoder.pt')