# Intrusion Detection for Unix Processes

## Methodology Overview

The dataset comprises system call sequences for Unix processes, both normal and anomalous, which we aim to classify accurately.

### Steps followed:

1. **Preprocessing Sequences:**
   The training and test sequences are processed from `.train` and `.test` files into fixed-length chunks. This is a crucial step since the Negative Selection Algorithm requires fixed-length patterns for training and classification.

2. **Parameter Selection:**
   We choose appropriate values for `n` (the length of the detectors) and `r` (the length of the contiguous chunk that must not match the self-strings). These parameters are critical for the algorithm's performance.

3. **Training the Algorithm & Classification of Test Sequences:**
   - The algorithm is trained on the preprocessed training sequences to create a set of detectors, which will help in identifying anomalies in the test sequences.
   - The trained algorithm is applied to the preprocessed test sequences. We calculate the number of matching patterns for each chunk and merge these counts to a composite anomaly score (average?). This score will serve as the basis for classifying the sequence as normal or anomalous.

4. **AUC Analysis:**
   Finally, we perform an AUC analysis using the `.labels` files to evaluate the quality of our classification. The `.labels` files contain the actual classification of the sequences, providing us with a ground truth to measure against.

In [109]:
# Importing necessary libraries
import os

## Preprocessing sequences
### Preprocessing Train and Test Sequences
The first step in our intrusion detection analysis using the negative selection algorithm involves preprocessing the raw sequences from `.train` and `.test` files. These sequences vary in length and need to be transformed into fixed-length chunks for compatibility with the negative selection algorithm. The preprocessing includes two main approaches:
- **Sliding Window**: Chunks are created by sliding a window of the desired length across the sequence one character at a time, allowing for overlap between consecutive chunks.
- **Non-Overlapping**: Chunks are created by dividing the sequence into contiguous segments of the desired length without any overlap between them.

Each approach is chosen based on the nature of the data and the desired sensitivity of the detection algorithm. The processed training data for both the `snd-cert` and `snd-unm` datasets are stored in respective `.train` files post preprocessing. 


### Test Dataset Chunking and Labeling
The test datasets are similarly processed into fixed-length chunks, with two main files resulting from this step:
- **Chunk Test Files**: Each test sequence is divided into chunks using the desired approach (overlapping or non-overlapping), and these chunks are stored in `.test` files. There are three separate chunk files for each of the three test sets, corresponding to the `snd-cert` and `snd-unm` datasets.
- **Label Files**: For each chunk created from the test sequences, a corresponding label indicating normal (0) or anomalous (1) behavior is assigned. These labels are stored in `.labels` files, with a separate label file for each of the three test sets for both `snd-cert` and `snd-unm`.

In [110]:
# Define the chunk length
chunk_length = 7

directory_path = 'syscalls/'

# File paths for snd-cert
snd_cert_train_file_path = os.path.join(directory_path, 'snd-cert/snd-cert.train')
snd_cert_test_file_paths = [os.path.join(directory_path, f'snd-cert/snd-cert.{i}.test') for i in range(1, 4)]
snd_cert_label_file_paths = [os.path.join(directory_path, f'snd-cert/snd-cert.{i}.labels') for i in range(1, 4)]
snd_cert_alpha_file_path = os.path.join(directory_path, 'snd-cert/snd-cert.alpha')

# File paths for snd-unm
snd_unm_train_file_path = os.path.join(directory_path, 'snd-unm/snd-unm.train')
snd_unm_test_file_paths = [os.path.join(directory_path, f'snd-unm/snd-unm.{i}.test') for i in range(1, 4)]
snd_unm_label_file_paths = [os.path.join(directory_path, f'snd-unm/snd-unm.{i}.labels') for i in range(1, 4)]
snd_unm_alpha_file_path = os.path.join(directory_path, 'snd-unm/snd-unm.alpha')

In [111]:
def preprocess_sequence(sequence, chunk_length, overlap=True):
    """
    Preprocesses a given sequence into fixed-length chunks.
    
    This function splits a sequence into substrings of a specified fixed length. It supports both overlapping
    and non-overlapping chunking methods. Overlapping chunks (substrings) are generated by moving one character
    at a time, while non-overlapping chunks are generated by moving the entire length of the chunk each time.
    
    Parameters:
    - sequence (str): The input sequence to be chunked.
    - chunk_length (int): The length of each chunk (substring) to be generated.
    - overlap (bool): Determines the chunking method. If True, overlapping chunks are created. If False,
                      non-overlapping chunks are generated.
    
    Returns:
    - chunks (list of str): A list of substrings (chunks) of the input sequence. If the input sequence is shorter
                            than the specified chunk length and cannot be chunked, an empty list is returned.
    
    Note:
    - The function strips trailing newlines and spaces from the input sequence before chunking.
    - If 'overlap' is True, each chunk will be shifted by one character from the previous chunk, leading to
      a higher number of generated chunks, each sharing a part with its neighbors.
    - If 'overlap' is False, each chunk starts right after the previous one ends, with no shared characters
      between consecutive chunks, leading to a lower number of generated chunks.
    """

    # Check if the sequence length is at least as long as the chunk length
    if len(sequence) < chunk_length:
        # If the sequence is shorter than the chunk length, return an empty list or handle accordingly
        return []

    # Initialize an empty list to store the chunks
    chunks = []

    # Remove any trailing newline or spaces from the sequence
    # sequence = sequence.strip()
    # Determine the step size based on whether overlapping chunks are desired
    step = 1 if overlap else chunk_length
    # Generate and append chunks to the list
    for i in range(0, len(sequence) - chunk_length + 1, step):
        chunk = sequence[i:i + chunk_length]
        chunks.append(chunk)

    return chunks


In [112]:
# Define the function to read labels from a given file path
def read_labels(file_path):
    """
    This function reads a file with labels (one per line) and returns them as a list.
    """
    with open(file_path, 'r') as file:
        labels = [int(line.strip()) for line in file]
    return labels

In [113]:
# Read and preprocess the .train file first for snd-cert and snd-unm
with open(snd_cert_train_file_path, 'r') as file:
    snd_cert_train_sequences = file.readlines()

with open(snd_unm_train_file_path, 'r') as file:
    snd_unm_train_sequences = file.readlines()

# Define paths for output files
snd_cert_train_chunks_file = os.path.join(directory_path, 'snd-cert/snd_cert_train_chunks.train')
snd_unm_train_chunks_file = os.path.join(directory_path, 'snd-unm/snd_unm_train_chunks.train')

# Preprocess train sequences for snd-cert and save to file
with open(snd_cert_train_chunks_file, 'w') as output_file:
    snd_cert_train_chunks = []
    for sequence in snd_cert_train_sequences:
        # Call the preprocess function for each sequence
        chunks = preprocess_sequence(sequence, chunk_length, overlap=False)
        snd_cert_train_chunks.extend(chunks)
        # Write each chunk to the file
        for chunk in chunks:
            output_file.write(chunk + '\n')

# Print the first 5 sequences to check
print("SND-CERT first 5 train chunks:")
for i, chunk in enumerate(snd_cert_train_chunks[:5], start=1):
    print(f"Chunk {i}: {chunk}")

# Preprocess train sequences for snd-unm and save to a file
with open(snd_unm_train_chunks_file, 'w') as output_file:
    snd_unm_train_chunks = []
    for sequence in snd_unm_train_sequences:
        # Call the preprocess function for each sequence
        chunks = preprocess_sequence(sequence, chunk_length, overlap=False)
        snd_unm_train_chunks.extend(chunks)
        # Write each chunk to the file
        for chunk in chunks:
            output_file.write(chunk + '\n')

# Print the first 5 sequences to check
print("\n" + "-" * 50 + "\n")
print("SND-UNM first 5 train chunks:")
for i, chunk in enumerate(snd_unm_train_chunks[:5], start=1):
    print(f"Chunk {i}: {chunk}")


SND-CERT first 5 train chunks:
Chunk 1: AEEEEEE
Chunk 2: DBccD=c
Chunk 3: EOVDPcE
Chunk 4: DBccEDB
Chunk 5: ccEEhEE

--------------------------------------------------

SND-UNM first 5 train chunks:
Chunk 1: pooqpoo
Chunk 2: qpooqED
Chunk 3: EESSprs
Chunk 4: NNpooqd
Chunk 5: spooqdN


In [114]:
# Function to preprocess test sequences and assign labels, then save to files
def preprocess_and_save_test_sequences_with_labels(test_file_path, label_file_path, chunk_length, overlap=True,
                                                   output_chunk_file_path=None, output_label_file_path=None):
    # Read test sequences and labels
    with open(test_file_path, 'r') as file:
        test_sequences = file.read().splitlines()
    labels = read_labels(label_file_path)

    chunks = []
    chunk_labels = []
    for sequence, label in zip(test_sequences, labels):
        sequence_chunks = preprocess_sequence(sequence, chunk_length, overlap)
        chunks.extend(sequence_chunks)
        chunk_labels.extend([label] * len(sequence_chunks))  # Assign the same label to all chunks from this sequence

    # Save chunks and labels to the specified output files
    if output_chunk_file_path and output_label_file_path:
        with open(output_chunk_file_path, 'w') as chunk_file, open(output_label_file_path, 'w') as label_file:
            for chunk, label in zip(chunks, chunk_labels):
                chunk_file.write(chunk + '\n')
                label_file.write(str(label) + '\n')

    return chunks, chunk_labels

In [115]:
# Initialize lists to hold chunks and labels for snd-cert test data
snd_cert_test_chunks_list = []
snd_cert_labels_list = []

# Define output file paths for chunks and labels
output_files = [
    (os.path.join(directory_path, f'snd-cert/snd_cert_test_set_{i}_chunks.test'),
     os.path.join(directory_path, f'snd-cert/snd_cert_test_set_{i}_labels.labels'))
    for i in range(1, 4)]

# Iterate over test and label file paths along with output file paths
for (test_file, label_file), (output_chunk_file, output_label_file) in zip(
        zip(snd_cert_test_file_paths, snd_cert_label_file_paths), output_files):
    chunks, labels = preprocess_and_save_test_sequences_with_labels(test_file,
                                                                    label_file,
                                                                    chunk_length,
                                                                    overlap=False,
                                                                    output_chunk_file_path=output_chunk_file,
                                                                    output_label_file_path=output_label_file)
    snd_cert_test_chunks_list.append(chunks)
    snd_cert_labels_list.append(labels)

# Initialize lists to hold chunks and labels for snd-cert test data
snd_unm_test_chunks_list = []
snd_unm_labels_list = []

# Generate output file paths dynamically
output_files = [
    (os.path.join(directory_path, f'snd-unm/snd_unm_test_set_{i}_chunks.test'),
     os.path.join(directory_path, f'snd-unm/snd_unm_test_set_{i}_labels.labels'))
    for i in range(1, 4)]

# Iterate over test and label file paths along with output file paths
for (test_file, label_file), (output_chunk_file, output_label_file) in zip(
        zip(snd_unm_test_file_paths, snd_unm_label_file_paths), output_files):
    chunks, labels = preprocess_and_save_test_sequences_with_labels(test_file, 
                                                                    label_file, 
                                                                    chunk_length, 
                                                                    overlap=False,
                                                                    output_chunk_file_path=output_chunk_file,
                                                                    output_label_file_path=output_label_file
                                                                    )
    snd_unm_test_chunks_list.append(chunks)
    snd_unm_labels_list.append(labels)

# View the first 5 chunks and labels from the first test set for snd-cert
print("First 5 chunks and their corresponding labels from SND-CERT Test Set 1:")
for chunk, label in zip(snd_cert_test_chunks_list[0][:5], snd_cert_labels_list[0][:5]):
    print(f"Chunk: {chunk}, Label: {label}")

print("\n" + "-" * 70 + "\n")

# View the first 5 chunks and labels from the first test set for snd-unm
print("First 5 chunks and their corresponding labels from SND-UNM Test Set 1:")
for chunk, label in zip(snd_unm_test_chunks_list[0][:5], snd_unm_labels_list[0][:5]):
    print(f"Chunk: {chunk}, Label: {label}")

First 5 chunks and their corresponding labels from SND-CERT Test Set 1:
Chunk: srrtsuv, Label: 0
Chunk: NNsrrtf, Label: 0
Chunk: vsrrtfN, Label: 0
Chunk: DlmEvNl, Label: 0
Chunk: oW-kwEE, Label: 0

----------------------------------------------------------------------

First 5 chunks and their corresponding labels from SND-UNM Test Set 1:
Chunk: BJLDDPM, Label: 1
Chunk: BLsNNQQ, Label: 1
Chunk: GEHIHDK, Label: 1
Chunk: sNsDBEs, Label: 1
Chunk: sVPDPg-, Label: 1


## Parameter Selection for Negative Selection Algorithm

With the preprocessing complete, the next critical step is parameter selection for the negative selection algorithm. This involves setting the value of `n` and determining an appropriate value for `r`.

- **Length of Detectors (n)**: For our analysis, we set the length of the detectors `n` equal to the length of the fixed-size chunks, which is 7 in our case. (NOT SURE)

- **Contiguous Chunk Length (r)**: The value for `r` is not yet determined. (NEED TO CHECK)


## Training the Algorithm & Classification of Test Sequences
The anomaly scores were generated using the Negative Selection ALgorithm. The scores were outputted to text files for subsequent analysis. Below are the commands used for  `r` value, structured to enhance readability:

- **Command structure that includes the alphabet file for the Unix process task for snd-cert**:
    ```bash
    java -jar negsel2.jar -alphabet snd-cert.alpha -self snd_cert_train_chunks.train -n <chosen_n_value> -r <r_value> -c -l < snd_cert_test_set_<test_no>_chunks.test >snd_cert_test_set_<test_no>_chunks_scores_r<r_value>.txt
    ```
- **Same for snd-unm**:
    ```bash
    java -jar negsel2.jar -alphabet snd-unm.alpha -self snd_unm_train_chunks.train -n <chosen_n_value> -r <r_value> -c -l < snd_unm_test_set_<test_no>_chunks.test >snd_unm_test_set_<test_no>_chunks_scores_r<r_value>.txt
    ```

**Now, with the generated text files containing the scores for each r value and each language, we are ready to proceed with the analysis.**
