In [25]:
import tarfile
import os

def create_tar_gz_of_directory(directory_path, output_archive):
    with tarfile.open(output_archive, "w:gz") as tar:
        # Walk through the directory
        for root, dirs, files in os.walk(directory_path):
            for file in files:
                # Create the path to your file
                file_path = os.path.join(root, file)
                # Calculate the arcname (name within the archive)
                arcname = os.path.relpath(file_path, directory_path)
                # Add the file to the archive; arcname controls the name inside the archive
                tar.add(file_path, arcname=arcname)

# Example usage
directory_path = 'model'  # The directory to tar.gz
output_archive = 'model.tar.gz'  # The output archive path
create_tar_gz_of_directory(directory_path, output_archive)

print(f"Archive created at: {output_archive}")

Archive created at: model.tar.gz


# A Simple Client for the SageMaker Model Endpoint
In this notebook, we go through a barebone, simplified implimentation of a client application for the MalConv model deployed on SageMaker. This client takes in the path of an executable file, then uses the EMBER library to extract its relevant features, which are then postprocessed to align with the expected input format of the model. It then uses the Boto3 library to establish a connection to AWS, authenticate us, and enable interactions with the SageMaker service that is serving our endpoint.

As always, we start by taking care of a few dependencies:

In [2]:
!pip install awscli boto3

Collecting awscli
  Downloading awscli-1.32.71-py3-none-any.whl (4.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting boto3
  Downloading boto3-1.34.71-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting botocore==1.34.71 (from awscli)
  Downloading botocore-1.34.71-py3-none-any.whl (12.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docutils<0.17,>=0.10 (from awscli)
  Downloading docutils-0.16-py2.py3-none-any.whl (548 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m548.2/548.2 kB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting s3transfer<0.11.0,>=0.10.0 (from awscli)
  Downloading s3transfer-0.10.1-py3-none-any.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [12]:
!pip install ember



In [3]:
# Version 1.23 is the latest that is compatible with the original EMBER code.
!pip install numpy==1.23

Collecting numpy==1.23
  Downloading numpy-1.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.0/17.0 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.25.2
    Uninstalling numpy-1.25.2:
      Successfully uninstalled numpy-1.25.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chex 0.1.86 requires numpy>=1.24.1, but you have numpy 1.23.0 which is incompatible.
tensorflow 2.15.0 requires numpy<2.0.0,>=1.23.5, but you have numpy 1.23.0 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.23.0


In [37]:
# Version 0.12 is the latest that is compatible with the original EMBER code.
!pip install lief==0.12

[31mERROR: Operation cancelled by user[0m[31m
[0m

# Converting the Executable File to Processed Feature Vectors

This block of code is crucial for the process of preparing and submitting Portable Executable (PE) files for malware classification using a machine learning model hosted on AWS SageMaker. It encompasses two primary functions: extract_features and format_features, designed for extracting features from PE files using the EMBER feature extractor and formatting those features for compatibility with your machine learning model, respectively.

**extract_features Function**
This function takes the path to a PE file as its input and returns a feature vector extracted using the EMBER feature extraction methodology. The steps are as follows:

*   PEFeatureExtractor Initialization: An instance of PEFeatureExtractor is created with a specified version (1 or 2), which determines the feature extraction method. The choice of version impacts the feature set and extraction behavior.
*   Reading the PE File: The PE file is opened in binary mode ("rb"), and its contents are read into the variable bytez. This binary data is what the EMBER extractor operates on.
*   Feature Extraction: The feature_vector method is called with the binary data of the PE file, returning a feature vector representing the file's characteristics from a cybersecurity perspective.

**format_features Function**
After extracting the features, they must be formatted correctly before submission to your machine learning model. This function performs such formatting:

*   Array Conversion: The extracted features are converted into a NumPy array of type float32. This step ensures that the data is in a numerical format compatible with further processing and machine learning models.
*   Tensor Conversion: The NumPy array is then converted into a PyTorch tensor of type long. This conversion is necessary because the model expects the input data in this specific format. Tensors are a fundamental data structure in PyTorch, allowing for efficient computations and easy integration with neural network models.

In [31]:
import boto3
import json
import numpy as np
import torch
from ember import read_vectorized_features, PEFeatureExtractor
import io

def extract_features(pe_file_path):
    """
    Extract features from a PE file using the EMBER feature extractor.
    """
    extractor = PEFeatureExtractor(2)  # The version parameter can be 1 or 2
    with open(pe_file_path, "rb") as f:
        bytez = f.read()
    features = extractor.feature_vector(bytez)
    return features

def format_features(features):
    """
    Formats the extracted features for the model. Adjust this function based
    on how your model expects the input data.
    """
    # This is a placeholder; adapt the formatting based on your model's needs
    features = np.array(features, dtype=np.float32)
    # Convert to a long tensor as expected by your model
    features_tensor = torch.tensor(features, dtype=torch.long)
    return features_tensor






In the next cell, we define the predict_with_sagemaker() function, which utilizes boto3 to (1) authenticate us to AWS, and (2) facilitate interactions with AWS services. The credentials for AWS are available from AWS Academy Learner Labs Panel -> AWS Details . Please note that these credentials change everytime the lab session is restarted.
![](https://github.com/UNHSAILLab/S24-AISec/blob/main/Midterm%20Tutorial/AWSDetails.png?raw=true)

Also, this implementation is taking the hard path, because my model interface is implemented to take in a serialized input to ensure fidelity in transmission - The easy path would be to define the model input interface to accept JSON objects. However, now that the model is deployed with our inference.py that already implements the hard way, we need to take an extra step here: use a buffer!

The io.BytesIO class is used in Python as an in-memory bytes buffer. It behaves like a file object that can be read from and written to, but instead of reading from or writing to a physical file on the disk, it operates on an in-memory byte stream. This makes io.BytesIO particularly useful for cases where you need a file-like interface for data that doesn't necessarily need to be stored on disk, enabling faster read/write operations and reducing the need for disk I/O.

Here's why io.BytesIO is used in the provided code and what it accomplishes:

Efficient Data Serialization: When you need to serialize data (in this case, a PyTorch tensor) to a format that can be transmitted over a network or stored in a non-Python environment (like a SageMaker endpoint expecting byte streams), io.BytesIO provides a convenient way to capture that serialized data stream without needing to write to and read from a disk.

Compatibility with File-like Interfaces: Many Python libraries, including torch.save for serializing PyTorch models or tensors, expect a file-like object for operations. io.BytesIO allows these libraries to operate on data in memory as if they were reading from or writing to a file, making it seamless to integrate with such libraries for in-memory operations.

Network Transmission: When sending data over the network, such as submitting input features to a machine learning model hosted on SageMaker, the data needs to be in a byte format. io.BytesIO provides a straightforward way to convert complex Python objects (after serialization) into a byte stream that can be sent over the network.

In the following code,

*   A BytesIO buffer is explicitly created before calling torch.save. This buffer acts as an in-memory file, which torch.save can write to.

*   The tensor features is saved to this buffer using torch.save(features, buffer).

*   After saving, the buffer's pointer is reset to the start with buffer.seek(0). This step is necessary because after writing, the buffer's pointer will be at the end of the written content, so attempting to read or get the value without resetting it will result in an empty output.

*   Finally, buffer.getvalue() is called to retrieve the byte stream content of the buffer, which is then sent in the request to the SageMaker endpoint.

In [None]:
def submit_to_endpoint(endpoint_name, features):
    """
    Submit the formatted features to the SageMaker endpoint for prediction.
    """
    # Create a BytesIO buffer and save the tensor to this buffer
    buffer = io.BytesIO()
    torch.save(features, buffer)
    buffer.seek(0)  # Move to the start of the buffer

    runtime = boto3.client('sagemaker-runtime',
                          aws_access_key_id='ASIAYS2NTIXOY35N4QFS',
                          aws_secret_access_key='ceSCquVXdsiFHo+hlC7E86Z2U1/4962AZ+fDaNQR',
                          aws_session_token='FwoGZXIvYXdzEK3//////////wEaDKmKuszU2WDDMripbCLOATY/C4+GKSvnyVZNdVxKj2A+dWZ0z4NZoHsVGP0lY1DoPMZHDInGFYczi6RjFoGuh1E9vxpY76L6FHNQN2L/olrvgHUlEHtzTMFbLhZM9eSyZZQgS2MKELB05j3fKEEMZGlWRvNgQXH0xjLxG7c7Vrqjtz8NnB196kj2G4AF0Y9R/fitgoAtvNTXyhv+j5wxqxNUO2PmxzDcAFVBQn9FmVMudGI4M3L4aH6SXF7u+pFamTtjOG513+pBEKpqopeZ9g5zgksBtgSxNInzIvuiKPqojrAGMi3r+trPAHd4iSf/oaU6LjrJIzlPOxEGhZkWjJn3azR2GpgckLqYD5hvAaS0J18='
                          )
    response = runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/octet-stream",
        Body=buffer.getvalue()  # Use the buffer's content
    )
    # Deserialize the response
    result = json.loads(response['Body'].read().decode())
    return result

In the following step, we test the API endpoint by passing the extracted features of an executable file to the model, and displaying the prediction.
The user-provided parameters in these model are (1) path of the executable file that you wish to classify, and (2) the name of the SageMaker endpoint serving your MalConv model. You can find the latter as follows: in the main navigation menu of the AWS SageMaker dashboard (left side), open the Inference category, and select the Endpoints item. It will take you to a page listing all of your endpoints. If you have multiple, you will probably want to use the one that is most recently created.
![](https://github.com/UNHSAILLab/S24-AISec/blob/main/Midterm%20Tutorial/SageMakerEndPoints.png?raw=true)

In [35]:
if __name__ == "__main__":
    pe_file_path = "calc.exe"
    endpoint_name = "pytorch-inference-2024-03-27-05-44-37-080"

    # Extract features from the PE file
    features = extract_features(pe_file_path)

    print (features)

    # Format the features as required by the model
    formatted_features = format_features(features)


    # Submit the formatted features to the SageMaker endpoint
    prediction = submit_to_endpoint(endpoint_name, formatted_features)
    print("Prediction result:", prediction)

[0.5253462  0.00397283 0.0012429  ... 0.         0.         0.        ]
Prediction result: [[0.999998927116394]]
