# Malicious URL Detection With Machine Learning on Amazon SageMaker and Offline Predictions

This workshop is based on a article on Medium https://medium.com/@ismaelbouarfa/malicious-url-detection-with-machine-learning-d57890443dec <br/>
The purpose of this workshop is to classify URLs given as inputs to predict if they are dangerous or inoffensive.
We selected **good** as a label for the legitimate ones and **bad** for the malicious. Using a dataset with many URLs (as text) already labeled, located in a CSV file, we’ll train our model.

## Anatomy of a URL<br>

![](https://miro.medium.com/max/1050/0*EaleKgM4uK8sCpvy)<br>


Please replace **ENTER YOUR BUCKET NAME HERE** with your bucket name that you created earlier

In [None]:
import datetime
import time
import tarfile
import re
import nltk

import boto3
import pandas as pd
import numpy as np

import sagemaker
from sagemaker import get_execution_role

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

bucket = "ENTER YOUR BUCKET NAME HERE"

sm_boto3 = boto3.client("sagemaker")

sess = sagemaker.Session()

region = sess.boto_session.region_name

print("Using bucket " + bucket)

## Prepare data
Now we copy over our datasets from a public S3 bucket. The GIT Repo for this dataset has been commented out and provided for your information
Also an additional dataset called gdms_data.csv is present and contains additional urls and their statuses to augment the dataset

In [None]:
# Original Source dataset 
#! git clone https://github.com/faizann24/Using-machine-learning-to-detect-malicious-URLs.git
! aws s3 cp s3://ml-materials/malicious-url-dataset/data.csv .
! aws s3 cp s3://ml-materials/malicious-url-dataset/gdms_data.csv .

In [None]:
# Load the datasets into dataframes
df = pd.read_csv('./data.csv',',',error_bad_lines=False)
df = pd.DataFrame(df)
df2 = pd.read_csv('./gdms_data.csv',',',error_bad_lines=False)
df2 = pd.DataFrame(df2)

# Create a random sample set of 10000 rows
df = df.sample(n=10000)

# Append the custom dataset to the training dataset
df = df.append(df2)

from io import StringIO
col = ['label','url']
df = df[col]

#Deleting nulls
df = df[pd.notnull(df['url'])]

#more settings for our data manipulation
df.columns = ['label', 'url']

In [None]:
#tokenizer function for URL by Faizan Ahmad, CEO FSecurify
def getTokens(input):
    tokensBySlash = str(input.encode('utf-8')).split('/')
    allTokens=[]
    for i in tokensBySlash:
        tokens = str(i).split('-')
        tokensByDot = []
        for j in range(0,len(tokens)):
            tempTokens = str(tokens[j]).split('.')
            tokentsByDot = tokensByDot + tempTokens
        allTokens = allTokens + tokens + tokensByDot
    allTokens = list(set(allTokens))
    if 'com' in allTokens:
        allTokens.remove('com')
    return allTokens

In [None]:
vectorizer = TfidfVectorizer( tokenizer=getTokens ,use_idf=True, smooth_idf=True, sublinear_tf=False)
features = vectorizer.fit_transform(df.url).toarray().astype('float32')
labels = df.label
features.shape

At this point since we have the vectorizer, we need to serialize it at store it locally. We will be using Joblib to serialize the vectorizer model (we could have also used Pickle). Then we will copy it to the S3 bucket so we can use it at the end of this workshop for doing inference outside of the AWS environment.

In [None]:
import joblib

joblib.dump(vectorizer, open("tfidf1.joblib", "wb"))

In [None]:
! aws s3 cp tfidf1.joblib s3://{bucket}/models/vectorizer/

Now lets split the dataset into 80/20 split for training and testing datasets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.20, random_state=0)

Now that we have the dataset split, we can assemable the features and labels for training and testing datasets and save as CSV files

In [None]:
train_pd_df = pd.DataFrame(data = X_train, 
                        columns = ['Column_' + str(i + 1) 
                        for i in range(X_train.shape[1])])

train_pd_df['LABEL'] = y_train.values

test_pd_df = pd.DataFrame(data = X_test, 
                        columns = ['Column_' + str(i + 1) 
                        for i in range(X_test.shape[1])])

test_pd_df['LABEL'] = y_test.values

train_pd_df.to_csv("mal_url_train.csv", index=False)
test_pd_df.to_csv("mal_url_test.csv", index=False)

We now copy the CSV files to your bucket

In [None]:
# send data to S3. SageMaker will take training data from s3
trainpath = sess.upload_data(
    path="mal_url_train.csv", bucket=bucket, key_prefix="sagemaker/sklearncontainer"
)

testpath = sess.upload_data(
    path="mal_url_test.csv", bucket=bucket, key_prefix="sagemaker/sklearncontainer"
)

## Writing a *Script Mode* script
The below script contains both training functionality and can run both in SageMaker Training hardware or locally (desktop, SageMaker notebook, on prem, etc). Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script

In [None]:
%%writefile script.py

import argparse
import joblib
import os

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression


# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

def predict_fn(input_data, model):
    prediction = model.predict(input_data)
    pred_prob = model.predict_proba(input_data)
    return np.array([prediction, pred_prob[:,0], pred_prob[:,1]])

if __name__ == "__main__":

    print("extracting arguments")
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    # to simplify the demo we don't use all sklearn RandomForest hyperparameters
    parser.add_argument("--random_state", type=int, default=0)
    #parser.add_argument("--target", type=str, default="")

    # Data, model, and output directories
    parser.add_argument("--model-dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--train-file", type=str, default="mal_url_train.csv")
    parser.add_argument("--test-file", type=str, default="mal_url_test.csv")
    parser.add_argument(
        "--features", type=str
    )  # in this script we ask user to explicitly name features
    parser.add_argument(
        "--target", type=str
    )  # in this script we ask user to explicitly name the target

    args, _ = parser.parse_known_args()

    print("reading data")
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    print("building training and testing datasets")
    X_train = train_df.drop(args.target,1)
    X_test = test_df.drop(args.target,1)
    y_train = train_df[args.target]
    y_test = test_df[args.target]

    # train
    print("training model")
    
    model = LogisticRegression(random_state=args.random_state)

    model.fit(X_train, y_train)

    print("validating model")
    
    train_score = model.score(X_train, y_train)
    print(f"Training Accuracy = {train_score}")
    
    test_score = model.score(X_test, y_test)
    print(f"Test Accuracy = {test_score}")

    # persist model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print("model persisted at " + path)

## SageMaker Training

### Launching a training job with the Python SDK

Here we set up the training job to use SKLearn pre-built container and the script for training.

In [None]:
# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"

sklearn_estimator = SKLearn(
    entry_point="script.py",
    role=get_execution_role(),
    instance_count=1,
    instance_type="ml.m4.2xlarge",
    framework_version=FRAMEWORK_VERSION,
    base_job_name="mal-url-scikit",
    hyperparameters={
        "target": "LABEL",
    },
)

Now we have everything setup, we can now kick off the training job providing the locations for the training and test datasets

In [None]:
# launch training job, with asynchronous call
sklearn_estimator.fit({"train": trainpath, "test": testpath}, wait=True)

## Making Predictions

### SageMaker Hosting Predictions

Now the training has successfully completed, let us deploy to an SageMaker endpoint to be able to do realtime predictions 

In [None]:
predictor = sklearn_estimator.deploy(instance_type="ml.m4.2xlarge", initial_instance_count=1)

The endpoint is up and running. We can now do predictions against this endpoint

In [None]:
X_predict = ['yahoo.fr','www.radsport-voggel.de/wp-admin/includes/log.exe','hello.ru','auction-korea.co.kr/technote7/peace/','https://www.kdnuggets.com/2016/10/machine-learning-detect-malicious-urls.html','https://mail.google.com/','https://gmail.com']
X_feature_vectors = vectorizer.transform(X_predict).toarray()
y_Predict = predictor.predict(X_feature_vectors)
x = len(y_Predict[1,:])
for y  in range(0,x):
    print(y_Predict[:, y])

### External Predictions from a SageMaker Endpoint

#### Loading and using the vectorizer

To do predictions outside of AWS, we will start by installing the joblib library as we will need it to deserialze the vectorizer model we saved.

In [None]:
! pip install joblib

We will now copy the vectorizer model and the trained SKLearn model from your S3 bucket to your local environment

In [None]:
vectorizer_uri = f"s3://{bucket}/models/vectorizer/tfidf1.joblib"

Now we have the vectorizer model file local, we can run the load and run vectorizer on our test urls<br>
We will load and use joblib to load the vectorizer model

In [None]:
import joblib

local_test_urls = ['gmail.com','www.radsport-voggel.de/wp-admin/includes/log.exe','hello.ru','auction-korea.co.kr/technote7/peace/','https://www.kdnuggets.com/2016/10/machine-learning-detect-malicious-urls.html']

local_vectorizer = joblib.load('tfidf1.joblib')

test_urls_vectorized = local_vectorizer.transform(local_test_urls).toarray()

#### Getting predictions from a SageMaker endpoint

Install boto3 if not already present

In [None]:
! pip install boto3

In [None]:
import boto3

sm_runtime_client = boto3.client('sagemaker-runtime',
                #        aws_access_key_id='<your_access_key_id>', 
                #        aws_secret_access_key='<your_secret_access_key>',
                #        region_name='us-east-1',
                                )

In [None]:
import json

payload = json.dumps(test_urls_vectorized.tolist())

**Replace "PUT YOUR ENDPOINT NAME HERE" with your endpoint name**

In [None]:
endpoint_name = "PUT YOUR ENDPOINT NAME HERE"

response = sm_runtime_client.invoke_endpoint(
    EndpointName = endpoint_name,
    Body = payload,
    ContentType='application/json',
)

In [None]:
result = json.loads(response['Body'].read().decode())

In [None]:
x = len(result[0])
for y  in range(0,x):
    print(result[0][y],result[1][y],result[2][y])

### Getting predictions directly from a model without using endpoints

#### Loading and using the vectorizer

To do predictions outside of AWS, we will start by installing the joblib library as we will need it to deserialze the vectorizer model we saved.

In [None]:
! pip install joblib

We will now copy the vectorizer model and the trained SKLearn model from your S3 bucket to your local environment

In [None]:
vectorizer_uri = f"s3://{bucket}/models/vectorizer/tfidf1.joblib"

Now we have the vectorizer model file local, we can run the load and run vectorizer on our test urls<br>
We will load and use joblib to load the vectorizer model

In [None]:
import joblib

local_test_urls = ['gmail.com','www.radsport-voggel.de/wp-admin/includes/log.exe','hello.ru','auction-korea.co.kr/technote7/peace/','https://www.kdnuggets.com/2016/10/machine-learning-detect-malicious-urls.html']

local_vectorizer = joblib.load('tfidf1.joblib')

test_urls_vectorized = local_vectorizer.transform(local_test_urls).toarray()

#### Loading and using the SKLearn Trained Model

We are now going to copy the trained SKLearn model file to our local filesystem 

In [None]:
model_uri = f"{sklearn_estimator.output_path}{sklearn_estimator._current_job_name}/output/model.tar.gz"

In [None]:
! aws s3 cp {model_uri} .

We now uncompress and untar the model file

In [None]:
! tar zxvf model.tar.gz

We already have the vectorized test urls from Step (1) which we will be using.

Now we load the trained model for the Scikit Learn Logistic Regression and run predictions and confidence of our vectorized urls array

In [None]:
#Loading the saved model with joblib
local_model = joblib.load('model.joblib')

# apply the whole pipeline to data
pred = pd.Series(local_model.predict(test_urls_vectorized))
prob = local_model.predict_proba(test_urls_vectorized)

Print out the predictions and associated confidence ratings of the predictions

In [None]:
x = len(pred)
for y  in range(0,x):
    print(pred[y],prob[y][0],prob[y][1])

## Don't forget to delete the endpoint !

In [None]:
sm_boto3.delete_endpoint(EndpointName=predictor.endpoint)