Skip to content

fcdalgic/GramBeddings

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

GRAM

Overview

The name Grambeddings comes from the N-gram Embeddings. Grambeddings a novel and scalable deep learning model that is specialized to to recognize malicious websites from a given URL by employing both character and multiple n-gram embeddings in parallel sub-networks. The suggested network model contains CNN, LSTM, and Attention (CLA) architecture to extract discriminative syntactical and temporal features to obtain high performance in terms of accuracy.

This repository contains the original implementation of the paper named "Grambeddings: An End-To-End Neural Model for Phishing URL-Classification Through N-Gram Embeddings"

You can check the paper from here. You can also download the published Grambeddings dataset from here.

Project Structure

Folder/File Name Purpose
/data Where the dataset files are stored under their unique dataset names. For detailed explanation please refer to this file.
/Utils Where the common helper classes and function are stored. For detailed explanation please refer to this file.
/outputs Where the output/deployed files are stored.
/features folder stores the feature_extraction.py results (deep features)
/tensorboard folder stores tensorboard results
/training folder stores the result.csv file which contains the training metrics.
train.py Trains the Grambeddings model according to given arguements.
test.py Tests the pre-trained Grambeddings model according to given arguements.
single_channel_train.py The proposed GramBeddings architecture is the composition of multiple CLA (CNN+LSTM+Attention) structures. This file provides training the single CLA architecture while preserving original implementation of proposed architecture. Therefore you can also conduct comparison between the scores of each level of n-grams.
NGramSequenceTransformer.py The Grambeddings involves two phases, in the first part, we extract whole n-grams from given textual corpus, select top-k n-grams and prepares appropriate input for the Embedding layer by converting a given textual input into one dimensional n-gram array. This file conducts whole of these operation by performing fit and transform processes.
Model.py Contains the class to produce Grambeddings' architecture
feature_extraction.py This file is responsible for extracting deep features from pre-trained model file by hooking the input of Fully Connected Layer. The output features could be used in visualizing features with some libraries as we've done by employing UMap in our implementation/paper.
adverserial_generator.py The additional file to generate adversarial examples by using specified samples

Benchmarking with Your Work

Benchmarking Type How To
Dataset-wise In order to conduct benchmark between your dataset and Grambeddings Dataset, please download our 800k samples from here
Algorithm-wise In order to conduct benchmark test between our implementation and yours. Please follow the instructions given below:
- Copy your dataset into the /data folder and convert your files into expected file formats.
- Inside the DataUtils.py file locate the DatasetOptions classs and add your dataset name with it's relative folder name there.
-Train the model with your dataset and do not forget to cite us :)

Training

In order to train our model use the train.py file. The descriotion of each input arguement that affects the training phase are given below.

Arguement Description
dataset An enumeration value, specifies which dataset will be used to train model
output output directory, by defaults it is /output folder
model_name the output model file name, if it is None then automatically named from given arguments
ngram_1 Ngram value of first ngram embedding layer
ngram_2 Ngram value of second ngram embedding layer
ngram_3 Ngram value of third ngram embedding layer
max_features Maximum number of features will be used to initialize Embedding Matrix
maxf Refers to Maximum Term Frequency, which will be used to filter out n-gram whose Term Frequency value is higher than given threshold. In our implementation we ignored it (set to 1) but for future works we put this additional functionality.
max_df Refers to Maximum Document Frequency, which will be used to filter out n-gram whose DF value is higher than given threshold. In our implementation we ignored it (set to 1) but for future works we put this additional functionality.
min-df Refers to Minimum Document Frequency, which will be used to filter out n-gram whose DF value is lower than given threshold. In our implementation we ignored it (set to 1) but for future works we put this additional functionality.
max_seq_len Used to define the length of one-dimensional Embedding Matrix input array. We either trimmed or padded transformed n-gram sequences according to this arguements.
case_insensitive Another additiona functionality, that we did not use in our paper but added for future works. Disables or enables the case sensivity while selecting n-grams and affects the whole model input data representation.
warm_start Experimental use only, not fully implemented yet. We trying to figure out that if we pre-defined the Embedding Matrix's initial weigts, could we get better results.
warm_mode Experimental use only, not fully implemented yet. The selected Embedding Layer weight initializing method. Only matters when warm_start is set True.
embed_dim Embedding dimension for Embedding Layer
attn_width The attention layer width
rnn_cell_size The recurrent size
batch_size Batch size
epochs The number of epoch to train our model
#  Basic/General Usage:
python .\train.py dataset=grambeddings --ngram_1=3 --ngram_2=4 --ngram_3=5 --max_seq_len=128 --attn_width=10 --embed_dim=15 --max_features=160000 --max_df=0.7 --min_df=1e-06 --rnn_cell_size=256
python .\train.py dataset=ebubekir --ngram_1=4 --ngram_2=5 --ngram_3=6 --max_seq_len=16 --attn_width=5 --embed_dim=20 --max_features=1200 --max_df=0.9 --min_df=1e-06 --rnn_cell_size=256
# To enable warm_start just specify the related argument:
python .\train.py warm_start -> warm_start is enabled
python .\train.py            -> warm_start is disabled

# To enable case_insensitive just specify the related argument:
python .\train.py case_insensitive -> case_insensitive is enabled
python .\train.py                  -> case_insensitive is disabled

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published