<h1 style="color: #4169E1; text-align:center">Domain Generation Algorithm classification</h1>

### I. Introduction

<p>
In this second notebook, exploratory data analytics is first used to better understand the behavior and trends observed in 17 features that characterize the domain name. Then, a decision tree classifier is used to classify suspicious domains. By transforming raw domain strings to Machine Learning features and creating a DGA classifer, it can be determined whether a given domain is legit or not. Overview:

<ul>
    <li>Data Processing - exploration and data cleaning</li>
    <li>Feature Engineering - from raw domain strings to features.</li>
    <li>Classification - predict whether a domain is legit or not using a different classifiers (for this first experiment I have only trained DecisionTreeClassifier, more classifiers in next experiments).</li>
</ul>
</p>
<br>
<p>This project focuses on a machine learning-based approach to tackle the DGA attack by detecting and classifying suspicious domain names...
</p>

In [105]:
# Load Libraries
import pandas as pd
import numpy as np
import regex as re
import tldextract
from collections import Counter
from pickle import dump
from pickle import load

### II. Dataset Description
The dataset consists of 160,000 domains with 80,000 legitimate domains and 80,000 DGA domains. Each record consists of the following three fields:
<ul>
    <li><b>isDGA</b>: the classification of the domain (either <i>legit</i> or <i>dga</i>).</li>
    <li><b>domain</b>: the second level domain SLD (e.g, <i>facebook</i>).</li>
    <li><b>host</b>: the top level domain (TLD) and second level domain (e.g., <i>facebook.com</i>), <i>apple.co.uk</i>).</li>
    <li><b>subclass</b>: the malware family the domain belongs to (e.g., <i>cryptolocker</i>).</li>
</ul>

In [106]:
df = pd.read_csv('C:/Users/Jorge Payà/Desktop/4Geeks/Final Project/Code/DGA-Detection-project2/data/raw/dga_data_full.csv')
df.head()

Unnamed: 0,isDGA,domain,host,subclass
0,dga,6xzxsw3sokvg1tc752y1a6p0af,6xzxsw3sokvg1tc752y1a6p0af.com,gameoverdga
1,dga,glbtlxwwhbnpxs,glbtlxwwhbnpxs.ru,cryptolocker
2,dga,xxmamopyipbfpk,xxmamopyipbfpk.ru,cryptolocker
3,dga,zfd5szpi18i85wj9uy13l69rg,zfd5szpi18i85wj9uy13l69rg.net,newgoz
4,dga,jpqftymiuver,jpqftymiuver.ru,cryptolocker


In [107]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160000 entries, 0 to 159999
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   isDGA     160000 non-null  object
 1   domain    159998 non-null  object
 2   host      160000 non-null  object
 3   subclass  160000 non-null  object
dtypes: object(4)
memory usage: 4.9+ MB


In [108]:
print(df.shape)
missing_values = df.isnull().sum()
print(missing_values)
print("")

(160000, 4)
isDGA       0
domain      2
host        0
subclass    0
dtype: int64



In [109]:
df = df.dropna()
print(df.shape)

(159998, 4)


In [110]:
print(f"Dimensions of the data before dropping duplicates: {df.shape}")
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
df = df.drop_duplicates()
print(f"Dimensions of the data after dropping duplicates: {df.shape}")

Dimensions of the data before dropping duplicates: (159998, 4)
Number of duplicate rows: 0
Dimensions of the data after dropping duplicates: (159998, 4)


In [111]:
print("Distribution of the target variable 'isDGA': ")
print(f"DGA: {df['isDGA'].value_counts()['dga']}")
print(f"Legit: {df['isDGA'].value_counts()['legit']}")

Distribution of the target variable 'isDGA': 
DGA: 80000
Legit: 79998


<div class="alert alert-block alert-warning">
<b>Note:</b> To ensure a fair comparison, I use a balanced dataset with 50% legit domains and 50% DGA-generated domains so that the classifier does not bias towards a majority class. However, in real-world scenarios, the distribution of data is often imbalanced, with a majority of samples belonging to legit domains and a minority belonging to DGA-generated domains. This can pose challenges the model, as it may struggle to learn from the minority class due to the overwhelming presence of the majority class. To prevent this I will use various algorithms and specially ensemble methods that are robust to imbalanced datasets such as Random Forests or boosting algorithms (e.g., XGBoosting or LightGBM). Also, in terms of evaluation metrics, apart from accuracy which can be misleading on imbalanced datasets, I'll consider using evaluation metrics that are more informative, such as precision, recall, F1-score, or area under the ROC curve (AUC-ROC).
</div>

### III. Exploratory Data Analysis (EDA) and Feature Engineering
<p>The domain name is a sequence of labels separated by dots (e.g., <i>www.example.com</i>) containing a chosen prefix aka second level domain (e.g., <i>example</i>) and a public suffix aka top level domain(e.g., <i>.com</i>, <i>.co.uk</i>). The top level domain can contain more than one label (e.g., <i>.co.uk</i>). Also, a domain name can be organized in more subdomains (e.g., <i>ssh.example.com</i>):</p>

<img src="./img/domains.png" width="" heigth="" />

<p>
When looking at the boxes (above blue and red), on the left I have the legitimate domains everyone has already visited and on the right the malicious domains. If look at them and intuitively think what makes them different, I clearly see that the distribution of characters is very different in the legitimate ones than it is in the malicious ones, more random, so this is the intuition that I'm hoping to capture in the feature engineering process.  

Therefore, I have to look at the malicious and benign domains and think what my human brain distinguishes from each other and then try to encode it in such a way that a machine learning model would be able to use my intuitive understanding of what makes them different in a kind of algorithmic way.</p>

#### A. Domain name preprocessing
For this experiment I consider the second level domain to understand the characteristics of suspicious domain names. Therefore, I will split the second level domain and top level domain. 

In [112]:
df.drop(['domain', 'subclass'], axis=1, inplace=True)
df.head()

Unnamed: 0,isDGA,host
0,dga,6xzxsw3sokvg1tc752y1a6p0af.com
1,dga,glbtlxwwhbnpxs.ru
2,dga,xxmamopyipbfpk.ru
3,dga,zfd5szpi18i85wj9uy13l69rg.net
4,dga,jpqftymiuver.ru


In [113]:
def extract_tld(host):
    ext = tldextract.extract(host)
    return '.'.join(part for part in [ext.domain, ext.suffix] if part)

def extract_subdomain_and_domain(host):
    ext = tldextract.extract(host)
    return ext.domain if ext.subdomain == '' else '.'.join([ext.subdomain, ext.domain])

df['tld'] = df['host'].apply(extract_tld)
df['host'] = df['host'].apply(extract_subdomain_and_domain)
df.head()

Unnamed: 0,isDGA,host,tld
0,dga,6xzxsw3sokvg1tc752y1a6p0af,6xzxsw3sokvg1tc752y1a6p0af.com
1,dga,glbtlxwwhbnpxs,glbtlxwwhbnpxs.ru
2,dga,xxmamopyipbfpk,xxmamopyipbfpk.ru
3,dga,zfd5szpi18i85wj9uy13l69rg,zfd5szpi18i85wj9uy13l69rg.net
4,dga,jpqftymiuver,jpqftymiuver.ru


In [114]:
def extract_top_level_domain(tld):
    return '.'.join(tld.split('.')[1:]) if '.' in tld else tld

# Apply the function to the 'tld' column
df['tld'] = df['tld'].apply(extract_top_level_domain)
df.head()

Unnamed: 0,isDGA,host,tld
0,dga,6xzxsw3sokvg1tc752y1a6p0af,com
1,dga,glbtlxwwhbnpxs,ru
2,dga,xxmamopyipbfpk,ru
3,dga,zfd5szpi18i85wj9uy13l69rg,net
4,dga,jpqftymiuver,ru


In [115]:
# I first transform the target variable into a binary one
df['isDGA'] = df['isDGA'].apply(lambda x: 1 if x == 'dga' else 0)
df.head()

Unnamed: 0,isDGA,host,tld
0,1,6xzxsw3sokvg1tc752y1a6p0af,com
1,1,glbtlxwwhbnpxs,ru
2,1,xxmamopyipbfpk,ru
3,1,zfd5szpi18i85wj9uy13l69rg,net
4,1,jpqftymiuver,ru


In [116]:
# Check for empty or null values
missing_values = df.isnull().sum()
print(missing_values)

isDGA    0
host     0
tld      0
dtype: int64


In [117]:
# Save as csv named 'preprocessed_data.csv'
df.to_csv('C:/Users/Jorge Payà/Desktop/4Geeks/Final Project/Code/DGA-Detection-project2/data/interim/preprocessed_data.csv', index=False)

In [118]:
df.shape

(159998, 3)

#### B. Feature Engineering
Now 16 features will be extracted (and explained) from the domain name that can be used to characterize it and therefore that can be supplied to the model.

<div class="alert alert-block alert-info">
<b>d_length - </b>This feature represents the length of the domain name. DGAs may generate domain names with varying lengths, so this feature helps capture potential patterns related to the length of DGA-generated domain names.
</div>

In [119]:
# Function to get the length of the domain
df['d_length'] = df['host'].apply(lambda x: len(x))
df.head()

Unnamed: 0,isDGA,host,tld,d_length
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26
1,1,glbtlxwwhbnpxs,ru,14
2,1,xxmamopyipbfpk,ru,14
3,1,zfd5szpi18i85wj9uy13l69rg,net,25
4,1,jpqftymiuver,ru,12


<div class="alert alert-block alert-info">
<b>unique_char_count - </b>This feature counts the total number of unique characters in the domain name. DGAs often generate domain names with repetitive patterns or random sequences of characters, so a higher unique character count may indicate a domain name that is less likely to be generated by a DGA.
</div>

In [120]:
# Function to get the number of unique characters in a domain name
def unique_char_count(domain):
    return len(set(domain))

df['unique_char_count'] = df['host'].apply(unique_char_count)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21
1,1,glbtlxwwhbnpxs,ru,14,10
2,1,xxmamopyipbfpk,ru,14,10
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19
4,1,jpqftymiuver,ru,12,12


<div class="alert alert-block alert-info">
<b>unique_letter_count - </b>Similar to <i>unique_char_count</i>, this feature specifically counts the number of unique letters (alphabetic characters) in the domain name. It helps capture the diversity of letters used in the domain name, which can be indicative of a non-DGA-generated domain name.
</div>

In [121]:
# Function to get the number of unique letters in a domain name
def unique_letter_count(domain):
    return len(set(re.sub(r'[^a-z]', '', domain)))

df['unique_letter_count'] = df['host'].apply(unique_letter_count)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14
1,1,glbtlxwwhbnpxs,ru,14,10,10
2,1,xxmamopyipbfpk,ru,14,10,10
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13
4,1,jpqftymiuver,ru,12,12,12


<div class="alert alert-block alert-info">
<b>unique_digit_count - </b>This feature counts the number of unique digits (numeric characters) in the domain name. DGAs may incorporate numeric sequences into domain names, so the presence of a higher number of unique digits may suggest a non-DGA-generated domain name.
</div>

In [122]:
# Function to get the number of unique digits in a domain name
def unique_digit_count(domain):
    return len(set(re.sub(r'[^0-9]', '', domain)))

df['unique_digit_count'] = df['host'].apply(unique_digit_count)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7
1,1,glbtlxwwhbnpxs,ru,14,10,10,0
2,1,xxmamopyipbfpk,ru,14,10,10,0
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6
4,1,jpqftymiuver,ru,12,12,12,0


<div class="alert alert-block alert-info">
<b>letter_ratio - </b>This feature calculates the ratio of letters (alphabetic characters) to the total number of characters in the domain name. It provides insight into the proportion of letters used in the domain name, which can help distinguish between DGA-generated and non-DGA-generated domain names.
</div>

In [124]:
# Function to get the ratio of letters to the length of the domain name
def letter_ratio(domain):
    letters = re.sub(r'[^a-z]', '', domain)
    return len(letters) / len(domain) if domain else 0

df['letter_ratio'] = df['host'].apply(letter_ratio)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6
4,1,jpqftymiuver,ru,12,12,12,0,1.0


<div class="alert alert-block alert-info">
<b>digit_ratio - </b>Similar to <i>letter_ratio</i>, this feature calculates the ratio of digits (numeric characters) to the total number of characters in the domain name. It helps capture the proportion of numeric sequences in the domain name, which can be indicative of DGA-generated domain names.
</div>

In [127]:
# Function to get the ratio of digits to the length of the domain name
def digit_ratio(domain):
    digits = re.sub(r'[^0-9]', '', domain)
    return len(digits) / len(domain) if domain else 0

df['digit_ratio'] = df['host'].apply(digit_ratio)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0


<div class="alert alert-block alert-info">
<b>unique_letter_ratio - </b>This feature calculates the ratio of unique letters to the total number of characters in the domain name. It measures the diversity of letters used in the domain name, which can help differentiate between DGA-generated and non-DGA-generated domain names.
</div>

In [129]:
# Function to get the ratio of unique letters to unique characters 
def unique_letter_ratio(domain):
    letters = re.sub(r'[^a-z]', '', domain)
    return len(set(letters)) / len(set(domain)) if domain else 0

df['unique_letter_ratio'] = df['host'].apply(unique_letter_ratio)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0


<div class="alert alert-block alert-info">
<b>unique_digit_ratio - </b>Similar to <i>unique_letter_ratio</i>, this feature calculates the ratio of unique digits to the total number of characters in the domain name. It captures the diversity of numeric sequences in the domain name, which can aid in DGA detection and classification.
</div>

In [131]:
# Function to get the ratio of unique digits to unique characters
def unique_digit_ratio(domain):
    digits = re.sub(r'[^0-9]', '', domain)
    return len(set(digits)) / len(set(domain)) if domain else 0

df['unique_digit_ratio'] = df['host'].apply(unique_digit_ratio)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0


<div class="alert alert-block alert-info">
<b>special_char_ratio - </b>This feature calculates the ratio of special characters (non-alphanumeric characters) to the total number of characters in the domain name. DGAs may or may not incorporate special characters into generated domain names, so this feature helps capture potential patterns related to the presence of special characters.
</div>

In [132]:
# Function to get the ratio of special characters to the length of the domain name
def special_char_ratio(domain):
    special_chars = re.sub(r'[a-z0-9]', '', domain)
    return len(special_chars) / len(domain) if domain else 0

df['special_char_ratio'] = df['host'].apply(special_char_ratio)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio,special_char_ratio
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333,0.0
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789,0.0
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0,0.0


<div class="alert alert-block alert-info">
<b>consonant_ratio - </b>This feature calculates the ratio of consonant letters to the total number of letters in the domain name. It provides insight into the distribution of consonant letters, which can be useful for distinguishing between different types of domain name patterns.
</div>

In [134]:
# Function to calculate ratio of consonant characters in domain name
def consonant_ratio(domain):
    consonants = sum(1 for char in domain if char.isalpha() and char.lower() not in 'aeiou')
    return consonants / len(domain) if domain else 0

df['consonant_ratio'] = df['host'].apply(consonant_ratio)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio,special_char_ratio,consonant_ratio
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333,0.0,0.538462
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,1.0
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,0.785714
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789,0.0,0.48
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0,0.0,0.75


<div class="alert alert-block alert-info">
<b>vowel_ratio - </b>Similar to <i>consonant_ratio</i>, this feature calculates the ratio of vowel letters to the total number of letters in the domain name. It helps capture the distribution of vowel letters, which can contribute to the characterization of domain name patterns.
</div>

In [135]:
# Function to calculate ratio of vowels in domain name
def vowel_ratio(domain):
    vowels = sum(1 for char in domain if char.lower() in 'aeiou')
    return vowels / len(domain) if domain else 0

df['vowel_ratio'] = df['host'].apply(vowel_ratio)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio,special_char_ratio,consonant_ratio,vowel_ratio
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333,0.0,0.538462,0.115385
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,0.785714,0.214286
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789,0.0,0.48,0.12
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0,0.0,0.75,0.25


<div class="alert alert-block alert-info">
<b>longest_consonant_string - </b>This feature identifies the longest consecutive string of consonant letters in the domain name. It helps capture patterns related to the arrangement of consonant letters, which can be informative for DGA detection and classification.
</div>

In [136]:
# Function to get the length of the longest consonant string
def longest_consonant_string(domain):
    consonants = 'bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ'
    max_length = 0
    current_length = 0
    for char in domain:
        if char in consonants:
            current_length += 1
            max_length = max(max_length, current_length)
        else:
            current_length = 0
    return max_length

df['long_consonant_str'] = df['host'].apply(longest_consonant_string)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio,special_char_ratio,consonant_ratio,vowel_ratio,long_consonant_str
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333,0.0,0.538462,0.115385,5
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,14
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,0.785714,0.214286,5
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789,0.0,0.48,0.12,3
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0,0.0,0.75,0.25,7


<div class="alert alert-block alert-info">
<b>longest_vowel_string - </b>Similar to <i>longest_consonant_string</i>, this feature identifies the longest consecutive string of vowel letters in the domain name. It contributes to the characterization of vowel letter patterns, which can aid in distinguishing between different types of domain names.
</div>

In [137]:
# Function to get the length of the longest vowel sequence
def longest_vowel_string(domain):
    vowels = 'aeiouAEIOU'
    max_length = 0
    current_length = 0
    for char in domain:
        if char in vowels:
            current_length += 1
            max_length = max(max_length, current_length)
        else:
            current_length = 0
    return max_length

df['long_vowel_str'] = df['host'].apply(longest_vowel_string)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio,special_char_ratio,consonant_ratio,vowel_ratio,long_consonant_str,long_vowel_str
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333,0.0,0.538462,0.115385,5,1
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,14,0
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,0.785714,0.214286,5,1
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789,0.0,0.48,0.12,3,1
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0,0.0,0.75,0.25,7,2


<div class="alert alert-block alert-info">
<b>longest_number_string - </b>This feature identifies the longest consecutive string of numeric digits in the domain name. It helps capture patterns related to numeric sequences, which may be indicative of DGA-generated domain names.
</div>

In [138]:
# Function to get the length of the longest string of numbers
def longest_number_string(domain):
    numbers = '0123456789'
    max_length = 0
    current_length = 0
    for char in domain:
        if char in numbers:
            current_length += 1
            max_length = max(max_length, current_length)
        else:
            current_length = 0
    return max_length

df['long_number_str'] = df['host'].apply(longest_number_string)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio,special_char_ratio,consonant_ratio,vowel_ratio,long_consonant_str,long_vowel_str,long_number_str
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333,0.0,0.538462,0.115385,5,1,3
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,14,0,0
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,0.785714,0.214286,5,1,0
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789,0.0,0.48,0.12,3,1,2
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0,0.0,0.75,0.25,7,2,0


<div class="alert alert-block alert-info">
<b>Entropy - </b>In detecting DGA, entropy plays a crucial role as it measures the <u>randomness or unpredictability of domain names</u>. <u>Legitimate domain names typically exhibit lower entropy</u>, as they follow patterns related to the organization's branding or naming conventions. In contrast, malicious <u>DGA-generated domains often have higher entropy, as they lack recognizable patterns and are randomly generated to evade detection</u>.
</div>

In [140]:
# Function to get the entropy of a domain name
def entropy(domain):
    p, lns = Counter(domain), float(len(domain))
    return -sum( count/lns * np.log2(count/lns) for count in p.values())

df['entropy'] = df['host'].apply(entropy)
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio,special_char_ratio,consonant_ratio,vowel_ratio,long_consonant_str,long_vowel_str,long_number_str,entropy
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333,0.0,0.538462,0.115385,5,1,3,4.315824
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,14,0,0,3.235926
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,0.785714,0.214286,5,1,0,3.182006
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789,0.0,0.48,0.12,3,1,2,4.163856
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0,0.0,0.75,0.25,7,2,0,3.584963


<div class="alert alert-block alert-info">
<b>N-grams - </b> In DGA detection, n-grams are utilized to <u>identify patterns in domain names that may indicate malicious activity</u>. By extracting n-grams (sequences of characters) from domain names, <u>algorithms can analyze the frequency and distribution of character sequences, identifying anomalies or patterns consistent with DGA-generated domains</u>. For example, DGA-generated domains often exhibit unusual combinations of characters or character sequences not commonly found in legitimate domain names.<br>
<br>The N-Grams feature is a well-studied problem in linguistics that captures the pronounceability of a domain name and can be reduced to quantifying the extent to which a string adheres to the phonotactics of the English languate (<a href="https://seclab.cs.ucsb.edu/files/publications/Schiavoni2014Phoenix_DGA-Based.pdf" target="_blank">Schiavoni 2014: "Phoenix: DGA-Based Botnet Tracking and Intelligence" - see section Linguistic Features)</a>.
</div>
<img src="./img/ngrams.png" width="" heigth="" />

<div class="alert alert-block alert-info">In a nutshell, what happens when I take a domain and convert into n-gram features is basically I explode the domain into characters or substrings of different length. In the example above I'm just showing a sample domain 'iglxbvkw' and explode into unigrams, bigrams and trigrams which are substrings of 1, 2, 3 characters respectively. For the N-grams functions below, I use the 10.000 most common English words in order of frequency (available in this <a href="https://github.com/first20hours/google-10000-english" target="_blank">repo</a>), as determined by <a href="https://en.wikipedia.org/wiki/Frequency_analysis" target="_blank">n-gram frequency analysis</a> of the <a href="https://books.google.com/ngrams/info" target="_blank">Google's Trillion Word Corpus</a>.
</div>

In [141]:
top_english_words = pd.read_csv('C:/Users/Jorge Payà/Desktop/4Geeks/Final Project/Code/DGA-Detection-project2/data/raw/google-10000-english.txt', header=None, names=['words'])
d = top_english_words
dump(d, open('C:/Users/Jorge Payà/Desktop/4Geeks/Final Project/Code/DGA-Detection-project2/data/raw/top_english_words.pkl', 'wb'))

In [142]:
# Funtion to generate all possible n-grams for each word and for each number in n 
def ngrams(word, n):    
    l_ngrams = []
    n = n if isinstance(n, list) else [n]
    word = word if isinstance(word, list) else [word]
    
    for w in word:
        for curr_n in n:
            ngrams = [w[i:i+curr_n] for i in range(0,len(w)-curr_n+1)]
            l_ngrams.extend(ngrams)
    return l_ngrams

# Function to calculate the n-gram feature for a domain
def ngram_feature(domain, d, n):    
    l_ngrams = ngrams(domain, n)
    count_sum = sum(d.get(ngram, 0) for ngram in l_ngrams)
    feature = count_sum/(len(domain)-n+1) if len(domain)-n+1 else 0
    return feature

# Function to calculate the average n-gram feature for a domain    
def average_ngram_feature(l_ngram_feature):    
    return sum(l_ngram_feature)/len(l_ngram_feature) if l_ngram_feature else 0

dict_freq = { word[0]: num for num, word in enumerate(d.values, 1) }

df['ngrams'] = df['host'].apply(lambda x: average_ngram_feature([ngram_feature(x, dict_freq, n) for n in [1,2,3]]))
df.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio,special_char_ratio,consonant_ratio,vowel_ratio,long_consonant_str,long_vowel_str,long_number_str,entropy,ngrams
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333,0.0,0.538462,0.115385,5,1,3,4.315824,390.462051
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,14,0,0,3.235926,1133.379121
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,0.785714,0.214286,5,1,0,3.182006,1005.406593
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789,0.0,0.48,0.12,3,1,2,4.163856,382.083333
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0,0.0,0.75,0.25,7,2,0,3.584963,1477.137879


In [143]:
print(f"After having applied a feature engineering process, the total new features are: ")
print(df.shape[1]-2)  

After having applied a feature engineering process, the total new features are: 
17


In [144]:
df_final = df
df_final.to_csv('C:/Users/Jorge Payà/Desktop/4Geeks/Final Project/Code/DGA-Detection-project2/data/processed/dga_features_final.csv', index=False)
df_final.head()

Unnamed: 0,isDGA,host,tld,d_length,unique_char_count,unique_letter_count,unique_digit_count,letter_ratio,digit_ratio,unique_letter_ratio,unique_digit_ratio,special_char_ratio,consonant_ratio,vowel_ratio,long_consonant_str,long_vowel_str,long_number_str,entropy,ngrams
0,1,6xzxsw3sokvg1tc752y1a6p0af,com,26,21,14,7,0.653846,0.346154,0.666667,0.333333,0.0,0.538462,0.115385,5,1,3,4.315824,390.462051
1,1,glbtlxwwhbnpxs,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,14,0,0,3.235926,1133.379121
2,1,xxmamopyipbfpk,ru,14,10,10,0,1.0,0.0,1.0,0.0,0.0,0.785714,0.214286,5,1,0,3.182006,1005.406593
3,1,zfd5szpi18i85wj9uy13l69rg,net,25,19,13,6,0.6,0.4,0.684211,0.315789,0.0,0.48,0.12,3,1,2,4.163856,382.083333
4,1,jpqftymiuver,ru,12,12,12,0,1.0,0.0,1.0,0.0,0.0,0.75,0.25,7,2,0,3.584963,1477.137879
