# Exploration of DNS-over-HTTPS Traffic Dataset

The purpose of this notebook is to visualize the charactersitics of the DNS-over-HTTPS (DoH) dataset. We will calculate and visualize multiple aspects of the data to gain insights and a deeper understanding of its underlying patterns. In order to achieve this, we will use standard data exploration techniques and try to create a variety of plots, including correlations, scatter plots, and histograms.

## Data Preparation
We start by loading two existing datasets of statistical features of TCP connections carrying DoH traffic into a Pandas dataframe. One dataset contains normal traffic and the other one malicious traffic. 

The datasets are in CSV format with a column for each feature. The Pandas dataframe allows us to read the CSV file into a Python data structure that is very similar to an Excel sheet. 

In [None]:
# The file that contains the normal DoH traffic data. In this case, we are using the traffic generated 
# using the Cloudflare dataset.
normal_doh_traffic_dataset = '../doh_traffic_datasets/normal_doh_traffic_cloudflare_server.csv'

# The file that contains the malicious DoH traffic. We are using DoH traffic that carries a dnscat 
# tunnel for data exfiltration. 
malicious_doh_traffic_dataset = '../doh_traffic_datasets/dnscat2_data_4.csv'

# Import the pandas library
import pandas as pd

# Load the datasets into Pandas dataframes
normal_traffic_df    = pd.read_csv(normal_doh_traffic_dataset)
malicious_traffic_df = pd.read_csv(malicious_doh_traffic_dataset)

### Lets see what we have in the dataframes

In [None]:
normal_traffic_df

In [None]:
malicious_traffic_df

### The contents of the dataset

We see that the normal and malicious datasets have 16 columns. These columns are statistical features of the TCP connections related to round trip times, number of bytes, number of packets, etc. We explain what these columns mean later in this notebook.

The is_doh column indicates the type of traffic. We have 1 for normal traffic and number higher than 1 for malicious traffic.  

## Preparing the datasets

The first step is to prepare a dataset the contains both benign and malicious data. The percentage of malicious data is set to 30%. Depending on the model, we will need both malicious and bening data for training or only benign.  

In [None]:
# We get a random set of malicious samples based on the contamination percentage.
contamination              = 0.3
seed                       = 1  
num_malicious              = min(malicious_traffic_df.shape[0], int(normal_traffic_df.shape[0]*contamination))
data_evaluation_malicious  = malicious_traffic_df.sample(num_malicious, random_state=seed)


# Concatenate the normal testing data and the malicious data to
# create the evaluation data set
data_evaluation_df = pd.concat([normal_traffic_df, data_evaluation_malicious])


# Shuffle the malicious samples in the whole dataset. 
data_evaluation_df = data_evaluation_df.sample(data_evaluation_df.shape[0], random_state=1)

data_evaluation_df

## Label standarization

To make things easier for our models, we set the labels for normal traffic to 1 and for malicious traffic to -1. The original labels for malicious traffic are 4's. 

In [None]:
# Standarize labels. If the labels are greater than 1, it means the traffic is malicious. We set those samples to -1
# as required by the ML Python libraries. 
normal_traffic_label = 1
label_col            = 'is_doh'
data_evaluation_df[label_col][data_evaluation_df[label_col]>normal_traffic_label] = -1


# Ignore warning


## The training and testing datasets
We now plit the data set into training and testing datasets. The training dataset helps the model understand the difference between normal and malicious data. The testing data set is used to evaluate how well it learned this difference. 

In [None]:
# Split the data into training and testing sets. Resulting dataframes will be randomized
from sklearn.model_selection import train_test_split

# We are assigning 20% of the data for testing
split = 0.2 
                                                                         
# Split the data into training and testing data sets
data_training, data_testing = train_test_split(data_evaluation_df, test_size =split, random_state=1)

## Save the datasets
Finally, we save the training and testing datasets. 

In [None]:
# Ignore warning above

# Save dataset for later
data_evaluation_df.to_csv('data_evaluation.csv', index=False)
data_training.to_csv('data_training.csv', index=False)
data_testing.to_csv('data_testing.csv', index=False)

### Lets see how our datasets look now

In [None]:
data_testing

In [None]:
data_training

### The contents of the dataset

We see that the normal and malicious datasets have 16 columns labeled 0 to 15. These columns are statistical features of the TCP connections related to round trip times, number of bytes, number of packets, etc. We explain what these columns mean in notebook 0.

The is_doh column indicates the type of traffic. We have 1 for normal traffic and number higher than 1 for malicious traffic.  

## Correlation Matrix
One of the initial steps in our analysis will involve creating a correlation matrix to assess the relationships between numerical variables in the dataset. This matrix will be visualized as a heatmap, making it east to identify strong correlations, which can be crucial in feature selection for machine learning models.  

In [None]:
# Import the plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# We take a random sample of the trainig data set for visualization
data_evaluation_df = data_training.sample(n = 700)

# Calculate the correlations and save in new variable.
correlation_matrix = data_evaluation_df.corr()

# Create a figure
plt.figure(figsize=(12,10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')

# Show figure
plt.show

##  Can you drop the mindelay row and column from the correlation matrix? 

In [None]:
# Drop the mindelay row and column and replot the correlation matrix

# Enter the row and column name that should be dropped
correlation_matrix = correlation_matrix.drop(index='enter_row_name_here', columns='enter_column_name_here')

# recreate the figure
plt.figure(figsize=(12,10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')

# Show figure
plt.show()

## Pair Plots
Pair plots provide an efficient way to visualize multiple pairwise relationships at once. By plotting all combinations of numerical values, we can uncover potential patterns and dependencies within the data.

## Scatter Plots and Historgrams
Scatter plots will be useful for visualizing relationships between pairs of numerical variables. By creating scatter plots, we can explore how two variables interact with eachother. This can help us identify patterns, clusters, or trends within the data.The scatter plots are in the off-diagonal of the plot matrices. Their coordinate is given by the two features that are used for the plot. 

Historgrams help us to understand the distribution of individual numerical variables. These histograms will allow us to assess the central tendency and spread of each variable and help in identifying potential outliers. The histograms are in the diagonal of the plot matrices. 

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Define a subset of variables for the pair plot
subset1 = ['bytes_in', 'bytes_out', 'num_pkts_in', 'num_pkts_out']

# Create a pair plot with specified attributes
sns.pairplot(data_evaluation_df, vars= subset1, diag_kind='kde', hue=label_col, palette= 'Set1')

# Display the pair plot. Ignore warning
plt.show

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Define a subset of variables for the pair plot
subset2 = ['av_pkt_size_in','av_pkt_size_out', 'var_pkt_size_in', 'var_pkt_size_out']

# Create a pair plot with specified attributes
sns.pairplot(data_evaluation_df, vars= subset2, diag_kind='kde', hue=label_col, palette= 'Set1')

# Display the pair plot
plt.show

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Define a subset of variables for the pair plot
subset3 = ['median_in', 'median_out', 'mindelay', 'maxdelay']

# Create a pair plot with specified attributes
sns.pairplot(data_evaluation_df, vars= subset3, diag_kind='kde', hue=label_col, palette= 'Set1')

# Display the pair plot
plt.show

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Define a subset of variables for the pair plot
subset4 = ['bytes_ration', 'num_pkts_ration', 'time', 'avgdelay']

# Create a pair plot with specified attributes
sns.pairplot(data_evaluation_df, vars= subset4, diag_kind='kde', hue=label_col, palette= 'Set1')

# Display the pair plot
plt.show

## Examples of scatter plots of one feature vs another feature

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Create a scatter plot with specified attributes
sns.scatterplot(x='bytes_in', y='bytes_out', data=data_evaluation_df, marker='o', hue=label_col, palette= 'Set1')

# Set labels for the x and y axes
plt.xlabel('bytes_in')
plt.ylabel('bytes_out')

# Set the title for the scatter plot
plt.title('Scatter Plot of bytes_in vs bytes_out')

# Display the plot
plt.show

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Create a scatter plot with specified attributes
sns.scatterplot(x='num_pkts_in', y='num_pkts_out', data=data_evaluation_df, marker='o', hue=label_col, palette= 'Set1')

# Set labels for the x and y axes
plt.xlabel('num_pkts_in')
plt.ylabel('num_pkts_out')

# Set the title for the scatter plot
plt.title('Scatter Plot of num_pkts_in vs num_pkts_out')

# Display the plot
plt.show

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Create a scatter plot with specified attributes
sns.scatterplot(x='bytes_ration', y='num_pkts_ration', data=data_evaluation_df, marker='o', hue=label_col, palette= 'Set1')

# Set labels for the x and y axes
plt.xlabel('bytes_ration')
plt.ylabel('num_pkts_ration')

# Set the title for the scatter plot
plt.title('Scatter Plot of bytes_ration vs num_pkts_ration')

# Display the plot
plt.show

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Create a scatter plot with specified attributes
sns.scatterplot(x='av_pkt_size_in', y='av_pkt_size_out', data=data_evaluation_df, marker='o', hue=label_col, palette= 'Set1')

# Set labels for the x and y axes
plt.xlabel('av_pkt_size_in')
plt.ylabel('av_pkt_size_out')

# Set the title for the scatter plot
plt.title('Scatter Plot of av_pkt_size_in vs av_pkt_size_out')

# Display the plot
plt.show

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Create a scatter plot with specified attributes
sns.scatterplot(x='var_pkt_size_in', y='var_pkt_size_out', data=data_evaluation_df, marker='o', hue=label_col, palette= 'Set1')

# Set labels for the x and y axes
plt.xlabel('var_pkt_size_in')
plt.ylabel('var_pkt_size_out')

# Set the title for the scatter plot
plt.title('Scatter Plot of var_pkt_size_in vs var_pkt_size_out')

# Display the plot
plt.show

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Create a scatter plot with specified attributes
sns.scatterplot(x='median_in', y='median_out', data=data_evaluation_df, marker='o', hue=label_col, palette= 'Set1')

# Set labels for the x and y axes
plt.xlabel('median_in')
plt.ylabel('median_out')

# Set the title for the scatter plot
plt.title('Scatter Plot of median_in vs median_out')

# Display the plot
plt.show

In [None]:
# Assign labels "malicious" and "benign" based on values in 'label_col'
data_evaluation_df[label_col][data_evaluation_df[label_col] == -1] = "malicious"
data_evaluation_df[label_col][data_evaluation_df[label_col] == 1] = "benign"

# Create a scatter plot with specified attributes
sns.scatterplot(x='time', y='avgdelay', data=data_evaluation_df, marker='o', hue=label_col, palette= 'Set1')

# Set labels for the x and y axes
plt.xlabel('time')
plt.ylabel('avgdelay')

# Set the title for the scatter plot
plt.title('Scatter Plot of time vs avgdelay')

# Display the plot
plt.show

By using these standard data exploration techniques and creating these diverse plots, we aim to gain a comprehensive
understanding of the DoH traffic dataset. These visualizations will serve as a foundation for subsequent steps in our analysis, including feature engineering, model selection, and any necessary data preprocessing steps to ensure the quality and suitability of our dataset for machine learning tasks.  