## IoT Botnet Detection

Due to Increasing usage of digital communication in this digital era, cyber security is crucial to maintain a high level of safety. To prevent an increasing number of cyber attacks, traditional security system firewalls, encryption is not enough. There is a need for intrusion detection systems that can integrate with traditional systems and assure a high level of security of data.


Data set contain 11 different csv file. each file represent different type of attack data. There is two malware used on this dataset. Mirai and Gafgyt.

* benign_traffic.csv - Benign Traffic
* mirai_attacks/ack.csv - Mallicious 1
* mirai_attacks/scan.csv - mallicious 2
* mirai_attacks/syn.csv - mallicious 3
* mirai_attacks/udp.csv - mallicious 4
* mirai_attacks/udpplain.csv - mallicious 5
* gafgyt_attacks/combo.csv - mallicious 6
* gafgyt_attacks/junk.csv - mallicious 7
* gafgyt_attacks/scan.csv - mallicious 8
* gafgyt_attacks/tcp.csv - mallicious 9
* gafgyt_attacks/udp.csv - mallicious 10

In [10]:
# Importing necssary modules
import pandas as pd
import numpy as np
import seaborn as sns
import os, glob

In [8]:
# Data folder path
base_directory = '../rawdata/'

The data contain 9 different IoT devices, each devices having benign / mallicious traffic data. since all csv files are in seperate folder, i created simple function to load the csv file into data frame

In [34]:
# This csv file divided into three category of dataframes, benign, mirai, gafgyt
def create_three_class_data_frame(directory):
    """
    Creates a data frame consisting of all the .csv-files in a given directory. The directory should
    be where the unzipped data files are stored. Assumes the file structurce is
        device name
            mirai_attacks(folder)
            gafgyt_attacks(folder)
            benign_traffic.csb
            
    Parameters
    ----------
    directory : str
        The directory in which the data files are stored. 
        
    Returns
    -------
    benign_data : pandas data frame 
        consisting of all the bengin data.
    mirai_data : pandas data frame
        consisting of all mirai attack csv file into one data frame
    gafgyt_data : pandas data frame
        consisting of all gafgyt attack csv file into one data frame
    """
    try:
    
        benign_data = pd.read_csv('{directory}benign_traffic.csv'.format(directory=directory))
        benign_data['label'] = 'benign_traffic'

        # Mirai
        if os.path.isdir(directory + 'mirai_attacks'):
            mirai_data = pd.concat(map(pd.read_csv, glob.glob(directory + 'mirai_attacks/*.csv')))
            mirai_data['label'] = 'mirai_traffic'
        else:
            mirai_data = 0

        # Gafgyt
        if os.path.isdir(directory + 'gafgyt_attacks'):
            gafgyt_data = pd.concat(map(pd.read_csv, glob.glob(directory + 'gafgyt_attacks/*.csv')))
            gafgyt_data['label'] = 'gafgyt_data'
        else:
            gafgyt_data = 0
    
        
        return(benign_data, mirai_data, gafgyt_data)
    except:
        return 'Failed'

#### Danmini_Doorbell

In [13]:
danmini_doorbell_benign,\
danmini_doorbell_mirai,\
danmin_doorbell_gafgyt = create_three_class_data_frame(base_directory + 'Danmini_Doorbell' + '/')

In [14]:
print(f' Benign_data_shape: {danmini_doorbell_benign.shape}')
print(f' Mirai_data_shape: {danmini_doorbell_mirai.shape}')
print(f' Gafgyt_data_shape: {danmin_doorbell_gafgyt.shape}')

 Benign_data_shape: (49548, 116)
 Mirai_data_shape: (652100, 116)
 Gafgyt_data_shape: (316650, 116)


In [16]:
danmini_doorbell_benign.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49548 entries, 0 to 49547
Columns: 116 entries, MI_dir_L5_weight to label
dtypes: float64(115), object(1)
memory usage: 43.9+ MB


In [17]:
danmini_doorbell_mirai.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 652100 entries, 0 to 81981
Columns: 116 entries, MI_dir_L5_weight to label
dtypes: float64(115), object(1)
memory usage: 582.1+ MB


In [18]:
danmin_doorbell_gafgyt.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 316650 entries, 0 to 105873
Columns: 116 entries, MI_dir_L5_weight to label
dtypes: float64(115), object(1)
memory usage: 282.7+ MB


There are 49548 rows of benign data, 652100 rows of mirai data, 316650 rows of gafgyt data. Mirai and gafgyt is combined dataframe of all 5 class of attacks.

Check if there ay missing values in the dataframe

In [25]:
## Chcking is there any missing value in the dataset
def check_dataset_having_null_value(dataframe):
    missing = pd.concat([dataframe.isnull().sum(), 100 * dataframe.isnull().mean()], axis=1)
    missing.columns=['count', '%']
    missing.sort_values(by='%', ascending=False, inplace=True)
    return missing

In [26]:
check_dataset_having_null_value(danmini_doorbell_benign)

Unnamed: 0,count,%
MI_dir_L5_weight,0,0.0
HH_jit_L1_variance,0,0.0
HpHp_L5_covariance,0,0.0
HpHp_L5_radius,0,0.0
HpHp_L5_magnitude,0,0.0
...,...,...
HH_L5_radius,0,0.0
HH_L5_magnitude,0,0.0
HH_L5_std,0,0.0
HH_L5_mean,0,0.0


In [27]:
check_dataset_having_null_value(danmini_doorbell_mirai)

Unnamed: 0,count,%
MI_dir_L5_weight,0,0.0
HH_jit_L1_variance,0,0.0
HpHp_L5_covariance,0,0.0
HpHp_L5_radius,0,0.0
HpHp_L5_magnitude,0,0.0
...,...,...
HH_L5_radius,0,0.0
HH_L5_magnitude,0,0.0
HH_L5_std,0,0.0
HH_L5_mean,0,0.0


In [28]:
check_dataset_having_null_value(danmin_doorbell_gafgyt)

Unnamed: 0,count,%
MI_dir_L5_weight,0,0.0
HH_jit_L1_variance,0,0.0
HpHp_L5_covariance,0,0.0
HpHp_L5_radius,0,0.0
HpHp_L5_magnitude,0,0.0
...,...,...
HH_L5_radius,0,0.0
HH_L5_magnitude,0,0.0
HH_L5_std,0,0.0
HH_L5_mean,0,0.0


Looks like there is no null values in the dataset

#### Ecobee_Thermostat

In [35]:
ecobee_thermostat_benign,\
ecobee_thermostat_mirai,\
ecobee_thermostat_gafgyt = create_three_class_data_frame(base_directory + 'Ecobee_Thermostat' + '/')

In [36]:
print(f' Benign_data_shape: {ecobee_thermostat_benign.shape}')
print(f' Mirai_data_shape: {ecobee_thermostat_mirai.shape}')
print(f' Gafgyt_data_shape: {ecobee_thermostat_gafgyt.shape}')

 Benign_data_shape: (13113, 116)
 Mirai_data_shape: (512133, 116)
 Gafgyt_data_shape: (310630, 116)


There are 13113 rows of benign data, 512133 rows of mirai data, 310630 rows of gafgyt data. Mirai and gafgyt is combined dataframe of all 5 class of attacks.

#### Ennio Doorbell

In [37]:
ennio_doorbell_benign,\
ennio_doorbell_mirai,\
ennio_doorbell_gafgyt = create_three_class_data_frame(base_directory + 'Ennio_Doorbell' + '/')

ValueError: too many values to unpack (expected 3)

In [None]:
print(f' Benign_data_shape: {ennio_doorbell_benign.shape}')
print(f' Mirai_data_shape: {ennio_doorbell_mirai.shape}')
print(f' Gafgyt_data_shape: {ennio_doorbell_gafgyt.shape}')