# Supervision des réseaux
## Détection d'intrusion : Méthode Lakhina Entropy
<div>
Groupe 9 :
<ul><li>AMATU Jonathan</li><li>BERCY Victor</li><li>SEMPERE Nicolas</li>
</div>
<div></div>
<div>Dataset CTU-13 : <a href=https://www.stratosphereips.org/datasets-ctu13>lien</a></div>
<div></div>
<div>Papier original : An empirical comparison of botnet detection methods, S. García et al.</div>
<div>Papier Lakhina Entropy : Mining anomalies using traffic feature distributions, A. Lakhina et al.</div>

### Imports

In [1]:
# Bibliothèques scientifiques
import math
import pandas as pd
import numpy as np
from numpy import linalg as LA

# Bibliothèque de Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, accuracy_score, balanced_accuracy_score

# Bibliothèque graphique
import matplotlib.pyplot as plt

# Boîte à outils
from datetime import datetime

### Analyse d'un scénario de la base de données CTU-13

On charge les données d'un scénario dans un dataframe pour l'analyse statistique.

In [23]:
num_scenario = 42
kept_fields = ["StartTime","Dur","Proto","SrcAddr","Sport","DstAddr","Dport","TotPkts","TotBytes","Label"]

scenario = pd.read_csv(f'./Datasets/CTU13_{num_scenario}.binetflow', usecols=kept_fields, )
scenario

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,DstAddr,Dport,TotPkts,TotBytes,Label
0,2011/08/10 09:46:53.047277,3550.182373,udp,212.50.71.179,39678,147.32.84.229,13363,12,875,flow=Background-UDP-Established
1,2011/08/10 09:46:53.048843,0.000883,udp,84.13.246.132,28431,147.32.84.229,13363,2,135,flow=Background-UDP-Established
2,2011/08/10 09:46:53.049895,0.000326,tcp,217.163.21.35,80,147.32.86.194,2063,2,120,flow=Background
3,2011/08/10 09:46:53.053771,0.056966,tcp,83.3.77.74,32882,147.32.85.5,21857,3,180,flow=Background
4,2011/08/10 09:46:53.053937,3427.768066,udp,74.89.223.204,21278,147.32.84.229,13363,42,2856,flow=Background-UDP-Established
...,...,...,...,...,...,...,...,...,...,...
2824631,2011/08/10 15:54:07.352393,0.000393,udp,147.32.86.92,36363,147.32.80.9,53,2,208,flow=To-Background-UDP-CVUT-DNS-Server
2824632,2011/08/10 15:54:07.353854,0.000935,udp,58.165.41.84,60122,147.32.84.229,13363,2,539,flow=Background-UDP-Established
2824633,2011/08/10 15:54:07.357302,0.000000,tcp,147.32.84.171,47077,78.191.168.43,13754,1,74,flow=Background-TCP-Attempt
2824634,2011/08/10 15:54:07.366830,0.002618,udp,93.79.39.15,10520,147.32.84.229,13363,2,520,flow=Background-UDP-Established


#### Analyse de la labellisation du trafic

Les labels ayant des noms longs et complexes pour le traitement que l'on souhaite leur appliquer, on les rassemble en 3 classes comme présenté dans le papier : Background, Normal, Botnet. On associe ensuite les labels 0 et 1 respectivement aux classes Background/Normal et Botnet pour la détection à suivre.

In [24]:
def get_class(full_name):
    """
    Get the class of the netflow (Normal, Background, Botnet) from the full name of the label given by CTU-13 dataset
    """
    if "Background" in full_name:
        return "Background"
    elif "Botnet" in full_name:
        return "Botnet"
    elif "Normal" in full_name:
        return "Normal"
    else:
        return "None"
    
def get_label(full_name):
    """
    Assign a label to the netflow (0 for Normal and Background, 1 for Botnet, -1 for unknown label) from the full name of the label given by CTU-13 dataset
    """
    if "Background" in full_name or "Normal" in full_name:
        return 0
    elif "Botnet" in full_name:
        return 1
    else:
        return -1

In [25]:
scenario = scenario.assign(
    Class=scenario['Label'].apply(get_class),
    Label=scenario['Label'].apply(get_label)
)
scenario.tail(5)

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,DstAddr,Dport,TotPkts,TotBytes,Label,Class
2824631,2011/08/10 15:54:07.352393,0.000393,udp,147.32.86.92,36363,147.32.80.9,53,2,208,0,Background
2824632,2011/08/10 15:54:07.353854,0.000935,udp,58.165.41.84,60122,147.32.84.229,13363,2,539,0,Background
2824633,2011/08/10 15:54:07.357302,0.0,tcp,147.32.84.171,47077,78.191.168.43,13754,1,74,0,Background
2824634,2011/08/10 15:54:07.366830,0.002618,udp,93.79.39.15,10520,147.32.84.229,13363,2,520,0,Background
2824635,2011/08/10 15:54:07.368340,0.001122,udp,78.56.231.126,29419,147.32.84.229,13363,2,137,0,Background


On analyse maintenant la répartition de ces classes dans le scénario.

In [26]:
nb_rows = len(scenario)

labels_count = pd.Series(
    scenario['Class'].value_counts(),
    name='Count'
)

labels_percentage = pd.Series(
    labels_count
    .apply(lambda x: f"{np.round(x/nb_rows*100, 3)} %"),
    index=labels_count.index,
    name='Percentage'
)

pd.concat([labels_count, labels_percentage], axis=1)

Unnamed: 0_level_0,Count,Percentage
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
Background,2753288,97.474 %
Botnet,40961,1.45 %
Normal,30387,1.076 %


#### Analyse des adresses IP et ports

On regarde la répartition des adresses IP et ports sources et destinations parmi les différentes classes de flux.

In [27]:
scenario.groupby("Class", group_keys=True)[["SrcAddr", "Sport", "DstAddr", "Dport"]].nunique()


Unnamed: 0_level_0,SrcAddr,Sport,DstAddr,Dport
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Background,542087,64737,115092,73781
Botnet,1,3975,4190,27
Normal,19,18631,545,93


In [28]:
print("L'adresse IP de l'hôte infectée est : {}".format(
    scenario[scenario['Label']==1]['SrcAddr'].unique()
))

L'adresse IP de l'hôte infectée est : ['147.32.84.165']


Pour la plupart des scénarios, on remarque qu'il n'y a qu'un seul hôte infecté qui émet sur différents ports et vers différents adresses IP et ports. Pour les autres scénarios, à savoir le n°44 (2 adresses IP), le n°50 (10 adresses IP), le n°51 (10 adresses IP), le n°52 (3 adresses IP) et le n°53 (9 adresses IP), on remarque que les botnets appartiennent à un même sous-réseau de masque 142.32.84.0/24.

In [29]:
unique_src_addr = scenario[scenario['Label']==0]['SrcAddr'].unique()

print("Les adresses IP de destination du jeu de données sont au nombre de {}, en voici un extrait : {}.".format(
    len(unique_src_addr),
    unique_src_addr
))

print("Parmi elles, {} % sont de la forme 147.32.84.X et {} % font partie du sous réseau de masque 80.0.0.0/4.".format(
    np.round(pd.Series(unique_src_addr)
     .str.contains('147.32.84.', regex=False)
     .value_counts()
     .loc[True]
     /len(unique_src_addr)*100,
     2
     ),
    np.round(pd.Series(unique_src_addr)
     .str.contains('^(8[0-9]|9[0-5]).*', regex=True)
     .value_counts()
     .loc[True]
     /len(unique_src_addr)*100,
     2
    )
))


Les adresses IP de destination du jeu de données sont au nombre de 542092, en voici un extrait : ['212.50.71.179' '84.13.246.132' '217.163.21.35' ... '83.46.238.157'
 '98.87.173.219' '88.222.4.220'].


  .str.contains('^(8[0-9]|9[0-5]).*', regex=True)


Parmi elles, 0.02 % sont de la forme 147.32.84.X et 33.1 % font partie du sous réseau de masque 80.0.0.0/4.


In [30]:
unique_dst_addr = scenario['DstAddr'].unique()

print("Les adresses IP de destination du jeu de données sont au nombre de {}, en voici un extrait : {}.".format(
    len(unique_dst_addr),
    unique_dst_addr
))

print("Parmi elles, {} % sont de la forme 147.32.84.X et {} % font partie du sous réseau de masque 80.0.0.0/4.".format(
    np.round(pd.Series(unique_dst_addr)
     .str.contains('147.32.84.', regex=False)
     .value_counts()
     .loc[True]
     /len(unique_dst_addr)*100,
     2
     ),
    np.round(pd.Series(unique_dst_addr)
     .str.contains('^(8[0-9]|9[0-5]).*', regex=True)
     .value_counts()
     .loc[True]
     /len(unique_dst_addr)*100,
     2
    )
))

Les adresses IP de destination du jeu de données sont au nombre de 119296, en voici un extrait : ['147.32.84.229' '147.32.86.194' '147.32.85.5' ... '93.80.227.24'
 '87.244.129.22' '86.147.113.119'].
Parmi elles, 0.21 % sont de la forme 147.32.84.X et 27.15 % font partie du sous réseau de masque 80.0.0.0/4.


  .str.contains('^(8[0-9]|9[0-5]).*', regex=True)


On ne peut pas dire grand chose sur les adresses IP sources non issues de botnet et de destination des flux. Cependant, on peut émettre l'hypothèse que, pour tous les scénarios, entre 20% et 25% de ces deux types d'adresses font partie d'un sous-réseau commun de masque 80.0.0.0/4.

#### Pré-traitement du jeu de données

In [31]:
def to_timestamp(date_string):
    """
    Transform the date of the dataframe into a timestamp
    """
    month_split = date_string.split('/')
    day_split = month_split[-1].split(' ')
    hour_split = day_split[-1].split(':')
    second_split = hour_split[-1].split('.')

    year = int(month_split[0])
    month = int(month_split[1])
    day = int(day_split[0])
    hour = int(hour_split[0])
    minute = int(hour_split[1])
    second = int(second_split[0])
    microsecond = int(second_split[1])
    date = datetime(year, month, day, hour, minute, second, microsecond)

    return date.timestamp()

In [32]:
scenario = scenario.assign(
    StartTime=scenario['StartTime'].apply(to_timestamp)
)
scenario

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,DstAddr,Dport,TotPkts,TotBytes,Label,Class
0,1.312962e+09,3550.182373,udp,212.50.71.179,39678,147.32.84.229,13363,12,875,0,Background
1,1.312962e+09,0.000883,udp,84.13.246.132,28431,147.32.84.229,13363,2,135,0,Background
2,1.312962e+09,0.000326,tcp,217.163.21.35,80,147.32.86.194,2063,2,120,0,Background
3,1.312962e+09,0.056966,tcp,83.3.77.74,32882,147.32.85.5,21857,3,180,0,Background
4,1.312962e+09,3427.768066,udp,74.89.223.204,21278,147.32.84.229,13363,42,2856,0,Background
...,...,...,...,...,...,...,...,...,...,...,...
2824631,1.312984e+09,0.000393,udp,147.32.86.92,36363,147.32.80.9,53,2,208,0,Background
2824632,1.312984e+09,0.000935,udp,58.165.41.84,60122,147.32.84.229,13363,2,539,0,Background
2824633,1.312984e+09,0.000000,tcp,147.32.84.171,47077,78.191.168.43,13754,1,74,0,Background
2824634,1.312984e+09,0.002618,udp,93.79.39.15,10520,147.32.84.229,13363,2,520,0,Background


In [33]:
normal_data = scenario[scenario['Label']==0]
botnet_data = scenario[scenario['Label']==1]

Unnamed: 0,StartTime,Dur,Proto,SrcAddr,Sport,DstAddr,Dport,TotPkts,TotBytes,Label,Class
675537,1.312967e+09,0.000278,udp,147.32.84.165,1025,147.32.80.9,53,2,203,1,Botnet
675872,1.312967e+09,0.020525,udp,147.32.84.165,1025,147.32.80.9,53,2,590,1,Botnet
675877,1.312967e+09,0.045125,tcp,147.32.84.165,1027,74.125.232.195,80,7,882,1,Botnet
689920,1.312967e+09,0.336250,udp,147.32.84.165,1025,147.32.80.9,53,2,215,1,Botnet
689955,1.312967e+09,3514.083496,tcp,147.32.84.165,1039,60.190.222.139,65520,120,7767,1,Botnet
...,...,...,...,...,...,...,...,...,...,...,...
2785281,1.312984e+09,0.000000,udp,147.32.84.165,2077,89.149.254.87,53,1,72,1,Botnet
2785303,1.312984e+09,0.000000,tcp,147.32.84.165,1081,202.59.166.29,25,1,62,1,Botnet
2785326,1.312984e+09,0.000405,udp,147.32.84.165,2079,147.32.80.9,53,2,138,1,Botnet
2785382,1.312984e+09,0.056870,udp,147.32.84.165,2077,188.65.208.29,53,2,144,1,Botnet
