## Preprocesado II

### KDD99

Ha sido usado en artículos de investigación 133 veces en el tiempo comprendido solamente entre los años 2010 y 2015, como lo publica el paper ‘A Review of KDD99 Dataset Usage in Intrusion Detection and Machine Learning between 2010 and 2015’. La historia del dataset KDD99, comienza con un anterior conjunto de datos, llamado DARPA creado entre 1998 y 1999.


El Grupo de ciberseguridad y ciencias de la información del laboratorio Lincoln del MIT, con el patrocinio de la Agencia de proyectos de investigación avanzados de defensa (DARPA) y también coordinados con el Laboratorio de las Fuerzas aérea de los EEUU, recogen y distribuyen el primer estándar para la evaluación de sistemas de detección de intrusos en la red.

Siendo la primera evaluación formal, respetable y estadísticamente significativa de los sistemas de detección de intrusos.

Teniendo en cuenta que no se puede conocer con total exactitud si una conexión es maliciosa o no en el mundo real, los datos fueron creados artificialmente en un entorno cerrado con generadores de trafico de red e inyecciones de ataques de red. Su objetivo fue el de simular el tráfico de red de una base de tamaño medio de la Fuerza aérea de los EEUU.

El dataset DARPA, una vez creado, se encontraba disponible al uso público, pero sus datos eran difíciles de utilizar directamente por gran parte de los investigadores de minería de datos debido a que se encontraban en ficheros tcpdump.

Salvatore J. Stolfo y Wenke Lee, crearon el dataset KDD99, a partir del conjunto de datos DARPA, las trazas de red fueron convertidas en registros de red, generando features con el software Bro IDS, haciendo que los datos fuesen más adecuados para que los investigadores de minería de datos, pudiendo realizar su trabajo.

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

In [2]:
kdd = pd.read_csv('ht tps://goo.gl/HrnDk5')

In [3]:
kdd.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,normal
1,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal
2,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal
3,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal
4,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal


In [4]:
kdd.shape

(494020, 42)

In [6]:
kdd.isnull().any().any()

False

In [10]:
kdd.dtypes

duration                         int64
protocol_type                   object
service                         object
flag                            object
src_bytes                        int64
dst_bytes                        int64
land                             int64
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
logged_in                        int64
lnum_compromised                 int64
lroot_shell                      int64
lsu_attempted                    int64
lnum_root                        int64
lnum_file_creations              int64
lnum_shells                      int64
lnum_access_files                int64
lnum_outbound_cmds               int64
is_host_login                    int64
is_guest_login                   int64
count                            int64
srv_count                        int64
serror_rate                    float64
srv_serror_rate          

In [11]:
for col_name in kdd.columns:  
    unique_val = len(kdd[col_name].unique())
    if unique_val == 1:       
        print(col_name)

lnum_outbound_cmds
is_host_login


In [12]:
kdd["lnum_outbound_cmds"].value_counts()

0    494020
Name: lnum_outbound_cmds, dtype: int64

In [13]:
kdd["is_host_login"].value_counts()

0    494020
Name: is_host_login, dtype: int64

In [14]:
kdd.drop(['lnum_outbound_cmds'], axis=1, inplace=True)

In [15]:
kdd.drop(['is_host_login'], axis=1, inplace=True)

In [16]:
kdd.shape[1]

40

In [17]:
kdd["label"].value_counts()

smurf              280790
neptune            107201
normal              97277
back                 2203
satan                1589
ipsweep              1247
portsweep            1040
warezclient          1020
teardrop              979
pod                   264
nmap                  231
guess_passwd           53
buffer_overflow        30
land                   21
warezmaster            20
imap                   12
rootkit                10
loadmodule              9
ftp_write               8
multihop                7
phf                     4
perl                    3
spy                     2
Name: label, dtype: int64

In [19]:
# Grouping of values with little frequency 
def grouping(x):   
    if x == 'smurf':    
        return 'smurf'  
    if x == 'neptune':    
        return 'neptune'  
    if x == 'normal':    
        return 'normal' 
    if x == 'back':     
        return 'back'   
    if x == 'satan':     
        return 'satan'  
    if x == 'ipsweep':   
        return 'ipsweep'  
    if x == 'portsweep':   
        return 'portsweep'  
    if x == 'warezclient':   
        return 'warezclient'  
    if x == 'teardrop':     
        return 'teardrop'  
    if x == 'pod':     
        return 'pod'  
    if x == 'nmap':    
        return 'nmap' 
    else:     
        return 'other'
kdd['label'] = kdd['label'].apply(grouping) 
print ('Grouping of values with little frequency finished')

Grouping of values with little frequency finished


In [20]:
kdd["label"].value_counts()

smurf          280790
neptune        107201
normal          97277
back             2203
satan            1589
ipsweep          1247
portsweep        1040
warezclient      1020
teardrop          979
pod               264
nmap              231
other             179
Name: label, dtype: int64

In [24]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
LE.fit(kdd["label"])
list(LE.classes_)

['back',
 'ipsweep',
 'neptune',
 'nmap',
 'normal',
 'other',
 'pod',
 'portsweep',
 'satan',
 'smurf',
 'teardrop',
 'warezclient']

In [25]:
kdd["label"] = LE.transform(kdd["label"])

In [26]:
kdd["label"].value_counts()

9     280790
2     107201
4      97277
0       2203
8       1589
1       1247
7       1040
11      1020
10       979
6        264
3        231
5        179
Name: label, dtype: int64

In [35]:
for col_name in kdd.columns: 
    if kdd[col_name].dtypes == 'object':
        unique_cat = len(kdd[col_name].unique()) 
        print("Feature '{col_name}' = {unique_cat} unique categories".format(col_name=col_name, unique_cat=unique_cat))


Feature 'protocol_type' = 3 unique categories
Feature 'service' = 66 unique categories
Feature 'flag' = 11 unique categories


In [36]:
kdd.select_dtypes(include=['object']).head()

Unnamed: 0,protocol_type,service,flag
0,tcp,http,SF
1,tcp,http,SF
2,tcp,http,SF
3,tcp,http,SF
4,tcp,http,SF


In [37]:
kdd["service"].value_counts()

ecr_i          281400
private        110893
http            64292
smtp             9723
other            7237
domain_u         5863
ftp_data         4721
eco_i            1642
ftp               798
finger            670
urp_i             538
telnet            513
ntp_u             380
auth              328
pop_3             202
time              157
csnet_ns          126
remote_job        120
gopher            117
imap4             117
domain            116
discard           116
iso_tsap          115
systat            115
echo              112
shell             112
rje               111
whois             110
sql_net           110
printer           109
                ...  
uucp_path         106
uucp              106
bgp               106
klogin            106
nnsp              105
ssh               105
supdup            105
login             104
hostnames         104
efs               103
daytime           103
link              102
netbios_ns        102
ldap              101
pop_2     

In [38]:
dummies = pd.get_dummies(kdd["protocol_type"]) 
kdd = pd.concat([kdd, dummies], axis=1) 
del kdd["protocol_type"]

In [39]:
dummies = pd.get_dummies(kdd["service"])
kdd = pd.concat([kdd, dummies], axis=1)
del kdd["service"] 

In [40]:
dummies = pd.get_dummies(kdd["flag"])
kdd = pd.concat([kdd, dummies], axis=1)
del kdd["flag"] 

In [41]:
X = kdd.drop("label", 1)


In [42]:
y = kdd["label"]

In [43]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1, stratify=y)


In [44]:
from sklearn.model_selection import cross_val_score

In [45]:
from sklearn.tree import DecisionTreeClassifier 
DTC = DecisionTreeClassifier(random_state=1)

In [50]:
print(cross_val_score(DTC, X_train, y_train, cv=3, scoring='accuracy'), "Árbol de decisión - Datos de entrenamiento")

[0.99957493 0.99949685 0.99962695] Árbol de decisión - Datos de entrenamiento


In [51]:
DTC.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

In [52]:
y_test_pred = DTC.predict(X_test)

In [53]:
from sklearn import metrics

In [54]:
print(metrics.accuracy_score(y_test, y_test_pred), "Árbol de decisión - Datos de test")


0.9996626317423046 Árbol de decisión - Datos de test


In [56]:
for name, importance in zip(kdd.columns, DTC.feature_importances_): 
    print(name, importance)

duration 0.00011826099488086687
src_bytes 0.007204472957071258
dst_bytes 0.0006473859151003633
land 9.680969171038845e-06
wrong_fragment 0.006618004455340361
urgent 0.0
hot 0.0016118432880662839
num_failed_logins 0.00032767338192025443
logged_in 2.7438053492070796e-05
lnum_compromised 0.012982459307259803
lroot_shell 2.9806542263293217e-05
lsu_attempted 0.0
lnum_root 2.008094361766279e-05
lnum_file_creations 3.196382479462524e-05
lnum_shells 9.577980137304436e-06
lnum_access_files 9.590411928706086e-06
is_guest_login 0.00015824329078476508
count 0.0001414382983742737
srv_count 0.6047664215645189
serror_rate 0.0002530516847069509
srv_serror_rate 8.387437647606245e-06
rerror_rate 0.00025531246023770406
srv_rerror_rate 1.8553590633219443e-05
same_srv_rate 0.32605842388196493
diff_srv_rate 0.0003562354699174849
srv_diff_host_rate 9.622254856837166e-06
dst_host_count 0.00027842071308286585
dst_host_srv_count 0.0001532314993875599
dst_host_same_srv_rate 0.00025183257402315676
dst_host_diff_s

In [58]:
from sklearn.metrics import classification_report

In [61]:
class_names = ['back',
 'ipsweep',
 'neptune',
 'nmap',
 'normal',
 'other',
 'pod',
 'portsweep',
 'satan',
 'smurf',
 'teardrop',
 'warezclient']


In [62]:
print(classification_report(y_test, y_test_pred, target_names=class_names))

             precision    recall  f1-score   support

       back       1.00      1.00      1.00       661
    ipsweep       1.00      1.00      1.00       374
    neptune       1.00      1.00      1.00     32160
       nmap       0.96      0.99      0.97        69
     normal       1.00      1.00      1.00     29183
      other       0.84      0.80      0.82        54
        pod       0.99      1.00      0.99        79
  portsweep       0.99      0.99      0.99       312
      satan       0.99      0.98      0.99       477
      smurf       1.00      1.00      1.00     84237
   teardrop       1.00      1.00      1.00       294
warezclient       0.99      0.99      0.99       306

avg / total       1.00      1.00      1.00    148206

