APRENDIZAJE DEL DETECTOR DE INTRUSIONES

El software para detectar intrusiones en la red protege una red informática de usuarios no autorizados, incluidos, quizás, personas con información privilegiada. La tarea de aprendizaje del detector de intrusiones es construir un modelo predictivo (es decir, un clasificador) capaz de distinguir entre conexiones "malas", llamadas intrusiones o ataques, y conexiones normales "buenas".

El Programa de evaluación de detección de intrusiones de DARPA de 1998 fue preparado y administrado por MIT Lincoln Labs. El objetivo era relevar y evaluar la investigación en detección de intrusos. Se proporcionó un conjunto estándar de datos para auditar, que incluye una amplia variedad de intrusiones simuladas en un entorno de red militar. El concurso de detección de intrusos KDD de 1999 utiliza una versión de este conjunto de datos.

Lincoln Labs estableció un entorno para adquirir nueve semanas de datos de volcado de TCP sin procesar para una red de área local (LAN) que simula una LAN típica de la Fuerza Aérea de EE. UU. Operaron la LAN como si fuera un verdadero entorno de la Fuerza Aérea, pero la salpicaron con múltiples ataques.

Los datos de entrenamiento sin procesar eran aproximadamente cuatro gigabytes de datos de volcado TCP binarios comprimidos de siete semanas de tráfico de red. Esto se procesó en aproximadamente cinco millones de registros de conexión. Del mismo modo, las dos semanas de datos de prueba arrojaron alrededor de dos millones de registros de conexión.

Una conexión es una secuencia de paquetes TCP que comienzan y terminan en momentos bien definidos, entre los cuales los datos fluyen hacia y desde una dirección IP de origen a una dirección IP de destino bajo algún protocolo bien definido. Cada conexión se etiqueta como normal o como un ataque, con exactamente un tipo de ataque específico. Cada registro de conexión consta de unos 100 bytes.

Los ataques se dividen en cuatro categorías principales:

    DOS: denegación de servicio, por ejemplo, syn flood;
    R2L: acceso no autorizado desde una máquina remota, por ejemplo, adivinar una contraseña;
    U2R: acceso no autorizado a privilegios de superusuario local (root), por ejemplo, varios ataques de "desbordamiento de búfer";
    Sondeo: vigilancia y otros sondeos, por ejemplo, escaneo de puertos.
    
Fuente: https://kdd.ics.uci.edu/databases/kddcup99/task.html

In [1]:
import numpy as np
import pandas as pd
import math as math
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.datasets import fetch_kddcup99

In [3]:
X, y = fetch_kddcup99(percent10=True, download_if_missing=True, return_X_y=True)

In [4]:
X.shape

(494021, 41)

In [5]:
y.shape

(494021,)

In [6]:
dt = [('duration', int),
      ('protocol_type', 'S4'),
      ('service', 'S11'),
      ('flag', 'S6'),
      ('src_bytes', int),
      ('dst_bytes', int),
      ('land', int),
      ('wrong_fragment', int),
      ('urgent', int),
      ('hot', int),
      ('num_failed_logins', int),
      ('logged_in', int),
      ('num_compromised', int),
      ('root_shell', int),
      ('su_attempted', int),
      ('num_root', int),
      ('num_file_creations', int),
      ('num_shells', int),
      ('num_access_files', int),
      ('num_outbound_cmds', int),
      ('is_host_login', int),
      ('is_guest_login', int),
      ('count', int),
      ('srv_count', int),
      ('serror_rate', float),
      ('srv_serror_rate', float),
      ('rerror_rate', float),
      ('srv_rerror_rate', float),
      ('same_srv_rate', float),
      ('diff_srv_rate', float),
      ('srv_diff_host_rate', float),
      ('dst_host_count', int),
      ('dst_host_srv_count', int),
      ('dst_host_same_srv_rate', float),
      ('dst_host_diff_srv_rate', float),
      ('dst_host_same_src_port_rate', float),
      ('dst_host_srv_diff_host_rate', float),
      ('dst_host_serror_rate', float),
      ('dst_host_srv_serror_rate', float),
      ('dst_host_rerror_rate', float),
      ('dst_host_srv_rerror_rate', float),
      ('labels', 'S16')]

In [7]:
column_names = [c[0] for c in dt]
df_conexiones = pd.DataFrame(data=np.column_stack((X,y)),columns=column_names)

In [8]:
# df_conexiones = df_conexiones.astype(dt)
df_conexiones = df_conexiones.infer_objects()

In [9]:
df_conexiones.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 494021 entries, 0 to 494020
Data columns (total 42 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   duration                     494021 non-null  int64  
 1   protocol_type                494021 non-null  object 
 2   service                      494021 non-null  object 
 3   flag                         494021 non-null  object 
 4   src_bytes                    494021 non-null  int64  
 5   dst_bytes                    494021 non-null  int64  
 6   land                         494021 non-null  int64  
 7   wrong_fragment               494021 non-null  int64  
 8   urgent                       494021 non-null  int64  
 9   hot                          494021 non-null  int64  
 10  num_failed_logins            494021 non-null  int64  
 11  logged_in                    494021 non-null  int64  
 12  num_compromised              494021 non-null  int64  
 13 

In [10]:
df_conexiones.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,labels
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,b'normal.'
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,b'normal.'
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,b'normal.'
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,b'normal.'


In [11]:
df_conexiones.labels.unique()

array([b'normal.', b'buffer_overflow.', b'loadmodule.', b'perl.',
       b'neptune.', b'smurf.', b'guess_passwd.', b'pod.', b'teardrop.',
       b'portsweep.', b'ipsweep.', b'land.', b'ftp_write.', b'back.',
       b'imap.', b'satan.', b'phf.', b'nmap.', b'multihop.',
       b'warezmaster.', b'warezclient.', b'spy.', b'rootkit.'],
      dtype=object)

**Ejercicio 1**

CONSIGNA:

1. Mediante la función SelectKBest de ScikitLearn, detectar los features más relevantes del dataset provisto

In [12]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [13]:
le.fit(df_conexiones['protocol_type'])

In [14]:
df_conexiones['protocol_type_le']=le.transform(df_conexiones['protocol_type'])

In [15]:
df_conexiones.protocol_type_le.unique()

array([1, 2, 0])

In [29]:
df_conexiones['service_le']=le.fit_transform(df_conexiones['service'])
df_conexiones.service_le.unique()

array([22, 50, 17, 11,  3, 56, 18, 13, 39, 14, 40, 45, 43, 19, 48, 59, 31,
       29, 47, 20, 52, 32, 65, 10, 30, 24,  8,  7, 38, 49,  0, 37, 23, 16,
       44, 15,  5, 62, 26, 27, 12,  9, 55, 54, 25, 21,  6, 42, 53, 63, 34,
       35, 33, 51, 64,  4,  2, 28, 36, 60,  1, 61, 41, 57, 58, 46])

In [30]:
df_conexiones['flag_le']=le.fit_transform(df_conexiones['flag'])

In [31]:
le.classes_

array([b'OTH', b'REJ', b'RSTO', b'RSTOS0', b'RSTR', b'S0', b'S1', b'S2',
       b'S3', b'SF', b'SH'], dtype=object)

In [16]:
di={b'normal.':'normal'}

In [17]:
df_conexiones.labels.unique()

array([b'normal.', b'buffer_overflow.', b'loadmodule.', b'perl.',
       b'neptune.', b'smurf.', b'guess_passwd.', b'pod.', b'teardrop.',
       b'portsweep.', b'ipsweep.', b'land.', b'ftp_write.', b'back.',
       b'imap.', b'satan.', b'phf.', b'nmap.', b'multihop.',
       b'warezmaster.', b'warezclient.', b'spy.', b'rootkit.'],
      dtype=object)

In [50]:
df_conexiones.loc[df_conexiones['labels']== 'normal','labels_le']=1


In [53]:
print(df_conexiones.loc[df_conexiones['labels']== 'normal','labels_le'])

Series([], Name: labels_le, dtype: float64)


In [51]:
df_conexiones.loc[df_conexiones['labels']!= 'normal','labels_le']=0

In [54]:
print(df_conexiones.loc[df_conexiones['labels']!= 'normal','labels_le'])

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
         ... 
494016    0.0
494017    0.0
494018    0.0
494019    0.0
494020    0.0
Name: labels_le, Length: 494021, dtype: float64


In [52]:
df_conexiones.labels_le.unique()

array([0.])

In [35]:
df_conexiones.labels_le.value_counts()

0.0    494021
Name: labels_le, dtype: int64

In [36]:
df_corr=df_conexiones.corr()

In [37]:
df_corr

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,protocol_type_le,labels_le,service_le,flag_le
duration,1.0,0.004258,0.00544,-0.000452,-0.003235,0.003786,0.013213,0.005239,-0.017265,0.058095,...,0.042642,-0.006983,-0.0304,-0.030612,0.006739,0.010465,0.163251,,0.078995,0.019739
src_bytes,0.004258,1.0,-2e-06,-2e-05,-0.000139,-5e-06,0.004483,-2.7e-05,0.001701,0.000119,...,-0.000724,0.001186,-0.000718,0.001122,-0.000393,0.001328,0.001904,,-0.001206,-0.00288
dst_bytes,0.00544,-2e-06,1.0,-0.000175,-0.001254,0.016288,0.004365,0.04933,0.047814,0.023298,...,-0.020143,0.008707,-0.011334,-0.011235,-0.005,-0.005471,0.024519,,0.006135,0.013191
land,-0.000452,-2e-05,-0.000175,1.0,-0.000318,-1.7e-05,-0.000295,-6.5e-05,-0.002784,-3.8e-05,...,0.003799,0.08332,0.012658,0.007795,-0.001511,-0.001665,0.006178,,-0.002285,-0.008427
wrong_fragment,-0.003235,-0.000139,-0.001254,-0.000318,1.0,-0.000123,-0.002106,-0.000467,-0.019908,-0.000271,...,-0.031803,0.012092,-0.019091,-0.022104,0.029774,-0.011904,0.113568,,0.0672,0.024541
urgent,0.003786,-5e-06,0.016288,-1.7e-05,-0.000123,1.0,0.000356,0.141996,0.006164,0.014285,...,-0.002002,-0.000408,-0.001194,-0.001191,-0.000648,-0.000641,0.002381,,0.004074,0.001322
hot,0.013213,0.004483,0.004365,-0.000295,-0.002106,0.000356,1.0,0.00874,0.105305,0.007348,...,-0.052923,-0.004467,-0.019491,-0.020201,-0.006541,-0.007749,0.040859,,-0.012767,0.021437
num_failed_logins,0.005239,-2.7e-05,0.04933,-6.5e-05,-0.000467,0.141996,0.00874,1.0,-0.001145,0.006907,...,-0.009565,0.016001,-0.001945,-0.002453,0.024753,0.023584,0.009056,,0.023183,-0.014497
logged_in,-0.017265,0.001701,0.047814,-0.002784,-0.019908,0.006164,0.105305,-0.001145,1.0,0.013612,...,-0.461558,0.140493,-0.190955,-0.191704,-0.090868,-0.087885,0.386216,,0.0664,0.211729
num_compromised,0.058095,0.000119,0.023298,-3.8e-05,-0.000271,0.014285,0.007348,0.006907,0.013612,1.0,...,-0.006715,0.000621,-0.001978,-0.001631,-0.000843,-0.000873,0.005257,,0.007513,0.002822


**Ejercicio 2**

CONSIGNA:

1. Usando las variables más relevantes y separando los datos en "train" y "test", utilizar un árbol de decisión para realizar la predicción sobre si es una conexión normal o no:

In [55]:
X = df_conexiones[['count','srv_count','protocol_type_le','logged_in']]
y = df_conexiones[['labels_le']]

In [56]:
from sklearn.tree import DecisionTreeClassifier
arbol = DecisionTreeClassifier(max_depth=4)

In [57]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

In [58]:
arbol.fit(X_train,y_train)

In [61]:
from sklearn.metrics import accuracy_score
y_train_pred = arbol.predict(X_train)
y_test_pred = arbol.predict(X_test)

print('Porcentaje de aciertos sobre el conjunto de entrenamiento:', accuracy_score(y_train_pred, y_train))
print('Porcentaje de aciertos sobre el conjunto de prueba:', accuracy_score(y_test_pred, y_test))

Porcentaje de aciertos sobre el conjunto de entrenamiento: 1.0
Porcentaje de aciertos sobre el conjunto de prueba: 1.0


CONSIGNA:

1. Usando las variables más relevantes y separando los datos en "train" y "test", utilizar un árbol de decisión para realizar la predicción sobre que tipo de conexión es pero teniendo en cuenta todas las posibles:

**Ejercicio 3**

CONSIGNA:

1. Con el modelo del ejercicio 3, realizar esta vez una validación cruzada.
2. Mostrar la curva de validación

3. Aplicar GridSearch