<a href="https://colab.research.google.com/github/bdwalker1/UCSD_MLE_Bootcamp_Capstone/blob/master/PreviousWorkReview/malware_detection_on_IoT_Bruce_Walker_Version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **20.6.2 Capstone: Survey Existing Research and Reproduce Available Solutions**

*This notebook created by Jaime Moranchel was retrieved from Kaggle (https://www.kaggle.com/code/jaimemoranchel/malware-dection-on-iot) and adopted to work in my Google CoLab account.*

I've added inline notes (look for comments starting with "NOTE:") for observations and/or if I have modified code for some reason.

My observations and conclusions are at the bottom of the Notebook.

---
---


# Network Malware Detection Connection Analysis exercise

## Inicializar el dataset y creación del dataframe (Initialize the dataset and create the dataframe)

In [1]:
# NOTE: Connect to my Google drive for data
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing

In [3]:
# NOTE: Setting the path to the data files on my Google drive

# Path for CoLab
data_path = "/content/drive/MyDrive/UCSD_MLE_Bootcamp_Capstone/data/MalwareDetectionInNetworkTrafficData/raw"


In [4]:
# NOTE: Moranchel wrote his notebook to only work on one data file at a time. I change
#       the filename as needed to try the code on different files from the dataset

# Carga el dataset desde un archivo CSV (Load the dataset from a CSV file)
df = pd.read_csv( data_path + '/CTU-IoT-Malware-Capture-35-1conn.log.labeled.csv', delimiter='|')

In [5]:
# Info del dataset (Dataset information)
print(df.head())
print("------------------------------")
print(df.info())
print("------------------------------")
print(f'registros totales: {len(df)}')
print(df['label'].value_counts())

             ts                 uid      id.orig_h  id.orig_p       id.resp_h  \
0  1.545403e+09  CdNmOg26ZIaBRzPvWj  192.168.1.196    59932.0  104.248.160.24   
1  1.545403e+09  CgzGV333k9WCximeu8  192.168.1.196    59932.0  104.248.160.24   
2  1.545403e+09  CLm5Pd3ZnqmYVjrZ44  192.168.1.196    59932.0  104.248.160.24   
3  1.545403e+09  CDn2pd1rDD1lCMXAia  192.168.1.196    35883.0     192.168.1.1   
4  1.545403e+09  C1NKkV3tB4rImzbpDj  192.168.1.196    43531.0     192.168.1.1   

   id.resp_p proto service  duration orig_bytes  ... local_resp missed_bytes  \
0       80.0   tcp       -  3.097754          0  ...          -          0.0   
1       80.0   tcp       -         -          -  ...          -          0.0   
2       80.0   tcp       -         -          -  ...          -          0.0   
3       53.0   udp     dns  5.005148         78  ...          -          0.0   
4       53.0   udp     dns  5.005145         78  ...          -          0.0   

  history orig_pkts  orig_ip_byt

Como hay demasiados registros, para agilizar el proceso, usaré 100.000
(Since there are too many records, to speed up the process, I will use 100,000)

In [6]:
# NOTE: Though Moranchel's code limited the working dataset to 100,000 records, I
#       have increased the sample size to see if model performance changes.

if len(df) > 3000000:
    sample_df = df.sample(n=3000000)
else:
  sample_df = df.copy()
copy_df = sample_df.copy() # Para no modificar el dataframe original
print(f'registros totales: {len(sample_df)}')


registros totales: 3000000


¿Hay registros nulos?
(Are there any null records?)

In [7]:
null_values = copy_df.isnull().sum()
print(null_values)


ts                     0
uid                    0
id.orig_h              0
id.orig_p              0
id.resp_h              0
id.resp_p              0
proto                  0
service                0
duration               0
orig_bytes             0
resp_bytes             0
conn_state             0
local_orig             0
local_resp             0
missed_bytes           0
history                0
orig_pkts              0
orig_ip_bytes          0
resp_pkts              0
resp_ip_bytes          0
tunnel_parents         0
label                  0
detailed-label    627464
dtype: int64


## Analizar 'history' (Analyze history)

In [8]:
copy_df['history'].value_counts() # Muestran estados de conexión TCP (Display TCP connection states)

Unnamed: 0_level_0,count
history,Unnamed: 1_level_1
S,2370659
I,608502
DTT,18840
D,722
Sr,548
-,458
Dd,178
F,19
ShAfFa,10
ShAdDaR,7


In [9]:
label_encoder = preprocessing.LabelEncoder() # objeto para ransformar en datos numericos los datos (object to transform data into numerical data)
copy_df['history']= label_encoder.fit_transform(copy_df['history']) # Modifica el campo 'history' por un int (Change the 'history' field to an integer)
copy_df['history'].unique() # Valores únicos de history (Unique history values)

array([ 7,  9,  1,  3,  4,  0, 36, 14, 23, 26, 33,  5, 21, 32, 29, 19, 28,
       17, 12, 31, 35, 16,  8, 27, 15, 34,  2, 18,  6, 25, 10, 30, 13, 24,
       20, 22, 11])

## Analizar 'detailed labels' (Analyze 'detailed labels')
aquí usaré onehot porque no son etiquetas con un conjunto discreto (Here I will use onehot because they are not labels with a discrete set)

In [10]:
copy_df['detailed-label'].value_counts()

Unnamed: 0_level_0,count
detailed-label,Unnamed: 1_level_1
-,2372531
FileDownload,5


In [11]:
onehot = pd.get_dummies(copy_df['detailed-label'])
copy_df = copy_df.join(onehot)
copy_df.head()
copy_df.drop(['detailed-label'],axis=1,inplace=True)
copy_df.head()

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,...,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents,label,-,FileDownload
4810957,1545452000.0,CKBdi02J9s1LhAtaj8,192.168.1.196,15590.0,209.97.190.136,80.0,tcp,-,3.588951,0,...,0.0,7,4.0,160.0,0.0,0.0,-,Malicious DDoS,False,False
1259960,1545416000.0,CpqE8mtQOhxjy1ksc,192.168.1.196,35094.0,223.136.170.87,23.0,tcp,-,3.125969,0,...,0.0,9,3.0,180.0,0.0,0.0,-,Benign,True,False
9153200,1545476000.0,C57Ua224Zbompq800c,192.168.1.196,34286.0,199.174.182.165,23.0,tcp,-,3.092500,0,...,0.0,9,3.0,180.0,0.0,0.0,-,Benign,True,False
9233881,1545477000.0,C4SdIs4CpuCV54PvW8,192.168.1.196,48580.0,175.249.244.183,23.0,tcp,-,3.148744,0,...,0.0,9,3.0,180.0,0.0,0.0,-,Benign,True,False
9263654,1545477000.0,C59aEG2tjXJtmpSX5f,192.168.1.196,52898.0,134.90.82.108,23.0,tcp,-,-,-,...,0.0,9,1.0,60.0,0.0,0.0,-,Benign,True,False


Hay nulos? (Are there any nulls?)

In [12]:
null_values = copy_df.isnull().sum()
print(null_values)

ts                0
uid               0
id.orig_h         0
id.orig_p         0
id.resp_h         0
id.resp_p         0
proto             0
service           0
duration          0
orig_bytes        0
resp_bytes        0
conn_state        0
local_orig        0
local_resp        0
missed_bytes      0
history           0
orig_pkts         0
orig_ip_bytes     0
resp_pkts         0
resp_ip_bytes     0
tunnel_parents    0
label             0
-                 0
FileDownload      0
dtype: int64


Transformamos el resto de campos categóricos (We transform the rest of the categorical fields)


In [13]:
# Timestamp
copy_df['ts'] = pd.to_numeric(copy_df['ts'])
# Identificadores (identifiers)
copy_df['uid']= label_encoder.fit_transform(copy_df['uid'])
# Host de origen (origin host)
copy_df['id.orig_h']= label_encoder.fit_transform(copy_df['id.orig_h'])
# Host de destino (destination host)
copy_df['id.resp_h']= label_encoder.fit_transform(copy_df['id.resp_h'])
# Puerto origen (origin port)
copy_df['id.orig_p']= label_encoder.fit_transform(copy_df['id.orig_p'])
# Puerto destino (destination port)
copy_df['id.resp_p']= label_encoder.fit_transform(copy_df['id.resp_p'])

In [14]:
onehot = pd.get_dummies(copy_df['proto'])
copy_df = copy_df.join(onehot)
copy_df.head()
copy_df.drop(['proto'],axis=1,inplace=True)
copy_df.head()

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,service,duration,orig_bytes,resp_bytes,...,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents,label,-,FileDownload,icmp,tcp,udp
4810957,1545452000.0,976822,169,15590,1331044,9,-,3.588951,0,0,...,160.0,0.0,0.0,-,Malicious DDoS,False,False,False,True,False
1259960,1545416000.0,2509415,169,35093,1441374,7,-,3.125969,0,0,...,180.0,0.0,0.0,-,Benign,True,False,False,True,False
9153200,1545476000.0,247352,169,34285,1230526,7,-,3.092500,0,0,...,180.0,0.0,0.0,-,Benign,True,False,False,True,False
9233881,1545477000.0,215302,169,48579,968350,7,-,3.148744,0,0,...,180.0,0.0,0.0,-,Benign,True,False,False,True,False
9263654,1545477000.0,249013,169,52897,415598,7,-,-,-,-,...,60.0,0.0,0.0,-,Benign,True,False,False,True,False


In [15]:
# Estado de conexión (connection state)
copy_df['conn_state']= label_encoder.fit_transform(copy_df['conn_state'])
# Comprobamos los bytes perdidos (We check for lost bytes)
if (copy_df['missed_bytes'] == 0).all():
    copy_df.drop(['missed_bytes'], axis=1, inplace=True)
else:
    copy_df['missed_bytes'] = label_encoder.fit_transform(copy_df['missed_bytes'])
# Paquetes enviados (Packets sent)
copy_df['orig_pkts'] = pd.to_numeric(copy_df['orig_pkts'])
# Paquetes respuesta (Packets received)
copy_df['resp_pkts'] = pd.to_numeric(copy_df['resp_pkts'])
# bytes enviados (bytes sent)
copy_df['orig_ip_bytes'] = pd.to_numeric(copy_df['orig_ip_bytes'])
# bytes respuesta (bytes received)
copy_df['resp_ip_bytes'] = pd.to_numeric(copy_df['resp_ip_bytes'])
# Etiquetas (tags)
copy_df['label']= label_encoder.fit_transform(copy_df['label'])
# Servicio (service)
copy_df['service']= label_encoder.fit_transform(copy_df['service'])
# Duracion (duration)
copy_df['duration']= label_encoder.fit_transform(copy_df['duration'])
# Bytes origen (origin bytes)
copy_df['orig_bytes']= label_encoder.fit_transform(copy_df['orig_bytes'])
# Bytes destino (destination bytes)
copy_df['resp_bytes']= label_encoder.fit_transform(copy_df['resp_bytes'])
# Conexion local? (local connection?)
copy_df['local_orig']= label_encoder.fit_transform(copy_df['local_orig'])
copy_df['local_resp']= label_encoder.fit_transform(copy_df['local_resp'])
# Pertenece a un tunel? (Does it belong to a tunnel?)
copy_df['tunnel_parents']= label_encoder.fit_transform(copy_df['tunnel_parents'])

copy_df.head()
copy_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3000000 entries, 4810957 to 6899871
Data columns (total 26 columns):
 #   Column          Dtype  
---  ------          -----  
 0   ts              float64
 1   uid             int64  
 2   id.orig_h       int64  
 3   id.orig_p       int64  
 4   id.resp_h       int64  
 5   id.resp_p       int64  
 6   service         int64  
 7   duration        int64  
 8   orig_bytes      int64  
 9   resp_bytes      int64  
 10  conn_state      int64  
 11  local_orig      int64  
 12  local_resp      int64  
 13  missed_bytes    int64  
 14  history         int64  
 15  orig_pkts       float64
 16  orig_ip_bytes   float64
 17  resp_pkts       float64
 18  resp_ip_bytes   float64
 19  tunnel_parents  int64  
 20  label           int64  
 21  -               bool   
 22  FileDownload    bool   
 23  icmp            bool   
 24  tcp             bool   
 25  udp             bool   
dtypes: bool(5), float64(5), int64(16)
memory usage: 582.3 MB


## Dividir datos para aprendizaje y test (Splitting data for learning and testing)

In [16]:
from sklearn.model_selection import train_test_split

X = copy_df.drop('label', axis=1)
y = copy_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
y_train.head()
y_train.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,1897679
2,502294
1,27


Normalizamiento de datos (Normalize the data)

In [17]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Modelo Naive Bayes

In [18]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score
import time

nb_model = GaussianNB()
start_time = time.time()
nb_model.fit(X_train, y_train)
training_time = time.time() - start_time

start_time = time.time()
y_pred_train = nb_model.predict(X_train)
prediction_time_train = time.time() - start_time
accuracy_train = accuracy_score(y_train, y_pred_train)
report_train = classification_report(y_train, y_pred_train)
print("\n\nResultados en datos de entrenamiento (Naive Bayes):")
print(f"Exactitud: {accuracy_train}")
print(f"Reporte de clasificación:\n{report_train}")
print(f"\nTiempo de ejecución: {prediction_time_train}")


start_time = time.time()
y_pred_test = nb_model.predict(X_test)
prediction_time_test = time.time() - start_time
accuracy_test = accuracy_score(y_test, y_pred_test)
report_test = classification_report(y_test, y_pred_test)
print("\nResultados en datos de prueba (Naive Bayes):")
print(f"Exactitud: {accuracy_test}")
print(f"Reporte de clasificación:\n{report_test}")
print(f"\nTiempo de ejecución: {prediction_time_test}")



Resultados en datos de entrenamiento (Naive Bayes):
Exactitud: 0.7977454166666667
Reporte de clasificación:
              precision    recall  f1-score   support

           0       0.80      1.00      0.89   1897679
           1       1.00      0.19      0.31        27
           2       0.98      0.03      0.07    502294

    accuracy                           0.80   2400000
   macro avg       0.92      0.41      0.42   2400000
weighted avg       0.83      0.80      0.71   2400000


Tiempo de ejecución: 0.8573720455169678

Resultados en datos de prueba (Naive Bayes):
Exactitud: 0.79846
Reporte de clasificación:
              precision    recall  f1-score   support

           0       0.80      1.00      0.89    474852
           1       0.00      0.00      0.00         4
           2       0.98      0.03      0.07    125144

    accuracy                           0.80    600000
   macro avg       0.59      0.34      0.32    600000
weighted avg       0.83      0.80      0.72    6000

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import time

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
start_time = time.time()
rf_model.fit(X_train, y_train)
training_time = time.time() - start_time

start_time = time.time()
y_pred_train_rf = rf_model.predict(X_train)
prediction_time_train = time.time() - start_time
accuracy_train_rf = accuracy_score(y_train, y_pred_train_rf)
report_train_rf = classification_report(y_train, y_pred_train_rf)
print("\n\nResultados en datos de entrenamiento (Random Forest):")
print(f"Exactitud: {accuracy_train_rf}")
print(f"Reporte de clasificación:\n{report_train_rf}")
print(f"\nTiempo de ejecución: {prediction_time_train}")
start_time = time.time()
y_pred_test_rf = rf_model.predict(X_test)
prediction_time_test = time.time() - start_time
accuracy_test_rf = accuracy_score(y_test, y_pred_test_rf)
report_test_rf = classification_report(y_test, y_pred_test_rf)
print("\n\nResultados en datos de prueba (Random Forest):")
print(f"Exactitud: {accuracy_test_rf}")
print(f"Reporte de clasificación:\n{report_test_rf}")
print(f"\nTiempo de ejecución: {prediction_time_test}")



Resultados en datos de entrenamiento (Random Forest):
Exactitud: 1.0
Reporte de clasificación:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1897679
           1       1.00      1.00      1.00        27
           2       1.00      1.00      1.00    502294

    accuracy                           1.00   2400000
   macro avg       1.00      1.00      1.00   2400000
weighted avg       1.00      1.00      1.00   2400000


Tiempo de ejecución: 4.923268795013428


Resultados en datos de prueba (Random Forest):
Exactitud: 1.0
Reporte de clasificación:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    474852
           1       1.00      1.00      1.00         4
           2       1.00      1.00      1.00    125144

    accuracy                           1.00    600000
   macro avg       1.00      1.00      1.00    600000
weighted avg       1.00      1.00      1.00    600000


Tiempo de 

---
---
# **Observations:**

* I believe Jaime Moranchel made an inherent mistake in leaving the "detailed_label" information in his feature dataset. The detailed label provides additional information for the records not labeled as benign thus, "detailed_label" is an additional target and not a feature.
  * This can be seen in the lower performance of the Naive Bayes model on data files with little to no detailed label information (e.g. CTU-IoT-Malware-Capture-9-1conn.log.labeled.csv and CTU-IoT-Malware-Capture-35-1conn.log.labeled.csv)
* I believe Moranchel should have eliminated more of the columns from the feature dataset before training. In particular:
  * Leaving the origin and response IP address information in the feature set means model may not generalize well to catch malignant network traffic that originates from new hosts or targets new destination devices.
  * Without specifically doing any sort of time series prediction, leaving the timestamp column in the features set does not seem helpful.
  * As the "uid" field is unique to every record, I believe it has no predictive value.
  * Some fields have only one value for the entire dataset and thus would have no predictive value -- though training the model would probably ignore these fields, it may save time/computing power to remove them before training/fitting a model.



# **Conclusions:**

*   While the Naive Bayes and Random Forest models both showed excellent accuracy, the Naive Bayes scored lower in precision and recall.
*   Despite the high scores, I believe Moranchel's model is flawed by inclusion of the "detailed_label" information in the training feature sets.


