# **Task 2: IoTMal2020-CDMC: IOT Malware Detection**

Based on the byte sequences collected at the entry points of ELF files as discriminant features and the malware families of the programs as training labels, the participants are required to perform a classification task to predict the malware families of the test samples. The dataset consists of 72,638 samples generated following the procedure below: First, a collection of malicious and benign Linux programs in ELF format were collected from various sources. Then, from each of these programs, the first 2K bytes (0-padded if the file is not long enough) starting at the entry point of the file were extracted. These ASCII strings were then encoded by a simple encryption cipher to remove the sensitive information and fed to a base64 encoder to yield readable radix-64 representations. Label (family type of malware) of the binary files are determined by the state-of-art anti-virus engines.

## **File Format**

The two .csv files share the same format as described below.

Except the first header line, each of the lines provide 3 characterizing features for a unique ELF file.

- "Family" column indicates the family type of the binary file, and is taken as class label of this classification task.
- "CP" column indicates the CPU architecture on which the file is compiled. Participant can determine whether or not to use it as side information to improve the prediction performance.
- "ByteSequence" column is the encoded first 2K bytes of the ELF files following the aforementioned steps.

The difference of the test file from the training file is that the "Family" column of the lines are all assigned to "Unknown".

## **Task**

The participants are required to provide the prediction of labels of the test samples based on information provided in the task.

## **Credits and Clarifications**

The original analysis data of IoT malware classification task was kindly contributed by Taiwan Information Security Center (TWISC). The data was processed by the CDMC 2020 committee with all sensitive information removed.

## Import packages and dependencies

In [1]:
import gc
import joblib
import string
import numpy as np
import pandas as pd
from random import Random
import xgboost
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from wordcloud import STOPWORDS
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.preprocessing import OneHotEncoder
from imblearn.combine import SMOTETomek

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow import keras
import lightgbm
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.keras.layers import Dense, Input, Dropout, GlobalAveragePooling1D
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit, train_test_split, cross_validate, KFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

SEED = 1337

In [2]:
df_train = pd.read_csv('.././Data/CDMC2020IoTMalware.train.csv')
df_test = pd.read_csv('.././Data/CDMC2020IoTMalware.test.csv')

In [3]:
df_train

Unnamed: 0,Family,CPU,ByteSequence
0,Mirai,mipsel,KmaXqyZmLYFmZmZmiGb2N31ARwYqlyCrKjZmZv4TDutmZr...
1,Mirai,x86el,cd6hA7V8vHvPzlal/NiIjqVESoGOmPelTI+Bjv4dCWZm0r...
2,Mirai,mipsel,KmaXqyZmLYFmZmZmFWb2N33DRwYqlyCrKjZmZnCkDutmZr...
3,Mirai,mipseb,q5dmKoEtZiZmZmZmN/ZmFQZH3+erIJcqZmY2KusOpHDrvm...
4,Mirai,ppceb,/W7mhs4qZvMClsutNLfla05mZmZEKux7/Y6rW70mZmZKA2...
...,...,...,...
36323,BenignWare,x86_64el,cd6VA9mheQOweXy8e8/OlZFoaPHTZnmRic/x02Z5kZFL2N...
36324,BenignWare,mips64eb,N61mM2bbyyruGHsh8ttKbBRMZqtmZmZmq0xmjmZmZmarl2...
36325,BenignWare,x86el,cd6hA7V8vHvPzlalRvGIjqWejIGOmPel0cWBjv7KO+zs0h...
36326,BenignWare,mipseb,q5dmKoEtZiZmZmZmN/ZmGAZHlJNmZjYq6w5Kk+u+ZmYGW2...


In [4]:
df_test

Unnamed: 0,Family,CPU,ByteSequence
0,Unknown,x86el,cd6hA7V8vHvPzlalpiKIjqVESoGOmPelTI+Bjv5aQ2Zm0r...
1,Unknown,mipsel,KmaXqyZmLYFmZmZmFWb2N33rRwYqlyCrKjZmZv4TDutmZr...
2,Unknown,mipseb,q5dmKoEtZiZmZmZmN/ZmiAZHgwWrIJcqZmY2KusOE/7rvm...
3,Unknown,sparceb,nstmZkc4lgubq5ZNRKuW4y1mZq+aZmbT6mZm0b3MTHBfzJ...
4,Unknown,mipsel,KmaXqyZmLYFmZmZmFWb2N+c+RwYqlyCrKjZmZoSkDutmZr...
...,...,...,...
36303,Unknown,mipsel,KmaXqyZmLYFmZmZmFWb2N4R9RwYqlyCrKjZmZtGkDutmZr...
36304,Unknown,x86el,cd6hA7V8vHvPzlalKVqIjqVHUoGOmPelvaiBjv7Ee+zs0h...
36305,Unknown,armel,ZkaWFmaXlhaBy/C8YEyWtYFMXVmBZl1Zy2ggWYFoXVklZi...
36306,Unknown,armel,ZkaWFmaXlhaBy/C8YEyWtYFMXVmBZl1ZZZYgWe6T67Crlp...


In [5]:
# Check for rows with missing values.
df_train.isnull().sum(axis=1).value_counts()

0    36328
dtype: int64

## 求子串

In [6]:
# def cut(s):
#     global n
#     tmp = [s[i:i + x + 1] for x in range(len(s)) for i in range(len(s) - x)]
#     pd.value_counts(tmp).to_csv(str(n)+'.csv')
#     n+=1
#     del tmp
#     gc.collect()
#     return n

In [None]:
# n=0
# df_train[df_train['Family']=='Dofloo']['ByteSequence'].map(lambda x: cut(x))

In [11]:
# len(result)

3733278

## model Stacking

In [6]:
ascii = [np.frombuffer(x.encode(), dtype=np.uint8).tolist()
         for x in df_train['ByteSequence']]

In [7]:
train_ascii = pd.DataFrame(ascii)

In [8]:
columns_data = ['data' + str(x) for x in range(2732)]
train_ascii.columns = columns_data

In [9]:
train = df_train.join(train_ascii).drop(['ByteSequence'], axis=1)

In [10]:
train

Unnamed: 0,Family,CPU,data0,data1,data2,data3,data4,data5,data6,data7,...,data2722,data2723,data2724,data2725,data2726,data2727,data2728,data2729,data2730,data2731
0,Mirai,mipsel,75,109,97,88,113,121,90,109,...,117,66,90,114,50,107,52,79,115,61
1,Mirai,x86el,99,100,54,104,65,55,86,56,...,116,121,90,75,43,121,57,119,103,61
2,Mirai,mipsel,75,109,97,88,113,121,90,109,...,86,98,114,85,98,115,108,115,115,61
3,Mirai,mipseb,113,53,100,109,75,111,69,116,...,87,89,122,119,43,87,109,79,77,61
4,Mirai,ppceb,47,87,55,109,104,115,52,113,...,70,109,74,113,99,117,71,109,73,61
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36323,BenignWare,x86_64el,99,100,54,86,65,57,109,104,...,120,53,110,101,51,82,43,101,119,61
36324,BenignWare,mips64eb,78,54,49,109,77,50,98,98,...,49,109,90,103,56,43,90,109,89,61
36325,BenignWare,x86el,99,100,54,104,65,55,86,56,...,51,115,55,79,122,43,51,57,103,61
36326,BenignWare,mipseb,113,53,100,109,75,111,69,116,...,82,75,73,87,98,84,98,67,111,61


In [11]:
ascii = [np.frombuffer(x.encode(), dtype=np.uint8).tolist()
         for x in df_test['ByteSequence']]

In [12]:
test_ascii = pd.DataFrame(ascii)

In [13]:
test_ascii.columns = columns_data

In [14]:
test = df_test.join(test_ascii).drop(['ByteSequence'], axis=1)

In [15]:
test

Unnamed: 0,Family,CPU,data0,data1,data2,data3,data4,data5,data6,data7,...,data2722,data2723,data2724,data2725,data2726,data2727,data2728,data2729,data2730,data2731
0,Unknown,x86el,99,100,54,104,65,55,86,56,...,116,121,90,75,43,121,57,119,103,61
1,Unknown,mipsel,75,109,97,88,113,121,90,109,...,117,66,90,114,50,107,52,79,115,61
2,Unknown,mipseb,113,53,100,109,75,111,69,116,...,119,50,79,117,54,66,90,105,89,61
3,Unknown,sparceb,110,115,116,109,90,107,99,52,...,48,104,103,98,77,108,50,69,81,61
4,Unknown,mipsel,75,109,97,88,113,121,90,109,...,89,55,54,56,47,76,115,119,89,61
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36303,Unknown,mipsel,75,109,97,88,113,121,90,109,...,89,55,54,52,104,109,48,56,115,61
36304,Unknown,x86el,99,100,54,104,65,55,86,56,...,54,103,83,79,122,115,102,75,103,61
36305,Unknown,armel,90,107,97,87,70,109,97,88,...,90,109,100,77,117,84,116,70,107,61
36306,Unknown,armel,90,107,97,87,70,109,97,88,...,120,86,116,87,97,84,118,86,107,61


### onehot

In [16]:
df = pd.concat([train,test])

In [17]:
df = pd.concat([df.drop(['CPU'], axis=1), pd.get_dummies(df['CPU'])], axis=1)

In [72]:
train = df.iloc[:36328,:].copy()
test = df.iloc[36328:,:].copy()

In [73]:
train

Unnamed: 0,Family,data0,data1,data2,data3,data4,data5,data6,data7,data8,...,armel,mips64eb,mipseb,mipsel,ppceb,sh4el,sparceb,unknown,x86_64el,x86el
0,Mirai,75,109,97,88,113,121,90,109,76,...,0,0,0,1,0,0,0,0,0,0
1,Mirai,99,100,54,104,65,55,86,56,118,...,0,0,0,0,0,0,0,0,0,1
2,Mirai,75,109,97,88,113,121,90,109,76,...,0,0,0,1,0,0,0,0,0,0
3,Mirai,113,53,100,109,75,111,69,116,90,...,0,0,1,0,0,0,0,0,0,0
4,Mirai,47,87,55,109,104,115,52,113,90,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36323,BenignWare,99,100,54,86,65,57,109,104,101,...,0,0,0,0,0,0,0,0,1,0
36324,BenignWare,78,54,49,109,77,50,98,98,121,...,0,1,0,0,0,0,0,0,0,0
36325,BenignWare,99,100,54,104,65,55,86,56,118,...,0,0,0,0,0,0,0,0,0,1
36326,BenignWare,113,53,100,109,75,111,69,116,90,...,0,0,1,0,0,0,0,0,0,0


In [20]:
test

Unnamed: 0,Family,data0,data1,data2,data3,data4,data5,data6,data7,data8,...,armel,mips64eb,mipseb,mipsel,ppceb,sh4el,sparceb,unknown,x86_64el,x86el
0,Unknown,99,100,54,104,65,55,86,56,118,...,0,0,0,0,0,0,0,0,0,1
1,Unknown,75,109,97,88,113,121,90,109,76,...,0,0,0,1,0,0,0,0,0,0
2,Unknown,113,53,100,109,75,111,69,116,90,...,0,0,1,0,0,0,0,0,0,0
3,Unknown,110,115,116,109,90,107,99,52,108,...,0,0,0,0,0,0,1,0,0,0
4,Unknown,75,109,97,88,113,121,90,109,76,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36303,Unknown,75,109,97,88,113,121,90,109,76,...,0,0,0,1,0,0,0,0,0,0
36304,Unknown,99,100,54,104,65,55,86,56,118,...,0,0,0,0,0,0,0,0,0,1
36305,Unknown,90,107,97,87,70,109,97,88,108,...,1,0,0,0,0,0,0,0,0,0
36306,Unknown,90,107,97,87,70,109,97,88,108,...,1,0,0,0,0,0,0,0,0,0


### model training

In [21]:
# Divide training data and training target
# x_train = train.iloc[:, 1:]
# y_train = train['Family']

# Divide the validation set and test set.
# x_train, x_val, y_train, y_val = train_test_split(
#     data, target, test_size=0.2, stratify=target, random_state=0)

# pca = PCA(n_components='mle')
# x_train = pca.fit_transform(x_train)
# joblib.dump(pca, '.././Data/pca_mle.m')

In [13]:
# pca = joblib.load('.././Data/pca_mle.m')
# x_val = pca.transform(x_val)

In [20]:
# train = pd.DataFrame(x_train).join(y_train)

In [23]:
kfold = KFold(n_splits=5, random_state=0, shuffle=True)  # cv

In [17]:
# Use SMOTE + Tomek links method for comprehensive sampling to solve sample imbalance.
# smote_tomek = SMOTETomek(random_state=0, n_jobs=-1)

### much model

In [29]:
x_train1 = train[train['Family'].isin(['Bashlite', 'BenignWare', 'Mirai'])].iloc[:, 1:]
y_train1 = train[train['Family'].isin(['Bashlite', 'BenignWare', 'Mirai'])]['Family']
# LabelEncoder
lab1 = LabelEncoder().fit(y_train1)
y_train1 = lab1.transform(y_train1)

# x_train1, y_train1 = smote_tomek.fit_sample(x_train1, y_train1)

In [31]:
rf1 = RandomForestClassifier(n_jobs=-1)
rf1_results = cross_validate(
    rf1, x_train1, y_train1, cv=kfold, scoring=['accuracy', 'f1_macro'])

In [32]:
joblib.dump(rf1_results, '.././Data/rf1_results.m')

['.././Data/rf1_results.m']

In [33]:
joblib.load('.././Data/rf1_results.m')
print(rf1_results)

{'fit_time': array([7.96464324, 6.06417489, 5.84753418, 6.08489251, 6.41700125]), 'score_time': array([0.22233653, 0.20639396, 0.21478105, 0.22105765, 0.19232178]), 'test_accuracy': array([0.98731859, 0.98435959, 0.98450049, 0.98435959, 0.98661218]), 'test_f1_macro': array([0.98208646, 0.97782807, 0.97843142, 0.97837564, 0.98142116])}


In [34]:
rf1.fit(x_train1, y_train1)
joblib.dump(rf1, ".././Data/rf1.m")

['.././Data/rf1.m']

In [35]:
rf1 = joblib.load(".././Data/rf1.m")
rf1_y_prob = rf1.predict_proba(x_train)

### meddle model

In [36]:
x_train2 = train[train['Family'].isin(['Tsunami', 'Android'])].iloc[:, 1:]
y_train2 = train[train['Family'].isin(['Tsunami', 'Android'])]['Family']
# LabelEncoder
lab2 = LabelEncoder().fit(y_train2)
y_train2 = lab2.transform(y_train2)

# x_train2, y_train2 = smote_tomek.fit_sample(x_train2, y_train2)

In [37]:
rf2 = RandomForestClassifier(n_jobs=-1)
rf2_results = cross_validate(
    rf2, x_train2, y_train2, cv=kfold, scoring=['accuracy', 'f1_macro'])

In [38]:
joblib.dump(rf2_results, '.././Data/rf2_results.m')

['.././Data/rf2_results.m']

In [39]:
joblib.load('.././Data/rf2_results.m')
print(rf2_results)

{'fit_time': array([0.27347469, 0.27059388, 0.30560088, 0.28422713, 0.28769565]), 'score_time': array([0.03774905, 0.03794289, 0.04108906, 0.03642726, 0.04271793]), 'test_accuracy': array([0.98675497, 0.97350993, 0.96      , 0.98666667, 0.98666667]), 'test_f1_macro': array([0.9855861 , 0.96204123, 0.95404412, 0.98553241, 0.98484542])}


In [40]:
rf2.fit(x_train2, y_train2)
joblib.dump(rf2, ".././Data/rf2.m")

['.././Data/rf2.m']

In [41]:
rf2 = joblib.load(".././Data/rf2.m")
rf2_y_prob = rf2.predict_proba(x_train)

### low model

In [42]:
x_train3 = train[train['Family'].isin(['Dofloo', 'Hajime', 'Pnscan', 'Xorddos'])].iloc[:, 1:]
y_train3 = train[train['Family'].isin(['Dofloo', 'Hajime', 'Pnscan', 'Xorddos'])]['Family']
# LabelEncoder
lab3 = LabelEncoder().fit(y_train3)
y_train3 = lab3.transform(y_train3)

# x_train3, y_train3 = SMOTETomek(random_state=0, n_jobs=-1, n_neighbors=2).fit_sample(x_train3, y_train3)

In [43]:
rf3 = RandomForestClassifier(n_jobs=-1)
rf3_results = cross_validate(
    rf3, x_train3, y_train3, cv=kfold, scoring=['accuracy', 'f1_macro'])

In [44]:
joblib.dump(rf3_results, '.././Data/rf3_results.m')

['.././Data/rf3_results.m']

In [45]:
joblib.load('.././Data/rf3_results.m')
print(rf3_results)

{'fit_time': array([0.18182659, 0.1747489 , 0.18157911, 0.19614291, 0.16912532]), 'score_time': array([0.03251386, 0.03144741, 0.04611158, 0.03215623, 0.03117323]), 'test_accuracy': array([0.78947368, 0.84210526, 0.83333333, 0.88888889, 0.94444444]), 'test_f1_macro': array([0.52083333, 0.40740741, 0.62760181, 0.61666667, 0.86666667])}


In [46]:
rf3.fit(x_train3, y_train3)
joblib.dump(rf3, ".././Data/rf3.m")

['.././Data/rf3.m']

In [47]:
rf3 = joblib.load(".././Data/rf3.m")
rf3_y_prob = rf3.predict_proba(x_train)

In [48]:
stacking_train = pd.concat([pd.DataFrame(rf1_y_prob), pd.DataFrame(rf2_y_prob), pd.DataFrame(rf3_y_prob)], axis=1)

In [49]:
xgb = xgboost.XGBClassifier()
xgb.fit(stacking_train.values, y_train.values.ravel())

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [52]:
x_test = test.iloc[:, 1:]

In [53]:
x_test

Unnamed: 0,data0,data1,data2,data3,data4,data5,data6,data7,data8,data9,...,armel,mips64eb,mipseb,mipsel,ppceb,sh4el,sparceb,unknown,x86_64el,x86el
0,99,100,54,104,65,55,86,56,118,72,...,0,0,0,0,0,0,0,0,0,1
1,75,109,97,88,113,121,90,109,76,89,...,0,0,0,1,0,0,0,0,0,0
2,113,53,100,109,75,111,69,116,90,105,...,0,0,1,0,0,0,0,0,0,0
3,110,115,116,109,90,107,99,52,108,103,...,0,0,0,0,0,0,1,0,0,0
4,75,109,97,88,113,121,90,109,76,89,...,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36303,75,109,97,88,113,121,90,109,76,89,...,0,0,0,1,0,0,0,0,0,0
36304,99,100,54,104,65,55,86,56,118,72,...,0,0,0,0,0,0,0,0,0,1
36305,90,107,97,87,70,109,97,88,108,104,...,1,0,0,0,0,0,0,0,0,0
36306,90,107,97,87,70,109,97,88,108,104,...,1,0,0,0,0,0,0,0,0,0


In [54]:
rf1_y_prob = rf1.predict_proba(x_test)
rf2_y_prob = rf2.predict_proba(x_test)
rf3_y_prob = rf3.predict_proba(x_test)

stacking_test = pd.concat([pd.DataFrame(rf1_y_prob), pd.DataFrame(rf2_y_prob), pd.DataFrame(rf3_y_prob)], axis=1)
xgb_y_pre = xgb.predict(stacking_test.values)

In [58]:
pd.DataFrame(xgb_y_pre).to_csv('CDMC2020IoTMalware_predict.csv', header=False)

In [71]:
print(classification_report(y_val, xgb_y_pre))

              precision    recall  f1-score   support

     Android       0.92      0.92      0.92        49
    Bashlite       0.99      0.93      0.96      1418
  BenignWare       1.00      0.99      0.99      3995
      Dofloo       0.50      0.75      0.60         4
      Hajime       0.60      1.00      0.75        12
       Mirai       0.98      0.94      0.96      1684
      Pnscan       0.00      0.00      0.00         1
     Tsunami       0.35      0.92      0.51       101
     Xorddos       1.00      0.50      0.67         2

    accuracy                           0.97      7266
   macro avg       0.70      0.77      0.71      7266
weighted avg       0.98      0.97      0.97      7266



  _warn_prf(average, modifier, msg_start, len(result))
