# **SEMA-Classifier**

![SEMA_illustration.png](attachment:SEMA_illustration.png)

## **Using SEMA-Classifier**

In [1]:
%matplotlib inline
%matplotlib notebook
import os
os.chdir('../../src')
os.getcwd()
!python3 ToolChainClassifier/ToolChainClassifier.py -h

usage: ToolChainClassifier.py [-h] [--train] [--classifier CLASSIFIER]
                              [--threshold THRESHOLD]
                              [--biggest_subgraph BIGGEST_SUBGRAPH]
                              [--support SUPPORT] [--nthread NTHREAD]
                              [--verbose_classifier] [--ctimeout CTIMEOUT]
                              [--families FAMILIES [FAMILIES ...]]
                              [--mode MODE] [--epoch EPOCH]
                              binaries

Classification module arguments

optional arguments:
  -h, --help            show this help message and exit
  --biggest_subgraph BIGGEST_SUBGRAPH
                        Biggest subgraph consider for Gspan (default: 5)

Classification module arguments:
  --train               Launch training process, else classify/detect new
                        sample with previously computed model
  --classifier CLASSIFIER
                        Classifier used for the analysis am

### **Classify SCDG**
We classify SCDGs extracted from different malware. We will first use SVM with Weisfeiler-Lehman kernel classifier and train our model :

In [2]:
!python3 ToolChainClassifier/ToolChainClassifier.py --train --classifier=wl  ../Tutorial/DATA/CLASS_DATA/TRAIN/ 

[1;32;40mINFO - 2022-05-16 11:15:24,710 - ToolChainClassifier - ['../Tutorial/DATA/CLASS_DATA/TRAIN/ircbot', '../Tutorial/DATA/CLASS_DATA/TRAIN/shiz', '../Tutorial/DATA/CLASS_DATA/TRAIN/bancteian', '../Tutorial/DATA/CLASS_DATA/TRAIN/sfone', '../Tutorial/DATA/CLASS_DATA/TRAIN/simbot', '../Tutorial/DATA/CLASS_DATA/TRAIN/delf', '../Tutorial/DATA/CLASS_DATA/TRAIN/sytro', '../Tutorial/DATA/CLASS_DATA/TRAIN/autoit', '../Tutorial/DATA/CLASS_DATA/TRAIN/sillyp2p', '../Tutorial/DATA/CLASS_DATA/TRAIN/wabot'][0m
[1;32;40mINFO - 2022-05-16 11:15:24,711 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/CLASS_DATA/TRAIN/', classifier='wl', ctimeout=3, epoch=5, families=None, mode='classification', nthread=8, support=0.75, threshold=0.45, train=True, verbose_classifier=False)[0m
[1;32;40mINFO - 2022-05-16 11:15:24,801 - SVMWLClassifier - Path: ../Tutorial/DATA/CLASS_DATA/TRAIN/[0m
N/A% (0 of 10) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--[1;32

We now have a working model that we will use to classify new data :

In [3]:
!python3 ToolChainClassifier/ToolChainClassifier.py --classifier=wl  ../Tutorial/DATA/CLASS_DATA/TEST/ 

[1;32;40mINFO - 2022-05-16 11:15:31,681 - ToolChainClassifier - ['../Tutorial/DATA/CLASS_DATA/TEST/ircbot', '../Tutorial/DATA/CLASS_DATA/TEST/shiz', '../Tutorial/DATA/CLASS_DATA/TEST/bancteian', '../Tutorial/DATA/CLASS_DATA/TEST/sfone', '../Tutorial/DATA/CLASS_DATA/TEST/simbot', '../Tutorial/DATA/CLASS_DATA/TEST/delf', '../Tutorial/DATA/CLASS_DATA/TEST/sytro', '../Tutorial/DATA/CLASS_DATA/TEST/autoit', '../Tutorial/DATA/CLASS_DATA/TEST/sillyp2p', '../Tutorial/DATA/CLASS_DATA/TEST/wabot'][0m
[1;32;40mINFO - 2022-05-16 11:15:31,681 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/CLASS_DATA/TEST/', classifier='wl', ctimeout=3, epoch=5, families=None, mode='classification', nthread=8, support=0.75, threshold=0.45, train=False, verbose_classifier=False)[0m
100% (10 of 10) |########################| Elapsed Time: 0:00:00 Time:  0:00:00
INFO:SVMWLClassifier:Accuracy 98.00 %
INFO:SVMWLClassifier:Precision 98.33 %
INFO:SVMWLClassifier:Recall 98.00 %
INFO:SVM

We get the following confusion matrix :
![Conf_matrix.png](attachment:Conf_matrix.png)

In a similar way, we could train another model and then use it to classify data. Note: make sure the submodule SEMA-quickspan (/src/submodules/) is correctly installed and compiled (https://github.com/csvl/SEMA-quickspan).

In [4]:
!python3 ToolChainClassifier/ToolChainClassifier.py --train --classifier=gspan  ../Tutorial/DATA/CLASS_DATA/TRAIN/ 

[1;32;40mINFO - 2022-05-16 11:15:33,505 - ToolChainClassifier - ['../Tutorial/DATA/CLASS_DATA/TRAIN/ircbot', '../Tutorial/DATA/CLASS_DATA/TRAIN/shiz', '../Tutorial/DATA/CLASS_DATA/TRAIN/bancteian', '../Tutorial/DATA/CLASS_DATA/TRAIN/sfone', '../Tutorial/DATA/CLASS_DATA/TRAIN/simbot', '../Tutorial/DATA/CLASS_DATA/TRAIN/delf', '../Tutorial/DATA/CLASS_DATA/TRAIN/sytro', '../Tutorial/DATA/CLASS_DATA/TRAIN/autoit', '../Tutorial/DATA/CLASS_DATA/TRAIN/sillyp2p', '../Tutorial/DATA/CLASS_DATA/TRAIN/wabot'][0m
[1;32;40mINFO - 2022-05-16 11:15:33,505 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/CLASS_DATA/TRAIN/', classifier='gspan', ctimeout=3, epoch=5, families=None, mode='classification', nthread=8, support=0.75, threshold=0.45, train=True, verbose_classifier=False)[0m
[1;32;40mINFO - 2022-05-16 11:15:33,507 - GSpanClassifier - Input Path = ../Tutorial/DATA/CLASS_DATA/TRAIN/[0m
N/A% (0 of 10) |                         | Elapsed Time: 0:00:00 ETA:  --:-

[1;32;40mINFO - 2022-05-16 11:15:46,933 - GSpanClassifier - Gspan_path = /home/bertrandvano/Documents/SEMA-ToolChain/src/submodules/SEMA-quickspan/build/gspan --input_file /home/bertrandvano/Documents/SEMA-ToolChain/src/ToolChainClassifier/sig/simbot_merge.gs --output_file /home/bertrandvano/Documents/SEMA-ToolChain/src/ToolChainClassifier/sig/simbot_sig.gs --pattern --biggest_subgraphs 5 --threads 8 --timeout  3 --support 0.75[0m
[1;32;40mINFO - 2022-05-16 11:15:49,992 - GSpanClassifier - b''[0m
[1;32;40mINFO - 2022-05-16 11:15:49,992 - GSpanClassifier - b'I0516 11:15:46.945909 420564 quickspan.cc:92] quickspan timeout: 3\nI0516 11:15:46.946831 420564 quickspan.cc:118] quickspan read input time: 0.000872\nI0516 11:15:46.948264 420564 quickspan_execute.cc:27] quickspan construct graph time: 0.001406\nI0516 11:15:46.948278 420564 quickspan_execute.cc:54] quickspan thread 0 create\nI0516 11:15:46.948285 420564 quickspan_execute.cc:54] quickspan thread 1 create\nI0516 11:15:46.948287

[1;32;40mINFO - 2022-05-16 11:16:03,279 - GSpanClassifier - b''[0m
[1;32;40mINFO - 2022-05-16 11:16:03,279 - GSpanClassifier - b'I0516 11:16:00.249399 420663 quickspan.cc:92] quickspan timeout: 3\nI0516 11:16:00.251344 420663 quickspan.cc:118] quickspan read input time: 0.001778\nI0516 11:16:00.253849 420663 quickspan_execute.cc:27] quickspan construct graph time: 0.002464\nI0516 11:16:00.253873 420663 quickspan_execute.cc:54] quickspan thread 0 create\nI0516 11:16:00.253880 420663 quickspan_execute.cc:54] quickspan thread 1 create\nI0516 11:16:00.253883 420663 quickspan_execute.cc:54] quickspan thread 2 create\nI0516 11:16:00.253890 420663 quickspan_execute.cc:54] quickspan thread 3 create\nI0516 11:16:00.253892 420663 quickspan_execute.cc:54] quickspan thread 4 create\nI0516 11:16:00.253898 420663 quickspan_execute.cc:54] quickspan thread 5 create\nI0516 11:16:00.253904 420663 quickspan_execute.cc:54] quickspan thread 6 create\nI0516 11:16:00.253906 420663 quickspan_execute.cc:54]

In [5]:
!python3 ToolChainClassifier/ToolChainClassifier.py --classifier=gspan  ../Tutorial/DATA/CLASS_DATA/TEST/ 

[1;32;40mINFO - 2022-05-16 11:16:09,167 - ToolChainClassifier - ['../Tutorial/DATA/CLASS_DATA/TEST/ircbot', '../Tutorial/DATA/CLASS_DATA/TEST/shiz', '../Tutorial/DATA/CLASS_DATA/TEST/bancteian', '../Tutorial/DATA/CLASS_DATA/TEST/sfone', '../Tutorial/DATA/CLASS_DATA/TEST/simbot', '../Tutorial/DATA/CLASS_DATA/TEST/delf', '../Tutorial/DATA/CLASS_DATA/TEST/sytro', '../Tutorial/DATA/CLASS_DATA/TEST/autoit', '../Tutorial/DATA/CLASS_DATA/TEST/sillyp2p', '../Tutorial/DATA/CLASS_DATA/TEST/wabot'][0m
[1;32;40mINFO - 2022-05-16 11:16:09,167 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/CLASS_DATA/TEST/', classifier='gspan', ctimeout=3, epoch=5, families=None, mode='classification', nthread=8, support=0.75, threshold=0.45, train=False, verbose_classifier=False)[0m
100% (10 of 10) |########################| Elapsed Time: 0:03:33 Time:  0:03:33
INFO:GSpanClassifier:Precision obtained : 0.8833333333333333
INFO:GSpanClassifier:Recall obtained : 0.88
INFO:GSpanCla

This leads to this confusion matrix :
![Conf_matrix_gspan.png](attachment:Conf_matrix_gspan.png)

### **Detect malware from cleanware**
Finally, we could also train and use a classifier to distinguish malware from cleanware.

In [6]:
!python3 ToolChainClassifier/ToolChainClassifier.py --mode=detection --train --classifier=wl  ../Tutorial/DATA/BIN_DATA/TRAIN/ 

[1;32;40mINFO - 2022-05-16 11:19:45,061 - ToolChainClassifier - ['../Tutorial/DATA/BIN_DATA/TRAIN/malware', '../Tutorial/DATA/BIN_DATA/TRAIN/cleanware'][0m
[1;32;40mINFO - 2022-05-16 11:19:45,061 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/BIN_DATA/TRAIN/', classifier='wl', ctimeout=3, epoch=5, families=None, mode='detection', nthread=8, support=0.75, threshold=0.45, train=True, verbose_classifier=False)[0m
[1;32;40mINFO - 2022-05-16 11:19:45,230 - SVMWLClassifier - Path: ../Tutorial/DATA/BIN_DATA/TRAIN/[0m
N/A% (0 of 2) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--[1;32;40mINFO - 2022-05-16 11:19:45,236 - SVMWLClassifier - Subpath: /home/bertrandvano/Documents/SEMA-ToolChain/src/../Tutorial/DATA/BIN_DATA/TRAIN/malware/[0m
[1;32;40mINFO - 2022-05-16 11:19:45,330 - SVMWLClassifier - Subpath: /home/bertrandvano/Documents/SEMA-ToolChain/src/../Tutorial/DATA/BIN_DATA/TRAIN/cleanware/[0m
100% (2 of 2) |######################

In [7]:
!python3 ToolChainClassifier/ToolChainClassifier.py --mode=detection --classifier=wl  ../Tutorial/DATA/BIN_DATA/TEST/ 

[1;32;40mINFO - 2022-05-16 11:19:52,134 - ToolChainClassifier - ['../Tutorial/DATA/BIN_DATA/TEST/malware', '../Tutorial/DATA/BIN_DATA/TEST/cleanware'][0m
[1;32;40mINFO - 2022-05-16 11:19:52,134 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/BIN_DATA/TEST/', classifier='wl', ctimeout=3, epoch=5, families=None, mode='detection', nthread=8, support=0.75, threshold=0.45, train=False, verbose_classifier=False)[0m
100% (2 of 2) |##########################| Elapsed Time: 0:00:00 Time:  0:00:00
[1;32;40mINFO - 2022-05-16 11:19:52,467 - ToolChainClassifier - Total detection time: 0.3332078456878662[0m


We could observe that our demo dataset is perfectly classifier for this task and with this classifier:
![Conf_matrix_detection.png](attachment:Conf_matrix_detection.png)