# **SEMA-Classifier**

![SEMA_illustration.png](attachment:SEMA_illustration.png)

## **Using SEMA-Classifier**

In [1]:
%matplotlib inline
%matplotlib notebook
import os
os.chdir('../../src')
os.getcwd()
!python3 ToolChainClassifier/ToolChainClassifier.py -h

usage: ToolChainClassifier.py [-h] [--train] [--classifier CLASSIFIER]
                              [--threshold THRESHOLD]
                              [--biggest_subgraph BIGGEST_SUBGRAPH]
                              [--support SUPPORT] [--nthread NTHREAD]
                              [--verbose_classifier] [--ctimeout CTIMEOUT]
                              [--families FAMILIES [FAMILIES ...]]
                              [--mode MODE] [--epoch EPOCH]
                              binaries

Classification module arguments

optional arguments:
  -h, --help            show this help message and exit
  --biggest_subgraph BIGGEST_SUBGRAPH
                        Biggest subgraph consider for Gspan (default: 5)

Classification module arguments:
  --train               Launch training process, else classify/detect new
                        sample with previously computed model
  --classifier CLASSIFIER
                        Classifier used for the analysis am

### **Classify SCDG**
We classify SCDGs extracted from different malware. We will first use SVM with Weisfeiler-Lehman kernel classifier and train our model :

In [2]:
!python3 ToolChainClassifier/ToolChainClassifier.py --train --classifier=wl  ../Tutorial/DATA/CLASS_DATA/TRAIN/ 

[1;32;40mINFO - 2022-05-12 10:03:32,111 - ToolChainClassifier - ['../Tutorial/DATA/CLASS_DATA/TRAIN/ircbot', '../Tutorial/DATA/CLASS_DATA/TRAIN/shiz', '../Tutorial/DATA/CLASS_DATA/TRAIN/bancteian', '../Tutorial/DATA/CLASS_DATA/TRAIN/sfone', '../Tutorial/DATA/CLASS_DATA/TRAIN/simbot', '../Tutorial/DATA/CLASS_DATA/TRAIN/delf', '../Tutorial/DATA/CLASS_DATA/TRAIN/sytro', '../Tutorial/DATA/CLASS_DATA/TRAIN/autoit', '../Tutorial/DATA/CLASS_DATA/TRAIN/sillyp2p', '../Tutorial/DATA/CLASS_DATA/TRAIN/wabot'][0m
[1;32;40mINFO - 2022-05-12 10:03:32,111 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/CLASS_DATA/TRAIN/', classifier='wl', ctimeout=3, epoch=5, families=None, mode='classification', nthread=8, support=0.75, threshold=0.45, train=True, verbose_classifier=False)[0m
[1;32;40mINFO - 2022-05-12 10:03:32,206 - SVMWLClassifier - Path: ../Tutorial/DATA/CLASS_DATA/TRAIN/[0m
N/A% (0 of 10) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--[1;32

We now have a working model that we will use to classify new data :

In [3]:
!python3 ToolChainClassifier/ToolChainClassifier.py --classifier=wl  ../Tutorial/DATA/CLASS_DATA/TEST/ 

[1;32;40mINFO - 2022-05-12 10:03:42,962 - ToolChainClassifier - ['../Tutorial/DATA/CLASS_DATA/TEST/ircbot', '../Tutorial/DATA/CLASS_DATA/TEST/shiz', '../Tutorial/DATA/CLASS_DATA/TEST/bancteian', '../Tutorial/DATA/CLASS_DATA/TEST/sfone', '../Tutorial/DATA/CLASS_DATA/TEST/simbot', '../Tutorial/DATA/CLASS_DATA/TEST/delf', '../Tutorial/DATA/CLASS_DATA/TEST/sytro', '../Tutorial/DATA/CLASS_DATA/TEST/autoit', '../Tutorial/DATA/CLASS_DATA/TEST/sillyp2p', '../Tutorial/DATA/CLASS_DATA/TEST/wabot'][0m
[1;32;40mINFO - 2022-05-12 10:03:42,962 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/CLASS_DATA/TEST/', classifier='wl', ctimeout=3, epoch=5, families=None, mode='classification', nthread=8, support=0.75, threshold=0.45, train=False, verbose_classifier=False)[0m
100% (10 of 10) |########################| Elapsed Time: 0:00:00 Time:  0:00:00
INFO:SVMWLClassifier:Accuracy 97.00 %
INFO:SVMWLClassifier:Precision 97.31 %
INFO:SVMWLClassifier:Recall 97.00 %
INFO:SVM

We get the following confusion matrix :
![Conf_matrix.png](attachment:Conf_matrix.png)

In a similar way, we could train another model and then use it to classify data. Note: make sure the submodule SEMA-quickspan (/src/submodules/) is correctly installed and compiled (https://github.com/csvl/SEMA-quickspan).

In [4]:
!python3 ToolChainClassifier/ToolChainClassifier.py --train --classifier=gspan  ../Tutorial/DATA/CLASS_DATA/TRAIN/ 

[1;32;40mINFO - 2022-05-12 10:05:53,977 - ToolChainClassifier - ['../Tutorial/DATA/CLASS_DATA/TRAIN/ircbot', '../Tutorial/DATA/CLASS_DATA/TRAIN/shiz', '../Tutorial/DATA/CLASS_DATA/TRAIN/bancteian', '../Tutorial/DATA/CLASS_DATA/TRAIN/sfone', '../Tutorial/DATA/CLASS_DATA/TRAIN/simbot', '../Tutorial/DATA/CLASS_DATA/TRAIN/delf', '../Tutorial/DATA/CLASS_DATA/TRAIN/sytro', '../Tutorial/DATA/CLASS_DATA/TRAIN/autoit', '../Tutorial/DATA/CLASS_DATA/TRAIN/sillyp2p', '../Tutorial/DATA/CLASS_DATA/TRAIN/wabot'][0m
[1;32;40mINFO - 2022-05-12 10:05:53,977 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/CLASS_DATA/TRAIN/', classifier='gspan', ctimeout=3, epoch=5, families=None, mode='classification', nthread=8, support=0.75, threshold=0.45, train=True, verbose_classifier=False)[0m
[1;32;40mINFO - 2022-05-12 10:05:53,979 - GSpanClassifier - Input Path = ../Tutorial/DATA/CLASS_DATA/TRAIN/[0m
N/A% (0 of 10) |                         | Elapsed Time: 0:00:00 ETA:  --:-

[1;32;40mINFO - 2022-05-12 10:06:10,136 - GSpanClassifier - b''[0m
[1;32;40mINFO - 2022-05-12 10:06:10,136 - GSpanClassifier - b'I0512 10:06:07.119973 346129 quickspan.cc:92] quickspan timeout: 3\nI0512 10:06:07.121060 346129 quickspan.cc:118] quickspan read input time: 0.001018\nI0512 10:06:07.122628 346129 quickspan_execute.cc:27] quickspan construct graph time: 0.001545\nI0512 10:06:07.122658 346129 quickspan_execute.cc:54] quickspan thread 0 create\nI0512 10:06:07.122664 346129 quickspan_execute.cc:54] quickspan thread 1 create\nI0512 10:06:07.122668 346129 quickspan_execute.cc:54] quickspan thread 2 create\nI0512 10:06:07.122671 346129 quickspan_execute.cc:54] quickspan thread 3 create\nI0512 10:06:07.122676 346129 quickspan_execute.cc:54] quickspan thread 4 create\nI0512 10:06:07.122679 346129 quickspan_execute.cc:54] quickspan thread 5 create\nI0512 10:06:07.122684 346129 quickspan_execute.cc:54] quickspan thread 6 create\nI0512 10:06:07.122687 346129 quickspan_execute.cc:54]

[1;32;40mINFO - 2022-05-12 10:06:23,747 - GSpanClassifier - Gspan_path = /home/bertrandvano/Documents/SEMA-ToolChain/src/submodules/SEMA-quickspan/build/gspan --input_file /home/bertrandvano/Documents/SEMA-ToolChain/src/ToolChainClassifier/sig/wabot_merge.gs --output_file /home/bertrandvano/Documents/SEMA-ToolChain/src/ToolChainClassifier/sig/wabot_sig.gs --pattern --biggest_subgraphs 5 --threads 8 --timeout  3 --support 0.75[0m
[1;32;40mINFO - 2022-05-12 10:06:27,576 - GSpanClassifier - b''[0m
[1;32;40mINFO - 2022-05-12 10:06:27,576 - GSpanClassifier - b'I0512 10:06:23.754863 346278 quickspan.cc:92] quickspan timeout: 3\nI0512 10:06:23.764564 346278 quickspan.cc:118] quickspan read input time: 0.009644\nI0512 10:06:23.775964 346278 quickspan_execute.cc:27] quickspan construct graph time: 0.011357\nI0512 10:06:23.775992 346278 quickspan_execute.cc:54] quickspan thread 0 create\nI0512 10:06:23.776001 346278 quickspan_execute.cc:54] quickspan thread 1 create\nI0512 10:06:23.776006 3

In [5]:
!python3 ToolChainClassifier/ToolChainClassifier.py --classifier=gspan  ../Tutorial/DATA/CLASS_DATA/TEST/ 

[1;32;40mINFO - 2022-05-12 10:06:29,843 - ToolChainClassifier - ['../Tutorial/DATA/CLASS_DATA/TEST/ircbot', '../Tutorial/DATA/CLASS_DATA/TEST/shiz', '../Tutorial/DATA/CLASS_DATA/TEST/bancteian', '../Tutorial/DATA/CLASS_DATA/TEST/sfone', '../Tutorial/DATA/CLASS_DATA/TEST/simbot', '../Tutorial/DATA/CLASS_DATA/TEST/delf', '../Tutorial/DATA/CLASS_DATA/TEST/sytro', '../Tutorial/DATA/CLASS_DATA/TEST/autoit', '../Tutorial/DATA/CLASS_DATA/TEST/sillyp2p', '../Tutorial/DATA/CLASS_DATA/TEST/wabot'][0m
[1;32;40mINFO - 2022-05-12 10:06:29,843 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/CLASS_DATA/TEST/', classifier='gspan', ctimeout=3, epoch=5, families=None, mode='classification', nthread=8, support=0.75, threshold=0.45, train=False, verbose_classifier=False)[0m
100% (10 of 10) |########################| Elapsed Time: 0:04:41 Time:  0:04:41
INFO:GSpanClassifier:Precision obtained : 0.8976190476190476
INFO:GSpanClassifier:Recall obtained : 0.8599999999999999

This leads to this confusion matrix :
![Conf_matrix_gspan.png](attachment:Conf_matrix_gspan.png)

### **Detect malware from cleanware**
Finally, we could also train and use a classifier to distinguish malware from cleanware.

In [6]:
!python3 ToolChainClassifier/ToolChainClassifier.py --mode=detection --train --classifier=wl  ../Tutorial/DATA/BIN_DATA/TRAIN/ 

[1;32;40mINFO - 2022-05-12 10:11:13,212 - ToolChainClassifier - ['../Tutorial/DATA/BIN_DATA/TRAIN/malware', '../Tutorial/DATA/BIN_DATA/TRAIN/cleanware'][0m
[1;32;40mINFO - 2022-05-12 10:11:13,212 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/BIN_DATA/TRAIN/', classifier='wl', ctimeout=3, epoch=5, families=None, mode='detection', nthread=8, support=0.75, threshold=0.45, train=True, verbose_classifier=False)[0m
[1;32;40mINFO - 2022-05-12 10:11:13,307 - SVMWLClassifier - Path: ../Tutorial/DATA/BIN_DATA/TRAIN/[0m
N/A% (0 of 2) |                          | Elapsed Time: 0:00:00 ETA:  --:--:--[1;32;40mINFO - 2022-05-12 10:11:13,309 - SVMWLClassifier - Subpath: /home/bertrandvano/Documents/SEMA-ToolChain/src/../Tutorial/DATA/BIN_DATA/TRAIN/malware/[0m
[1;32;40mINFO - 2022-05-12 10:11:13,392 - SVMWLClassifier - Subpath: /home/bertrandvano/Documents/SEMA-ToolChain/src/../Tutorial/DATA/BIN_DATA/TRAIN/cleanware/[0m
100% (2 of 2) |######################

In [7]:
!python3 ToolChainClassifier/ToolChainClassifier.py --mode=detection --classifier=wl  ../Tutorial/DATA/BIN_DATA/TEST/ 

[1;32;40mINFO - 2022-05-12 10:11:19,895 - ToolChainClassifier - ['../Tutorial/DATA/BIN_DATA/TEST/malware', '../Tutorial/DATA/BIN_DATA/TEST/cleanware'][0m
[1;32;40mINFO - 2022-05-12 10:11:19,895 - ToolChainClassifier - Namespace(biggest_subgraph=5, binaries='../Tutorial/DATA/BIN_DATA/TEST/', classifier='wl', ctimeout=3, epoch=5, families=None, mode='detection', nthread=8, support=0.75, threshold=0.45, train=False, verbose_classifier=False)[0m
100% (2 of 2) |##########################| Elapsed Time: 0:00:00 Time:  0:00:00
[1;32;40mINFO - 2022-05-12 10:11:21,003 - ToolChainClassifier - Total detection time: 1.1106009483337402[0m


We could observe that our demo dataset is perfectly classifier for this task and with this classifier:
![Conf_matrix_detection.png](attachment:Conf_matrix_detection.png)