<a href="https://colab.research.google.com/github/chandrajitpal/Cybersecurity/blob/main/cyberlabs/LAB3_CYBER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Malware Threat Detection**

**Clustering malware with K-Means**


In the following example, we will see the K-Means clustering algorithm applied to our
previously created dataset of artifacts.
Remember that our dataset of artifacts contains the fields extracted from the PE file format
of the individual samples, consisting of the .exe files previously stored, including both the
legitimate and the suspect files.
The number of clusters that we will assign to the k parameter in the algorithm initialization
phase will therefore be 2 , while the features that we will select as distinctive criteria of the
possible malware correspond to the MajorLinkerVersion , MajorImageVersion ,
MajorOperatingSystemVersion , and DllCharacteristics fields:

Once the fields of interest from our dataset are selected, we can proceed to instantiate the
KMeans class of scikit-learn , passing the k value as an input parameters representing
the number of clusters, equal to 2 ( n_clusters = 2 ), and defining the maximum number
of iterations that the algorithm can execute, equal to 300 ( max_iter = 300 ) in our case:

`k_means = KMeans(n_clusters=2,max_iter=300)`

We can then invoke the fit() method on the k_means object, thus proceeding to start the
iterative algorithm process:

`k_means.fit(samples)`

We just have to evaluate the results obtained by the algorithm. To this end, we will use the
Silhouette coefficient we introduced previously, calculated by using the Euclidean distance
as a metric, together with the confusion matrix of the results. This will show us a table with
the respective clustering results, divided between correct and incorrect forecasts:

In [17]:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from textblob import TextBlob

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

import warnings
warnings.simplefilter('ignore')


from google.colab import drive
drive.mount('/content/drive')

malware_dataset = pd.read_csv('/content/drive/My Drive/PICTURES_CYBER/MalwareArtifacts.csv' , delimiter=',')

# Extacting artifacts samples fields 'MajorLinkerVersion,MajorImageVersion,MajorOperatingSystemVersion,DllCharacteristics'
samples = malware_dataset.iloc[:, [1,2,3,4]].values
targets = malware_dataset.iloc[:, 8].values


k_means = KMeans(n_clusters=2,max_iter=300)
k_means.fit(samples)

print ("K-means labels: " + str(k_means.labels_))
print ("\nK-means Clustering Results:\n\n", pd.crosstab(targets, k_means.labels_,rownames = ["Observed"],colnames = ["Predicted"]) )
print ("\nSilhouette coefficient: %0.3f" % silhouette_score(samples, k_means.labels_, metric='euclidean'))


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
K-means labels: [0 0 0 ... 0 1 0]

K-means Clustering Results:

 Predicted      0      1
Observed               
0          83419  13107
1           8836  32082

Silhouette coefficient: 0.975


We can see how the clustering algorithm was able to successfully identify the labels
corresponding to the clusters to be associated with the individual samples, and from the
confusion matrix, it is possible to detect how 83419 samples (out of a total of 96,526)
belonging to the suspect category have been correctly identified (having being classified
under label 0 ), while only 13107 (13.58% of the total) were mistakenly considered as
legitimate.
In the same way, only 8836 samples (out of a total of 40,918) were classified as suspect
(equal to 19.54% of the total), despite being truly legitimate instead, compared to 32082
samples correctly classified as legitimate.
The Silhouette coefficient is equal to 0.975 , which is very close to 1, reflecting the
goodness of the results obtained by the clustering algorithm.

**Detecting malwares with decision trees**

In [9]:


import pandas as pd
import numpy as np
from sklearn import *

from sklearn.metrics import accuracy_score
import warnings
warnings.simplefilter('ignore')

from google.colab import drive
drive.mount('/content/drive')

malware_dataset = pd.read_csv('/content/drive/My Drive/PICTURES_CYBER/MalwareArtifacts.csv', delimiter=',')


# Extacting artifacts samples fields "AddressOfEntryPoint" and "DllCharacteristics"
samples = malware_dataset.iloc[:, [0, 4]].values
targets = malware_dataset.iloc[:, 8].values

from sklearn.model_selection import train_test_split

training_samples, testing_samples, training_targets, testing_targets = train_test_split(
         samples, targets, test_size=0.2, random_state=0)


from sklearn import tree
tree_classifier = tree.DecisionTreeClassifier()

tree_classifier.fit(training_samples, training_targets)



predictions = tree_classifier.predict(testing_samples)

accuracy = 100.0 * accuracy_score(testing_targets, predictions)
print ("Decision Tree accuracy: " + str(accuracy))



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Decision Tree accuracy: 96.27487358579796


**Random Forest Malware Classifier**

In [10]:


import pandas as pd
import numpy as np
from sklearn import *

import warnings
warnings.simplefilter('ignore')


from google.colab import drive
drive.mount('/content/drive')

malware_dataset = pd.read_csv('/content/drive/My Drive/PICTURES_CYBER/MalwareArtifacts.csv', delimiter=',')

# Extacting artifacts samples fields "AddressOfEntryPoint" and "DllCharacteristics"
samples = malware_dataset.iloc[:, [0,4]].values
targets = malware_dataset.iloc[:, 8].values

from sklearn.model_selection import train_test_split

training_samples, testing_samples, training_targets, testing_targets = train_test_split(samples, targets, test_size=0.2)

rfc =  ensemble.RandomForestClassifier(n_estimators=50)
rfc.fit(training_samples, training_targets)
accuracy = rfc.score(testing_samples, testing_targets)

print("Random Forest Classifier accuracy: " + str(accuracy*100) )



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Random Forest Classifier accuracy: 96.62046636836553


**Detecting metamorphic malware with HMMs**

In [16]:
!pip install hidden_markov



In our example, the possible observations are as follows:
ob_types = ('W','N' )

Here, W stands for Working and N for Not Working, while the hidden states are as follows:
states = (‘L’, ’M')

Here, M corresponds to Malicious and L corresponds to Legitimate.

The sequence of observations comes next, which is associated to the single instructions that get executed by the program:
observations = (‘W’,‘W’,‘W’,‘N’)

This sequence of observations tells us that after the execution of the first three instructions
of the program, the machine worked properly, while it stopped working only after
executing the fourth instruction.

On the basis of this sequence of observable events, we must proceed with the training of the
HMM. To this end, we will pass our probability matrices (as defined previously) to the
algorithm, corresponding to the start matrix:

`start = np.matrix(‘0.1 0.9’)`

*The transition matrix is as follows:*

`transition = np.matrix(‘0.7 0.3 ; 0.1 0.9’)`

*The emission matrix is as follows:*

`emission = np.matrix(‘0.2 0.8 ; 0.4 0.6’)`

In [15]:
import numpy as np
from hidden_markov import hmm

ob_types = ('W','N' )

states = ('L', 'M')

observations = ('W','W','W','N')

start = np.matrix('0.1 0.9')
transition = np.matrix('0.7 0.3 ; 0.1 0.9')
emission = np.matrix('0.2 0.8 ; 0.4 0.6')

_hmm = hmm(states,ob_types,start,transition,emission)

print("Forward algorithm: ")
print ( _hmm.forward_algo(observations) )

print("\nViterbi algorithm: ")
print( _hmm.viterbi(observations) )

Forward algorithm: 
0.033196

Viterbi algorithm: 
['M', 'M', 'M', 'M']


Forward algorithm gives us the probability of an observed sequence in the HMM, while

Viterbi algorithm is used to find out the most likely sequence of hidden states that can
generate the given set of observations.