# HW-5: Malware Classification (Due 5th January, 2023)

**Instructions:**

Suppose your company is struggling with a series of computer virus attacks for the past several months. The viruses were grouped into a few types with some effort. However, it takes a long time to sort out what kind of virus it is when been hit with. Thus, as a senior IT department member, you undertook a project to classify the virus as quickly as possible. You've been given a dataset of the features that may be handy (or not), and  also the associated virus type (target variable). 

You are supposed to try different classification methods and apply best practices we have seen in the lectures such as grid search, cross validation, regularization etc. To increase your grade you can add more elaboration such as using ensembling or exploiting feature selection/extraction techniques. **An evaluation rubric is provided.**

Please prepare a python notebook that describes the steps, present the results as well as your comments. 

You can download the data (csv file) [here](https://drive.google.com/file/d/1yxbibzUU8bjOyChDVFPfQ4viLduYdk29/view?usp=sharing).


In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets
from matplotlib import pyplot
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
df = pd.read_csv("hw5_data.csv")

#K-fold cross-Validation
X = df.drop(columns='target')
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)
tree = DecisionTreeClassifier(max_depth=4)
tree.fit(X_train, y_train)
scores = cross_val_score(tree,X_train, y_train, cv=5)
print(scores)
print("Avg Acc: " , scores.mean() , " std dev: ", scores.std(), " max: ", scores.max())

[0.7475    0.7625    0.7175    0.76125   0.7146433]
Avg Acc:  0.7406786608260326  std dev:  0.020789283887866186  max:  0.7625


In [None]:
#HyperParameterSearch 
param_grid = [
  {'min_samples_leaf': [1, 5, 10, 20],
   'max_depth': [3, 5, 9, 15],
   'criterion': ['gini', 'entropy']},
 ]


clf = GridSearchCV(tree, param_grid, cv=5)   # cv = 5
clf.fit(X_train, y_train) # it fits xTrain and yTrain.
print('Test set score after CV: ', accuracy_score(y_test, clf.predict(X_test)))


In [None]:
#Ensemble
import time
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
#First, lets vote among all 3 model types
classifiers = [ DecisionTreeClassifier(), 
               BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, max_samples=0.5, oob_score=True) , 
               RandomForestClassifier(n_estimators=100, oob_score=True)]
names = ["DecisionTree", "Bagged Trees", "RandomForest"]



for myModel, m_name in zip(classifiers, names):
    t = time.process_time()
    myModel.fit(X_train, y_train)
    fit_time = time.process_time() - t
    print("\t", m_name, accuracy_score(y_test, myModel.predict(X_test)), "\ttime:", fit_time)
    if "oob_score_" in dir(myModel):
        print("\t\tOOB:", myModel.oob_score_)

In [2]:
#Filter Method
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import GenericUnivariateSelect
#from sklearn.feature_selection import SelectKBest
mutual_info = mutual_info_classif(X, y) 
mfSelected = GenericUnivariateSelect(score_func=mutual_info_classif, mode = "k_best", param=20)
mfSelected.fit(X, y)                        # it fits the data.
X_mf = mfSelected.transform(X)              # this line transfroms the data.
lr = LogisticRegression(solver='liblinear')
print("Logistic Regression Mutual Info for k=20 ---> ", cross_val_score(lr, X_mf, y, cv=5).mean())

Logistic Regression Mutual Info for k=20 --->  0.41348208208208204
