# Final Demo - Trojan File Modeling and Analysis
    Outline: 
        1) Our Feature Extractor - Analyze the makeup of a file without executing it on the system
            - We will show the Feature Extractor on a benign file, a Trojan file, and a non-Trojan malicious file
        
        2) Using our models (Random Forest Classifiers)
            - Show how we can load in our pre-made models to identify the files live
            
        3) Show decision tree visual from our specialized Trojan Classifier 
            - How does our model identify a Trojan attack
        
        4) Feature Importance Breakdown 
            - What is important to identify a Trojan attack 
            - Map those features to the demo files
            - Explain the context of these features (why are they important)

# Using our Feature Extractor Function
    Demonstrate how we can parse a file for these key attributes 

    All Files Used (and their labels) come form the Dike Project: https://github.com/iosifache/DikeDataset

In [1]:
benign_file_path = r"C:\Users\caleb\PycharmProjects\security_project\security_project\Data\benign.exe"
trojan_file_path = r"C:\Users\caleb\PycharmProjects\security_project\security_project\Data\trojan.exe"
malicious_file_path = r"C:\Users\caleb\PycharmProjects\security_project\security_project\Data\malicious.exe"

In [2]:
all_files = [benign_file_path, trojan_file_path, malicious_file_path]

In [3]:
# Import our FeatureExtractor Library
from FeatureExtractor.featureEx import FeatureExtractor
import pandas as pd
import re

ModuleNotFoundError: No module named 'FeatureExtractor'

    Run File Extractor on the 3 example files

In [None]:
master_df = pd.DataFrame()
for file in all_files:
    # instead of ..\Data\foo.exe we need foo
    print(file)
    hash_name = re.search(r'Data\\(.*)\.|Data\\(.*)',file).group(1)
    # Use FeatureExtractor
    newFeatureExtractor = FeatureExtractor(path=file)
    attributesToTest_df = newFeatureExtractor.getFileFeatures()
    attributesToTest_df["hash"] = hash_name
    # Save the file to the master DF
    master_df = master_df.append(attributesToTest_df, ignore_index=True)
    del attributesToTest_df
    del newFeatureExtractor

In [None]:
master_df.shape

        Perform some minor cleanup before modeling

In [None]:
master_df = master_df.drop(columns=["ID","md5","VersionInformationSize","legitimate"])

In [None]:
master_df.head()

# Using our Deployable Models
    We have trained two Random Forest Classifier Models that we have saved and will demonstrate here

In [None]:
# Load in our pre-made model
# Note: Our Model is a RandomForestClassifier from the sklearn library
import pickle

with open('master_model', 'rb') as f:
    model = pickle.load(f)

In [None]:
# Use the model
# Here we predict - Benign, Trojan, or Generic (non-Trojan Attack)
predictions = model.predict(master_df.drop(columns=["hash"]))

In [None]:
predictions

In [None]:
master_df["prediction"] = predictions

In [None]:
master_df.head()

# Explain how we Identify Trojan Attacks 
    We can predict a Trojan attack, but what gives it away?
    This is one of the key takeaways from this project

## Show Decision Tree
    This is one tree in the forest, to give you the idea about the complexity of the model

In [None]:
from IPython.display import SVG
SVG(filename="single_trojan_tree.svg")

## Show Feature Importance 
    We can open up the "forest" and see which features carry the most weight

In [None]:
# Here we are loading in the feature importance of our Trojan focused classifier
# We opened up the model to see which attributes carry the heaviest weight
trojan_feature_importances_df = pd.read_pickle("trojan_feature_importance.p")

In [None]:
trojan_feature_importances_df.head(10)

## Put Context to the Key Features
    Why are these important in identifying a Trojan Attack?
    Relate to other papers, or knowledge in the security field

    TODO: Explain the table above