In [1]:
import os

from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline

We begin by importing standard Python libraries to analyze the files and set up machine
learning pipelines (Step 1). In Steps 2 and 3, we collect the non-obfuscated and obfuscated
JavaScript files into arrays and assign them their respective labels. 

In [2]:
js_path = "../input/obfuscated-javascript-dataset/JavascriptSamplesNotObfuscated/JavascriptSamples"
obfuscated_js_path = "../input/obfuscated-javascript-dataset/JavascriptSamplesObfuscated/JavascriptSamplesObfuscated"

In [3]:
corpus = []
labels = []
file_types_and_labels = [(js_path,0), (obfuscated_js_path, 1)]

In [4]:
for files_path, label in file_types_and_labels:
    files = os.listdir(files_path)
    for file in files:
    
        file_path = files_path + "/" + file
        try:
            with open(file_path, "r") as myfile:
                data = myfile.read().replace("\n", "")
                data = str(data)
                corpus.append(data)
                labels.append(label)
        except Exception as e:
            print(e)

In [5]:
len(corpus), len(labels)

(3375, 3375)

Note that the main challenge in producing this classifier
is producing a large and useful dataset. Ideas for solving this hurdle include collecting a
large number of JavaScript samples and then using different tools to obfuscate these.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    corpus, labels, test_size=0.33, random_state=42
)

text_clf = Pipeline(
    [
        ("vect", HashingVectorizer(input="content",ngram_range=(1,3))),
        ("tfidf", TfidfTransformer(use_idf=True,)),
        ("rf", RandomForestClassifier(class_weight="balanced")),
    ]
)


Having collected the data, we separate it into training and testing subsets (Step
4). In addition, we set up a pipeline to apply NLP methods to the JavaScript code itself, and
then train a classifier

Finally, we measure the performance of our classifier

In [7]:
text_clf.fit(X_train, y_train)
y_test_pred = text_clf.predict(X_test)

print(accuracy_score(y_test, y_test_pred))
print(confusion_matrix(y_test, y_test_pred))

0.9712746858168761
[[619  16]
 [ 16 463]]
