# Making Synthetic TI reports

The goal here is to fine-tune/train a language model on Threat Intelligence reports, and then use that model to generate new, fake, TI reports. I'm doing this mostly because I think it will be funny.

For this experiment, we're doing this with a very basic LSTM, rather than any of the fancier GPT or transformer models. (walk before you run and all that.)

In any case, the process here is:
 * collect a ton of TI reports. This is what the https://github.com/g-clef/ThreatIntelCollector is for.
 * parse them all to extract the text (that's what the prefect job in this repo is for)
 * make a fastai databundle out of them
 * instantiate a language learner
 * train it
 * predict text
 
This should be fairly straightforward, and I imagine this is going to be wonky at best. Ways to improve it that I can see right off the top:
 * replace the vendor URLs at the bottom of every page
 * parse out other URLs/hashes/DNS names/file paths in the report?
 * try gpt-2 (will it fit in my video card?)
 * try other fancier transformer libraries
 
For now, though, let's just do the default/easy one.

In [10]:
from fastai.text.all import *
from fastai import *

So, first I'm going to pull the data from the prefect job in (I copy the relevant files to a local drive on my ML rig so that I'm not paying a network round-trip price for every file), then make a DataLoader out of that data, setting aside 10% for a "valid" dataset. Note: this will also do some standard tokenization of the text (using Spacy, I believe)

In [11]:
base_dir = Path("/home/g-clef/local_ml_data_copy/ti-reports/bec011c7-0344-49dd-8daa-891f62a8c8f2")

In [19]:
dls_lm = TextDataLoaders.from_folder(base_dir, is_lm=True, valid_pct=0.1)

In [20]:
dls_lm.show_batch(max_n=5)

Unnamed: 0,text,text_
0,xxbos 12 / 5 / 2019 xxmaj obfuscation xxmaj tools xxmaj found in the xxmaj capesand xxmaj exploit xxmaj kit xxmaj possibly xxmaj used in “ kurdishcoder ” xxmaj campaign - trendlabs xxmaj security xxmaj intelligence xxmaj blog \n\n▁ xxmaj trend xxmaj micro \n▁ xxmaj about trendlabs xxmaj security xxmaj intelligence xxmaj blog \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n▁ xxmaj search : \n\n▁ xxmaj go to … \n\n▁ homecategories \n\n\n▁ xxmaj home » xxmaj exploits » xxmaj,12 / 5 / 2019 xxmaj obfuscation xxmaj tools xxmaj found in the xxmaj capesand xxmaj exploit xxmaj kit xxmaj possibly xxmaj used in “ kurdishcoder ” xxmaj campaign - trendlabs xxmaj security xxmaj intelligence xxmaj blog \n\n▁ xxmaj trend xxmaj micro \n▁ xxmaj about trendlabs xxmaj security xxmaj intelligence xxmaj blog \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n▁ xxmaj search : \n\n▁ xxmaj go to … \n\n▁ homecategories \n\n\n▁ xxmaj home » xxmaj exploits » xxmaj obfuscation
1,"all reported on widespread campaigns xxmaj education , government , financial services , energy , \n▁ similar to the activity we describe in this and the entertainment industries appear to be \n▁ report . xxmaj in xxmaj august 2014 xxmaj kaspersky described the most affected . xxmaj figure 5 depicts the share \n▁ the “ epic xxmaj turla ” xxunk while in xxmaj january of detection alerts for xxup witchcoven among \n▁","reported on widespread campaigns xxmaj education , government , financial services , energy , \n▁ similar to the activity we describe in this and the entertainment industries appear to be \n▁ report . xxmaj in xxmaj august 2014 xxmaj kaspersky described the most affected . xxmaj figure 5 depicts the share \n▁ the “ epic xxmaj turla ” xxunk while in xxmaj january of detection alerts for xxup witchcoven among \n▁ 2015"
2,"than 100 referenced sources for known as xxup apt28 ) , discovered [ 44 ] by xxup eset and the first xxup uefi rootkit found in the wild . \n\n▁ the xxmaj mobile matrix . \n▁ xxmaj in the xxmaj groups category , one of eset ’s contributions is xxmaj machete ( xxunk ) [ 45 ] , a xxunk \n\n▁ onage group with high - profile targets in xxmaj latin xxmaj","100 referenced sources for known as xxup apt28 ) , discovered [ 44 ] by xxup eset and the first xxup uefi rootkit found in the wild . \n\n▁ the xxmaj mobile matrix . \n▁ xxmaj in the xxmaj groups category , one of eset ’s contributions is xxmaj machete ( xxunk ) [ 45 ] , a xxunk \n\n▁ onage group with high - profile targets in xxmaj latin xxmaj american"
3,ngav and other endpoint in�ltration \n▁ prevention solutions . \n\n▁ xxmaj mitigation \n\n\n\n▁ atombombing is performed just by using the underlying xxmaj windows mechanisms . xxmaj there is no need to exploit \n▁ operating system bugs or vulnerabilities . \n▁ http : / / blog.ensilo.com / atombombing - a - code - injection - that - bypasses - current - security - solutions 2 / 6 \n▁ 3 / 1 / 2017,and other endpoint in�ltration \n▁ prevention solutions . \n\n▁ xxmaj mitigation \n\n\n\n▁ atombombing is performed just by using the underlying xxmaj windows mechanisms . xxmaj there is no need to exploit \n▁ operating system bugs or vulnerabilities . \n▁ http : / / blog.ensilo.com / atombombing - a - code - injection - that - bypasses - current - security - solutions 2 / 6 \n▁ 3 / 1 / 2017 atombombing
4,"spyware fork in its third version and \n\n▁ then to the complex spyware that is version 4 . xxmaj this last step is especially interesting , \n▁ showing a big leap from straightforward code functionality to highly sophisticated malware . \n\n▁ xxmaj this suggests the latest version may have been bought from vendors of specialist \n▁ surveillance tools . xxmaj that would n’t be surprising , as the market for these espionage","fork in its third version and \n\n▁ then to the complex spyware that is version 4 . xxmaj this last step is especially interesting , \n▁ showing a big leap from straightforward code functionality to highly sophisticated malware . \n\n▁ xxmaj this suggests the latest version may have been bought from vendors of specialist \n▁ surveillance tools . xxmaj that would n’t be surprising , as the market for these espionage tools"


Next I'm going to make an NLP language learning model, in this case using a pre-trained LSTM model, and pointing that learning model at the dataloader from before.

In [22]:
learner = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], path=base_dir, wd=0.1).to_fp16()

By default, fastai language models are created in a "frozen" state, which means that only the head of the model will change with training, not the underlying layers. We'll see how well that works by just running a single epoch of training 

In [24]:
learner.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.565845,4.394007,0.287921,80.964211,21:44


In [25]:
learner.save("one_epoch")

Path('/home/g-clef/local_ml_data_copy/ti-reports/bec011c7-0344-49dd-8daa-891f62a8c8f2/models/one_epoch.pth')

To train the whole model (fine-tuning the already-trained model), I have to "unfreeze" the model, and train again. I don't know off the top of my head how long to train it for, but I'll take a guess at 10 epochs.

In [26]:
learner.unfreeze()
learner.fit_one_cycle(10)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.956028,4.040853,0.326732,56.874836,22:26
1,3.593564,3.834614,0.345887,46.275551,22:10
2,3.460929,3.716889,0.357832,41.136227,22:33
3,3.234648,3.64072,0.366255,38.119286,21:48
4,3.179806,3.578173,0.373663,35.808056,21:39
5,3.142587,3.5416,0.376357,34.522125,21:38
6,3.009695,3.504174,0.381908,33.253979,21:48
7,3.001116,3.471208,0.387079,32.175575,21:32
8,3.037449,3.45209,0.389489,31.566305,21:42
9,2.90421,3.450108,0.390049,31.503803,21:31


It's looking like it's starting to level off here, so 10 epochs may be enough. Let's save it so I don't have to re-do that 200 minutes of training time, and then see how it does.

In [27]:
learner.save("eleven_epochs")
learner.save_encoder("first_finetuning")

To make predictions, we give the model a starting point (or starting words, in this case), and ask it to predict a number of words forward. Fortunately, lots of the models begin with a title, so we'll make one up. You can also add a "temperature" variable here, which is a way of (to my understanding) weighting the distribution of the words to be more or less "spiky". I'm not sure how that will effect the final results here, so I'm going to leave it alone for the first try.

In [28]:
first_try = learner.predict("Underpants Gnomes - A Targetted Intrusion", 2000)

In [29]:
print(first_try)

Xxunk Xxunk - a Targetted Intrusion on a Unioncryptoupdater`memory_exec2 
▁ Internetopena 


▁ Whatsapp 
▁ Pwn2own 

▁ tty - security 

▁ Posted on November 18 , 2012 .  Keep notified and encouraged to follow full categories 

▁ high-ﬁdelity 85321dee31100bd3ece5b586ac3e6557 부트 difﬁcult 

▁ Detecting Malware 


▁ Running tslow.pyc ! 

▁ Linux Backdoor [ mechanized ] 


▁ First Determines : False Positives 

▁ Successful Kill Chain : The RLO 


▁ Although the cyberthreat is considered in a just very segmented manner , which would better hearing the next attacks relevant 
▁ to APT , it may be even more difficult and thoroughly 1761 . Just like all the Linux threat 
▁ actors , this case envyscout Additionally , the overall form of obfuscation 

▁ with a slight bb28 splits out three odds of choice : it does n’t need to apply a different cybersecurity 

▁ capability when Sofacy works . We will expect these types of attacks to be conducted out of that 

▁ multiple kinds — among other governm