In [1]:
from fastai.text.all import get_text_files, Path, TextDataLoaders, language_model_learner, text_classifier_learner, accuracy, Perplexity, AWD_LSTM
import pandas as pd

In [2]:
path = Path("/home/g-clef/local_ml_data_copy/whois")

Goal: explore the accuracy of using pre-trained NLP models on pre-classified whois data, to see how accurate it can get from purely whois-related information.

To try this out, I grabbed 100k domains from spark for each of the categories we track (spam, malware, phishing, benign). I then pulled the whois records for each of those domains from our HTTP api. I've made csv files that include them in combinations (spam+benign, malware+benign), and one "everything" csv that has all of them. 

My only modification of the original files from the DT starting point is that I removed the newlines and commas in the whois records and replaced them with spaces...that was mostly just expediency to make the csv easier to deal with.

Given the size of the info, it took several days to pull all the whois data, so there is a chance some of them changed between when I collected the list and when I got the whois. I'm going to have to accept that risk.


I included a bunch of pre-parsed information from DT's whois data, which theoretically should be in the full "content" field, but was easy to pull out individually. The fields in the csv files are: `id, registryNameServer, registryExpires, registryUpdated, content, registryCreated, registry, registryStatus, classification`.

`content` is the big one...it's the full contents of the whois record. I may later run analyses on just that field, in case some of the pre-parsed data confuses things. `id` is also interesting, since it's the domain name in question, and `classification` is the DT-assigned class for that domain (benign/malicious/spam/phish).

First up, let's look at the pairings, like "malware+benign". (one note: have to turn down the batch size here since my video card doesn't have enough RAM to do 64-size batches. That shouldn't have a big impact on the accuracy, just on how long it takes to train.)

In [3]:
text_cols = ["id", "registryNameServer", "registryExpires", "registryUpdated", "content", "registryCreated", "registry", "registryStatus"]

In [4]:
dls = TextDataLoaders.from_csv(path=path, csv_fname="malware_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=16)

  return array(a, dtype, copy=False, order=order)


In [5]:
dls.show_batch(max_n=3)

Unnamed: 0,text,category
0,xxbos xxfld 1 cherylpamelasawesomesite.com xxfld 2 xxup ns4.wixdns.net;;ns5.wixdns.net xxfld 3 2022 - 01 - 04 xxfld 4 2021 - 01 - 05 xxfld 5 xxmaj domain xxmaj name : xxup cherylpamelasawesomesite.com xxmaj registry xxmaj domain xxup i d : 2475798900_domain_com - vrsn xxmaj registrar xxup whois xxmaj server : whois.wix.com xxmaj registrar xxup url : http : / / xxrep 3 w .wix.com xxmaj updated xxmaj date : 2021 - 01 - 05t08:25:55z xxmaj creation xxmaj date : 2020 - 01 - 04t17:51:22z xxmaj registry xxmaj expiry xxmaj date : 2022 - 01 - 04t17:51:22z xxmaj registrar : xxmaj wix.com xxmaj ltd . xxmaj registrar xxup iana xxup i d : 3817 xxmaj registrar xxmaj abuse xxmaj contact xxmaj email : domain-abuse@wix.com xxmaj registrar xxmaj abuse xxmaj contact xxmaj phone : +14154291173 xxmaj domain xxmaj status : clienttransferprohibited https : / / icann.org / epp # clienttransferprohibited xxmaj domain xxmaj,benign
1,xxbos xxfld 1 xxunk xxfld 2 nan xxfld 3 nan xxfld 4 nan xxfld 5 xxmaj domain xxmaj name : xxunk xxmaj registry xxmaj domain xxup i d : xxup xxunk xxmaj registrar xxup whois xxmaj server : whois.markmonitor.com xxmaj registrar xxup url : xxrep 3 w .markmonitor.com xxmaj updated xxmaj date : 2021 - 04 - xxunk xxmaj creation xxmaj date : 2020 - 03 - xxunk xxmaj registry xxmaj expiry xxmaj date : 2022 - 03 - xxunk xxmaj registrar : markmonitor xxmaj inc . xxmaj registrar xxup iana xxup i d : 292 xxmaj registrar xxmaj abuse xxmaj contact xxmaj email : abusecomplaints@markmonitor.com xxmaj registrar xxmaj abuse xxmaj contact xxmaj phone : +1.2083895740 xxmaj domain xxmaj status : clienttransferprohibited https : / / icann.org / epp # clienttransferprohibited xxmaj registry xxmaj registrant xxup i d : xxup xxunk xxmaj registrant xxmaj name : xxmaj global xxmaj internet,malware
2,xxbos xxfld 1 xxunk xxfld 2 nan xxfld 3 nan xxfld 4 nan xxfld 5 xxmaj domain xxmaj name : xxup xxunk xxmaj registry xxmaj domain xxup i d : xxup xxunk _ xxrep 5 0 xxunk - beer xxmaj registrar xxup whois xxmaj server : xxmaj registrar xxup url : xxmaj updated xxmaj date : 2020 - 09 - xxunk xxmaj creation xxmaj date : 2019 - 10 - xxunk xxmaj registry xxmaj expiry xxmaj date : 2021 - 10 - xxunk xxmaj registrar : xxmaj tucows xxmaj domains xxmaj inc . xxmaj registrar xxup iana xxup i d : 69 xxmaj registrar xxmaj abuse xxmaj contact xxmaj email : nicrelations@opensrs.com xxmaj registrar xxmaj abuse xxmaj contact xxmaj phone : +49.2283296859 xxmaj domain xxmaj status : clienttransferprohibited https : / / icann.org / epp # clienttransferprohibited xxmaj domain xxmaj status : clientupdateprohibited https : / / icann.org / epp,benign


A few interesting things right off the top: the fastai text data loader uses the spaCy tokenizer, which cares about stuff like up-casing and punctuation, which isn't always appropriate for domainname like words. It's entirely possible that we could gain some accuracy here by building a tokenizer that understands domain names and doesn't tokenize them as word+punctuation.


In [6]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])

In [7]:
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.664428,0.559479,0.740336,1.74976,30:44


Accuracy of 75% right off the top? wow. That's...super encouraging. Let's do this right, and run a bunch more epochs of training to see if we can pull that up.

As I found when messing with the NLP fake news classifier, there's a bit of an art to training an NLP classifier, as opposed to an image classifier. You don't just run it against the data set on multiple epochs all at once, at least not when you're tranfer-learning an already-trained model (like `AWD_LSTM`) against your data. What you do instead is multiple rounds of training gradually unfreezing lower and lower layers of the model.

So, we're going to try that.

In [8]:
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.645336,0.546571,0.741411,1.727319,30:04


In [9]:
learn.freeze_to(-2)
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.560158,0.370851,0.837276,1.448967,35:40


In [10]:
learn.freeze_to(-3)
learn.fit_one_cycle(1)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.513881,0.358086,0.851503,1.430589,48:07


In [11]:
learn.unfreeze()
learn.fit_one_cycle(5)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.481729,0.32699,0.861404,1.386787,1:00:31
1,0.42477,0.285852,0.885483,1.330896,1:00:35
2,0.375334,0.326296,0.886658,1.385826,1:00:47
3,0.361011,0.325241,0.891709,1.384364,1:10:01
4,0.355986,0.284599,0.892484,1.329228,1:00:56


Almost 90% accuracy. That's cool. It may benefit from still more training, since it looks like the loss functions are still decreasing with each epoch of training, but at 60 mins per epoch, I'll take this for now.

On the timing front, it's interesting that it consistently takes this long per cycle, almost double what the training runs took with more layers frozen. I guess that's not totally unexpected (more stuff to learn, more variables to change), but it's interesting to me that later runs didn't get faster the way they did during the image model training. I'd assumed the OS would be able to cache the files, so the pipeline to the GPU would fill faster on later runs, but the GPU utilization was running a consistent 80%-ish during training, so it may be that it is already full.

One of the things I will need to research after this is inspecting the model and learned parameters to identify what exactly it's triggering on. If this is getting to ~ 90% accuracy on just whois, it's an interesting question to ask  what it's doing to get there. 

In [12]:
learn.save("malicious_benign")

Path('/home/g-clef/local_ml_data_copy/whois/models/malicious_benign.pth')

One obvious next question is whether the type of label makes a difference for the accuracy of the predictions...so far we've only looked at `malicious_plus_benign`. We should probably do the same thing with each of the other two pairs, and then at the `everything` set.

So, next up: `phish_plus_benign`.

In [4]:
text_cols = ["id", "registryNameServer", "registryExpires", "registryUpdated", "content", "registryCreated", "registry", "registryStatus"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="phish_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.657397,0.620844,0.661833,1.860497,21:46


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.567969,0.485089,0.766771,1.624319,24:09


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.528673,0.441808,0.80135,1.555516,30:32


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.505351,0.39253,0.830929,1.480723,38:33
1,0.693471,0.690988,0.509789,1.995686,38:23
2,0.620316,31.926788,0.630904,73388500451328.0,38:20
3,0.416142,0.34386,0.851956,1.41038,38:19
4,0.377683,0.365152,0.851506,1.440733,38:22


In [5]:
learn.save("phish_benign")

Path('/home/g-clef/local_ml_data_copy/whois/models/phish_benign.pth')

Not sure what happened at the 3rd epoch, but this looks like it maxes out around 85% accurate. 

On reflection, phishing is actively trying to look like benign, so it shouldn't be that surprising that this category is a little harder than the others, I guess. This makes me wonder a bit if the model is leaning hard on the domain name itself in its classification, and since phishing sites try to impersonate normal ones, the phishing domains read more normally than malicious ones. The obvious way to test that is to leave out the domain name column in the training, though the domain name is still in the whois full record. 

As I think about it, it may be an interesting test to leave off everything but the "content" field, to see if that changes the results at all.

In the name of completeness, let's see what it looks like for the last pairing: spam+benign

In [3]:
text_cols = ["id", "registryNameServer", "registryExpires", "registryUpdated", "content", "registryCreated", "registry", "registryStatus"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="spam_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.609746,0.530079,0.796137,1.699066,28:11


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.511294,0.336027,0.883121,1.399376,27:05


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.428254,0.297486,0.901283,1.34647,34:25


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.420518,0.280797,0.91902,1.324185,41:16
1,0.32407,0.197272,0.942611,1.218075,42:50
2,0.237189,0.165071,0.947765,1.179476,44:02
3,0.234629,0.185698,0.950141,1.204059,42:24
4,0.223163,0.160225,0.952418,1.173774,43:10


In [4]:
learn.save("spam_benign")

Path('/home/g-clef/local_ml_data_copy/whois/models/spam_benign.pth')

## (note to self: save the notebook, adn shut it down between big runs like this, or python will allocate so much system RAM that it'll make the OS swap, and make these tests run *absurdly* slowly.)

This one's *really* accurate, especially compared to the others. > 95% accurate? wow. and train_loss was still > valid_loss, with both decreasing (though sometimes one increasing at the cost of the other. that seems to be normal). So it's possible that this one isn't yet overfitted and could be improved even more. 

95% accurate, though, is a good enough start for me. That's lovely.

lastly, let's have a look at the "big bang" of running all the classes at once. I kinda expect this to perform much worse than the others, since we allow domains to be in multiple classes here, so confusion may be high.

In [3]:
text_cols = ["id", "registryNameServer", "registryExpires", "registryUpdated", "content", "registryCreated", "registry", "registryStatus"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="everything.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,1.303066,1.17395,0.49435,3.234745,08:59


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,1.156142,0.986218,0.60295,2.681076,10:51


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,1.04133,0.917421,0.6242,2.502827,15:48


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.91036,0.78128,0.68135,2.184267,19:54
1,0.823245,0.693002,0.7275,1.99971,19:56
2,0.704262,0.663274,0.7356,1.941137,19:54
3,0.615001,0.674567,0.73045,1.963182,19:55
4,0.57965,0.728279,0.7235,2.071513,19:56


In [4]:
learn.save("everything")

Path('/home/g-clef/local_ml_data_copy/whois/models/everything.pth')

first thing to note: the initial building of the textdataloader took up an obscene amount of RAM on my machine. It led to the system swapping like mad, and slowed the actual computation of the dls to a crawl. The GPU during this time was completely quiescent...there were no jobs being sent to it, this was purely building the initial dls. It appears that 400k rows of this data is perhaps too much for my local system. Tried it again with swap turned off, to force Python to accept that it doesn't get any more RAM after a certain point, and that just froze the machine. Also tried setting `num_workers=0` to force it to not copy data between processes during the intial setup.

None of those worked, so I ended up truncating the everything dataset to be 100k, comprised of 25k samples of each category. That's disappointing, and I'm not sure how it will impact the results at this point.  

Anyway, now that I got it to run, it did, indeed, perform worse than the others. Don't mistake me, 70% accuracy on four categories isn't bad given that it's purely looking at the whois info. It may also be that there are simply fewer examples of each category, which is giving it fewer examples to train from.

I think it might be overtrained as well, since train_loss is lower than valid_loss, and valid_loss started increasing. Still, those results are nothing to be ashamed of for a naive language model. 

Now, having done that, let's come back to some of the earlier questions. Firstly, I'd like to see if looking just at the whois itself, and skipping the other colums changes anything. We'll go back to the first, `malware_plus_benign` set, and see if we can get better than 89% with just the whois, or if this is worse.

In [3]:
text_cols = ["content",]
dls = TextDataLoaders.from_csv(path=path, csv_fname="malware_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.653912,0.525352,0.745937,1.691054,20:27


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.501098,0.369633,0.837776,1.447204,24:03


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.513324,0.357666,0.846177,1.429987,29:08


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.497036,0.378999,0.843026,1.460822,36:41
1,0.42443,0.330132,0.878507,1.391151,36:44
2,0.680645,0.908663,0.584713,2.481003,36:37
3,0.457402,0.341828,0.872631,1.407519,36:34
4,0.391443,0.480057,0.880407,1.616166,36:40


In [4]:
learn.save("malicious_benign_content_only")

Path('/home/g-clef/local_ml_data_copy/whois/models/malicious_benign_content_only.pth')

So, the first time I ran this I f'd up the data columns, and had one more data column than there were headers. That made this perform very badly (like no better than 65% accurate)...probably because it was fitting the wrong column. I'm not sure how pandas will handle a dataframe with mis-matched headers & columns, but it was a mistake in any case, so I re-ran these analyses.

Now that this is fixed, it's super interesting that it's working surprisingly well. Accuracy in the high-80's from just the raw, unprocessed content of the whois record is not bad. The version with everything was just 89%, so this is comparable. It may be overtrained at th is point, given that train_loss is less than valid loss, but high-80's is still quite good. (89.2484 was where it got to with all the columns, so 88.0 with just the unparsed whois data is quite good.)

Out of curiousity, after looking at the content alone, I was curious to see what the impact would be of adding the other fields back in one by one, to see which ones had the most impact on additional accuracy. Since I have a personal hunch that the nameservers are going to matter here, we'll start with those.

After that, we'll add back the domain name, then both. 

In [3]:
text_cols = ["content","registryNameServer"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="malware_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.657151,0.551934,0.740661,1.736609,21:20


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.531476,0.364293,0.844627,1.439496,23:58


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.502845,0.390263,0.841726,1.477369,29:34


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.473423,0.30786,0.870281,1.36051,37:22
1,0.429192,0.298816,0.882732,1.348261,37:30
2,0.381557,0.309226,0.885158,1.36237,37:25
3,0.336094,0.278081,0.891209,1.320593,37:21
4,0.308111,0.298538,0.892159,1.347887,37:15


Adding the nameservers back made it back to almost exactly the accuracy it had with all the columns, but it was already close to that (88.0) with just the whois. So this did have a positive impact, but a fairly small one.

Let's try adding just the domain to the full whois content, see if that has the same effect.

In [3]:
text_cols = ["content","id"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="malware_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.659657,0.565272,0.725209,1.759927,21:26


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.562479,0.373781,0.837326,1.453218,24:09


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.506287,0.349964,0.851228,1.419016,29:26


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.438413,0.319075,0.871631,1.375855,37:23
1,0.430771,0.356392,0.873831,1.428168,37:19
2,0.356327,0.314858,0.883708,1.370065,37:20
3,0.340277,0.297356,0.891984,1.346295,37:18
4,0.260013,0.30845,0.889808,1.361314,37:16


This looks slightly worse than the nameservers one, but not appreciably. This may be an unfruitful path, but let's finish it off.

two other possibilities occur to me: `registryCreated` and `registry` . I can totally see a time-based correlation for malicious domains, or a registry-based one. In fact, I pretty strongly suspect the registry one is true. Let's try that next.

In [3]:
text_cols = ["content","registry"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="malware_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.663452,0.533253,0.747387,1.704467,21:36


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.566597,0.38657,0.825874,1.471924,24:21


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.509276,0.368582,0.838301,1.445684,29:45


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.445605,0.352681,0.856453,1.422877,37:12
1,0.40048,0.321578,0.877507,1.379303,37:17
2,0.379589,0.48872,0.880757,1.630228,37:12
3,0.300424,0.329104,0.882082,1.389722,37:13
4,0.318386,0.325499,0.885908,1.384721,37:14


huh. Adding the registry didn't accomplish much. It's basically the same accuracy as with just the domain. I'm actually quite surprised by that. My instinct would have been that there would be a fairly strong correlation between registry and malware domains (that bad actors would prefer certain regitrars).

Following up above, let's try `registryCreated`

In [3]:
text_cols = ["content","registryCreated"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="malware_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.654669,0.516791,0.770691,1.676639,21:45


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.503632,0.39582,0.832525,1.485601,24:41


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.494965,0.366697,0.847052,1.44296,29:34


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.483581,0.364735,0.857304,1.440132,37:21
1,0.394297,0.298961,0.883758,1.348457,37:28
2,0.37883,0.288536,0.890184,1.334473,37:22
3,0.319705,0.331098,0.893509,1.392496,37:20
4,0.310072,0.310836,0.891134,1.364566,37:19


So, that got back to the original accuracy, but didn't make much of a dent beyond that. 

I wonder what happens if I add all the timestamps?

In [3]:
text_cols = ["registryExpires", "registryUpdated", "content", "registryCreated"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="malware_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.664266,0.526443,0.76624,1.692899,22:29


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.541039,0.380852,0.835525,1.463531,25:35


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.510377,0.377457,0.844677,1.45857,30:14


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.450346,0.347307,0.86683,1.415252,38:23
1,0.440764,0.337866,0.86873,1.401953,38:06
2,0.357215,0.335284,0.890534,1.398337,38:02
3,0.321744,0.345346,0.892584,1.412478,38:01
4,0.247602,0.394439,0.894184,1.483552,38:03


Huh. So adding all the timestamps got this back to the original state, but no higher or lower. It seems like the content match is dominating everything else (not a huge surprise, given that it's enormously more data than a single date or name), so the accuracy it's getting from the full whois content is dominating all the other decisions. 

Given that, the next obvious thing to do is to *remove* the full whois content, and see what the performance would be if I left that out. First let's try taking out just the full whois, and using everything else.

In [3]:
text_cols = ["id", "registryNameServer", "registryExpires", "registryUpdated", "registryCreated", "registry", "registryStatus"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="malware_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.691307,0.662306,0.624069,1.939259,02:43


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.631688,0.600314,0.674301,1.82269,03:08


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.608227,0.571405,0.685428,1.770753,03:35


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.59544,0.691332,0.689978,1.996374,04:25
1,0.586165,0.562467,0.703355,1.754997,04:25
2,0.580344,0.559158,0.708831,1.749199,04:26
3,0.576549,0.526633,0.709856,1.693222,04:24
4,0.572886,0.535067,0.711382,1.707564,04:26


That's a lot less accurate, but still much better than even odds, with a lot less time per epoch. It also seems to have stabilized at close to 70% but may still benefit from some more training. Given the path, though, I wouldn't expect more training to pull this above 75% accurate, where the full whois was tracking to closer to 89%.


Let's try re-running the other analyses leaving out the raw whois also. 

In [3]:
text_cols = ["id", "registryNameServer", "registryExpires", "registryUpdated", "registryCreated", "registry", "registryStatus"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="phish_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.689834,0.658959,0.612702,1.93278,02:41


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.647577,0.59328,0.651481,1.809916,03:09


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.625498,0.592574,0.659357,1.808637,03:37


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.588473,0.586338,0.670809,1.797394,04:33
1,0.60275,0.600147,0.667983,1.822386,04:32
2,0.577022,0.534752,0.658132,1.707024,04:31
3,0.556424,0.51106,0.667058,1.667058,04:33
4,0.548785,0.514143,0.665983,1.672204,04:31


Well, this is interesting. It flatlined quickly, and never got ahead of the high 60-s in terms of accuracy. The one with the full whois maxed out around 85% accurate, so this is a fairly big step down. It's better than a coin toss, but not by much. 


In [3]:
text_cols = ["id", "registryNameServer", "registryExpires", "registryUpdated", "registryCreated", "registry", "registryStatus"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="spam_plus_benign.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.672253,0.628687,0.656568,1.875147,02:41


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.585202,0.624904,0.706877,1.868067,03:09


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.566745,0.540087,0.709129,1.716156,03:38


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,0.499739,0.508869,0.730218,1.663409,04:32
1,0.50336,0.451244,0.728892,1.570264,04:33
2,0.490418,0.442655,0.737923,1.556835,04:33
3,0.492845,0.434042,0.738298,1.543484,04:32
4,0.45378,0.436137,0.736747,1.54672,04:33


This maxed out at 95% accurate with the raw whois, and 74% without. That's a pretty serious step down. 

Lastly, the "everything.csv" file.

In [3]:
text_cols = ["id", "registryNameServer", "registryExpires", "registryUpdated", "registryCreated", "registry", "registryStatus"]
dls = TextDataLoaders.from_csv(path=path, csv_fname="everything.csv", text_col=text_cols, label_col="classification", valid_pct=0.2, bs=32)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
learn.fit_one_cycle(1)
learn.freeze_to(-2)
learn.fit_one_cycle(1)
learn.freeze_to(-3)
learn.fit_one_cycle(1)
learn.unfreeze()
learn.fit_one_cycle(5)

  return array(a, dtype, copy=False, order=order)


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,1.376452,1.33761,0.3237,3.809927,01:20


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,1.344559,1.26916,0.3662,3.557864,01:33


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,1.279905,1.267,0.4185,3.550187,01:48


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,1.227397,1.647639,0.35025,5.194699,02:13
1,1.11306,1.189788,0.4765,3.286386,02:14
2,1.144386,1.128097,0.48345,3.089771,02:13
3,1.118706,1.124633,0.4873,3.079087,02:14
4,1.101728,1.124326,0.49005,3.078141,02:14


originally this maxed out at 72% accurate with the full whois `content`. Leaving it out, we get to...50%...now, it's 4 categories, so it's better than a coin toss, but that's still not great.

So...the whole point of this exercise was to see if an NLP model could do effective classification based just on the whois field, and the answer to that is clearly "yes", which is neat. I'm a little surprised (well, more disappointed) that parsed fields (timestamps, etc) aren't very useful, but the full whois data is. I would like to be able to introspect the full whois content models a bit more to see what they're flagging on that the parsed fields don't have...but some of the pure-whois models got up to 95% accurate, which is really impressive.

So, what I think the next step is to build a random forest classifier, based on the non-`content` fields. What I'm hoping to get out of that is a better map of which of those fields are "useful" (for lack of a better phrase) in classification, and experiment a bit with the date-ifying stuff that fastai has (take a date, add day-of-week, day-of-month, day-of-year, etc columns) to see if there are any interesting correlations there. But that's for another notebook.