# DomainTools

The DomainTools service is [releasing](https://www.domaintools.com/resources/blog/free-covid-19-threat-list-domain-risk-assessments-for-coronavirus-threats) a set of COVID-19 related domains that have been flagged as "high risk":

> Drawing upon data points from over 330 million current Internet domains, DomainTools Risk Score predicts how likely a domain is to be malicious, often before it is weaponized. The score comes from two distinct algorithms: Proximity and Threat Profile. Proximity evaluates the likelihood a domain may be part of an attack campaign by analyzing how closely connected it is to other known-bad domains. Threat Profile leverages machine learning to model how closely the domain’s intrinsic properties resemble those of others used for spam, phishing, or malware. The strongest signal from either of those algorithms becomes the combined Domain Risk Score.

The data is released daily at https://covid-19-threat-list.domaintools.com/. The `utils/domaintools.py` script can be run from cron to collect the files on a schedule. They are also added here to this repository in `data/domaintools/`. 

Despite the `.csv` file extension each file is a gzipped tab separated file that has three columns:

* domain name
* create date
* risk score

The columns are not labeled in the data, so we will need to add them when we read in the data.

## Risk Scores

In [1]:
import pandas

df = pandas.read_csv('data/domaintools/2020-04-08.csv.gz',
    parse_dates=['created'], 
    sep='\t',
    names=['domain', 'created', 'risk']
)
df.head()

Unnamed: 0,domain,created,risk
0,covid19bolivia.app,2020-04-07,99
1,puppyloveinthetimeofcorona.com,2020-04-07,99
2,covid19-finances.online,2020-04-07,99
3,coronavirusppestore.net,2020-04-07,99
4,coronacat.site,2020-04-07,99


We know from their description of the dataset that it only includes domains with a risk score of greater than or equal to 70. Here is a general statistical picture of the domains.

In [2]:
df.describe()

Unnamed: 0,risk
count,126156.0
mean,97.865793
std,3.553246
min,70.0
25%,99.0
50%,99.0
75%,99.0
max,99.0


We can see the average risk score is 97.8 and the percentile measures that the majority of these domains have been ranked 99.0 or higher.

In [3]:
df[df.risk >= 99].count()

domain     98588
created    98588
risk       98588
dtype: int64

## Create Dates

We can also take a look at the number of domains that are created or tracked per day. The [documentation](https://www.domaintools.com/resources/blog/free-covid-19-threat-list-domain-risk-assessments-for-coronavirus-threats) for the dataset indicates that the date can either be the day that the domain was registered, or the day that DomainTools started monitoring the domain.

In [4]:
import altair as alt

counts = df.created.value_counts().reset_index()
counts.columns = ['created', 'count']

alt.Chart(counts, width=700, title="Domains Created/Tracked per Day").mark_bar().encode(
    alt.X("created:T", title="Time (Days)"),
    alt.Y('count:Q', title="Domains")
)

# Top-Level Domains
 
What registrar's are getting used the most? We can group the results by top-level domain to get a sense of that.

In [86]:
import tld

df['tld'] = df['domain'].map(lambda d: tld.get_tld(d, fix_protocol=True))
df

Unnamed: 0,domain,created,risk,tld
0,thepostcovid19.com,2020-04-05,99,com
1,covid19consultation.com,2020-04-05,99,com
2,coronaexpedite.com,2020-04-05,99,com
3,coronavirus-social-distancing-strategy.com,2020-04-05,99,com
4,covid-19accountants.com,2020-04-05,99,com
...,...,...,...,...
120267,fargerike-corona.no,2020-01-28,70,no
120268,coronamask.online,2020-01-27,70,online
120269,corona-protect.online,2020-01-25,70,online
120270,wuhanfapai.net,2020-01-10,70,net


A quick look at the top 25 shows us that .com has the most.

In [87]:
tld_counts = df.tld.value_counts().reset_index()
tld_counts.columns = ['tld', 'count']
tld_counts.head(25)

Unnamed: 0,tld,count
0,com,62394
1,org,8077
2,net,4530
3,info,3842
4,de,3565
5,co.uk,2768
6,online,2234
7,nl,1837
8,ru,1739
9,eu,1631


In [88]:
com_count = tld_counts.iloc[0]['count']
com_count / tld_counts['count'].sum()

0.518774112012771

Just over half the doains are .com, which is managed by Verisign. Does the .com registration curve look similar to the overall trend?

In [89]:
com = df[df.tld == 'com'].groupby('created').count()
alt.Chart(com.reset_index(), width=700, title=".com Created per Day").mark_line().encode(
    alt.X("created:T", title="Time (Days)"),
    alt.Y('domain:Q', title="Created")
)

Yes, it does. How about the highest country code, .de?

In [90]:
de = df[df.tld == 'de'].groupby('created').count()
alt.Chart(de.reset_index(), width=700, title=".de Created per Day").mark_line().encode(
    alt.X("created:T", title="Time (Days)"),
    alt.Y('domain:Q', title="Created")
)

In [91]:
co_uk = df[df.tld == 'co.uk'].groupby('created').count()
alt.Chart(co_uk.reset_index(), width=700, title=".co.uk Created per Day").mark_line().encode(
    alt.X("created:T", title="Time (Days)"),
    alt.Y('domain:Q', title="Created")
)

In [92]:
ru = df[df.tld == 'ru'].groupby('created').count()
alt.Chart(ru.reset_index(), width=700, title=".ru Created per Day").mark_line().encode(
    alt.X("created:T", title="Time (Days)"),
    alt.Y('domain:Q', title="Created")
)

At this point it's probably worth pointing out that these views may say more about DomainTools own processing pipeline than anything about the creation of domains--especially since the created time could be simply when DomainTools started tracking it. 