# DomainTools

The DomainTools service is [releasing](https://www.domaintools.com/resources/blog/free-covid-19-threat-list-domain-risk-assessments-for-coronavirus-threats) a set of COVID-19 related domains that have been flagged as "high risk":

> Drawing upon data points from over 330 million current Internet domains, DomainTools Risk Score predicts how likely a domain is to be malicious, often before it is weaponized. The score comes from two distinct algorithms: Proximity and Threat Profile. Proximity evaluates the likelihood a domain may be part of an attack campaign by analyzing how closely connected it is to other known-bad domains. Threat Profile leverages machine learning to model how closely the domain’s intrinsic properties resemble those of others used for spam, phishing, or malware. The strongest signal from either of those algorithms becomes the combined Domain Risk Score.

The data is released daily at https://covid-19-threat-list.domaintools.com/. The `utils/domaintools.py` script can be run from cron to collect the files on a schedule. They are also added here to this repository in `data/domaintools/`. 

Despite the `.csv` file extension each file is a gzipped tab separated file that has three columns:

* domain name
* create date
* risk score

The columns are not labeled in the data, so we will need to add them when we read in the data.

## Risk Scores

In [27]:
import pandas

df = pandas.read_csv('data/domaintools/2020-04-05.csv.gz',
    parse_dates=['created'], 
    sep='\t',
    names=['domain', 'created', 'risk']
)
df.head()

Unnamed: 0,domain,created,risk
0,covid19fashions.com,2020-04-04,99
1,coronaviruscardgame.com,2020-04-04,99
2,covidonline.me,2020-04-04,99
3,covid-19contained.com,2020-04-04,99
4,covidcrime-chicago.com,2020-04-04,99


We know from their description of the dataset that it only includes domains with a risk score of greater than or equal to 70. Here is a general statistical picture of the domains.

In [28]:
df.describe()

Unnamed: 0,risk
count,116711.0
mean,97.580022
std,3.957183
min,70.0
25%,98.0
50%,99.0
75%,99.0
max,99.0


We can see the average risk score is 97.5 and the percentile measures that the majority of these domains have been ranked 99.0 or higher.

## Create Dates

We can also take a look at the number of domains that are created or tracked per day. The [documentation](https://www.domaintools.com/resources/blog/free-covid-19-threat-list-domain-risk-assessments-for-coronavirus-threats) for the dataset indicates that the date can either be the day that the domain was registered, or the day that DomainTools started monitoring the domain.

In [29]:
import altair as alt

counts = df.created.value_counts().reset_index()
counts.columns = ['created', 'count']

alt.Chart(counts, width=700, title="Domains Created/Tracked per Day").mark_bar().encode(
    alt.X("created:T", title="Time (Days)"),
    alt.Y('count:Q', title="Domains")
)