# T4 - Final Project

Semester 2221, CSEC 520/620, Team 4\
Final Project - URL Classification\
Due by December 14, 2022 11:59 PM EST.\
Accounts for 18% of total grade.

## 0. Preliminary Requirements

This section ensures that the `python-whois` package is installed.
We also download our raw datasets which are stored in a Git repository hosted on GitHub.

**Make sure you are running this notebook in an isolated directory, as it will be turned into a Git working directory.**

In [1]:
# Ensure the WHOIS package is installed
print(f'{"":#^{36}}\n{"## Installing Packages ":#<{36}}\n{"":#^{36}}')
!pip install python-whois

# Download our repo, which contains the RAW datasets
print(f'\n{"":#^{36}}\n{"## Updating Repository ":#<{36}}\n{"":#^{36}}')
!git init
!git remote add origin https://github.com/aisgbnok/T4-Project.git
!git pull origin main --allow-unrelated-histories

####################################
## Installing Packages #############
####################################

####################################
## Updating Repository #############
####################################
Reinitialized existing Git repository in C:/Users/Anthony/Documents/T4-Project/.git/


usage: git remote add [<options>] <name> <url>

    -f, --fetch           fetch the remote branches
    --tags                import all tags and associated objects when fetching
                          or do not fetch any tag at all (--no-tags)
    -t, --track <branch>  branch(es) to track
    -m, --master <branch>
                          master branch
    --mirror[=(push|fetch)]
                          set up remote as a mirror to push to or fetch from



Already up to date.


From https://github.com/aisgbnok/T4-Project
 * branch            main       -> FETCH_HEAD


In [2]:
from os import makedirs, path, listdir
from sklearn.svm import SVC
import threading
import numpy
import pandas as pd
import whois

## 1. Merge Raw Datsets

In [5]:
def merge_raw(seed=75, output=False, save=False):
    """
    Loads our separate dataframes and merges them together.
    Ensures the columns are equal, normalizes the labels,
    and finally shuffles the columns using the seed.

    :param seed: Integer value that ensures reproduction of resulting dataframe.
    :param output: Whether to print dataframes or not. True or False.
    :param save: Whether to save the pandas dataframe. Can be False (default), 'CSV', or 'PICKLE'.
    :return: A pandas dataframe that contains a URL and label column.
             The label column is 0 for benign and 1 for malicious.
    """
    # Get all of our datasets
    df_aj = pd.read_csv(path.join('datasets', 'raw', 'urls-antonyj.csv'))
    df_ms = pd.read_csv(path.join('datasets', 'raw', 'urls-manu-siddhartha.csv'))

    if output:
        print(f'{"":#^{36}}\n{"## Original ":#<{36}}\n{"":#^{36}}')
        print('## urls-antonyj.csv')
        display(df_aj)
        print('## urls-manu-siddhartha.csv')
        display(df_ms)

    # Ensure Columns Match
    df_ms.columns = df_aj.columns

    # Normalize Data, 1 is malicious, 0 is benign
    df_aj['label'] = (df_aj['label'] == 'bad').astype(int)
    df_ms['label'] = (df_ms['label'] != 'benign').astype(int)

    # Merge dataframes
    df = pd.merge(df_aj, df_ms, how='outer')

    # Keep first exact matches
    df = df.drop_duplicates()

    # Drop all duplicate urls with conflicting labels
    # Prevents some data poisoning, and promotes data integrity
    df = df.drop_duplicates(subset='url', keep=False)

    # Shuffle using seed value
    df = df.sample(frac=1, random_state=seed)

    # Reset Index
    df = df.reset_index(drop=True)

    if output:
        print(f'{"":#^{36}}\n{"## Resulting ":#<{36}}\n{"":#^{36}}')
        print('## urls-antonyj.csv')
        display(df_aj)
        print('## urls-manu-siddhartha.csv')
        display(df_ms)
        print('## Final')
        display(df)

    if save == 'PICKLE':
        df.to_pickle('datasets/t4-urls.zip')
    elif save == 'CSV':
        df.to_csv('datasets/t4-urls.csv')

    return df

In [6]:
dataset = merge_raw()

In [None]:
datasetsize = len(dataset.index)
mals = dataset[dataset["label"] == 1]
malsize = len(mals.index)
bensize = datasetsize - malsize
print(datasetsize, malsize, bensize)

707514 66447 641067


## 2. Gather ICANN WHOIS Lookup Data

In [9]:
def multithreading(data, max_threads=10):
    """
    Wrapper for gather_whois that utilizes multithreading to achieve faster results.

    :param data: The entire dataset that you want to gather information for.
                 Must be a dataframe with a "url" column.
    :param max_threads: The maximum number of threads to generate at a given time.
                        The default is 10.
    :return: None.
    """
    percent = 0.0
    threads = []
    thread_number = 1

    # Generate threads based on percentage
    while percent < 1:
        print(f'There are {len(threads)} threads.')

        if len(threads) >= max_threads:
            for thread in threads:
                print('Trying to join.')
                thread.join()
                print(f'Thread {thread_number} done!')
                thread_number += 1

        print(f'Starting percent: {percent}%')
        thread_data = data['url'].iloc[int(percent * data.shape[0]): int((percent + 0.05) * data.shape[0])].values
        thread = threading.Thread(target=gather_whois, args=(thread_data, percent))
        thread.start()
        threads.append(thread)

        percent = round((percent + 0.05), 3)

In [None]:

def multithreading(data, percent=1, max_threads=10):
  threads = []
  thread_number = 1

  percent = int(len(data) * percent)

  for i in range(0, percent, percent)

  # Generate threads based on percentage
  for i in numpy.arange(0, percent, 0.05).round(3):
    print(f'There are {len(threads)} threads.')

    if len(threads) >= max_threads:
      for thread in threads:
        print('Trying to join.')
        thread.join()
        print(f'Thread {thread_number} done!')
        thread_number += 1

    print(f'Starting percent: {i} of {percent}%')
    data_percent = data['url'].iloc[i * data.shape[0]: i * data.shape[0]].values
    # thread = threading.Thread(target=gather_whois, args=(, d))
    thread = threading.Thread(target=gather_whois, args=(i * 100, dataset.iloc[i * len(data):(i + 0.05) * len(data)]))
    thread.start()
    threads.append(thread)

multithreading(dataset)

In [10]:
#numsection -> the percent into the data that it is
#data -> the data of the current section
def gather_whois(urls, section):
  """
  Queries and saves WHOIS information for the entire list of URLs.

  :param urls: A list of URLs to query.
  :param section: The current percentage.
  :return: None. Saves data into CSV.
  """
  # Ensure Directory Exists
  directory = path.join('datasets', 'whois')
  makedirs(directory, exist_ok=True)
  section = int(section * 100)

  # Query WHOIS for all URLs
  samples = []

  for i, url in enumerate(urls):
    try:
      response = whois.whois(url)
    except:
      continue

    if response.domain_name != "null" and response.domain_name is not None:
      response['sample_url'] = url
      samples.append(dict(response))

  # Save the WHOIS information into a CSV
  df = pd.DataFrame(samples)
  df_name = path.join(directory, f'whois-{section}.csv')
  df.to_csv(df_name)
  print(f"Generated {path.basename(df_name)}")

In [None]:
# Import the drive module from the Google Colab library
from google.colab import drive

# Mount your personal Google Drive
drive.mount('/content/drive/')

# Immediately change the current directory to the shared drive.
# This will reduce the chance that your personal drive will be modified erroneously.
os.chdir('/content/drive/Shareddrives/CSEC 620 Group 4/Final Project')

Mounted at /content/drive/


In [None]:
multithreading(dataset)

There are 0 threads.
Starting percent: 0.0 of 1.0%
There are 1 threads.
Starting percent: 0.05 of 1.0%
There are 2 threads.
Starting percent: 0.1 of 1.0%
There are 3 threads.
Starting percent: 0.15 of 1.0%
There are 4 threads.
Starting percent: 0.2 of 1.0%
There are 5 threads.
Starting percent: 0.25 of 1.0%
There are 6 threads.
Starting percent: 0.3 of 1.0%
There are 7 threads.
Starting percent: 0.35 of 1.0%
There are 8 threads.
Starting percent: 0.4 of 1.0%
There are 9 threads.
Starting percent: 0.45 of 1.0%
There are 10 threads.
Trying to join.
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to so

## 3. The Rest

In [None]:
def join_saved_whois():

    directory = path.join('datasets', 'whois')
    filelist = listdir(directory)
Nor
    df = pd.DataFrame()
    for f in filelist:
        if f.find('.csv') == 0:
            print(f)
            currentframe = pd.read_csv(f)
            df = pd.concat([df, currentframe], ignore_index=True)

whoisdata35.0.csv
whoisdata20.0.csv
whoisdata30.0.csv
whoisdata5.0.csv
whoisdata40.0.csv
whoisdata25.0.csv
whoisdata10.0.csv
whoisdata45.0.csv
whoisdata15.0.csv
whoisdata0.0.csv


  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
benmal = []
nummal = 0
numbenign = 0
for index, row in newframe.iterrows():
    thisurl = row["originalurl"]
    mergedrow = dataset.loc[dataset["url"] == thisurl]
    benigncheck = mergedrow.iloc[0]["label"]
    benmal.append(benigncheck)
    if benigncheck == 0:
        numbenign = numbenign + 1
    else:
        nummal = nummal + 1
print(nummal)
print(numbenign)

Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
Error trying to connect to socket: closing socket - [Errno 104] Connection reset by peer
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - [Errno 104] Connection reset by peer
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - [Errno 104] Connection reset by peer
Error trying to connect to socket: closing socket - [Errno -2] Na

## 4. Splitting Data

In [None]:
splitVal = int(len(newframe) * .3)
testing_set1 = newframe.iloc[:splitVal, ]
testing_set2 = newframe.iloc[(splitVal * 2) + 1:, ]
training_set = newframe.iloc[splitVal + 1:(splitVal * 2), ]
benmaltraining = benmal[splitVal + 1:(splitVal * 2)]
benmaltesting = benmal[:splitVal]
benmaltesting2 = benmal[(splitVal * 2) + 1:]

print(len(testing_set1))
print(len(testing_set2))
print(len(training_set))
print(len(benmaltraining))
print(len(benmaltesting))
print(len(benmaltesting2))

1618
2157
1617
1617
1618
2157


In [None]:
countries = {}
for index, row in training_set.iterrows():
    rowcountry = row["country"]
    if rowcountry in countries.keys():
        countries[rowcountry] = countries[rowcountry] + 1
    else:
        countries[rowcountry] = 1

for c in countries:
    print(c, countries[c])

US 647
CA 59
nan 674
IS 51
IN 16
GB 37
HK 1
AU 7
BR 23
SG 2
ES 3
REDACTED FOR PRIVACY 5
CN 19
VN 1
FR 7
UK 5
TW 1
DE 11
BG 2
NL 6
CY 4
PA 6
JP 11
DK 2
IT 5
PT 1
RU 7
PK 1
China 3
ID 2
TH 4
RO 2
NZ 2
Austria 2
SE 3
AT 2
LU 1
SI 1
CZ 3
SN 1
CR 2
PE 2
SK 1
KR 2
MX 1
TR 2
MY 1
IE 1
UA 1
CH 3
United Kingdom of Great Britain and Northern Ireland (the) 1
my 1
BE 1
KN 1
Malaysia 1
PL 1
PH 1


In [None]:
regs = {}
for index, row in training_set.iterrows():
    rowreg = row["registrar"]
    if rowreg in regs.keys():
        regs[rowreg] = regs[rowreg] + 1
    else:
        regs[rowreg] = 1

for r in regs:
    print(r, regs[r])

MarkMonitor, Inc. 151
TUCOWS, INC. 44
GoDaddy.com, LLC 277
NAMECHEAP INC 61
Hosting Concepts B.V. d/b/a Registrar.eu 3
PDR Ltd. d/b/a PublicDomainRegistry.com 14
Safenames Ltd 3
CSC CORPORATE DOMAINS, INC. 81
Network Solutions, LLC 111
Key-Systems GmbH 16
Google LLC 19
MarkMonitor Inc. 68
RegistrarSafe, LLC 29
nan 138
DNC Holdings, Inc 6
ENOM, INC. 37
Dreamscape Networks International Pte Ltd 2
DIAMATRIX C.C. 2
Heart Internet Ltd t/a Heart Internet [Tag = HEARTINTERNET] 2
Krystal Hosting Ltd [Tag = KRYSTAL] 1
123-Reg Limited 4
Network Solutions Inc. 1
Wild West Domains, LLC 13
TurnCommerce, Inc. DBA NameBright.com 14
Corporation Service Company 3
MASLEN s.r.o. 1
Gabia, Inc. 2
DYNADOT LLC 2
eName Technology Co.,Ltd. 3
Xiamen 35.Com Technology Co., Ltd. 1
Cloudflare, Inc. 13
Gandi SAS 2
Register.com, Inc. 11
DREAMHOST 1
Internet Domain Service BS Corp 7
mat bao corporation 1
OVH, SAS 5
Domain.com, LLC 10
Nom-IQ Limited t/a Com Laude [Tag = NOMIQ] 1
IONOS SE 7
Tucows.com Co. 4
Misk.com, I

In [None]:
dns = {}
for index, row in training_set.iterrows():
    rowdns = row["dnssec"]
    if rowdns in dns.keys():
        dns[rowdns] = dns[rowdns] + 1
    elif 'unsigned' in str(rowdns).lower() and (rowdns != 'unsigned'):
        dns["unsigned"] = dns["unsigned"] + 1
    else:
        dns[rowdns] = 1

for r in dns:
    print(r, dns[r])

unsigned 1355
nan 250
Signed delegation 7
signedDelegation 30
Inactive 8
signed delegation 5
no 6
yes 2


In [None]:

percentages = {}
percentagerow = []

for index, row in training_set.iterrows():
    rowdomain = row["domain_name"]
    num = 0
    lett = 0
    for char in rowdomain:
        if char.isalpha():
            lett += 1
        elif char.isnumeric():
            num += 1
    perc = round(num / (num + lett), 1)
    percentagerow.append(perc)
    if perc in percentages.keys():
        percentages[perc] = percentages[perc] + 1
    else:
        percentages[perc] = 1

for p in percentages:
    print(p, percentages[p])

0.0 1598
0.5 3
0.2 23
0.7 2
0.1 24
0.4 6
0.3 7


In [None]:
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB

le1 = preprocessing.LabelEncoder()
url1 = le1.fit_transform(training_set["originalurl"])
countries = le1.fit_transform(training_set["country"])
features1 = [[url1[i], countries[i]] for i in range(0, len(url1))]
label1 = le1.fit_transform(benmaltraining)

model1 = GaussianNB()
model1.fit(features1, label1)

GaussianNB()

In [None]:
le2 = preprocessing.LabelEncoder()
url2 = le2.fit_transform(training_set["originalurl"])
regs = le2.fit_transform(training_set["registrar"])
features2 = [[url2[i], regs[i]] for i in range(0, len(url2))]
label2 = le2.fit_transform(benmaltraining)
model2 = GaussianNB()
model2.fit(features2, label2)

GaussianNB()

In [None]:
le3 = preprocessing.LabelEncoder()
url3 = le3.fit_transform(training_set["originalurl"])
dnses = le3.fit_transform(training_set["dnssec"])
features3 = [[url3[i], dnses[i]] for i in range(0, len(url3))]
label3 = le3.fit_transform(benmaltraining)
model3 = GaussianNB()
model3.fit(features3, label3)

GaussianNB()

In [None]:
le4 = preprocessing.LabelEncoder()
url4 = le4.fit_transform(training_set["originalurl"])
percents = le4.fit_transform(percentagerow)
features4 = [[url4[i], percents[i]] for i in range(0, len(url4))]
label4 = le4.fit_transform(benmaltraining)
model4 = GaussianNB()
model4.fit(features4, label4)

GaussianNB()

In [None]:
test1url = le1.fit_transform(testing_set1["originalurl"])
test1country = le1.fit_transform(testing_set1["country"])
test1feats = [[test1url[i], test1country[i]] for i in range(0, len(test1url))]
test1final = model1.predict(test1feats)
test1final

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
test2url = le2.fit_transform(testing_set1["originalurl"])
test2reg = le2.fit_transform(testing_set1["registrar"])
test2feats = [[test2url[i], test2reg[i]] for i in range(0, len(test2url))]
test2final = model2.predict(test2feats)
test2final

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
test3url = le3.fit_transform(testing_set1["originalurl"])
test3dns = le3.fit_transform(testing_set1["dnssec"])
test3feats = [[test3url[i], test3dns[i]] for i in range(0, len(test3url))]
test3final = model3.predict(test3feats)
test3final

array([0, 0, 0, ..., 0, 0, 1])

In [None]:
percenttest = []

for index, row in testing_set1.iterrows():
    rowdomain = row["domain_name"]
    num = 0
    lett = 0
    for char in rowdomain:
        if char.isalpha():
            lett += 1
        elif char.isnumeric():
            num += 1
    perc = round(num / (num + lett), 1)
    percenttest.append(perc)

test4url = le4.fit_transform(testing_set1["originalurl"])
test4percent = le4.fit_transform(percenttest)
test4feats = [[test4url[i], test4percent[i]] for i in range(0, len(test4url))]
test4final = model4.predict(test4feats)
test4final

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
# this is a function to actually test naive bayes and get results back for any given testing set. used with the svm
def nb(testingset):
    test1url = le1.fit_transform(testingset["originalurl"])
    test1country = le1.fit_transform(testingset["country"])
    test1feats = [[test1url[i], test1country[i]] for i in range(0, len(test1url))]
    test1finalf = model1.predict(test1feats)
    print("Country NB:")
    print(classification_report(benmaltesting2, test1finalf))

    test2url = le2.fit_transform(testingset["originalurl"])
    test2reg = le2.fit_transform(testingset["registrar"])
    test2feats = [[test2url[i], test2reg[i]] for i in range(0, len(test2url))]
    test2finalf = model2.predict(test2feats)
    print("Registrar NB:")
    print(classification_report(benmaltesting2, test2finalf))

    test3url = le3.fit_transform(testingset["originalurl"])
    test3dns = le3.fit_transform(testingset["dnssec"])
    test3feats = [[test3url[i], test3dns[i]] for i in range(0, len(test3url))]
    test3finalf = model3.predict(test3feats)
    print("DNSSEC NB:")
    print(classification_report(benmaltesting2, test3finalf))

    percenttest = []

    for index, row in testingset.iterrows():
        rowdomain = row["domain_name"]
        num = 0
        lett = 0
        for char in rowdomain:
            if char.isalpha():
                lett += 1
            elif char.isnumeric():
                num += 1
        perc = round(num / (num + lett), 1)
        percenttest.append(perc)

    test4url = le4.fit_transform(testingset["originalurl"])
    test4percent = le4.fit_transform(percenttest)
    test4feats = [[test4url[i], test4percent[i]] for i in range(0, len(test4url))]
    test4finalf = model4.predict(test4feats)
    print("Percent of Number NB:")
    print(classification_report(benmaltesting2, test4finalf))

    return [test1finalf, test2finalf, test3finalf, test4finalf]

## 5. SVM Aggregation

In [None]:
# this is the actual svm function. the trainingdata being passed in is a 2d matrix where each row is one entry and testing data is the same
# maybe try changing the kernel since the data is so bad atm
from sklearn.metrics import classification_report


def svm_aggregate(trainingdata, testingdata):
    y_train = benmaltesting
    y_test = benmaltesting2
    x_train = trainingdata
    x_test = testingdata
    svmclassifier = SVC(kernel='linear')
    svmclassifier.fit(x_train, y_train)
    y_predictions = svmclassifier.predict(x_test)
    print("SVM:")
    print(classification_report(y_test, y_predictions))

In [None]:
# this is a helper function just to make transforming the data easier. it takes the array of each of the test results and makes the entries row-by-row instead of in each column if im not mistaken in how it is
def aggregator(data):
    outputdata = []
    i = 0
    row = []
    while i < len(data[0]):
        for d in data:
            row.append(d[i])
        outputdata.append(row)
        row = []
        i += 1
    return outputdata

In [None]:

# this is the handler function for svm, it manipulates the data into a useable form for the svm then calls the svm_aggregate function
# d2aggregate is the training data that needs to be aggregated. this comes from naive bayes' first testing set
def svm_handler(d2aggregate):
    trainingdata = aggregator(d2aggregate)
    testingdata = aggregator(nb(testing_set2))
    svm_aggregate(trainingdata, testingdata)


svm_handler([test1final, test2final, test3final, test4final])

Country NB:
              precision    recall  f1-score   support

           0       0.93      1.00      0.97      2073
           1       0.00      0.00      0.00       146

    accuracy                           0.93      2219
   macro avg       0.47      0.50      0.48      2219
weighted avg       0.87      0.93      0.90      2219

Registrar NB:
              precision    recall  f1-score   support

           0       0.93      1.00      0.97      2073
           1       0.00      0.00      0.00       146

    accuracy                           0.93      2219
   macro avg       0.47      0.50      0.48      2219
weighted avg       0.87      0.93      0.90      2219

DNSSEC NB:
              precision    recall  f1-score   support

           0       0.94      0.99      0.96      2073
           1       0.14      0.03      0.05       146

    accuracy                           0.92      2219
   macro avg       0.54      0.51      0.50      2219
weighted avg       0.88      0.92    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
