# Swordphish Testing

Python notebook that shows users how Swordphish features can be used and how does the api testing tool works.


**Requirements:**

* Python 3.5+
* tldextract
* Pandas (version)
* requests
* json
* re
* sys
* os.listdir
* pandas
* colorama
* numpy


In [1]:
import pandas as pd
import numpy as np
import math
from extract_urls import *
from swordphish_api import *

SWORDPHISH_API = 'https://api.easysol.io/swordphish/'
SWORDPHISH_APIKEY = '' # Please specify your API KEY
SAMPLE_DIRECTORY = 'sample/'

### This are all the different options user has to extraxt the urls

In [2]:
# 1. Default extractraction of csv file
# This option reads the file and extracts the first column that contains urls
url_array = extract_urls_default(SAMPLE_DIRECTORY)

# 2. Override the default and choose column
# In this option the user chooses the column to be extracted
url_array = extract_urls_override(SAMPLE_DIRECTORY, 1)

# 3. Users can extract the csv file themseleves 
# Manually extract the urls
file_content = pd.read_csv(SAMPLE_DIRECTORY + 'combined.csv').values.tolist()
file_content = pd.DataFrame(file_content)
file_content.columns = ['url','classification']
file_content.sample(10, random_state=42)
# file_content = pd.read_csv(SAMPLE_DIRECTORY + 'test.csv', usecols=[1])
# file_content.columns = ['url']
# file_content.sample(10, random_state=42)

Unnamed: 0,url,classification
1860,http://of8fbkm65.biennale.info/,1
353,http://www.conservapedia.com/index.php?title=S...,0
1333,http://fouadstationary.com/sparkasse/sparkasse...,1
905,http://www.sunadvocate.com/index.php?tier=1&ar...,0
1289,http://www.aokpage.com/%7Exinhvl/loginphpsited...,1
1273,http://www.turnoverball.com/pp/reception@brown...,1
938,https://www.gfg.com/cgibin/shopping.cgi?card=3...,0
1731,http://nqq2129uff4106.ks600.de/weber/,1
65,http://www.sporcle.com/games/sunshine423/picactr,0
1323,http://www.pp-ermittlung-deutsch.net/konflikt/...,1


### Swordphish can process only 1000 urls at a time. If the length of the array of urls is larger than that then, we need to pass the information by batches of 1000

In [3]:
final_array = []
length = file_content.shape[0]
print("Number of urls being test: " + str(length))
url_array = file_content[['url']]
index = np.array_split(np.arange(0,length), math.ceil(length / 1000))
for index_ in index:
    final_array.append(url_array.iloc[index_].values.T.tolist()[0])

Number of urls being test: 2000


### Now we call Swordphish with per batch and we count the time it takes to run all the queries

In [4]:
start_time = time.time()  # starts counting time
final_results = []
for batch in final_array:
    params = {
      "urlArray": batch,
      "force_clf": False
    }
    results = call_swordphish(SWORDPHISH_APIKEY, params)  # calls Swordphish
    final_results += results
sphish_time = round((time.time() - start_time)*1000,2)  # ends the counter
avg_query_time = round(sphish_time / length, 2)  # calculates average time per query

print("** SWORDPHISH PROCESS TIMING ** ")
print("-- Total time elapsed:     " + str(sphish_time) + "ms")
print("-- Average time per query: " + str(avg_query_time) + "ms")

** SWORDPHISH PROCESS TIMING ** 
-- Total time elapsed:     26572.03ms
-- Average time per query: 13.29ms


## Now we can see the results for each of the different calculations:
### 1. Phishing

In [5]:
phishing_stats = calculate_stats("PHISHING", 2, final_results)
print(phishing_stats)

[37m** PHISHING STATS  **
[31m50.65% of the links have been categorized as PHISHING.
[33m2.05% of the links have not been marked completely safe.
[32m47.3% of the links are safe.



### 2. DGA

In [6]:
dga_stats = calculate_stats("DGA", 3, final_results)
print(dga_stats)

[37m** DGA STATS  **
[31m2.0% of the links have been categorized as DGA.
[33m2.05% of the links have not been marked completely safe.
[32m95.95% of the links are safe.



### 3. Malware 

In [7]:
malware_stats = calculate_stats("MALWARE", 4, final_results)
print(malware_stats)

[37m** MALWARE STATS  **
[31m28.9% of the links have been categorized as MALWARE.
[33m8.75% of the links have not been marked completely safe.
[32m62.35% of the links are safe.



### Results comparison:

In [8]:
test_labels = file_content[['classification']].values.T.tolist()[0]
final_results = classify(final_results)
final_results = pd.DataFrame(final_results)
final_results.columns = ['URL', 'Rank', 'Phishing Score', 'DGA Score', 'Malware Score', 'classification']
class_labels = final_results[['classification']].values.T.tolist()[0]

In [9]:
tp, fp, tn, fn = 0,0,0,0
for i in range(len(test_labels)):
    if(test_labels[i] == 0 and class_labels[i] == 0):
        tn += 1
    elif(test_labels[i] == 1 and class_labels[i] == 0):
        fn += 1
    elif(test_labels[i] == 1 and class_labels[i] == 1):
        tp += 1
    else:
        fp += 1

In [10]:
tp_percentage = round(tp/length*100,2)
fp_percentage = round(fp/length*100,2)
tn_percentage = round(tn/length*100,2)
fn_percentage = round(fn/length*100,2)

In [11]:
print(str(tp_percentage) + '% of the classification were True Positives')
print(str(fp_percentage) + '% of the classification were False Positives')
print(str(tn_percentage) + '% of the classification were True Negatives')
print(str(fn_percentage) + '% of the classification were False Negatives')

48.0% of the classification were True Positives
2.65% of the classification were False Positives
47.35% of the classification were True Negatives
2.0% of the classification were False Negatives


In [12]:
correct = round((tp+tn)/len(test_labels)*100, 2)
wrong = round((fp+fn)/len(test_labels)*100, 2)
print(str(correct) + '% of the classification were correct')
print(str(wrong) + '% of the classification were wrong')

95.35% of the classification were correct
4.65% of the classification were wrong


In [None]:
compared_res = []
# file_content = file_content.values.tolist()
final_results = pd.DataFrame(final_results)
final_results.columns = ['url','rank', 'phishing', 'dga', 'malware', 'classification']
get_res = final_results[['url', 'phishing', 'classification']].values.tolist()
for i in range(len(final_results)):
    tup = tuple(get_res[i]) + (test_labels[i],)
    compared_res.append(tup)

df_url = pd.DataFrame(compared_res, columns = ['URL', 'score', 'Classification', 'original'])
df_url.to_csv('comp_res.csv')

#### Finally we can create a csv file that contains all the results

In [None]:
create_csv(final_results, 'sample')

In [None]:
results_csv = pd.read_csv('swordphish_sample_results.csv', index_col=0, header=None, names=['url','rank', 'phishing', 'dga', 'malware'])
print(results_csv.iloc[:5])

### We can select whats results we cant to see, such as phishing results:

In [None]:
phish_res = results_csv[['url','phishing']]
print(phish_res[:10])

#### This whole process can be done as well by extracting the domnains from the urls

In [None]:
url_array = pd.read_csv(SAMPLE_DIRECTORY + 'combined.csv', usecols=[0]).values.T.tolist()[0]
domain_array = extract_domains(url_array)
domain_array = pd.DataFrame(domain_array)
domain_array.columns = ['domain']
domain_array.sample(10, random_state=42)

#### Now the whole process is repeats

In [None]:
final_array = []
length = file_content.shape[0]
index = np.split(np.arange(0,length), math.ceil(length / 1000))
for index_ in index:
    final_array.append(file_content.iloc[index_].values.T.tolist()[0])
    

In [None]:
start_time = time.time()  # starts counting time
final_results = []
for batch in final_array:
    params = {
      "urlArray": batch,
      "force_clf": True
    }
    results = call_swordphish(SWORDPHISH_APIKEY, params)  # calls Swordphish
    final_results += results
sphish_time = round((time.time() - start_time) * 1000, 2)  # ends the counter
avg_query_time = round(sphish_time / length, 2)  # calculates average time per query
print("** SWORDPHISH PROCESS TIMING ** ")
print("-- Total time elapsed:     " + str(sphish_time) + "ms")
print("-- Average time per query: " + str(avg_query_time) + "ms")

In [None]:
phishing_stats = calculate_stats("PHISHING", 2, final_results)
print(phishing_stats)
dga_stats = calculate_stats("DGA", 3, final_results)
print(dga_stats)
malware_stats = calculate_stats("MALWARE", 4, final_results)
print(malware_stats)