# Swordphish Testing

Python notebook that shows users how Swordphish features can be used and how does the api testing tool works.


**Requirements:**

* Python 3.5+
* tldextract
* Pandas (version)
* requests
* json
* re
* sys
* os.listdir
* pandas
* colorama
* numpy


In [1]:
import pandas as pd
import numpy as np
import math
from extract_urls import *
from swordphish_api import *

SWORDPHISH_API = 'https://api.easysol.io/swordphish/'
SWORDPHISH_APIKEY = '' # Please specify your API KEY
SAMPLE_DIRECTORY = 'sample/'

### This are all the different options user has to extraxt the urls

In [2]:
# 1. Default extractraction of csv file
# This option reads the file and extracts the first column that contains urls
url_array = extract_urls_default(SAMPLE_DIRECTORY)

# 2. Override the default and choose column
# In this option the user chooses the column to be extracted
url_array = extract_urls_override(SAMPLE_DIRECTORY, 1)

# 3. Users can extract the csv file themseleves 
# Manually extract the urls
file_content = pd.read_csv(SAMPLE_DIRECTORY + 'combined.csv').values.tolist()
file_content = pd.DataFrame(file_content)
file_content.columns = ['url','classification']
file_content.sample(10, random_state=42)

Unnamed: 0,url,classification
1860,http://libertyhotelsitges.com/remaxlistings/vi...,1
353,http://cakejournal.com/tutorials/cupcake-decor...,0
1333,https://srv34.prodns.com.br/~psjb/update-your-...,1
905,http://www.gardenguides.com/113739-different-k...,0
1289,http://hermesbookmarks.com/includes/player/bom...,1
1273,http://californiaimport.de/administrator/cache...,1
938,http://www.xlathlete.com/view_exercise.jsp?exe...,0
1731,http://www.conceptplace.com.br/%7Enative/image...,1
65,http://www.mysoutex.com/pages/full_story_landi...,0
1323,http://pf.unze.ba/nova/CV,1


### Swordphish can process only 1000 urls at a time. If the length of the array of urls is larger than that then, we need to pass the information by batches of 1000

In [3]:
final_array = []
length = file_content.shape[0]
print("Number of urls being test: " + str(length))
url_array = file_content[['url']]
index = np.array_split(np.arange(0,length), math.ceil(length / 1000))
for index_ in index:
    final_array.append(url_array.iloc[index_].values.T.tolist()[0])

Number of urls being test: 2000


### Now we call Swordphish with per batch and we count the time it takes to run all the queries

In [4]:
start_time = time.time()  # starts counting time
final_results = []
for batch in final_array:
    params = {
      "urlArray": batch,
      "force_clf": True
    }
    results = call_swordphish(SWORDPHISH_APIKEY, params)  # calls Swordphish
    final_results += results
sphish_time = round((time.time() - start_time)*1000,2)  # ends the counter
avg_query_time = round(sphish_time / length, 2)  # calculates average time per query

print("** SWORDPHISH PROCESS TIMING ** ")
print("-- Total time elapsed:     " + str(sphish_time) + "ms")
print("-- Average time per query: " + str(avg_query_time) + "ms")

** SWORDPHISH PROCESS TIMING ** 
-- Total time elapsed:     6219.46ms
-- Average time per query: 3.11ms


## Now we can see the results for each of the different calculations:
### 1. Phishing

In [5]:
phishing_stats = calculate_stats("PHISHING", 2, final_results)
print(phishing_stats)

IndexError: string index out of range

### 2. DGA

In [None]:
dga_stats = calculate_stats("DGA", 3, final_results)
print(dga_stats)

### 3. Malware 

In [None]:
malware_stats = calculate_stats("MALWARE", 4, final_results)
print(malware_stats)

### Results comparison:

In [None]:
test_labels = file_content[['classification']].values.T.tolist()[0]
final_results = classify(final_results)
final_results = pd.DataFrame(final_results)
final_results.columns = ['URL', 'Rank', 'Phishing Score', 'DGA Score', 'Malware Score', 'classification']
class_labels = final_results[['classification']].values.T.tolist()[0]

In [None]:
tp, fp, tn, fn = 0,0,0,0
for i in range(len(test_labels)):
    if(test_labels[i] == 0 and class_labels[i] == 0):
        tn += 1
    elif(test_labels[i] == 1 and class_labels[i] == 0):
        fn += 1
    elif(test_labels[i] == 1 and class_labels[i] == 1):
        tp += 1
    else:
        fp += 1

In [None]:
tp_percentage = round(tp/length*100,2)
fp_percentage = round(fp/length*100,2)
tn_percentage = round(tn/length*100,2)
fn_percentage = round(fn/length*100,2)

In [None]:
print(str(tp_percentage) + '% of the classification were True Positives')
print(str(fp_percentage) + '% of the classification were False Positives')
print(str(tn_percentage) + '% of the classification were True Negatives')
print(str(fn_percentage) + '% of the classification were False Negatives')

In [None]:
correct = round((tp+tn)/len(test_labels)*100, 2)
wrong = round((fp+fn)/len(test_labels)*100, 2)
print(str(correct) + '% of the classification were correct')
print(str(wrong) + '% of the classification were wrong')

#### Finally we can create a csv file that contains all the results

In [None]:
create_csv(results, 'sample')

In [None]:
results_csv = pd.read_csv('swordphish_sample_results.csv', index_col=0, header=None, names=['url','rank', 'phishing', 'dga', 'malware'])
print(results_csv.iloc[:5])

### We can select whats results we cant to see, such as phishing results:

In [None]:
phish_res = results_csv[['url','phishing']]
print(phish_res[:10])

#### This whole process can be done as well by extracting the domnains from the urls

In [None]:
url_array = pd.read_csv(SAMPLE_DIRECTORY + 'combined.csv', usecols=[0]).values.T.tolist()[0]
domain_array = extract_domains(url_array)
domain_array = pd.DataFrame(domain_array)
domain_array.columns = ['domain']
domain_array.sample(10, random_state=42)

#### Now the whole process is repeats

In [None]:
final_array = []
length = file_content.shape[0]
index = np.split(np.arange(0,length), math.ceil(length / 1000))
for index_ in index:
    final_array.append(file_content.iloc[index_].values.T.tolist()[0])
    

In [None]:
start_time = time.time()  # starts counting time
final_results = []
for batch in final_array:
    params = {
      "urlArray": batch,
      "force_clf": True
    }
    results = call_swordphish(SWORDPHISH_APIKEY, params)  # calls Swordphish
    final_results += results
sphish_time = round((time.time() - start_time) * 1000, 2)  # ends the counter
avg_query_time = round(sphish_time / length, 2)  # calculates average time per query
print("** SWORDPHISH PROCESS TIMING ** ")
print("-- Total time elapsed:     " + str(sphish_time) + "ms")
print("-- Average time per query: " + str(avg_query_time) + "ms")

In [None]:
phishing_stats = calculate_stats("PHISHING", 2, final_results)
print(phishing_stats)
dga_stats = calculate_stats("DGA", 3, final_results)
print(dga_stats)
malware_stats = calculate_stats("MALWARE", 4, final_results)
print(malware_stats)