### Module 6: Phishing

10/10/2019

Benny Cohen

In this notebook we analyze a dataset which looks at urls and website source code to see if a website is phishing or not. We find that certain features assumed to be telling of phishing don't reveal much. 

In [2]:
from scipy.io import arff
data, metadata = arff.loadarff('data/Training Dataset.arff')

These are the names of all the columns

In [3]:
metadata

Dataset: phishing
	having_IP_Address's type is nominal, range is ('-1', '1')
	URL_Length's type is nominal, range is ('1', '0', '-1')
	Shortining_Service's type is nominal, range is ('1', '-1')
	having_At_Symbol's type is nominal, range is ('1', '-1')
	double_slash_redirecting's type is nominal, range is ('-1', '1')
	Prefix_Suffix's type is nominal, range is ('-1', '1')
	having_Sub_Domain's type is nominal, range is ('-1', '0', '1')
	SSLfinal_State's type is nominal, range is ('-1', '1', '0')
	Domain_registeration_length's type is nominal, range is ('-1', '1')
	Favicon's type is nominal, range is ('1', '-1')
	port's type is nominal, range is ('1', '-1')
	HTTPS_token's type is nominal, range is ('-1', '1')
	Request_URL's type is nominal, range is ('1', '-1')
	URL_of_Anchor's type is nominal, range is ('-1', '0', '1')
	Links_in_tags's type is nominal, range is ('1', '-1', '0')
	SFH's type is nominal, range is ('-1', '1', '0')
	Submitting_to_email's type is nominal, range is ('-1', '1')
	

Each row corresponds to one feature used to judge whether the website is a legitimate of a phising website. Everything up to the ''s' is the name of the feature. The word nominal indicates that the data type is a number. The values -1, 0, and 1 appear to correspond to whether the site is legitimate, suspicious, or phishing based on the function on the document for each row. 


Here are paraphrased descriptions of some of the features used to judge whether a website is phishing or not to get an idea of how the judgements were made. 
The rest of the features are similar; they are all oditities not found in regular websites but found in phishing websites. 
1. having_IP_Address - Normally websites use DNS to map host names to IP Addresses. If an IP Address is used instead of a domain name it is a phishing website. 
2. URL_Length- Phishers sometimes make urls long to hide suspicious urls. If the URL length is over 75 it is called phishing, if between 54 and 75 then suspicious.
3. Shortining_Service - If website uses URL shortener it is deemed phishing. 
4. having_At_Symbol - If it has an @ then the browser ignores everything up to the @ hiding the real address so it is deemed phishing
5. double_slash_redirecting - A double slash is code for redirectiong. If it has one past the 7th position it is deemed phishing. 
6. Prefix_Suffix - If it has a dash it is deemed phishing since websites rarely use dashes. 
7. having_Sub_Domain - Domain's are what comes after the dot in urls. Ex-one Top Level Domain is .com, .net.... If a site has multiple sub domain's, it is considered phishing

The descriptions for the rest can be found in the 3rd download here: https://archive.ics.uci.edu/ml/machine-learning-databases/00327/

The last column, 'Result' says if it is phishing or not. Let's assume that 1 is positive for phishing. 

Let's turn it into a dataframe. The data is stored as bytes so let's convert them to signed ints so we don't have to see the b

In [4]:
import pandas as pd
df = pd.DataFrame(data, dtype = 'i')
df.head(5)

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1


In [6]:
phishing_df = df[df['Result'] == 1]
non_phishing_df =  df[df['Result'] == -1]

phishing_df.to_csv('data/phishing.csv')
non_phishing_df.to_csv('data/non_phishing.csv')

#### Analysis

In [7]:
len(phishing_df)

6157

In [8]:
len(non_phishing_df)

4898

First we note that there are more phishing rows. 

Let's try to find the most common attribute among both phising and non phishing websites. All values are -1, 0, or 1. A high mean therefore indicates many 1's. 

In [9]:
phishing_means = phishing_df.mean().sort_values(ascending = False)

In [10]:
non_phishing_means = non_phishing_df.mean().sort_values(ascending = False)

In [11]:
means = pd.DataFrame([phishing_means, non_phishing_means]).T
means.columns = ['Phishing', 'NonPhishing']
means

Unnamed: 0,Phishing,NonPhishing
Result,1.0,-1.0
RightClick,0.918467,0.908126
SSLfinal_State,0.832223,-0.479788
Iframe,0.81517,0.81911
Google_Index,0.801202,0.621478
on_mouseover,0.78626,0.731727
Statistical_report,0.769043,0.657411
port,0.750528,0.700286
having_At_Symbol,0.734286,0.658228
double_slash_redirecting,0.718369,0.770519


Now let's make a column for the differences to see the features most prevalent in only phishing websites. Note that some like 'Right Click' (disabling right clicks to download a website source code) is common in both and therefore not revealing of much making this step needed

In [12]:
means_copy = means
means_copy['Differences'] = means['Phishing'] - means['NonPhishing']
means_copy.sort_values(by= 'Differences', ascending = False)

Unnamed: 0,Phishing,NonPhishing,Differences
Result,1.0,-1.0,2.0
SSLfinal_State,0.832223,-0.479788,1.312011
URL_of_Anchor,0.365438,-0.632095,0.997532
web_traffic,0.542797,-0.033891,0.576688
Request_URL,0.408803,-0.092283,0.501086
having_Sub_Domain,0.281468,-0.209473,0.490942
Prefix_Suffix,-0.524119,-1.0,0.475881
Links_in_tags,0.050999,-0.330747,0.381746
SFH,-0.445834,-0.784198,0.338364
age_of_domain,0.169401,-0.074724,0.244125


We can make a couple of conclusions from this data. Note that positive differences indicate as expected that phishing websites are more likely to have this feature. Negative differences indicate that more non phishing websites have this feature, going against the assumption that websites that have this feature make it more likely to have a phishing problem. 
1. SSLfinal_state has the biggest difference; it is found in a lot of Phishing websites but not a lot of non Phishing websites. This makes a lot of sense. The SSLfinal_state records a -1 if the site uses https AND the certificate assigned with it was assigned by a trustworthy source AND the age of certificate is over 1 year. It is therefore highly unlikly that a phishing website would meet the requirements for this field. 
2. If the domain address expires in 1 year then the website is marked with a 1 for phishing. The opposite though seems to be true though. Domain Registration Length for Phishing was -.526 indicating that many phishing websites did not have the feature and about half of the nonPhishing websites had this feature. We can conclude this feature is not so usefull in judging if a page is phishing. 
3. There are several negative differences, indicating that these corresponding features don't really indicate that it is phishing (ie - Favicon, Iframe, Redirect, double_slash_redirectiong,HTTPS_toke,Abnormal_URL,Shorining_Serivce). The values that are close to 0 also indicate that the feature doesn't impact phishing that much. 

Similar conclusions can be made for all of the features with the differences series. High Positive differences indicate that it is a notable feature to look at when judging if a page is a phishing website.