# PREPROCESSING

In [1]:
import pandas as pd

path = "dataset_phishing.csv"
phishing_data = pd.read_csv(path, header=0)
phishing_data.head()

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,http://www.crestonwood.com/router.php,37,19,0,3,0,0,0,0,0,...,0,1,0,45,-1,0,1,1,4,legitimate
1,http://shadetreetechnology.com/V4/validation/a...,77,23,1,1,0,0,0,0,0,...,1,0,0,77,5767,0,0,1,2,phishing
2,https://support-appleld.com.secureupdate.duila...,126,50,1,4,1,0,1,2,0,...,1,0,0,14,4004,5828815,0,1,0,phishing
3,http://rgipt.ac.in,18,11,0,2,0,0,0,0,0,...,1,0,0,62,-1,107721,0,0,3,legitimate
4,http://www.iracing.com/tracks/gateway-motorspo...,55,15,0,2,2,0,0,0,0,...,0,1,0,224,8175,8725,0,0,6,legitimate


# Splitting into training and test sets

Before preparing the training and test sets, there is a need to properly configure X and y. By analyzing the dataset (as seen above), we know that the url and status columns must be excluded in X as they are not attributes that help determine whether a url is legitimate or not. This explains why these columns are dropped. On the other hand, y is expected to be the classification of each element; hence only the inclusion of the status column.

The data is then split into the training and test sets with a 70:30 (training:test) split.

In [2]:
from sklearn.model_selection import train_test_split

X = phishing_data.drop(['url', 'status'], axis=1)
y = phishing_data['status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50)

# Normalization/Standardization

Normalizing the data for these columns will now result to a more feasible and justifiable model.

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Training the model

After normalizing the data for the columns, we can now train the model. Eventually, we check the accuracy_score of the model by comparing the true values of the test set to the obtained predictions.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

model = SVC(kernel='linear').fit(X_train, y_train)

#predictions = model.predict(X_test)
#accuracy_test_svm = accuracy_score(y_test,predictions)

#print("SVM")
#print(accuracy_test_svm)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=ee7555e5-26a4-4c4c-a79c-80edb2017ecb' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>