# <font color="#00adb5">👉👨‍💻 Menggunakan Machine Learning (Logistic Regression) untuk Deteksi Phishing☠</font>
* Halaman Kaggle Notebook ini akan menjelaskan bagaimana proses prediksi machine learning dilakukan untuk mendeteksi URL yang berbahaya.

In [15]:
#Pertama kita lakukan input data
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [16]:
from IPython.display import Image
import os
!ls ../input/

In [17]:
import pandas as pd

In [18]:
phishing_data = pd.read_csv('/kaggle/input/phising-urls/phishing_site_urls.csv')

In [19]:
phishing_data.head()

In [20]:
phishing_data.tail()

In [21]:
# Untuk mendapat info mengenai 'phishing_site_urls.csv':
phishing_data.info()

In [22]:
# Untuk cek apakah ada missing value?
phishing_data.isnull().sum()

In [23]:
# Sekarang kita buat dataframenya untuk kelas count
lbl_counts = pd.DataFrame(phishing_data.Label.value_counts())

In [24]:
# Seaborn sering digunakan untuk visualisasi
import seaborn as sns
sns.set_style('darkgrid')
sns.barplot(lbl_counts.index, lbl_counts.Label)

### <font color="#1597bb">Regexp Tokenizer</font>

In [25]:
# Regexp tokenizers untuk memisahkan huruf dari kata:
from nltk.tokenize import RegexpTokenizer

In [26]:
# Kita hanya split alfabet di program ini biar lebih ringan hehe
tokenizer = RegexpTokenizer(r'[A-Za-z]+')

In [27]:
print(phishing_data.URL[0]) # 0 adalah row pertama

In [28]:
# Perintah ini mengelompokan semua string alfabet ke sebuah URL
clean_text = tokenizer.tokenize(phishing_data.URL[0]) 
print(clean_text)

### <font color="#ff1a75">⏰ Time module</font>

In [29]:
# Untuk kalkulasi waktu eksekusi
import time
start = time.time()
phishing_data['text_tokenized'] = phishing_data.URL.map(lambda text: tokenizer.tokenize(text))
end = time.time()
time_req = end - start
formatted_time = "{:.2f}".format(time_req)
print(f"Time required to tokenize text is: \n{formatted_time} sec")

In [30]:
# Lalu mari kita melihat sample result sampai di tahap ini:
phishing_data.sample(7)

### <font color="#f875aa">Snowball Stemmer NLTK</font>

In [32]:
# Sekarang kita menggunakan Snowball Stemmer untuk 'mencukur' lebih dalam kata-katanya
from nltk.stem.snowball import SnowballStemmer

sbs = SnowballStemmer("english")

# Kasih time module untuk mengukur kecepatan proses. 
start = time.time()
phishing_data['text_stemmed'] = phishing_data['text_tokenized'].map(lambda text: [sbs.stem(word) for word in text])
end = time.time()
time_req = end - start
formatted_time = "{:.2f}".format(time_req)
print(f"⏳ Time required for stemming all the tokenized text is: \n{formatted_time} sec")

In [33]:
# Now let's see the sample stemmed text:
phishing_data.sample(7)

In [34]:
# Sekarang menggabungkan kata kata yang sudah di"cukur" menjadi sebuah kalimat + time module
start = time.time()
phishing_data['text_to_sent'] = phishing_data['text_stemmed'].map(lambda text: ' '.join(text))
end = time.time()
time_req = end - start
formatted_time = "{:.2f}".format(time_req)
print(f"Time required for joining text to sentence is: \n{formatted_time} sec")

In [35]:
phishing_data.sample(10)

### <font color="#ff3366">Creating Model</font>

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

In [37]:
# Membuat objek "CV"
CV = CountVectorizer()

#Ini kalau mau lihat CountVectorizer tu ngapain aja
help(CountVectorizer())

In [38]:
# transform semua token dan text ke bentuk fitur paling dasar/matrix
feature = CV.fit_transform(phishing_data.text_to_sent)

In [39]:
# Mari kita kelompokkan matrix ke sebuah array
feature[:5].toarray()

In [40]:
# Split data antara feature, untuk model, dan target:
from sklearn.model_selection import train_test_split

# Kita kasih laporan juga untuk metriks2 pengukuran (yaitu: recall,precision,f1_score,c_m)
from sklearn.metrics import classification_report

# Membuat info antara actual dan prediksi
from sklearn.metrics import confusion_matrix

# Splitting data:
train_X, test_X, train_Y, test_Y = train_test_split(feature, phishing_data.Label)

### <font color="#e6e600">Regresi Logistic / Logistic Regression</font>

In [41]:
from sklearn.linear_model import LogisticRegression

# membuat objek untuk fungsi Logistic Regression()
lr = LogisticRegression()
lr.fit(train_X, train_Y)

In [42]:
# Sekarang kita kalkulasi berapa skornya:
lr.score(test_X, test_Y)

### <font color="#00ffaa">LR Score</font>
* Wah, skor Logistic Regression ada di angka akurasi 96%. HORE!

In [46]:
#Mengukur Matriks (Precision/Recall/F1-Score dan Confusion Matrix)
import  matplotlib.pyplot as plt
import numpy as np

Score_ml = {}
Score_ml['Logistic Regression'] = np.round(lr.score(test_X, test_Y), 2)

print('Training Accuracy: ',lr.score(train_X, train_Y))
print('Testing Accuracy: ',lr.score(test_X, test_Y))
# here we create confusion matrix:
conf_mat = pd.DataFrame(confusion_matrix(lr.predict(test_X), test_Y),
                       columns = ['Predicted: Phishing', 'Predicted: Not Phishing'],
                       index = ['Actual: Phishing', 'Actual: Not Phishing'])

print('\nClassification Report: \n')
print(classification_report(lr.predict(test_X), test_Y,
                           target_names = ['Bad', 'Good']))

print('\nconfusion Matrix: \n')
plt.figure(figsize = (6, 4))
sns.heatmap(conf_mat, annot = True, fmt='d', cmap="RdYlBu")

# <font color="Red">Membuat Pipeline untuk Model</font>
* Pipeline digunakan agar model dapat disalurkan ke data yang akan diprediksi.

In [51]:
from sklearn.pipeline import make_pipeline
pipeline_ls = make_pipeline(CountVectorizer(tokenizer = RegexpTokenizer(r'[A-Za-z]+').tokenize, stop_words='english'), LogisticRegression())
train_X, test_X, train_Y, test_Y = train_test_split(phishing_data.URL, phishing_data.Label)
pipeline_ls.fit(train_X, train_Y)

In [52]:
#Testing pipeline score-nya lagi
pipeline_ls.score(test_X, test_Y)

In [53]:
#Evaluasi Matriks di Pipeline
print("Training Accuracy: ",pipeline_ls.score(train_X, train_Y))
print("Testing Accuracy: ",pipeline_ls.score(test_X, test_Y))

conf_mat = pd.DataFrame(confusion_matrix(pipeline_ls.predict(test_X), test_Y), 
                       columns = ["Predicted: Phishing", "Predicted: Not Phishing"],
                       index = ["Actual: Phishing", "Actual: Not Phishing"])

print("\nClassification Report \n")
print(classification_report(pipeline_ls.predict(test_X), test_Y,
                            target_names = ['Bad', 'Good']))

print("\nConfusion Matrix \n")
plt.figure(figsize = (6,4))
sns.heatmap(conf_mat, annot = True, fmt = 'd', cmap="Blues")

In [54]:
#Memasukkan pipeline dan model ke dalam pickle file sebagai mesin/algoritma prediksi
import pickle
pickle.dump(pipeline_ls,open('phishing.pkl','wb'))

#Memastikan score model terjaga
loaded_model = pickle.load(open('phishing.pkl', 'rb'))
result = loaded_model.score(test_X,test_Y)
print(result)

# <font color="#009999">🙌KESIMPULAN</font>
* Jadi, kita mendapatkan akurasi 96%. Itu value yang sangat tinggi sebenarnya. Mari kita coba test dengan link-link lain dan melihat apakah bisa prediksi dengan akurat atau tidak.

##### <font color="#ff0000">❌ Bad Links</font> 
* __Website dibawah ini adalah PHISING 😡😡😡😡!__
   1. www.yeniik.com.tr/wp-admin/js/login.alibaba.com/login.jsp.php
   2. www.fazan-pacir.rs/temp/libraries/ipad
   3. www.tubemoviez.exe
   4. www.svision-online.de/mgfi/administrator/components/com_babackup/classes/fx29id1.txt

##### <font color="#77ff33">✔ Good Links</font>
* __Website di bawah ini adalah website normal 👍👍👍👍__
   1. www.youtube.com/
   2. www.kominfo.go.id
   3. www.digitalent.kominfo.go.id
   4. www.netacad.com

In [57]:
#Masukkan data prediksi
import pickle
predict_pertama = ['www.yeniik.com.tr/wp-admin/js/login.alibaba.com/login.jsp.php',
               'www.fazan-pacir.rs/temp/libraries/ipad',
               'www.tubemoviez.exe/',
               'www.svision-online.de/mgfi/administrator/components/com_babackup/classes/fx29id1.txt']

predict_kedua = ['www.youtube.com',
                'www.kominfo.go.id',
                'www.digita1ent.kominfo.go.id',
                'www.br1.com']

loaded_model = pickle.load(open('phishing.pkl', 'rb'))
# predict_pertama = vectorizers.transform(predict_pertama)

result_1 = loaded_model.predict(predict_pertama)
result_2 = loaded_model.predict(predict_kedua)

print(f"{result_1} \n {'-'*26} \n{result_2}")

# `SELESAIIIIII. TERIMA KASIH 😎`