This repo contains my capstone project for the 2023 machine learning zoomcamp which I completed.

1. About the project

Malicious URLs are web links created to achieve dubious acts like scams, frauds, or phishing. Some malicious URLs are designed to install malware on user devices when clicked on. The interesting thing about these URLs is that they are designed to appear like normal, safe URLs, therefore luring naive users to click on them. Conventional approaches to combating these threats have depended on the use of blacklists and whitelists, which consist of URL listings classified as malicious or safe. Nevertheless, blacklists have limitations in terms of their applicability and their inability to safeguard against novel, unidentified malicious URLs. In this project, Natural language processing techniques were applied to address this challenge. A machine learning model was trained with datasets containing URL samples, comprising both malicious and legitimate URLs. This model learns from the data to distinguish between what constitutes a legitimate URL and what characterises a malicious one. When provided with new URL samples, the model will be able to generate predictions to classify them as either malicious or legitimate. Following this training, the model can be deployed for real-world usage in a production environment. This is a machine learning classification problem

Image source

2. Project important files and folders explained

phishing_site_urls.csv: This is the dataset used to train the machine learning model; it contains 549846 unique URL entries. and 2 columns, namely, URLs and labels. The URL columns contain samples of different URLs, for example, mistershortcut.bravepages.com/tlfp1n9, cnajs.com/pic/Agatha%20Christie/dropbox/dropbox/, pro-football-reference.com/players/H/HarrIk00.htm. The Label column contains two categories: Good and Bad. Good is a non-malicious URL. Bad is a malicious URL. The dataset is available to download on Kaggle.
malicious_url.ipynb: This is a Jupyter Notebook where I carried out exploratory (EDA) data analysis on the dataset, processed the data, and experimented with different machine learning techniques to build this project.
final_model.pkl: This is the trained model
train.py: This is the Python script used to train the model
predict.py This is a Python script for the web service, which was built with Flask. This script serves the model as a RESTful API to the request.py script to make inferences with the model
request.py This Python script is used to make inferences with the model, This script prompts users to enter a URL sample and sends the URL as a POST request to the web service script to make predictions.
Pipfile & Pipfile.lock These are configuration files used for managing Python dependencies and packages in this project
dockerfile This is a docker image file for deploying the model to docker container

3. Running this project

To try out this project, follow these steps:

Clone this repository by running this command: https://github.com/cyberholics/Malicious-URL-detector.git
Enter the project directory: cd Malicious-URL-detector
Build the docker image: docker build -t Malicious-URL-detector .
Run the docker container: docker run -it --rm -p 9696:9696 Malicious-URL-detector (Make sure you have docker running.)
Open another terminal and run cd Malicious-URL-detector. Next, run python request.py. This will prompt you to enter a URL sample, so enter any URL sample to get a prediction of whether the URL is malicious or good.

4. Project demo

I tried: www.youtube.com/, www.grammarly.com/plagiarism-checker , www.yeniik.com.tr/wp-admin/js/login.alibaba.com/login.jsp.php, www.tubemoviez.exe/, www.svision-online.de/mgfi/administrator/components/com_babackup/classes/fx29id1.txt.

If you have any questions, reach out to me via email at victorkingoshimua@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Images		Images
__pycache__		__pycache__
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
dockerfile		dockerfile
final_model.pkl		final_model.pkl
malicious_url.ipynb		malicious_url.ipynb
phishing_site_urls.csv		phishing_site_urls.csv
predict.py		predict.py
request.py		request.py
train.py		train.py

cyberholics/Malicious-URL-detector

Folders and files

Latest commit

History

Repository files navigation

This repo contains my capstone project for the 2023 machine learning zoomcamp which I completed.

Table of contents

1. About the project

2. Project important files and folders explained

3. Running this project

4. Project demo

About

Resources

Stars

Watchers

Forks

Languages