This project focuses on exploring and analyzing malicious URLs using Exploratory Data Analysis (EDA) techniques. The analysis involves the use of Logistic Regression and Random Forest models to gain insights into patterns within URLs, allowing users to identify potential threats and take preventive measures.
- Avadhi Shah
- Urmil Trivedi
- Dhyey Dave
The primary goal of this project is to enable users to closely examine suspicious web links by employing Logistic Regression and Random Forest models for Exploratory Data Analysis. By studying patterns in data and extracting essential features, users can make informed decisions about the potential harm associated with a given URL.
The initial section imports essential libraries such as NumPy, Pandas, Matplotlib, Seaborn, and tools for Google Colab file uploads. The code prompts users to upload a CSV file containing URL data and performs basic checks.
The URLDataProcessor class is designed to process URLs and extract various features. It includes methods for calculating URL length, hostname length, path length, first directory length, extracting the top-level domain (TLD), and counting characters and substrings. These features provide insights into the structure and content of a given URL.
The project includes comprehensive data analysis, such as calculating URL attributes (length, hostname length, path length, etc.) and generating visualizations like heatmaps and histograms to identify patterns and correlations.
The code implements machine learning models,including Decision Tree and Random Forest classifiers and Logistic Regression , to predict whether a URL is malicious or benign based on its features. The models are trained and evaluated using accuracy scores and confusion matrices.
A section of the code demonstrates how to extract features for new URLs and predict their labels using a pre-trained Decision Tree model.
-
Uploading Data:
- Ensure your dataset is in CSV format.
- Use the provided code to upload your CSV file to the Google Colab environment.
-
Data Analysis:
- Explore the data analysis section to understand various attributes of the URLs.
- Examine visualizations like heatmaps and histograms to identify patterns.
-
URL Data Processing:
- Utilize the
URLDataProcessorclass to extract features from URLs for further analysis.
- Utilize the
-
Machine Learning Models:
- Train and evaluate machine learning models (Decision Tree, Random Forest) for URL classification.
-
Feature Extraction for New URLs:
- Use the provided code to extract features for new URLs and predict their labels using a pre-trained model.
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Google Colab (for file uploads)
- scikit-learn
- tld (Top-Level Domain extraction)
- tabulate
- Regex
- Warning (for handling convergence warnings)
If you wish to contribute to this project, please follow standard GitHub contribution practices. Fork the repository, create a new branch for your changes, and submit a pull request.