This project implements a heuristic-based phishing website detector. It analyzes URLs and fetches web page content to extract various features, calculate a phishing score, and flag potentially malicious websites. This tool is designed for educational purposes to demonstrate how phishing detection mechanisms work.
Disclaimer: This tool is a simplified educational project and is NOT a substitute for professional browser security features or dedicated anti-phishing solutions. It relies on a set of heuristics and a simple blacklist, which may not catch all phishing attempts and can produce false positives. The author is not responsible for any damage or misuse of this software.
Phishing detection is a critical aspect of cybersecurity, but its development and use come with significant ethical responsibilities:
- Educational Use Only: This detector is for learning purposes. Do not rely on it for absolute protection against phishing attacks.
- False Positives: Heuristic-based detection is prone to incorrectly flagging legitimate websites as phishing. Always verify suspicious URLs independently.
- Privacy: When fetching external URLs, be aware that your IP address might be logged by the target server. Avoid scanning sensitive or private URLs without explicit permission.
- Scope and Limitations: Clearly understand that this is a basic detector. It does not employ machine learning, advanced behavioral analysis, or real-time threat intelligence feeds that commercial solutions use.
- URL Feature Extraction: Analyzes URL components such as length, presence of suspicious characters (
@,//in path), number of subdomains, HTTPS status, and IP address in hostname. - HTML Content Feature Extraction: Examines webpage content for indicators like the presence of forms, iframes, scripts, suspicious keywords (e.g., "login", "verify"), and external links.
- Blacklist Integration: Checks analyzed domains and HTML content against a local blacklist of known malicious domains and keywords.
- Heuristic-Based Scoring: Assigns a phishing score based on a weighted sum of detected suspicious features.
- Configurable Threshold: Allows users to adjust the phishing score threshold for flagging websites.
- Detailed Output: Provides a breakdown of extracted features, the calculated phishing score, and detected indicators.
- JSON Report Generation: Can generate a detailed JSON report of the analysis for further examination.
- Command-Line Interface: Easy-to-use interface for analyzing URLs.
.
├── phishing_website_detector/
│ ├── __init__.py
│ ├── main.py
│ └── blacklists.txt
├── tests/
│ ├── __init__.py
│ └── test_detector.py
├── .gitignore
├── conceptual_analysis.txt
├── README.md
└── requirements.txt
- Python 3.7+
pipfor installing dependencies
-
Clone the repository:
git clone https://github.com/your-username/Python-Phishing-Website-Detector.git cd Python-Phishing-Website-Detector -
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the dependencies:
pip install -r requirements.txt
The phishing_website_detector/blacklists.txt file contains default blacklisted domains and keywords. You can edit this file to add more entries, one per line.
- Domains: Enter full domain names (e.g.,
malicious-site.com). The detector will covert these to lowercase for matching. - Keywords: Enter suspicious words or phrases (e.g.,
urgent_security_alert). These will be searched for in the HTML content.
Provide the URL you wish to analyze as a command-line argument:
python phishing_website_detector/main.py <url_to_analyze>Example:
python phishing_website_detector/main.py https://www.example-phishing.com/login-b,--blacklist <file_path>: Specify a custom path to the blacklist file (default:blacklists.txt).-t,--threshold <score>: Adjust the phishing score threshold for flagging a website (default: 0.5). Scores equal to or above this threshold will flag the site as "Potentially Phishing."-o,--output <file_path>: Output the detailed analysis results to a JSON file.
-
Analyze a URL with a custom blacklist and output to JSON:
python phishing_website_detector/main.py http://suspicious.site/ --blacklist my_custom_blacklist.txt --output analysis_report.json
-
Analyze a URL with a higher phishing threshold:
python phishing_website_detector/main.py http://another-suspicious.net/ --threshold 0.7
The detector calculates a phishing score based on various features. Each feature, when detected, adds a certain weight to the total score. A higher score indicates a greater likelihood of being a phishing site.
Some key indicators and their conceptual impact:
- URL Length: Very long URLs can be a characteristic of phishing sites.
@Symbol in URL: Often used to trick users by displaying a legitimate-looking domain before the@symbol.//in URL Path: Can be used for confusing redirects.- Excessive Subdomains: Many subdomains can obscure the true top-level domain.
- No HTTPS: The absence of HTTPS, especially on login pages, is a major red flag.
- IP Address in Hostname: Legitimate sites rarely use raw IP addresses in their URLs.
- Forms/Iframes/Scripts: While common, their presence, especially with other suspicious indicators, can suggest data harvesting or malicious injections.
- Suspicious Keywords in HTML: Words like "login," "verify account," etc., when combined with other factors, can point to a phishing intent.
- Blacklist Match: A direct hit on a blacklisted domain or keyword significantly increases the phishing score.
To run the automated tests, execute the following command from the project's root directory:
python -m unittest discover testsContributions are welcome! If you have ideas for improvements or have found a bug, please open an issue or submit a pull request.
- Fork the repository.
- Create a new branch:
git checkout -b feature/your-feature-name - Make your changes and commit them:
git commit -m 'Add some feature' - Push to the branch:
git push origin feature/your-feature-name - Open a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.