A Python-based web scraper that gathers publicly available information about websites and detects potentially exposed sensitive data, such as emails or API keys. This project showcases how automation can be used to identify publicly exposed sensitive information, which could pose a security risk.
- Scrapes and parses website content using
requestsandBeautifulSoup. - Detects sensitive data patterns like:
- Emails
- API Keys
- Generates reports with the detected information.
- Customizable regex patterns for different data types (e.g., phone numbers, credit cards).
-
Clone the repository:
git clone https://github.com/aavila02/web-scraping-data-security.git cd web-scraping-data-security -
Install the required dependencies:
pip install -r requirements.txt
-
Run the scraper:
python scraper.py
- Integrate with Have I Been Pwned API to check if scraped emails are part of known breaches.
- Add multi-page scraping for large websites.