A Python tool for crawling websites and saving their content as markdown files.
- Crawls websites from a CSV list
- Extracts main content from HTML
- Converts HTML to clean markdown
- Saves files with metadata frontmatter
- Handles errors gracefully
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install required packages:
pip install requests beautifulsoup4 markdownify
-
Create a CSV file named
server.csvwith the following format:https://example.com/page,variable key,Site Name,Category,Start marker,End marker https://another-site.com,variable key,Another Site,Another Category,Start marker,End marker -
Run the crawler:
python webcrawler.py
-
The crawler will:
- Read URLs from
server.csv - Crawl each website
- Save content as markdown files in the current directory
- Files are named with the format:
YYYYMMDD_SiteName_varibaleValue.md - VaribaleValue is number. the key is in csv file.
- Read URLs from
Each markdown file contains:
- YAML frontmatter with metadata
- Title as H1 heading
- Main content of the webpage
- Edit the
extract_main_contentfunction to customize content extraction - Modify the
clean_contentfunction to remove unwanted elements - Change the filename format in the
create_filenamefunction
- Python 3.6+
- BeautifulSoup4
- Markdownify
- Requests