Web Crawler Project (webco)

A Python tool for crawling websites and saving their content as markdown files.

Features

Crawls websites from a CSV list
Extracts main content from HTML
Converts HTML to clean markdown
Saves files with metadata frontmatter
Handles errors gracefully

Setup

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install required packages:

pip install requests beautifulsoup4 markdownify

Usage

Create a CSV file named server.csv with the following format:

https://example.com/page,variable key,Site Name,Category,Start marker,End marker
https://another-site.com,variable key,Another Site,Another Category,Start marker,End marker

Run the crawler:
```
python webcrawler.py
```
The crawler will:
- Read URLs from server.csv
- Crawl each website
- Save content as markdown files in the current directory
- Files are named with the format: YYYYMMDD_SiteName_varibaleValue.md
- VaribaleValue is number. the key is in csv file.

Output

Each markdown file contains:

YAML frontmatter with metadata
Title as H1 heading
Main content of the webpage

Customization

Edit the extract_main_content function to customize content extraction
Modify the clean_content function to remove unwanted elements
Change the filename format in the create_filename function

Requirements

Python 3.6+
BeautifulSoup4
Markdownify
Requests

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
ansung_sample.html		ansung_sample.html
langgr.py		langgr.py
notebook.md		notebook.md
prd.md		prd.md
prd_kor.md		prd_kor.md
reg_test.py		reg_test.py
requirements.txt		requirements.txt
samchuck_sample.html		samchuck_sample.html
server.csv		server.csv
test.md		test.md
webcrawler.py		webcrawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler Project (webco)

Features

Setup

Usage

Output

Customization

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Crawler Project (webco)

Features

Setup

Usage

Output

Customization

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages