Skip to content

Commit

Permalink
Try only accepting html files in the crawl download
Browse files Browse the repository at this point in the history
  • Loading branch information
schbetsy committed Oct 12, 2020
1 parent 8eb419d commit 2ab81ec
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions crawl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ time wget \
--follow-tags=a \
--limit-rate=200k \
--random-wait \
--reject '*.css,*.doc,*.docx,*.epub,*.gif,*.ico,*.jpg,*.js,*.mp3,*.pdf,*.PDF,*.png,*.pptx,*.tmp,*.txt,*.wav,*.woff,*.woff2,*.xls,*xlsx,*.xml,*.zip' \
--accept html \
--reject-regex "topics=|authors=|categories=|filter_blog_category=|ext_url=|search_field=|issuer_name=" \
--recursive \
--level=8 \
--level=5 \
--trust-server-names \
--no-verbose \
--no-clobber \
Expand Down

0 comments on commit 2ab81ec

Please sign in to comment.