Skip to content
Scrape the body text from a bunch of articles at once.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md
scrape.rb
scrape.sh
urls.txt

README.md

body_scraper

This script scrapes the body text from a list of URLS and outputs the text from each URL in a separate .txt file.

It was built on OSX 10.10. It uses bash and Ruby with the nokogiri library.

To use it, open a terminal in the directory and run the following command:

./scrape.sh urls.txt

This will look at all the urls in urls.txt and create a new text file for each one of them in a subdirectory called texts. Before running the script, make sure all your urls are in urls.txt, with exactly one url per line and no blank lines.

Notes

This script first grabs what is inside a web page's <title> tags and then extracts text enclosed in the HTML tags that would be under the XPath //body/p. This works for about 90% of web pages, but some web pages don't put their articles' text inside <p> tags. (For example, sometimes important text is inside <ul> or <li> tags.)

The text files generated by the script are just named 1.txt, 2.txt, 3.txt, etc. It would be nice to update this so it puts the article title or part of it as the text file's name.

You can’t perform that action at this time.