DannyChoo.com Post Scraper v0.2.1
DannyChoo.com Post Scraper is a PHP library that allows you to glean data from DannyChoo.com and store it in a MySQL database.
Check out http://dc-api.danielbough.com for an API that uses data gathered by this library.
- Post URL
- Post Title
- Post Description
- Post Publish Date
- Post Category
- Category URL
- First Photo URL
daniel.bough AT gmail.com
This software is free to use under the GPL license.
- MySQL Database (v5+)
- PHP Simple HTML DOM Parser (http://sourceforge.net/projects/simplehtmldom/files/).
- Download the latest version of PHP Simple HTML DOM Parser from http://sourceforge.net/projects/simplehtmldom/files/, which, at the time of this writing, is 1.5. This should be stored in the same location as
dc_post_scraper.php. (If you'd like to store it somewhere else, you will need to change the path at the top of
- Add your MySQL database info to
- Create a MySQL database (InnoDB) and import the table structure in
-dfor debug output).
- You can adjust the number of archive pages scanned by updating the
PAGE_DEPTHglobal variable in dc_include.php. The default is 1, however there are (at the time of this writing) at least 179 pages.
- The more archive pages you scan at a time, the longer this script will take. There are nearly 6000 posts - it could take a few hours to parse them all. If you don't mind waiting, I'd set the
PAGE_DEPTHvariable to the max number of pages, run the script, and walk away. After you have them all, change the
PAGE_DEPTHto 1 and run the script once per day.
- It takes approximately 5 minutes to scrape & store 1 archive page worth of posts (36 max).
- Change the format of the Changelog section of the README file.
- Fixed a few bugs caused by HTML changes on DannyChoo.com.
- Added basic debug functionality to
- Added code to grab the first photo url of a post. Cleaned up some typo-s.