GitHub - axieyangb/WebCrawler: Use BFS Alogrithm to implement a simple crawler to fetch all the embedded jpg/png urls in html pages derived from one root page. Fetching gigbytes of images will cost several hours. I am still working on the performance improvement. This is the first version. Any advise and suggestion are welcome!

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ImageDownload		ImageDownload
ImgCrawler		ImgCrawler
.gitattributes		.gitattributes
.gitignore		.gitignore
README.txt		README.txt

Repository files navigation

If you just want to run the program  to test the functions, you only need put following files in one folder

In WebCrawler/ImgCrawler/ImgCrawler/bin/Debug/
MySql.Data.dll
ImgCrawler.exe
ImgCrawler.exe.config

In WebCrawler/ImageDownload/ImageDownload/bin/Debug/
Dapper.dll
ImageDownload.exe
ImageDownload.exe.config


Basically the final directory will looks like this:


MySql.Data.dll
Dapper.dll
ImgCrawler.exe
ImgCrawler.exe.config
ImageDownload.exe
ImageDownload.exe.config



1.You need have your own mysql database( local/remote)
replace the connection string in ImgCrawler.exe.config with  your own value:

 <add name="mySqlConnString" connectionString="server=*****;uid=***;pwd=***;database=crawlerDb"/>

2. also the connection string in ImageDownload.exe.config.


3. <add key="ImgOutDirectory" value="D://GrabPics"/>  (any place you want)
Change the directory path where you prefer to store your download images.

In database crawlerDb

there are two tables 
 1. UrlContent is used to store the html url
 2.ImgContent is used to store the image url

CREATE TABLE `UrlContent` (
   `ListId` bigint(20) NOT NULL AUTO_INCREMENT,
   `Url` varchar(255) DEFAULT NULL,
   `CreateDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
   `Depth` int(11) DEFAULT '0',
   `IsVisited` int(11) DEFAULT '0',
   PRIMARY KEY (`ListId`)
 ) ENGINE=InnoDB AUTO_INCREMENT=64846 DEFAULT CHARSET=latin1

CREATE TABLE `ImgContent` (
   `ImgId` bigint(20) NOT NULL AUTO_INCREMENT,
   `ImgUrl` varchar(255) DEFAULT NULL,
   `UrlId` bigint(20) DEFAULT NULL,
   `IsVisited` int(11) DEFAULT '0',
   `CreateDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
   PRIMARY KEY (`ImgId`),
   KEY `fk_urlId` (`UrlId`),
   CONSTRAINT `fk_urlId` FOREIGN KEY (`UrlId`) REFERENCES `UrlContent` (`ListId`)
 ) ENGINE=InnoDB AUTO_INCREMENT=34917 DEFAULT CHARSET=latin1



Start : 
  Insert one root url into your UrlContent table:
  <your start url> replace with your  root website ,set depth to be 0:
  Here is the insert query.
  
  insert into UrlContent (Url,depth) values (<your start url>, 0);
  
  RUN The ImgCrawler.exe firstly,
  The records in UrlContent will increase dramatically.
  Then, run the ImageDownload.exe to download the files if there are lots of urls in ImgContent.
  Normally, You need wait serveral mins after ImgCrawler.exe has grap bunch of imgs.
 
 
 If your ImageDownload.exe normally exit, which means all found images have been downloaded. 
 You can repeatly execute it if new images found.
 Thats all.
 
 Hope you will enjoy it.

About

Use BFS Alogrithm to implement a simple crawler to fetch all the embedded jpg/png urls in html pages derived from one root page. Fetching gigbytes of images will cost several hours. I am still working on the performance improvement. This is the first version. Any advise and suggestion are welcome!

Readme