Skip to content

agrawalnaman/web-crawler-using-Golang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

Web Crawler using Go Language

References:

https://golang.org/
https://golang.org/doc/
https://github.com/jackdanger/collectlinks
https://godoc.org/github.com/PuerkitoBio/goquery

Downloads:

https://golang.org/dl/
https://golang.org/doc/install

Install Go on Linux (STEP-1)

1)Use apt package installer
sudo add-apt-repository ppa:longsleep/golang-backports
sudo apt update
sudo apt install golang-go
OR , Simply use Snap Installer

sudo snap install --classic go

Configure Go Path (Step-2)

1. Set $GOPATH environment variable points to Go Project directory.
2. Add $GOPATH/bin to $PATH

NOTE: Can set using Bash Alias (run to reset in each Go Project Root)
set go path in linux using command : 	
alias gopath='export GOPATH=$(pwd);export PATH=$PATH:$GOPATH/bin'

If Golang is already installed on your system and Go path is configured then follow the steps below to clone the repo and run the script in Linux console:

Installing 3rd party package (Required dependency) (Step-3)

go get "github.com/jackdanger/collectlinks"

Git Clone the repo (Step-4)

cd go/src
git clone https://github.com/agrawalnaman/web-crawler-using-Golang.git

Run crawl.go (Step-5)

command for running the script:go run crawl.go <Base URL>

cd web-crawler-using-Golang/
go run crawl.go http://rescale.com/

Testing

Crawled some small websites and manually checked for loops and infinite call backs
go run crawl.go https://www.crawler-test.com/links/repeated_internal_links
and other websites too .. for example
go run crawl.go https://www.crawler-test.com/links/page_with_external_links

A good website for testing the crawler is

go run crawl.go https://www.crawler-test.com/

Printed Timestamps after printing URLs of each page to show that multiple threads are simultaionsly parsing the URLs

Output

for test input go run crawl.go http://rescale.com/

Click here to view test run output snippet

ubuntu@ip-10-0-0-136:~/go/src/web-crawler-using-Golang$ go run crawl.go http://rescale.com/
[http://rescale.com/]
Start time of crawl of this URL: 2020-03-13 22:07:37.810176103 +0000 UTC m=+0.000699272
Stop time of crawl of this URL : 2020-03-13 22:07:38.578908454 +0000 UTC m=+0.769431501
http://rescale.com/
	 http://rescale.com/
	 http://rescale.com/jp/
	 http://rescale.com/chs/
	 http://rescale.com/kr/
	 http://rescale.com/products/
	 http://rescale.com/products/enterprise/
	 http://rescale.com/products/advanced/
	 http://rescale.com/products/elements/
	 http://rescale.com/products/government/
	 http://rescale.com/products/universities/
	 http://rescale.com/products/developer/
	 http://rescale.com/features/
	 http://rescale.com/features/visualization/
	 http://rescale.com/security/
	 http://rescale.com/pricing/
	 http://rescale.com/infrastructure/
	 http://rescale.com/features/file-management/
	 http://rescale.com/features/admin-portal/
	 http://rescale.com/features/api/
	 http://rescale.com/features/workflow/
	 http://rescale.com/
	 http://rescale.com/aws/
	 http://rescale.com/azure/
	 http://rescale.com/gcp/
	 http://rescale.com/ibm/
	 http://rescale.com/ansys/
	 http://rescale.com/avl/
	 http://rescale.com/siemens/
	 http://rescale.com/solutions/
	 http://rescale.com/solutions/engineers-and-scientists/
	 http://rescale.com/solutions/cxos/
	 http://rescale.com/solutions/hpc-management/
	 http://rescale.com/solutions/academia/
	 http://rescale.com/solutions/startups/
	 http://rescale.com/solutions/aerospace/
	 http://rescale.com/solutions/automotive/
	 http://rescale.com/solutions/consumer-products/
	 http://rescale.com/solutions/eda-and-electronics/
	 http://rescale.com/solutions/oil-and-gas/
	 http://rescale.com/solutions/life-sciences/
	 http://rescale.com/solutions/autonomous-driving/
	 http://rescale.com/solutions/automation-api-cli/
	 http://rescale.com/solutions/cae-mdo/
	 http://rescale.com/solutions/digital-twin/
	 http://rescale.com/solutions/driver-assistance-adas/
	 http://rescale.com/solutions/disaster-recovery/
	 http://rescale.com/solutions/iot-big-data/
	 http://rescale.com/solutions/machine-learning/
	 http://rescale.com/solutions/spdm/
	 http://rescale.com/storage/
	 http://rescale.com/software/
	 http://rescale.com/partners/
	 http://rescale.com/partners-list/
	 https://resources.rescale.com/
	 https://docs.rescale.com/
	 https://resources.rescale.com/?wpv-resource-type=white-paper
	 http://rescale.com/about/
	 http://rescale.com/investors/
	 https://resources.rescale.com//blog
	 https://resources.rescale.com//news
	 https://resources.rescale.com//events
	 http://rescale.com/jobs/
	 http://rescale.com/legal/
	 http://rescale.com/signup/
	 http://rescale.com/login/
	 https://info.rescale.com/case-studies/nissan
	 https://info.rescale.com/white-papers/cloud-3.0-the-rise-of-big-compute
	 https://info.rescale.com/case-studies/dinex-reduces-time-to-market-of-exhaust-systems-by-25-percent
	 https://www.youtube.com/watch?v=05HfJ8dZJXE
	 https://info.rescale.com/case-studies/boom-supersonic
	 https://www.youtube.com/watch?v=umiGy7fe5zc
	 https://www.youtube.com/watch?v=h1nsUGuklHw
	 https://www.youtube.com/watch?v=tPaq3Hmeg5Y
	 https://resources.rescale.com/?wpv-resource-type=video
	 https://resources.rescale.com/resource/a3-project-vahana-rescale-power-personal-flight/
	 https://resources.rescale.com/boom-technology-leverages-rescale-platform-to-enable-a-rebirth-of-supersonic-passenger-travel/
	 https://resources.rescale.com/resource/the-need-for-speed-drives-nascars-richard-childress-racing-to-the-cloud/
	 https://support.rescale.com/customer/en/portal/articles/2778993-trek-bicycle-uses-rescale-to-run-cutting-edge-coupled-optimization-analysis
	 http://rescale.com/booking/
	 https://resources.rescale.com/events/
	 https://resources.rescale.com/news/
	 https://resources.rescale.com/rescale-enables-faster-time-to-market-for-nissan/
	 https://resources.rescale.com/announcements/rescale-announces-strategic-partnership-offering-with-siemens-plm/
	 https://resources.rescale.com/announcements/rescale-announces-innovations-to-accelerate-time-to-results/
	 https://resources.rescale.com/announcements/rescale-receives-2018-hpcwire-editors-choice-award-for-best-hpc-in-the-cloud-platform/
	 https://resources.rescale.com/blog
	 http://info.rescale.com/contact_sales
	 https://www.linkedin.com/company/rescale/
	 https://twitter.com/rescaleinc
	 https://www.facebook.com/rescaleinc/
Start time of crawl of this URL: 2020-03-13 22:07:38.586394942 +0000 UTC m=+0.776918083
Stop time of crawl of this URL : 2020-03-13 22:07:39.315722492 +0000 UTC m=+1.506245536
http://rescale.com/products/
	 http://rescale.com/
	 http://rescale.com/products/
	 http://rescale.com/jp/products/
	 http://rescale.com/chs/products/
	 http://rescale.com/kr/products/
	 http://rescale.com/products/enterprise/
	 http://rescale.com/products/advanced/
	 http://rescale.com/products/elements/
	 http://rescale.com/products/government/
	 http://rescale.com/products/universities/
	 http://rescale.com/products/developer/
	 http://rescale.com/features/
	 http://rescale.com/features/visualization/
	 http://rescale.com/security/
	 http://rescale.com/pricing/
	 http://rescale.com/infrastructure/
	 http://rescale.com/features/file-management/
	 http://rescale.com/features/admin-portal/
	 http://rescale.com/features/api/
	 http://rescale.com/features/workflow/
	 http://rescale.com/products/
	 http://rescale.com/aws/
	 http://rescale.com/azure/
	 http://rescale.com/gcp/
	 http://rescale.com/ibm/
	 http://rescale.com/ansys/
	 http://rescale.com/avl/
	 http://rescale.com/siemens/
	 http://rescale.com/solutions/
	 http://rescale.com/solutions/engineers-and-scientists/
	 http://rescale.com/solutions/cxos/
	 http://rescale.com/solutions/hpc-management/
	 http://rescale.com/solutions/academia/
	 http://rescale.com/solutions/startups/
	 http://rescale.com/solutions/aerospace/
	 http://rescale.com/solutions/automotive/
	 http://rescale.com/solutions/consumer-products/
	 http://rescale.com/solutions/eda-and-electronics/
	 http://rescale.com/solutions/oil-and-gas/
	 http://rescale.com/solutions/life-sciences/
	 http://rescale.com/solutions/autonomous-driving/
	 http://rescale.com/solutions/automation-api-cli/
	 http://rescale.com/solutions/cae-mdo/
	 http://rescale.com/solutions/digital-twin/
	 http://rescale.com/solutions/driver-assistance-adas/
	 http://rescale.com/solutions/disaster-recovery/
	 http://rescale.com/solutions/iot-big-data/
	 http://rescale.com/solutions/machine-learning/
	 http://rescale.com/solutions/spdm/
	 http://rescale.com/storage/
	 http://rescale.com/software/
	 http://rescale.com/partners/
	 http://rescale.com/partners-list/
	 https://resources.rescale.com/
	 https://docs.rescale.com/
	 https://resources.rescale.com/?wpv-resource-type=white-paper
	 http://rescale.com/about/
	 http://rescale.com/investors/
	 https://resources.rescale.com//blog
	 https://resources.rescale.com//news
	 https://resources.rescale.com//events
	 http://rescale.com/jobs/
	 http://rescale.com/legal/
	 http://rescale.com/signup/
	 http://rescale.com/login/
	 http://rescale.com/products/pro/
	 http://rescale.com/booking/
	 http://info.rescale.com/contact_sales
	 https://resources.rescale.com/events/
	 https://resources.rescale.com/news/
	 https://resources.rescale.com/rescale-enables-faster-time-to-market-for-nissan/
	 https://resources.rescale.com/announcements/rescale-announces-strategic-partnership-offering-with-siemens-plm/
	 https://resources.rescale.com/announcements/rescale-announces-innovations-to-accelerate-time-to-results/
	 https://resources.rescale.com/announcements/rescale-receives-2018-hpcwire-editors-choice-award-for-best-hpc-in-the-cloud-platform/
	 https://resources.rescale.com/blog
	 https://www.linkedin.com/company/rescale/
	 https://twitter.com/rescaleinc
	 https://www.facebook.com/rescaleinc/
Start time of crawl of this URL: 2020-03-13 22:07:38.585727075 +0000 UTC m=+0.776250248
Stop time of crawl of this URL : 2020-03-13 22:07:39.326520789 +0000 UTC m=+1.517043764
http://rescale.com/chs/
	 http://rescale.com/chs/
	 http://rescale.com/
	 http://rescale.com/jp/
	 http://rescale.com/kr/
	 http://rescale.com/chs/products/
	 http://rescale.com/chs/products/enterprise/
	 http://rescale.com/chs/products/advanced/
	 http://rescale.com/chs/products/elements/
	 http://rescale.com/chs/products/government/
	 http://rescale.com/chs/products/universities/
	 http://rescale.com/chs/products/developer/
	 http://rescale.com/chs/features/
	 http://rescale.com/chs/features/visualization/
	 http://rescale.com/chs/security/
	 http://rescale.com/chs/pricing/
	 http://rescale.com/chs/infrastructure/
	 http://rescale.com/chs/features/file-management/
	 http://rescale.com/chs/features/admin-portal/
	 http://rescale.com/chs/features/api/
	 http://rescale.com/chs/features/workflow/
	 http://rescale.com/chs/
	 http://rescale.com/chs/aws/
	 http://rescale.com/chs/azure/
	 http://rescale.com/chs/gcp/
	 http://rescale.com/chs/ibm/
	 http://rescale.com/chs/ansys/
	 http://rescale.com/chs/avl/
	 http://rescale.com/chs/siemens/
	 http://rescale.com/chs/solutions/
	 http://rescale.com/chs/solutions/engineers-and-scientists/
	 http://rescale.com/chs/solutions/cxos/
	 http://rescale.com/chs/solutions/hpc-management/
	 http://rescale.com/chs/solutions/academia/
	 http://rescale.com/chs/solutions/startups/
	 http://rescale.com/chs/solutions/aerospace/
	 http://rescale.com/chs/solutions/automotive/
	 http://rescale.com/chs/solutions/consumer-products/
	 http://rescale.com/chs/solutions/eda-and-electronics/
	 http://rescale.com/chs/solutions/oil-and-gas/
	 http://rescale.com/chs/solutions/life-sciences/
	 http://rescale.com/chs/solutions/autonomous-driving/
	 http://rescale.com/chs/solutions/automation-api-cli/
	 http://rescale.com/chs/solutions/cae-mdo/
	 http://rescale.com/chs/solutions/digital-twin/
	 http://rescale.com/chs/solutions/driver-assistance-adas/
	 http://rescale.com/chs/solutions/disaster-recovery/
	 http://rescale.com/chs/solutions/iot-big-data/
	 http://rescale.com/chs/solutions/machine-learning/
	 http://rescale.com/chs/solutions/spdm/
	 http://rescale.com/chs/storage/
	 http://rescale.com/chs/software/
	 http://rescale.com/chs/partners/
	 http://rescale.com/chs/partners-list/
	 https://resources.rescale.com/
	 https://docs.rescale.com/
	 https://resources.rescale.com/?wpv-resource-type=white-paper
	 http://rescale.com/chs/about/
	 http://rescale.com/chs/investors/
	 https://resources.rescale.com//blog
	 https://resources.rescale.com//news
	 https://resources.rescale.com//events
	 http://rescale.com/chs/jobs/
	 http://rescale.com/chs/legal/
	 http://rescale.com/signup/
	 http://rescale.com/login/
	 https://info.rescale.com/case-studies/nissan
	 https://info.rescale.com/white-papers/cloud-3.0-the-rise-of-big-compute
	 https://info.rescale.com/case-studies/dinex-reduces-time-to-market-of-exhaust-systems-by-25-percent
	 https://www.youtube.com/watch?v=05HfJ8dZJXE
	 https://info.rescale.com/case-studies/boom-supersonic
	 https://www.youtube.com/watch?v=umiGy7fe5zc
	 https://www.youtube.com/watch?v=h1nsUGuklHw
	 https://www.youtube.com/watch?v=tPaq3Hmeg5Y
	 https://resources.rescale.com/?wpv-resource-type=video
	 https://resources.rescale.com/resource/a3-project-vahana-rescale-power-personal-flight/
	 https://resources.rescale.com/boom-technology-leverages-rescale-platform-to-enable-a-rebirth-of-supersonic-passenger-travel/
	 https://resources.rescale.com/resource/the-need-for-speed-drives-nascars-richard-childress-racing-to-the-cloud/
	 https://support.rescale.com/customer/en/portal/articles/2778993-trek-bicycle-uses-rescale-to-run-cutting-edge-coupled-optimization-analysis
	 http://rescale.com/chs/booking/
	 https://resources.rescale.com/events/
	 https://resources.rescale.com/blog
	 http://info.rescale.com/contact_sales
	 https://www.linkedin.com/company/rescale/
	 https://twitter.com/rescaleinc
	 https://www.facebook.com/rescaleinc/
Start time of crawl of this URL: 2020-03-13 22:07:38.5852948 +0000 UTC m=+0.775817962
Stop time of crawl of this URL : 2020-03-13 22:07:39.337943349 +0000 UTC m=+1.528466402
http://rescale.com/jp/
	 http://rescale.com/jp/
	 http://rescale.com/
	 http://rescale.com/chs/
	 http://rescale.com/kr/
	 http://rescale.com/jp/products/
	 http://rescale.com/jp/products/enterprise/
	 http://rescale.com/jp/products/advanced/
	 http://rescale.com/jp/products/elements/
	 http://rescale.com/jp/products/government/
	 http://rescale.com/jp/products/universities/
	 http://rescale.com/jp/products/developer/
	 http://rescale.com/jp/features/
	 http://rescale.com/jp/features/visualization/
	 http://rescale.com/jp/security/
	 http://rescale.com/jp/pricing/
	 http://rescale.com/jp/infrastructure/
	 http://rescale.com/jp/features/file-management/
	 http://rescale.com/jp/features/admin-portal/
	 http://rescale.com/jp/features/api/
	 http://rescale.com/jp/features/workflow/
	 http://rescale.com/jp/
	 http://rescale.com/jp/aws/
	 http://rescale.com/jp/azure/
	 http://rescale.com/jp/gcp/
	 http://rescale.com/jp/ibm/
	 http://rescale.com/jp/ansys/
	 http://rescale.com/jp/avl/
	 http://rescale.com/jp/siemens/
	 http://rescale.com/jp/solutions/
	 http://rescale.com/jp/solutions/engineers-and-scientists/
	 http://rescale.com/jp/solutions/cxos/

About

web crawler which fetches URLs and outputs crawl results to console as the crawl proceeds

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages