Web Crawler

About The Project

A fast high-level web crawling and web scraping application framework that includes multiple features using Go. Used to crawl websites concurrently and extract structured data from their pages. Hosted on a web server in a Docker container on AWS that implements a REST API. This project includes a Dockerfile that is ready to go out of the box and can be hosted locally. Simply follow the instructions below to get started!

Design

Built With

(back to top)

Getting Started

Prerequisites

This assumes you already have a working Go environment, if not please see this page first.
This assumes you already have a working Docker environment, if not please see this page first.
This assumes you already have a working Discord bot, if not please see this page first.
If you are deploying the container to AWS, please configure your AWS credentials. Please see this page for assistance.

Installation

Clone the repo

git clone https://github.com/cody6750/web-crawler

Get Go packages

 go get github.com/aws/aws-sdk-go
 go get github.com/gorilla/mux
 go get github.com/sirupsen/logrus

(back to top)

Usage

The Web Crawler is designed to be deployed on AWS EC2 as a Docker container implementing a REST API, though it can be deployed locally by building and executing the exectuable or deploying the Docker container locally on your machine. This section will cover 2 of the ways to do so. Please note that these instructions are for Mac OS using a bash terminal.

Build locally without Docker

Navigate to webcrawler repo location
Navigate to the /web directory, build the go exectuable . Set environment variables if you want to override anything.

go build -o app

Run the go exectuable

./app

Build locally with Docker

Navigate to the discord-tracking-bot repo location.
Navigate to the /web directory, build the go exectuable .

go build -o app

Build the Docker image using the Dockerfile. Set environment variables if you want to override anything, in the Dockerfile.

docker build -t webcrawler .

Run the Docker image.

  docker run -d -p 9090:9090 --network=discord --name webcrawler webcrawler

Check for docker container

  docker ps -a

Check docker logs

  docker logs webcrawler

(back to top)

Environment Variables

The Discord Tracking Bot uses environment variables to set configuration. Use Dockerfile or set through shell console.

Environment Variable	Default Value	Description
`ALLOW_EMPTY_ITEM`	false	Allows webcrawler to return scrape responses with empty items.
`AWS_WRITE_OUTPUT_TO_S3`	false	Determines whether to write scrape responses to S3.
`AWS_MAX_RERIES`	discord/token	If `AWS_WRITE_OUTPUT_TO_S3` is set to true, set maximum retry responses during creation of AWS session.
`AWS_REGION`	us-east-1	If `AWS_WRITE_OUTPUT_TO_S3` is set to true, region to configure AWS session.
`AWS_S3_BUCKET`	webcrawler-results	If `AWS_WRITE_OUTPUT_TO_S3` is set to true, region to configure AWS session, S3 bucket to send scrape responses.
`CRAWL_DELAY`	5	Delay between crawls per web scraper worker.
`HEADER_KEY`	User-Agent	Header agent used during http request
`HEADER_VALUE`	Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36	Header agent value used during http request.
`LOG_LEVEL`	INFO	Determines level of logs.
`IDLE_TIMEOUT`	120	Maximum amount of time to wait for the next request when keep-alives are enabled.
`MAX_DEPTH`	1	Maximum crawl depth during an execution of a crawl.
`MAX_GO_ROUTINES`	10000	Maximum go routines deployed during an execution of a crawl.
`MAX_VISITED_URLS`	20	Maximum visited urls during an execution of a crawl.
`MAX_ITEMS_FOUND`	5000	Maximum items extracted during an execution of a crawl.
`PORT`	:9090	Port used to expose web server.
`READ_TIMEOUT`	60	Maximum duration for reading the entire request, including the body.
`WEB_SCRAPER_WORKER_COUNT`	5	Number of web scraper workers during an execution of a crawl.
`WRITE_TIMEOUT`	60	Maximum duration for writing the response.

Start crawling

The web crawler is hosted on a web server that is exposed using a REST API. To call the web crawler, we must make an http request to the web server that host the web crawler.

Run the go executable locally or the web crawler docker container. Confirm that it is ready to recieve traffic.
Generate the payload.

Example empty payload: web/example/empty_playoad.json

{
    "RootURL" :"",
    "ScrapeItemConfiguration": [ 
        {
            "ItemName" : "",
            "ItemToGet" :  {
                "Tag" : "",
                "Attribute" : "",
                "AttributeValue" : "",
                "AttributeToGet" : ""
            },
            "ItemDetails" : {              
                "<ITEM_NAME>" : {
                    "Tag": "",
                    "Attribute": "",
                    "AttributeValue" : "",
                    "AttributeToGet" : "",
                    "FilterConfiguration": {
                        "IsLessThan" : "",
                        "IsGreaterThan" : "", 
                        "IsEqualTo" : "",
                        "IsNotEqualTo": "",
                        "Contains" : "",
                        "ConvertStringToNumber" : ""
                    },
                    "FormatAttributeConfiguration" : {
                        "SuffixExist" : "",
                        "SuffixToAdd" : "",
                        "SuffixToRemove" : "",
                        "PrefixToAdd" : "",
                        "PrefixExist" : "",
                        "PrefixToRemove" : "",
                        "ReplaceOldString" : "",
                        "ReplaceNewString" : ""
                    },
                    "SkipToken" :""
                }
            }
        }
    ],
    "ScrapeURLConfiguration": [
        {
           "ExtractFromTokenConfig": {
            "Tag": "",
            "Attribute": "",
            "AttributeValue" : ""
            },
            "FormatURLConfiguration": {
                "SuffixExist" : "",
                "SuffixToAdd" : "",
                "SuffixToRemove" : "",
                "PrefixToAdd" : "",
                "PrefixExist" : "",
                "PrefixToRemove" : "",
                "ReplaceOldString" : "",
                "ReplaceNewString" : ""
            }
        }                       
    ]
}

Example payload: web/example/playoad.json

{
    "RootURL" :"https://www.ebay.com/sch/i.html?_nkw=rtx+3050+graphics+card&_sop=15&rt=nc&LH_BIN=1",
    "ScrapeItemConfiguration": [ 
        {
            "ItemName" : "Graphics Card",
            "ItemToGet" :  {
                "Tag" : "li",
                "Attribute" : "class",
                "AttributeValue" : "s-item s-item__pl-on-bottom s-item--watch-at-corner"
            },
            "ItemDetails" : {
                "title" : {
                    "Tag": "h3",
                    "Attribute": "class",
                    "AttributeValue" : "s-item__title"
                },
                "link" : {
                    "Tag": "a",
                    "Attribute": "class",
                    "AttributeValue" : "s-item__link",
                    "AttributeToGet": "href"
                },                 
                "price" : {
                    "Tag": "span",
                    "Attribute": "class",
                    "AttributeValue" : "s-item__price",
                    "FilterConfiguration": {
                        "IsLessThan" : 450,
                        "IsGreaterThan" : 200,
                        "ConvertStringToNumber" : "true"
                    }
                }
            }
        }
    ],
    "ScrapeURLConfiguration": [
        {
            "FormatURLConfiguration": {
                "PrefixExist":    "////",
                "PrefixToRemove": "////",
                "PrefixToAdd":    "http://"
            }
        },
        {
            "FormatURLConfiguration": {
                "PrefixExist":    "///",
                "PrefixToRemove": "///",
                "PrefixToAdd":    "http://"
            }
        },
        {
            "FormatURLConfiguration": {
                "PrefixExist":    "//",
                "PrefixToRemove": "//",
                "PrefixToAdd":    "http://"
            }
        },
        {
            "FormatURLConfiguration": {
                "PrefixExist":    "/",
                "PrefixToAdd":    "http://ebay.com"
            }
        }                        
    ]
}

Send GET request to <HOST_NAME>:9090/crawler/item using the payload. There are examples in web/example.

Features

The webcrawler includes various features:

Ability to crawl multiple request concurrently
REST API
Json validation middleware
Crawl depth restrictions
Liveleness and readiness health checks
Respects robots.txt
Metrics
Generates output files in JSON
Sends output files to S3 bucket
Item and URL validation
Unit test
Sends metrics to metrics channel

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Cody Kieu - cody6750@gmail.com

Project Link: https://github.com/cody6750/web-crawler

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.vscode		.vscode
media		media
pkg		pkg
shared		shared
web		web
README.md		README.md
go.mod		go.mod
go.sum		go.sum

cody6750/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Table of Contents

About The Project

Design

Built With

Getting Started

Prerequisites

Installation

Usage

Build locally without Docker

Build locally with Docker

Environment Variables

Start crawling

Features

Contributing

License

Contact

About

Resources

Stars

Watchers

Forks

Languages