Skip to content

cody6750/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

Web Crawler

Table of Contents

  1. About The Project
  2. Getting Started
  3. Usage
  4. Environment Variables
  5. Start crawling
  6. Features
  7. Contributing
  8. License
  9. Contact

About The Project

Tracking Screen Shot

A fast high-level web crawling and web scraping application framework that includes multiple features using Go. Used to crawl websites concurrently and extract structured data from their pages. Hosted on a web server in a Docker container on AWS that implements a REST API. This project includes a Dockerfile that is ready to go out of the box and can be hosted locally. Simply follow the instructions below to get started!

Design

Design

Built With

(back to top)

Getting Started

Prerequisites

  1. This assumes you already have a working Go environment, if not please see this page first.

  2. This assumes you already have a working Docker environment, if not please see this page first.

  3. This assumes you already have a working Discord bot, if not please see this page first.

  4. If you are deploying the container to AWS, please configure your AWS credentials. Please see this page for assistance.

Installation

  1. Clone the repo

    git clone https://github.com/cody6750/web-crawler
  2. Get Go packages

     go get github.com/aws/aws-sdk-go
     go get github.com/gorilla/mux
     go get github.com/sirupsen/logrus 

(back to top)

Usage

The Web Crawler is designed to be deployed on AWS EC2 as a Docker container implementing a REST API, though it can be deployed locally by building and executing the exectuable or deploying the Docker container locally on your machine. This section will cover 2 of the ways to do so. Please note that these instructions are for Mac OS using a bash terminal.

Build locally without Docker

  1. Navigate to webcrawler repo location
  2. Navigate to the /web directory, build the go exectuable . Set environment variables if you want to override anything.
go build -o app 
  1. Run the go exectuable
./app

Go build locally

Build locally with Docker

  1. Navigate to the discord-tracking-bot repo location.

  2. Navigate to the /web directory, build the go exectuable .

go build -o app 
  1. Build the Docker image using the Dockerfile. Set environment variables if you want to override anything, in the Dockerfile.
docker build -t webcrawler .
  1. Run the Docker image.
  docker run -d -p 9090:9090 --network=discord --name webcrawler webcrawler
  1. Check for docker container
  docker ps -a
  1. Check docker logs
  docker logs webcrawler

Go build docker locally

(back to top)

Environment Variables

The Discord Tracking Bot uses environment variables to set configuration. Use Dockerfile or set through shell console.

Environment Variable Default Value Description
ALLOW_EMPTY_ITEM false Allows webcrawler to return scrape responses with empty items.
AWS_WRITE_OUTPUT_TO_S3 false Determines whether to write scrape responses to S3.
AWS_MAX_RERIES discord/token If AWS_WRITE_OUTPUT_TO_S3 is set to true, set maximum retry responses during creation of AWS session.
AWS_REGION us-east-1 If AWS_WRITE_OUTPUT_TO_S3 is set to true, region to configure AWS session.
AWS_S3_BUCKET webcrawler-results If AWS_WRITE_OUTPUT_TO_S3 is set to true, region to configure AWS session, S3 bucket to send scrape responses.
CRAWL_DELAY 5 Delay between crawls per web scraper worker.
HEADER_KEY User-Agent Header agent used during http request
HEADER_VALUE Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36 Header agent value used during http request.
LOG_LEVEL INFO Determines level of logs.
IDLE_TIMEOUT 120 Maximum amount of time to wait for the next request when keep-alives are enabled.
MAX_DEPTH 1 Maximum crawl depth during an execution of a crawl.
MAX_GO_ROUTINES 10000 Maximum go routines deployed during an execution of a crawl.
MAX_VISITED_URLS 20 Maximum visited urls during an execution of a crawl.
MAX_ITEMS_FOUND 5000 Maximum items extracted during an execution of a crawl.
PORT :9090 Port used to expose web server.
READ_TIMEOUT 60 Maximum duration for reading the entire request, including the body.
WEB_SCRAPER_WORKER_COUNT 5 Number of web scraper workers during an execution of a crawl.
WRITE_TIMEOUT 60 Maximum duration for writing the response.

Start crawling

The web crawler is hosted on a web server that is exposed using a REST API. To call the web crawler, we must make an http request to the web server that host the web crawler.

  1. Run the go executable locally or the web crawler docker container. Confirm that it is ready to recieve traffic.
  2. Generate the payload.

Example empty payload: web/example/empty_playoad.json

{
    "RootURL" :"",
    "ScrapeItemConfiguration": [ 
        {
            "ItemName" : "",
            "ItemToGet" :  {
                "Tag" : "",
                "Attribute" : "",
                "AttributeValue" : "",
                "AttributeToGet" : ""
            },
            "ItemDetails" : {              
                "<ITEM_NAME>" : {
                    "Tag": "",
                    "Attribute": "",
                    "AttributeValue" : "",
                    "AttributeToGet" : "",
                    "FilterConfiguration": {
                        "IsLessThan" : "",
                        "IsGreaterThan" : "", 
                        "IsEqualTo" : "",
                        "IsNotEqualTo": "",
                        "Contains" : "",
                        "ConvertStringToNumber" : ""
                    },
                    "FormatAttributeConfiguration" : {
                        "SuffixExist" : "",
                        "SuffixToAdd" : "",
                        "SuffixToRemove" : "",
                        "PrefixToAdd" : "",
                        "PrefixExist" : "",
                        "PrefixToRemove" : "",
                        "ReplaceOldString" : "",
                        "ReplaceNewString" : ""
                    },
                    "SkipToken" :""
                }
            }
        }
    ],
    "ScrapeURLConfiguration": [
        {
           "ExtractFromTokenConfig": {
            "Tag": "",
            "Attribute": "",
            "AttributeValue" : ""
            },
            "FormatURLConfiguration": {
                "SuffixExist" : "",
                "SuffixToAdd" : "",
                "SuffixToRemove" : "",
                "PrefixToAdd" : "",
                "PrefixExist" : "",
                "PrefixToRemove" : "",
                "ReplaceOldString" : "",
                "ReplaceNewString" : ""
            }
        }                       
    ]
}

Example payload: web/example/playoad.json

{
    "RootURL" :"https://www.ebay.com/sch/i.html?_nkw=rtx+3050+graphics+card&_sop=15&rt=nc&LH_BIN=1",
    "ScrapeItemConfiguration": [ 
        {
            "ItemName" : "Graphics Card",
            "ItemToGet" :  {
                "Tag" : "li",
                "Attribute" : "class",
                "AttributeValue" : "s-item s-item__pl-on-bottom s-item--watch-at-corner"
            },
            "ItemDetails" : {
                "title" : {
                    "Tag": "h3",
                    "Attribute": "class",
                    "AttributeValue" : "s-item__title"
                },
                "link" : {
                    "Tag": "a",
                    "Attribute": "class",
                    "AttributeValue" : "s-item__link",
                    "AttributeToGet": "href"
                },                 
                "price" : {
                    "Tag": "span",
                    "Attribute": "class",
                    "AttributeValue" : "s-item__price",
                    "FilterConfiguration": {
                        "IsLessThan" : 450,
                        "IsGreaterThan" : 200,
                        "ConvertStringToNumber" : "true"
                    }
                }
            }
        }
    ],
    "ScrapeURLConfiguration": [
        {
            "FormatURLConfiguration": {
                "PrefixExist":    "////",
                "PrefixToRemove": "////",
                "PrefixToAdd":    "http://"
            }
        },
        {
            "FormatURLConfiguration": {
                "PrefixExist":    "///",
                "PrefixToRemove": "///",
                "PrefixToAdd":    "http://"
            }
        },
        {
            "FormatURLConfiguration": {
                "PrefixExist":    "//",
                "PrefixToRemove": "//",
                "PrefixToAdd":    "http://"
            }
        },
        {
            "FormatURLConfiguration": {
                "PrefixExist":    "/",
                "PrefixToAdd":    "http://ebay.com"
            }
        }                        
    ]
}
  1. Send GET request to <HOST_NAME>:9090/crawler/item using the payload. There are examples in web/example.

postman tracking log

Features

The webcrawler includes various features:

  • Ability to crawl multiple request concurrently
  • REST API
  • Json validation middleware
  • Crawl depth restrictions
  • Liveleness and readiness health checks
  • Respects robots.txt
  • Metrics
  • Generates output files in JSON
  • Sends output files to S3 bucket
  • Item and URL validation
  • Unit test
  • Sends metrics to metrics channel

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Cody Kieu - cody6750@gmail.com

Project Link: https://github.com/cody6750/web-crawler

(back to top)