- About The Project
- Getting Started
- Usage
- Environment Variables
- Start crawling
- Features
- Contributing
- License
- Contact
A fast high-level web crawling and web scraping application framework that includes multiple features using Go. Used to crawl websites concurrently and extract structured data from their pages. Hosted on a web server in a Docker container on AWS that implements a REST API. This project includes a Dockerfile that is ready to go out of the box and can be hosted locally. Simply follow the instructions below to get started!
-
This assumes you already have a working Go environment, if not please see this page first.
-
This assumes you already have a working Docker environment, if not please see this page first.
-
This assumes you already have a working Discord bot, if not please see this page first.
-
If you are deploying the container to AWS, please configure your AWS credentials. Please see this page for assistance.
-
Clone the repo
git clone https://github.com/cody6750/web-crawler
-
Get Go packages
go get github.com/aws/aws-sdk-go go get github.com/gorilla/mux go get github.com/sirupsen/logrus
The Web Crawler is designed to be deployed on AWS EC2 as a Docker container implementing a REST API, though it can be deployed locally by building and executing the exectuable or deploying the Docker container locally on your machine. This section will cover 2 of the ways to do so. Please note that these instructions are for Mac OS using a bash terminal.
- Navigate to
webcrawler
repo location - Navigate to the
/web
directory, build the go exectuable . Set environment variables if you want to override anything.
go build -o app
- Run the go exectuable
./app
-
Navigate to the
discord-tracking-bot
repo location. -
Navigate to the
/web
directory, build the go exectuable .
go build -o app
- Build the Docker image using the Dockerfile. Set environment variables if you want to override anything, in the Dockerfile.
docker build -t webcrawler .
- Run the Docker image.
docker run -d -p 9090:9090 --network=discord --name webcrawler webcrawler
- Check for docker container
docker ps -a
- Check docker logs
docker logs webcrawler
The Discord Tracking Bot uses environment variables to set configuration. Use Dockerfile or set through shell console.
Environment Variable | Default Value | Description |
---|---|---|
ALLOW_EMPTY_ITEM |
false | Allows webcrawler to return scrape responses with empty items. |
AWS_WRITE_OUTPUT_TO_S3 |
false | Determines whether to write scrape responses to S3. |
AWS_MAX_RERIES |
discord/token | If AWS_WRITE_OUTPUT_TO_S3 is set to true, set maximum retry responses during creation of AWS session. |
AWS_REGION |
us-east-1 | If AWS_WRITE_OUTPUT_TO_S3 is set to true, region to configure AWS session. |
AWS_S3_BUCKET |
webcrawler-results | If AWS_WRITE_OUTPUT_TO_S3 is set to true, region to configure AWS session, S3 bucket to send scrape responses. |
CRAWL_DELAY |
5 | Delay between crawls per web scraper worker. |
HEADER_KEY |
User-Agent | Header agent used during http request |
HEADER_VALUE |
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36 | Header agent value used during http request. |
LOG_LEVEL |
INFO | Determines level of logs. |
IDLE_TIMEOUT |
120 | Maximum amount of time to wait for the next request when keep-alives are enabled. |
MAX_DEPTH |
1 | Maximum crawl depth during an execution of a crawl. |
MAX_GO_ROUTINES |
10000 | Maximum go routines deployed during an execution of a crawl. |
MAX_VISITED_URLS |
20 | Maximum visited urls during an execution of a crawl. |
MAX_ITEMS_FOUND |
5000 | Maximum items extracted during an execution of a crawl. |
PORT |
:9090 | Port used to expose web server. |
READ_TIMEOUT |
60 | Maximum duration for reading the entire request, including the body. |
WEB_SCRAPER_WORKER_COUNT |
5 | Number of web scraper workers during an execution of a crawl. |
WRITE_TIMEOUT |
60 | Maximum duration for writing the response. |
The web crawler is hosted on a web server that is exposed using a REST API. To call the web crawler, we must make an http request to the web server that host the web crawler.
- Run the go executable locally or the web crawler docker container. Confirm that it is ready to recieve traffic.
- Generate the payload.
Example empty payload: web/example/empty_playoad.json
{
"RootURL" :"",
"ScrapeItemConfiguration": [
{
"ItemName" : "",
"ItemToGet" : {
"Tag" : "",
"Attribute" : "",
"AttributeValue" : "",
"AttributeToGet" : ""
},
"ItemDetails" : {
"<ITEM_NAME>" : {
"Tag": "",
"Attribute": "",
"AttributeValue" : "",
"AttributeToGet" : "",
"FilterConfiguration": {
"IsLessThan" : "",
"IsGreaterThan" : "",
"IsEqualTo" : "",
"IsNotEqualTo": "",
"Contains" : "",
"ConvertStringToNumber" : ""
},
"FormatAttributeConfiguration" : {
"SuffixExist" : "",
"SuffixToAdd" : "",
"SuffixToRemove" : "",
"PrefixToAdd" : "",
"PrefixExist" : "",
"PrefixToRemove" : "",
"ReplaceOldString" : "",
"ReplaceNewString" : ""
},
"SkipToken" :""
}
}
}
],
"ScrapeURLConfiguration": [
{
"ExtractFromTokenConfig": {
"Tag": "",
"Attribute": "",
"AttributeValue" : ""
},
"FormatURLConfiguration": {
"SuffixExist" : "",
"SuffixToAdd" : "",
"SuffixToRemove" : "",
"PrefixToAdd" : "",
"PrefixExist" : "",
"PrefixToRemove" : "",
"ReplaceOldString" : "",
"ReplaceNewString" : ""
}
}
]
}
Example payload: web/example/playoad.json
{
"RootURL" :"https://www.ebay.com/sch/i.html?_nkw=rtx+3050+graphics+card&_sop=15&rt=nc&LH_BIN=1",
"ScrapeItemConfiguration": [
{
"ItemName" : "Graphics Card",
"ItemToGet" : {
"Tag" : "li",
"Attribute" : "class",
"AttributeValue" : "s-item s-item__pl-on-bottom s-item--watch-at-corner"
},
"ItemDetails" : {
"title" : {
"Tag": "h3",
"Attribute": "class",
"AttributeValue" : "s-item__title"
},
"link" : {
"Tag": "a",
"Attribute": "class",
"AttributeValue" : "s-item__link",
"AttributeToGet": "href"
},
"price" : {
"Tag": "span",
"Attribute": "class",
"AttributeValue" : "s-item__price",
"FilterConfiguration": {
"IsLessThan" : 450,
"IsGreaterThan" : 200,
"ConvertStringToNumber" : "true"
}
}
}
}
],
"ScrapeURLConfiguration": [
{
"FormatURLConfiguration": {
"PrefixExist": "////",
"PrefixToRemove": "////",
"PrefixToAdd": "http://"
}
},
{
"FormatURLConfiguration": {
"PrefixExist": "///",
"PrefixToRemove": "///",
"PrefixToAdd": "http://"
}
},
{
"FormatURLConfiguration": {
"PrefixExist": "//",
"PrefixToRemove": "//",
"PrefixToAdd": "http://"
}
},
{
"FormatURLConfiguration": {
"PrefixExist": "/",
"PrefixToAdd": "http://ebay.com"
}
}
]
}
- Send GET request to
<HOST_NAME>:9090/crawler/item
using the payload. There are examples inweb/example
.
The webcrawler includes various features:
- Ability to crawl multiple request concurrently
- REST API
- Json validation middleware
- Crawl depth restrictions
- Liveleness and readiness health checks
- Respects robots.txt
- Metrics
- Generates output files in JSON
- Sends output files to S3 bucket
- Item and URL validation
- Unit test
- Sends metrics to metrics channel
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
Cody Kieu - cody6750@gmail.com
Project Link: https://github.com/cody6750/web-crawler